Showing only posts with label releases. See the RSS for this label, or see all posts.

leveldb-ruby 0.7 has been released.

This release fixed a double-freeing bug in DB#close, which caused a segfault if the GC tried to reclaim the LevelDB object or when Ruby terminated.

William Morgan, July 16, 2011.

Thanks to some prompting from Aaron Gallagher, I’ve bundled up two years of Whisper bugfixes into release 0.6.

This release requires Ruby 1.9. I have also published the gem on rubygems.org. Unfortunately the name “whisper” was already taken by some ancient Rails gem, so to get this latest release, please do:

gem install whisperblog

Send your comments and questions my way. It’s still a pretty homebrew project, obviously.

William Morgan, March 4, 2011.

Whistlepig 0.2 is out already. This time it should actually compile under OS X. I’m having a grand old time figuring out all the differences in flags that Ruby when compiling gems under different architectures. But not so grand that I want to learn automake/autoconf.

As part of making sure things were compiling on different platforms, I ran make test-integration and noticed that on my dual-boot laptop runs it at 8000k/s on Linux but only 6200k/s on OS X. I was expecting a difference in Linux’s favor, but not quite that large! As another point of reference, my mid-range Linux desktop gives me 9500 k/s.

William Morgan, February 10, 2011.

Today I released the very first version of Whistlepig, a minimalist realtime full-text search index.

Side projects apparently take a lot longer when you have a job and a baby, because it’s taken me over 6 months to get to the point where I have something releasable. And there are so many obvious improvements to make. But all known bugs are squashed, and it’s good enough to use, so, it’s out.

The README has a good description of what Whistlepig is, so here I thought I’d talk about the why. Why write yet another inverted index?

The unfortunate fact is that you have too many choices already: Lucene, obviously, and its derivatives like SOLR, and if you’re shy of the JVM, Xapian and Sphinx. Ferret used to be a good choice in the Ruby world until Dave Balmain absconded and no one had the cojones to maintain his code. I’ve used each of these things.

But they are all very heavy-weight solutions, and they all suffer from what I call the “TREC mentality”. In early TREC competitions, you were given a big, static corpus, which you indexed at your leisure, and then you were given a bunch of queries, which were all long descriptions of what documents someone was interested in. It would be something like “I am interested in documents about Mayan architecture, but only during the pre-conquistador period, and specifically I am not interested in such and such” and so on. These competitions were great in that they spurred advances in search engineering, but the result is that almost every inverted index implementation today is optimized for precisely the case of static corpora and large queries.

In the intervening 30 years, the use case for full-text search has far exceeded the library-science-style applications of the early TREC competitions. There are many applications where you don’t need tf-idf scores and the Okapi formula or even necessarily stemming. You just want recent things that match your query, and you value control and transparency over some kind of fuzzy natural language matching. Search in GMail (or Sup of, course!) comes to mind, or searching within the posts on this blog.

That’s one part of the reason why existing solutions are not ideal. The other part is that inverted indexes are so optimized for speed and for size that even little things like wanting documents from last to first can be drastically slower than using the standard ordering. For example, Sup wants documents in reverse chronological order; Xapian is fastest in increasing docid order; so we play crazy games to map dates to docids:

DOCID_SCALE = 2.0**32
TIME_SCALE = 2.0**27
MIDDLE_DATE = Time.gm(2011)
def assign_docid m, truncated_date
  t = (truncated_date.to_i - MIDDLE_DATE.to_i).to_f
  docid = (DOCID_SCALE - DOCID_SCALE /
    (Math::E**(-(t/TIME_SCALE)) + 1)).to_i
  while docid > 0 and docid_exists? docid
    docid -= 1
  end
  docid > 0 ? docid : nil
end

This snippet is courtesy of Rich Lane, who should be credited in history books as the first person to find a use for a logistic curve in an email client.

If you try and use something like Xapian or Sphinx for these applications, you have to play games like that for performance. And when new documents arrive, you have to play further games to get them into the index sooner rather than later. And all the while you’re turning off 90% of the features anyways.

So that leads us to the world of realtime search, which explicitly values recent documents over older ones. It’s “realtime” where new documents arrive on the fly and must be made available to queries as soon as they arrive. If you’re in that situation, you typically also care more about more recent documents that older ones anyways. Those are the two tenets of realtime search: documents are available immediately, and recent documents are more important than older ones.

Whistlepig is my attempt to capture those two tenets in as few lines of code as possible, while still being reasonable performant. I do this by stripping away all the vestigial TREC functionality of relevance, ranking, sorting, tf-idf, etc. You get documents in LIFO order, and that’s it. Whistlepig doesn’t return anything besides the docid either: if you need something more than the id, you have to fit that into a separate store somewhere. It turns out if you throw that stuff away, you can accomplish the rest of the search problem without a tremendous amount of code. Like any C program, it’s 5% algorithm and 95% bookkeeping.

There is one wrinkle that I actually add to the model: I allow adding and removing labels from documents. Every other aspect of a document is fixed in Whistlepig—you can’t even delete it from the index once it’s been added—but labels are mutable. And of course you can intermingle labels with the other components of your query. Almost every realtime search application I can dream up would benefit from this functionality, so there you go.

My hope for Whistlepig is that it becomes the default choice for realtime search applications, especially in the Ruby world, which hasn’t had a good in-process search solution since Ferret bit the dust. And if I mysteriously disappear like Dave did, I also hope that the codebase is small enough and simple enough that taking it over doesn’t seem like a herculean effort.

William Morgan, February 9, 2011.

Most programmers are by now familiar with the difference between the number of bytes in a string and the number of characters. Depending on the string’s encoding, the relationship between these two measures can be either trivially computable or complicated and compute-heavy.

With the advent of Ruby 1.9, the Ruby world at last has this distinction formally encoded at the language level: String#bytesize is the number of bytes in the string, and String#length and String#size the number of characters.

But when you’re writing console applications, there’s a third measure you have to worry about: the width of the string on the display. ASCII characters take up one column when displayed on screen, but super-ASCII characters, such as Chinese, Japanese and Korean characters, can take up multiple columns. This display width is not trivially computable from the byte size of the character.

Finding the display width of a string is critical to any kind of console application that cares about the width of the screen, i.e. is not simply printing stuff and letting the terminal wrap. Personally, I’ve been needing it forever:

  1. Trollop needs it because it tries to format the help screen nicely.
  2. Sup needs it in a million places because it is a full-fledged console application and people use it for reading mail in all sorts of funny languages.

The actual mechanics of how to compute string width make for an interesting lesson in UNIX archaeology, but suffice it to say that I’ve travelled the path for you, with help from Tanaka Akira of pp fame, and I am happy to announce the release of the Ruby console gem.

The console gem currently provides these two methods:

  • Console.display_width: calculates the display width of a string
  • Console.display_slice: returns a substring according to display offset and display width parameters.

There is one horrible caveat outstanding, which is that I haven’t managed to get it to work on Ruby 1.8. Patches to this effect are most welcome, as are, of course, comments and suggestions.

Try it out!.

William Morgan, May 19, 2010.

Trollop 1.16.2 has been out for a while now, but I realized I (heavens!) haven’t yet blogged about it.

Exciting features include:

  1. Scientific notation is now supported for floating-point arguments, thanks to Will Fitzgerald.
  2. Hoe dependency dropped. Finally.
  3. Some refactoring of the standard exception-handling logic, making it easier to customize Trollop’s behavior. For example, check this out:

opts = Trollop::with_standard_exception_handling p do
  p.parse ARGV
  raise Trollop::HelpNeeded if ARGV.empty? # show help screen
end

This example shows the help screen if there are no arguments. Previous to 1.16, this was difficult to do, since the standard exception-handling was baked into Trollop::options. The help message would automatically be displayed if -h was given, but programmatically invoking it on demand was difficult.

So I’ve refactored the standard exception handling into with_standard_exception_handling, and if you want fine-grained control, instead of calling Trollop::options, you now have the option to call Trollop#parse from within with_standard_exception_handling.

You don’t really need any of this stuff, of course, unless you’re really picky about how your exception-handling works. But hey, that’s why I wrote Trollop in the first place….

William Morgan, May 11, 2010.

I’ve just released Trollop 1.15, which fixes an irritating misfeature pointed out by Rafael Sevilla: when Trollop runs out of characters when it’s generating short option names, e.g. when you have a lot of options, it shouldn’t throw an exception and die. It should just continue peacefully.

Trollop’s reign of domination continues!

William Morgan, September 30, 2009.

I’ve released git-wtf version bf06ab7. The highlight of this release is colorized output. ANSI escape sequences are the future of the web.

Also, the feature / integration branch comparisons is now only displayed when -r is supplied.

Check out the git-wtf home page for an example of the fancy colorization, or just download it now.

William Morgan, July 28, 2009.

I’ve released Ritex 0.3. No API or functionality changes; this is just a set of miscellaneous tweaks that make Ritex work on Ruby 1.9.

William Morgan, June 18, 2009.

I’ve released Whisper version 0.5. Lots of good stuff since 0.3 (I didn’t announce 0.4 because it was a minor bugfix release):

  • Nested comments are now properly supported.
  • New <pre> and <poem> blocks added.
  • A new whisper-process-email command for manually reprocessing email. You can also offload all email processing to this program instead of the main Whisper server, if you like.
  • New dependency for the 0.2 version of RiTeX, which has equation array support (see announcement for details).
  • Better mbox-splitting code, now that I’ve figured out how to do this properly in Sup.
  • RiTeX macros now properly persist throughout an entry.
  • Many other minor bugfixes: attribution lines in emails, various incorrect bits of HTML output, escaping of Ritex error messages, etc.

Try it now!

  1. sudo gem install whisper --source http://masanjin.net/
  2. whisper-init <blog directory>
  3. Follow the instructions.
William Morgan, May 20, 2009.