I’ve release Trollop 2.0. This is the first Trollop release in about two years!

Trollop is pretty stable and I don’t care to tweak it too much, but there was one behavior that has always irked me: how boolean parameters (aka flags) are handled. In pre-2.0, not specifying a flag on the commandline would result in the option being set to its default value; specifying it on the commandline would result in the option being set to the opposite of its default value. This was weird for options with a default of true:

  opt :magic, "Use magic", default: true
Using --magic with the above configuration would result in a :magic => false value in the options hash, which is a little silly. You could change the description to reflect this, but that looks silly too:
  opt :magic, "Don't use magic", default: true

Fixing this issue required breaking the API. I’ve finally bitten that bullet: in Trollop 2.0, I’ve introduced the GNU-style notion of a --no-x negation form for --x parameters. Now, specifying --x will always set the option :x to true, regardless of its default value. Specifying --no-x will always set the option :x to false, regardless of its default value. The default value only comes into play when neither form is given on the commandline.

E.g.:

 opt :magic, "Use magic", :default => true
Using --magic will result in :magic => true, and --no-magic will result in :magic => false, and not using either will result in :magic => true.

The behavior is exactly the same for flags with a default of false, or, equivalently, flags without a default value:

 opt :science, "Use science"
Using --science will result in :science => true, and --no-science will result in :science => false, and not using either will result in :science => true.

There is one special case behavior: if the option itself starts with a “no_”, then you’ll get the opposite behavior:

 opt :no_muffins, "Don't use muffins", :default => true

Using --muffins will result in :no_muffins => false, and --no-muffins will result in :no_muffins => true, and using neither will result in :no_muffins => true.

Finally, The help screen will explicitly display the --no-x form for flags with a default of true, but otherwise Trollop will accept, but not display, the negative form for all flags.

This will probably require some upgrade code for those of you who were using a flag with :default => true, but I think the end result is a much more consistent behavior.

William Morgan, August 15, 2012.

Update: I’ve published this as a gem. Install it with gem install redis-scheduler or check out the GitHub page.

Want to use Redis to schedule work items? Zsets seem like a natural fit, but they’re actually fairly tricky to use for this purpose. Unlike lists, Redis provides no blocking or atomic move primitives for zsets. Additionally, set semantics don’t allow for multiple schedulings of the same item. These problems are surmountable, but only with some effort.

So here is a basic scheduler for Redis, in Ruby. I think it does everything you’d want from a production scheduler:

  • You can schedule arbitrary items at arbitrary times.
  • It supports multiple simultaneous readers and writers.
  • An exception causes the in-process item to be rescheduled at the original time.
  • A crash leaves the item in an error queue, from which it can later be recovered.
  • You can iterate over ready items in either blocking or non-blocking mode.

An example invocation:

r = Redis.new
s = RedisScheduler.new r, blocking: true
startt = Time.now
s.schedule! "a", startt + 10
s.schedule! "b", startt + 15
s.schedule! "c", startt + 20
s.each { |item, at| puts "#{Time.now - startt}: #{item}" }

will print:

10.03481788: a
15.05255288: b
20.06058172: c
... waits forever ...

Here’s the code:

class RedisScheduler
  include Enumerable

  POLL_DELAY = 1.0 # seconds
  CAS_DELAY  = 0.5 # seconds

  ## options:
  ##   namespace: prefix for redis data, e.g. "scheduler/"
  ##   blocking: whether #each should block or return immediately if
  ##     there are items to be processed immediately.
  ##
  ## Note that nonblocking mode may still actually block as part of the
  ## check-and-set semantics, i.e. block during contention from multiple
  ## clients. "Nonblocking" mode just refers to whether the scheduler
  ## should wait until events in the schedule are ready, or only return
  ## those items that are ready currently.
  def initialize redis, opts={}
    @redis = redis
    @namespace = opts[:namespace]
    @blocking = opts[:blocking]

    @queue = [@namespace, "q"].join
    @error_queue = [@namespace, "errorq"].join
    @counter = [@namespace, "counter"].join
  end

  ## schedule an item at a specific time. item will be converted to a
  ## string.
  def schedule! item, time
    id = @redis.incr @counter
    @redis.zadd @queue, time.to_f, "#{id}:#{item}"
  end

  def reset!
    [@queue, @error_queue, @counter].each { |k| @redis.del k }
  end

  ## yields items along with their scheduled times. only returns items
  ## on or after their scheduled times. items returned as strings. if
  ## @blocking is false, will stop once there are no more items that can
  ## be processed immediately; if it's true, will wait until items
  ## become available (and never terminate).
  def each
    while(x = get)
      item, at = x
      begin
        yield item, at
      rescue Exception # back in the hole!
        schedule! item, at
        raise
      ensure
        cleanup! item
      end
    end
  end

private

  def get; @blocking ? blocking_get : nonblocking_get end

  def blocking_get
    sleep POLL_DELAY until(x = nonblocking_get)
    x
  end

  class InvalidEntryException < StandardError; end
  def nonblocking_get
    catch :cas_retry do
      @redis.watch @queue
      item, at = @redis.zrangebyscore @queue, 0, Time.now.to_f,
        :withscores => true, :limit => [0, 1]
      if item
        @redis.multi do # try and grab it
          @redis.zrem @queue, item
          @redis.lpush @error_queue, item
        end or begin
          sleep CAS_DELAY
          throw :cas_retry
        end
        item =~ /^\d+:(\S+)$/ or raise InvalidEntryException, item
        item = $1
        [item, Time.at(at.to_f)]
      end
    end
  end

  def cleanup! item
    @redis.lrem @error_queue, 1, item
  end
end

If you find this useful, please drop me a note.

William Morgan, April 18, 2012.

I’ve released Whistlepig 0.10. This release changes the index format in a backwards-incompatible way. Indexes created by pre-0.10 Whistlepigs will not be readable by Whistlepig 0.10.

The good news going is that I have also added proper versioning on the index formats now, and so providing upgrade paths will be possible in the future, should I find it necessary to introduce further backwards-incompatible changes.

This change was necessary primarily for multi-process support. As of Whistlepig 0.10, you can have multiple processes reading from, and writing to, the same index. This means that Whistlepig may finally be useful for e.g., Rails apps, where you have multiple server processes all reading from the same index.

Whistlepig provides this multi-process concurrency by using pthread read-write locks. From a performance perspective, locking is not a very good form of concurrency. The original plan, way back when I was designing Whistlepig, was to use lock-free algorithms to allow concurrent access (at least, single-writer and multi-reader) and remain performant. Unfortunately, the C memory model isn’t as robust as the JVM memory model and, as far as I can tell, doesn’t support the atomicity guarantees necessary to accomplish this (see e.g. this Quora question and my follow-up comment). FWIW, it look like the C11 standard may address these issues (Jens Gustedt has more on this) but that’s not going to help me today.

So, locking it it is, for now. The read-write locking mechanism at least allows multiple reader processes to access the same segment at the same time, but a single writer will block any other writers and any other readers. The implication is that if the proportion of your write traffic is high relative to your read traffic, you will waste a lot of time waiting to acquire locks. So if you need high-throughput writes from Whistlepig, you probably don’t want to glom it all together into a single index, and instead should shard documents across multiple indexes. High-throughput reads with occasional writes should be fine.

The benefit that locking does provide is that it frees me up to implement more sophisticated data structures for postings lists. Right now Whistlepig essentially builds giant linked lists of all postings, which is very naive and wasteful, but does lend itself to a lock-free approach. If I am guaranteed exclusive write access, I can be much smarter about how I represent the postings. The learning in traditional IR applications is that smaller postings lists can improve search time as well. I have some concrete ideas on how to do this that should find themselves into future Whistlepig releases.

William Morgan, April 1, 2012.

I’ve released version 0.9 of Whistlepig, my minimalist real-time full-text search engine. This is a beta release. There are no known bugs, but I wouldn’t use it to power the space shuttle just yet.

Changes since the previous release include:

  • A new Query#term_map method, which allows you to programmatically alter queries after they’ve been parsed. The most obvious use case for this is to do case-folding, which can be tricky to do on the unparsed string due to the “OR” operator.
  • Various bugfixes and some additional safety-check code.

Do a gem install whistlepig to get it, or download the tarball here.

William Morgan, March 14, 2012.

leveldb-ruby 0.7 has been released.

This release fixed a double-freeing bug in DB#close, which caused a segfault if the GC tried to reclaim the LevelDB object or when Ruby terminated.

William Morgan, July 16, 2011.

Found this message in my own commit logs for the console rubygem and I figured I’d post it here in case some other poor fool is stuck writing code with mbrtowc:

Well this is a fun little feature of mbrtowc that I discovered! Give it one bad input and it will barf on all future inputs for the rest of the program, unless you pass in your own cleared shift state object.

The mbrtowc manpage is helpfully vague:

“If the multibyte string starting at s contains an invalid multibyte sequence before the next complete character, mbrtowc() returns (size_t) -1 and sets errno to EILSEQ. In this case, the effects on *ps are undefined.”

and

“If ps is a NULL pointer, a static anonymous state only known to the mbrtowc function is used instead. Otherwise, *ps must be a valid mbstate_t object.”

The reader is left to infer, of course, that the “undefined effects” of the “static anonymous state only known to the mbrtowc function” are that of “breaking all successive calls for the rest of the execution of the program”.

William Morgan, July 7, 2011.

Edsger Dijkstra explains why array indexes should start at 0 rather than 1. [pdf]

Here’s the summary: first, let’s consider question of how to represent a range of natural numbers. For example, consider the range of integers . There are four compact ways to represent this by varying whether we use or :

Which is best? In the last two variants, we can subtract the first number from the last to find the size of the range. Based on that favorable property, let’s restrict our options to those two.

What if we instead wish to represent ? Our two candidates give us:

The second representation forces us into the realm of unnatural numbers, which is not ideal. So we have our winning representation: .

Let’s return to the original problem of indexing things. We have two choices for representing the entire range, depending on whether we start at or :

Indexing starting at requires a clumsy ; starting at allows us to write the number of elements directly as part of the range. We conclude that indexing starting at 0 allows us the greatest notational convenience.

William Morgan, July 6, 2011.

Since this is something I’ve had to figure out 5 times already:

#!/bin/sh

git filter-branch --env-filter '
if [ "$GIT_AUTHOR_EMAIL" = "old-email-address" ];
then
    export GIT_AUTHOR_EMAIL="new-email-address";
fi
' HEAD

… will replace all occurrences of ‘old-email-address’ in the git commit email field with ‘new-email-address’ on the current branch.

The usual caveats of rewriting history apply: your head will diverge and anyone working off your branch is in for a hassle.

William Morgan, April 17, 2011.

Thanks to some prompting from Aaron Gallagher, I’ve bundled up two years of Whisper bugfixes into release 0.6.

This release requires Ruby 1.9. I have also published the gem on rubygems.org. Unfortunately the name “whisper” was already taken by some ancient Rails gem, so to get this latest release, please do:

gem install whisperblog

Send your comments and questions my way. It’s still a pretty homebrew project, obviously.

William Morgan, March 4, 2011.

Whistlepig 0.2 is out already. This time it should actually compile under OS X. I’m having a grand old time figuring out all the differences in flags that Ruby when compiling gems under different architectures. But not so grand that I want to learn automake/autoconf.

As part of making sure things were compiling on different platforms, I ran make test-integration and noticed that on my dual-boot laptop runs it at 8000k/s on Linux but only 6200k/s on OS X. I was expecting a difference in Linux’s favor, but not quite that large! As another point of reference, my mid-range Linux desktop gives me 9500 k/s.

William Morgan, February 10, 2011.