william's blog | 2012-02-04 11:55:31 +0000
===========================================
leveldb-ruby 0.7 released
-------------------------
Date: July 16, 2011 9:04pm
Author: William Morgan
Labels: releases, leveldb
URL: http://masanjin.net/blog/leveldb-ruby-0.7-released.txt
leveldb-ruby [1] 0.7 has been released.
This release fixed a double-freeing bug in DB#close, which caused a segfault
if the GC tried to reclaim the LevelDB object or when Ruby terminated.
[1] https://github.com/wmorgan/leveldb-ruby
(Reply to this at http://masanjin.net/blog/leveldb-ruby-0.7-released.txt.)
Error handling with mbrtowc
---------------------------
Date: July 7, 2011 6:05pm
Author: William Morgan
Labels: console, ncurses, sup
URL: http://masanjin.net/blog/error-handling-with-mbrtowc.txt
Found this message in my own commit logs for the console rubygem [1] and I
figured I'd post it here in case some other poor fool is stuck writing code
with @mbrtowc@:
"Well this is a fun little feature of mbrtowc that I discovered! Give
it one bad input and it will barf on all future inputs for the rest of
the program, unless you pass in your own cleared shift state object."
"The mbrtowc manpage is helpfully vague:"
""If the multibyte string starting at s contains an invalid multibyte
sequence before the next complete character, mbrtowc() returns
(size_t) -1 and sets errno to EILSEQ. In this case, the effects on *ps
are undefined.""
"and"
""If ps is a NULL pointer, a static anonymous state only known to the
mbrtowc function is used instead. Otherwise, *ps must be a valid
mbstate_t object.""
The reader is left to infer, of course, that the "undefined effects" of the
"static anonymous state only known to the mbrtowc function" are that of
"breaking all successive calls for the rest of the execution of the program".
[1] http://masanjin.net/console
(Reply to this at http://masanjin.net/blog/error-handling-with-mbrtowc.txt.)
Why indexing should start at 0
------------------------------
Date: July 6, 2011 8:23pm
Author: William Morgan
Labels: math
URL: http://masanjin.net/blog/why-indexing-should-start-at-0.txt
Edsger Dijkstra explains why array indexes should start at 0 rather than 1
[1]. [pdf]
Here's the summary: first, let's consider question of how to represent a range
of natural numbers. For example, consider the range of integers
$$2, 3, \cdots,
12$$. There are four compact ways to represent this by varying
whether we use $$\le$$ or $$\lt$$:
Which is best? In the last two variants, we can subtract the first number from
the last to find the size of the range. Based on that favorable property,
let's restrict our options to those two.
What if we instead wish to represent $$0, 1, \cdots, 4$$? Our two candidates give
us:
The second representation forces us into the realm of unnatural numbers, which
is not ideal. So we have our winning representation: $$a \le i \lt b$$.
Let's return to the original problem of indexing $$N$$ things. We
have two choices for representing the entire range, depending on whether we
start at $$0$$ or $$1$$:
Indexing starting at $$1$$ requires a clumsy $$+ 1$$;
starting at $$0$$ allows us to write the number of elements
directly as part of the range. We conclude that indexing starting at 0 allows
us the greatest notational convenience.
[1] http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF
(Six replies on this article at http://masanjin.net/blog/why-indexing-should-start-at-0.txt.)
How to fix git commit email addresses
-------------------------------------
Date: April 17, 2011 9:14pm
Author: William Morgan
Labels: git
URL: http://masanjin.net/blog/how-to-fix-git-commit-email-addresses.txt
Since this is something I've had to figure out 5 times already:
#!/bin/sh
git filter-branch --env-filter '
if [ "$GIT_AUTHOR_EMAIL" = "old-email-address" ];
then
export GIT_AUTHOR_EMAIL="new-email-address";
fi
' HEAD
... will replace all occurrences of 'old-email-address' in the git commit
email field with 'new-email-address' on the current branch.
The usual caveats of rewriting history apply: your head will diverge and
anyone working off your branch is in for a hassle.
(Reply to this at http://masanjin.net/blog/how-to-fix-git-commit-email-addresses.txt.)
Whisper 0.6 released
--------------------
Date: March 4, 2011 6:51pm
Author: William Morgan
Labels: whisper, releases
URL: http://masanjin.net/blog/whisper-0.6-released.txt
Thanks to some prompting from Aaron Gallagher, I've bundled up two years of
Whisper bugfixes into release 0.6.
This release requires Ruby 1.9. I have also published the gem on rubygems.org.
Unfortunately the name "whisper" was already taken by some ancient Rails gem,
so to get this latest release, please do:
gem install whisperblog
Send your comments and questions my way. It's still a pretty homebrew project,
obviously.
(Reply to this at http://masanjin.net/blog/whisper-0.6-released.txt.)
Whistlepig 0.2 released
-----------------------
Date: February 10, 2011 5:03am
Author: William Morgan
Labels: whistlepig, releases
URL: http://masanjin.net/blog/whistlepig-0.2-released.txt
Whistlepig 0.2 is out already. This time it should actually compile under OS
X. I'm having a grand old time figuring out all the differences in flags that
Ruby when compiling gems under different architectures. But not so grand that
I want to learn automake/autoconf.
As part of making sure things were compiling on different platforms, I ran
@make test-integration@ and noticed that on my dual-boot laptop runs it at
8000k/s on Linux but only 6200k/s on OS X. I was expecting a difference in
Linux's favor, but not quite that large! As another point of reference, my
mid-range Linux desktop gives me 9500 k/s.
(Reply to this at http://masanjin.net/blog/whistlepig-0.2-released.txt.)
Whistlepig 0.1 released
-----------------------
Date: February 9, 2011 7:08am
Author: William Morgan
Labels: whistlepig, releases
URL: http://masanjin.net/blog/whistlepig-0.1-released.txt
Today I released the very first version of Whistlepig [1], a minimalist
realtime full-text search index.
Side projects apparently take a lot longer when you have a job and a baby,
because it's taken me over 6 months to get to the point where I have something
releasable. And there are so many obvious improvements to make. But all known
bugs are squashed, and it's good enough to use, so, it's out.
The README [2] has a good description of what Whistlepig is, so here I thought
I'd talk about the why. Why write yet another inverted index?
The unfortunate fact is that you have too many choices already: Lucene,
obviously, and its derivatives like SOLR, and if you're shy of the JVM, Xapian
and Sphinx. Ferret used to be a good choice in the Ruby world until Dave
Balmain absconded and no one had the cojones to maintain his code. I've used
each of these things.
But they are all very heavy-weight solutions, and they all suffer from what I
call the "TREC mentality". In early TREC competitions, you were given a big,
static corpus, which you indexed at your leisure, and then you were given a
bunch of queries, which were all long descriptions of what documents someone
was interested in. It would be something like "I am interested in documents
about Mayan architecture, but only during the pre-conquistador period, and
specifically I am not interested in such and such" and so on. These
competitions were great in that they spurred advances in search engineering,
but the result is that almost every inverted index implementation today is
optimized for precisely the case of static corpora and large queries.
In the intervening 30 years, the use case for full-text search has far
exceeded the library-science-style applications of the early TREC
competitions. There are many applications where you don't need tf-idf scores
and the Okapi formula or even necessarily stemming. You just want recent
things that match your query, and you value control and transparency over some
kind of fuzzy natural language matching. Search in GMail (or Sup of, course!)
comes to mind, or searching within the posts on this blog.
That's one part of the reason why existing solutions are not ideal. The other
part is that inverted indexes are so optimized for speed and for size that
even little things like wanting documents from last to first can be
drastically slower than using the standard ordering. For example, Sup wants
documents in reverse chronological order; Xapian is fastest in increasing
docid order; so we play crazy games to map dates to docids:
DOCID_SCALE = 2.0**32
TIME_SCALE = 2.0**27
MIDDLE_DATE = Time.gm(2011)
def assign_docid m, truncated_date
t = (truncated_date.to_i - MIDDLE_DATE.to_i).to_f
docid = (DOCID_SCALE - DOCID_SCALE /
(Math::E**(-(t/TIME_SCALE)) + 1)).to_i
while docid > 0 and docid_exists? docid
docid -= 1
end
docid > 0 ? docid : nil
end
This snippet is courtesy of Rich Lane, who should be credited in history books
as the first person to find a use for a logistic curve in an email client.
If you try and use something like Xapian or Sphinx for these applications, you
have to play games like that for performance. And when new documents arrive,
you have to play further games to get them into the index sooner rather than
later. And all the while you're turning off 90% of the features anyways.
So that leads us to the world of realtime search, which explicitly values
recent documents over older ones. It's "realtime" where new documents arrive
on the fly and must be made available to queries as soon as they arrive. If
you're in that situation, you typically also care more about more recent
documents that older ones anyways. Those are the two tenets of realtime
search: documents are available immediately, and recent documents are more
important than older ones.
Whistlepig is my attempt to capture those two tenets in as few lines of code
as possible, while still being reasonable performant. I do this by stripping
away all the vestigial TREC functionality of relevance, ranking, sorting,
tf-idf, etc. You get documents in LIFO order, and that's it. Whistlepig
doesn't return anything besides the docid either: if you need something more
than the id, you have to fit that into a separate store somewhere. It turns
out if you throw that stuff away, you can accomplish the rest of the search
problem without a tremendous amount of code. Like any C program, it's 5%
algorithm and 95% bookkeeping.
There is one wrinkle that I actually add to the model: I allow adding and
removing labels from documents. Every other aspect of a document is fixed in
Whistlepig--you can't even delete it from the index once it's been added--but
labels are mutable. And of course you can intermingle labels with the other
components of your query. Almost every realtime search application I can dream
up would benefit from this functionality, so there you go.
My hope for Whistlepig is that it becomes the default choice for realtime
search applications, especially in the Ruby world, which hasn't had a good
in-process search solution since Ferret bit the dust. And if I mysteriously
disappear like Dave did, I also hope that the codebase is small enough and
simple enough that taking it over doesn't seem like a herculean effort.
[1] http://masanjin.net/whistlepig
[2] http://github.com/wmorgan/whistlepig/blob/master/README
(Reply to this at http://masanjin.net/blog/whistlepig-0.1-released.txt.)
"sudo gem install" considered harmful
-------------------------------------
Date: September 22, 2010 5:36pm
Author: William Morgan
Labels: ruby
URL: http://masanjin.net/blog/sudo-gem-install-considered-harmful.txt
_Update 2010/10/02: see here [1] for a real-life example._
If you habitually type @sudo gem install@ on your development box, you are
potentially exposing yourself to nasty behavior. If you have @sudo gem
install@ as part of your automated deploy process, you are begging for
something tragic to happen.
Consider:
1. A gem can execute arbitrary code at install time.[1]
2. Anyone with the proper permissions on rubygems.org [2] can publish a new
version of a gem at any point. This code is not reviewed or audited by anyone
before publication.
3. @gem install@ pulls in the latest version of any dependencies that it can,
for the entire dependency graph.
All it takes is for one malicious or incompetent gem writer to do something
wrong, _even in a gem you don't directly depend on_, and @sudo gem install@
will destroy your box.
Happily, rubygems work perfectly well in non-root mode. For local development,
you can leave out the @sudo@ and gems will be installed in your home
directory. For production use, you should be running servers and apps as
non-root users anyways.
Please, stop propagating the @sudo gem install@ meme.
fn1. See http://github.com/wmorgan/killergem.
[1] http://twitter.com/#!/timcharper/status/26202857990
[2] http://rubygems.org
(Reply to this at http://masanjin.net/blog/sudo-gem-install-considered-harmful.txt.)
Whisper Fix
-----------
Date: September 18, 2010 5:08pm
Author: William Morgan
Labels: whisper, ruby1.9
URL: http://masanjin.net/blog/whisper-fix.txt
I just noticed that comments have been backlogged for a few months because the
blog received a (spam) email with invalid UTF-8, which apparently in Ruby 1.9
causes @String#=~@ to throw the very generic @ArgumentError@.
I've caught the exception and, thanks to my high-tech mbox-based queueing
system, we're back on track. The hazards of a 1-person install base, I
suppose.
(Reply to this at http://masanjin.net/blog/whisper-fix.txt.)
Wall on Greatness
-----------------
Date: September 11, 2010 11:09pm
Author: William Morgan
Labels: perl, proglang
URL: http://masanjin.net/blog/wall-on-greatness.txt
bq. The very fact that it's possible to write messy programs in Perl is also
what makes it possible to write programs that are cleaner in Perl than they
could ever be in a language that attempts to enforce cleanliness. The
potential for greater good goes right along with the potential for greater
evil. A little baby has little potential for good or evil, at least in the
short term. A President of the United States has tremendous potential for both
good and evil. -- Larry Wall [1].
[1] http://www.wall.org/~larry/pm.html
(Reply to this at http://masanjin.net/blog/wall-on-greatness.txt.)
Pages
-----
* Page 1: You're reading it.
* Page 2: http://masanjin.net/blog/index/1.txt
* Page 3: http://masanjin.net/blog/index/2.txt
* Page 4: http://masanjin.net/blog/index/3.txt
* Page 5: http://masanjin.net/blog/index/4.txt
* Page 6: http://masanjin.net/blog/index/5.txt
* Page 7: http://masanjin.net/blog/index/6.txt
* Page 8: http://masanjin.net/blog/index/7.txt
* Page 9: http://masanjin.net/blog/index/8.txt
This delicious text version served up by Whisper .