william's blog | 2012-02-04 12:04:48 +0000
===========================================
Ritex 0.2 released
------------------
Date: April 2, 2009 1:02pm
Author: William Morgan
Labels: ritex, releases
URL: http://masanjin.net/blog/ritex-0.2.txt
It's been almost four years since the previous release, so I'm happy to
announce that Ritex [1] 0.2 has been released today. This version features
many bugfixes an improvements, most notably:
* Array options are now supported. (Necessary to get the @eqnarray@-style
equation alignment in this post [2]
* Unary minus heuristics are much improved.
Here's a quick demo of the unary minus:
table{margin-left: auto; margin-right: auto;}. | @-x@ |
$$-x$$ | | @x-x@ | $$x-x$$
| | @x--x@ | $$x--x$$ | |
@\alpha-x@ | $$\alpha - x$$ | | @\alpha\,-x@ |
$$\alpha\, - x$$ |
Sadly, just as with LaTeX itself, there are still times where you have to hint
to get the right behavior:
table{margin-left: auto; margin-right: auto;}. | @\sin-x@ |
$$\sin-x$$ | | @\sin{-x}@ | $$\sin{-x}$$
|
Over the years since the last release it looks like there are two new options
for generating MathML in Ruby. Itex2MML [3] has developed Ruby bindings, and
there's some other project just called MathML [4]. The big win for Ritex over
these packages, of course, is macro support:
table{margin-left: auto; margin-right: auto;}. |
@\define{\onion}{\hat{\theta}}@ | $$\define{\onion}{\hat{\theta}}$$ _-_ | |
@\define{\potato}[1]{E_\theta[#1]}@ | $$\define{\potato}[1]{E_\theta[#1]}$$ _-_ | |
@\potato{\onion}@ | $$\potato{\onion}$$
|
A quick @gem install ritex@ should get it for you, and you can see some more
example input/output pairs here [5].
[1] http://masanjin.net/ritex/
[2] http://localhost:9292/smoothing.)
[3] http://golem.ph.utexas.edu/~distler/blog/itex2MML.html
[4] http://mathml.rubyforge.org/
[5] http://masanjin.net/ritex/report.xml
(Reply to this at http://masanjin.net/blog/ritex-0.2.txt.)
Smoothing users' votes
----------------------
Date: March 31, 2009 11:16pm
Author: William Morgan
Labels: stats
URL: http://masanjin.net/blog/smoothing.txt
In a previous post [1] I describe how you can cook up a Bayesian framework
that results in IMDB's so-called "true Bayesian estimate", a formula which, on
its face, doesn't look particularly Bayesian.
As my astute commenters pointed out, this formula has many simpler
interpretations without needing to invoke the B word. For example, it's a
linear interpolation between two values: $$\define{\that}{\hat{\theta}}$$
where $$R$$ is our mean vote, $$\tau$$ is some smoothing
target, $$\lambda(v)$$ is the smoothing weight. $$\lambda(v)$$ can be any
function, as long as it increases with $$v$$, stays between 0 and
1, and is 0 when $$v$$ is 0. Those constraints give you the right
behavior: with no votes, your estimate is $$\tau$$ exactly; as you add
votes, it approaches $$R$$, and $$\lambda(v)$$ controls how fast
that happens.
This formulation naturally leads to the following question: if I'm smoothing
like this to deal with paucity-of-data issues, what value of $$\tau$$
should I pick? IMBD uses the $$\tau=C$$, the global movie mean.
Intuitively that makes sense, but is it the right choice?
What's nice about the expression for $$\that$$ above is that the
behavior we're most interested in is when $$v=0$$, i.e. when there
are no votes. In that case, $$\that=\tau$$, because of how I've constrained
$$\lambda(v)$$. So finding the best $$\tau$$ is equivalent to
finding the best $$\that$$ when $$v=0$$.
$$
\define{\risk}{R(\theta, \that)}
\define{\loss}{L(\theta, \that)}
\define{\exp}[1]{E_\theta\left[#1\right]}
$$
Happily, we can answer the question of the best $$\that$$
analytically, at least if we're happy to imagining that there is a "true"
value of the movie $$\theta$$.
Given $$\theta$$, we can define a loss function $$\loss$$ that
describes how bad we think a particular value of $$\that$$ is. But we
don't really know what $$\theta$$ is for any movie (if we did, we
wouldn't be bothering with any of this). So we can generalize that a step
further and define a risk function $$\risk=\exp{\loss}$$ quantifying our _expected
loss_: the aggregate of the loss function across all possible values of
$$\theta$$, weighted by the probability of each value. This gives us
the tool we really need to answer the question above: the $$\that$$
that minimizes our risk is the winner.
In the absence of any specific notions about errors, we'll use the standard
loss function for reals, squared-error loss: $$\loss = (\theta-\that)^2$$. Then it's just
a matter of churning the crank:
We can drop that first term since we're only interested in minimimizing this
as a function of $$\tau$$. To find the minimum:
Unsurprisingly, we see that the best estimate of $$\theta$$ under
squared-error loss is the mean of the distribution of $$\theta$$. Since
we're interested in the case where $$v=0$$, this implies that the
best value to use for $$\tau$$ is also the mean.
So IMDB's choice of $$C$$ makes sense: the mean vote over all your
movies is a great estimate of the mean of the distribution of
$$\theta$$.
A couple concluding points:
1. This answer is specific to squared-error loss; if you plug in another loss
function, the optimal value for $$\tau$$ might very well change. And
you might actually have a specific model in mind for how "bad" mis-estimates
are. Maybe over-estimates are worse than under-estimates, or something like
that.
2. The definition of the distribution of $$\theta$$ is actually
completely vague above. In fact we don't even talk about it; we just use it
implicitly in our $$\exp{\cdot}$$ terms. So you should feel free to plug in
(the mean of) whatever distribution you believe most accurately represents
your product/movie/whatever. IMDB could arguable to better by plugging in
per-category means, or something even fancier.
3. IMDB is actually a particularly bad case because movie opinions are extremely
subjective. If you're serious about modeling very subjective things, we
should be talking about multinomial models, Dirichlet priors, and the like.
But the take-home message is: in the absence of a specific loss function that
you really believe, smoothing towards the mean isn't just intuitive, it's
minimizing your risk.
[1] http://all-thing.net/bayesian-average
(Two replies on this article at http://masanjin.net/blog/smoothing.txt.)
What's cooking in Sup next
--------------------------
Date: March 25, 2009 4:51pm
Author: William Morgan
Labels: sup
URL: http://masanjin.net/blog/sup-next.txt
The 0.7 release ain't the only exciting Sup [1] news. Here's a list of
interesting features that are currently cooking in Sup next, along with the
associated branch name.
* zsh completion for sup commandline commands, thanks to Ingmar Vanhassel.
(_zsh-completion_)
* Undo support for many commands, thanks to Mike Stipicevic.
(_undo-manager_)
* You can now remove labels from multiple tagged threads, thanks to Nicolas
Pouillard, using the syntax @-label@). (_multi-remove-labels_)
* Sup works on terminals with transparent backgrounds (and that's fixed
copy-and-paste for me too!), thanks to Mark Alexander. (_default-colors_)
* Pressing 'b' now lets you roll buffers both forward and backward, also
thanks to Nicolas Pouillard. (_roll-buffers_)
* Duplicate messages (including messages you send to a mailing list, and
then receive a copy of) should now have their labels merged, except for
unread and inbox labels. So if you automatically label messages from mailing
lists via the before-add-hook, that should work better for you now.
(_merge-labels_)
* Saving message state is now backgrounded, so pressing '$' after reading a
big thread shouldn't interfere with your life. It still blocks when closing
a buffer, though, so I have to make that work. (_background-save_)
* Email canonicalization, also thanks to Nicolas Pouillard. The mapping
between email addresses and names is no longer maintained across multiple
emails. (_dont-canonicalize-email-addresses_)
The canonicalization one is a weird one. There's been a long-standing problem
in Sup where names associated with email addresses are saved and reused.
Unfortunately many automated systems like JIRA, evite, blogger, etc. will send
you email on behalf of someone else, using the same email address but
different names. The issue was compounded because Sup decided that longer
names should always replace shorter ones, so receiving some spam claiming to
be from your address but with a random name would have all sorts of crazy
effects.
Addresses are still stored in the index, both for search purposes, and for
@thread-index-mode@. (Otherwise @thread-index-mode@ has to reread the headers
from the message source, which is slow.) Once @thread-view-mode@ is opened,
the headers must be read from the source anyways, so the email address is
updated to the correct version.
So, incoming new email should be fine. Sup will store whatever name is in the
headers, and won't do any canonicalization.
For older email, you can update the index manually by viewing the message in
@thread-view-mode@, and forcing Sup to re-save it, e.g. by changing the labels
and then changing them back. Marking it as read, and then reading it, is an
easy way to accomplish this, at least for read messages.
You can also make judicious use of @sup-sync@ to do this for all messages in
your index.
[1] http://sup.rubyforge.org/
(Reply to this at http://masanjin.net/blog/sup-next.txt.)
Sup 0.7 released
----------------
Date: March 25, 2009 4:49pm
Author: William Morgan
Labels: sup, releases
URL: http://masanjin.net/blog/sup-0.7.txt
Sup 0.7 has been released.
You can read the announcement here [1]
The big win in this release is that Ferret index corruption issues should now
be fixed, thanks to an extensive programming of locking and
thread-safety-adding.
The other nice change is that text entry will now scroll to the right upon
overflow, thanks to some arcane Curses magic.
[1] http://rubyforge.org/pipermail/sup-talk/2009-March/002030.html
(Three replies on this article at http://masanjin.net/blog/sup-0.7.txt.)
Sharing Conflict Resolutions in Git
-----------------------------------
Date: March 22, 2009 9:23pm
Author: William Morgan
Labels: git, sup
URL: http://masanjin.net/blog/git-conflict-resolution.txt
Development of Sup [1] is done with Git. Sup follows a _topic branch_
methodology: features and bugfixes typically start off as "topic" branches
from @master@, and are merged into an "integration"/"version" branch @next@
for integration testing. After _n_ cycles of additional bugfix commits to the
topic branch, and re-merges into @next@, the topic branches are finally merged
down to @master@, to be included in the next release.
I really like this approach because I think it evinces the real power of Git:
that merges are so foolproof that I can pick and choose, on a
feature-by-feature basis, which bits of code I want at each level of
integration. That's crazy cool. And users can stick to @master@ if they want
something stable, and @next@ if they want the latest-and-greatest features.
The biggest problem I've had, though, is that long-lived topic branches often
conflict with each other. This happens both when merging into @next@ and when
merging into @master@. I don't think there's a way around it; isolating
features in this way has all the benefits above, but it also means that when
they touch the same bits of code, you'll get a conflict.
As a lazy maintainer, the biggest question I've had is: is there a way to push
the burden of conflict resolution to the patch submitter? Is there a way for
me to say: hey, your change conflicts with Bob's. Can you resolve the conflict
and send it to me?
One option I've considered is to have contributors to publish not only their
feature branches, but their @next@ branch as well. Assuming they aren't
mucking about with their @next@ branch otherwise, if it contains just the
merge commit, I can merge it into mine, and it should be a fast-forward that
gets me the merge commit, conflict resolution and all.
But I don't like that idea because, in every other case, I'm merging in the
feature branches directly. Why should I suddenly start merging in @next@ just
because you have a conflict?
Furthermore, Sup primarily receives email contributions via @git
format-patch@, and I do the dirty deed of sorting them into branches and
merging things around. Requiring everyone to host a git repo iff they produce
a conflicting patch seems silly. (And @git format-patch@, unfortunately,
produces nothing for merge commits, even if they have conflict resolution
changes. Maybe there's a good reason for this, or maybe not. I'm not sure.)
After some effort, and some git-talk discussion, I have a solution. And no, it
doesn't involve sharing @git-rerere@ caches. (Which it seems that some people
do!)
For the contributor: once you have resolved the conflict, do a @git diff
HEAD^@. This will output the conflict resolution changes. Email that to the
maintainer along with your patch.
For the maintainer:
$ git checkout next
$ git merge
[... you have a conflict, yada yada ...]
$ git checkout next .
$ git apply --index
$ git commit
Running @git merge@ gets you to the point where you have a conflict. Running
@git checkout next .@ sets your working directory to the state it was before
you merged. And @git apply@ applies the resolution changes.
You lose authorship of the conflict resolution, but you can use @git commit
--author@ to set it.
I think the ideal solution would be for @git format-patch@ to produce
something usable in this case. I see some traffic on the Git list that
suggests this is being considered, so hopefully one day this rigmarole will
not be necessary.
[1] http://sup.rubyforge.org/
(Four replies on this article at http://masanjin.net/blog/git-conflict-resolution.txt.)
No MathML in webkit
-------------------
Date: March 19, 2009 5:07pm
Author: William Morgan
Labels: mathml, whisper
URL: http://masanjin.net/blog/no-mathml-in-webkit.txt
So apparently WebKit has no real MathML support [1]. Empirically, it seems
like you get some stuff like greek symbols, but things like sums and whatnot
don't appear. Oh well. Mac users, switch to Firefox, or ignore the math posts.
[1] http://webkit.org/projects/mathml/index.html
(Reply to this at http://masanjin.net/blog/no-mathml-in-webkit.txt.)
Trollop 1.13 released
---------------------
Date: March 16, 2009 5:54pm
Author: William Morgan
Labels: trollop, releases
URL: http://masanjin.net/blog/trollop-1.13.txt
I've released Trollop 1.13. This is a minor bugfix release. Arguments given
with ='s and with spaces in the values are now parsed correctly. (E.g.
@--name="your mom"@.)
Get it with a quick @gem install trollop@.
(Three replies on this article at http://masanjin.net/blog/trollop-1.13.txt.)
Whisper 0.3 released
--------------------
Date: March 16, 2009 5:44pm
Author: William Morgan
Labels: whisper, releases
URL: http://masanjin.net/blog/whisper-0.3.txt
I've released Whisper 0.3. This is mostly a bugfix release, with generally
better email support, including support for MIME multipart email.
How to do it:
1. @sudo gem install whisper --source http://masanjin.net/@
2. @whisper-init @
3. Follow the instructions!
(Reply to this at http://masanjin.net/blog/whisper-0.3.txt.)
git-wtf dd706855 released
-------------------------
Date: March 16, 2009 5:02pm
Author: William Morgan
Labels: git, git-wtf, releases
URL: http://masanjin.net/blog/git-wtf-dd706855-released.txt
I've released a version dd706855 of git-wtf, available here:
http://git-wt-commit.rubyforge.org/git-wtf [1]
I've tweaked the output format so that branches that don't exist on the remote
server are displayed with @()@'s and those that do with @[]@'s, and @~@ is the
new symbol for a merge that only occurs on the local side.
I think this produces a better display; lots more information per line of
ourput.
I've also added a couple random options which you can discover by reading the
source. :)
The big next step I'd like to take with this thing is to support multiple
remote repos better. Currently it's kinda specific to your origin repo.
[1] http://git-wt-commit.rubyforge.org/git-wtf
(Reply to this at http://masanjin.net/blog/git-wtf-dd706855-released.txt.)
Understanding the "Bayesian Average"
------------------------------------
Date: March 12, 2009 4:07pm
Author: William Morgan
Labels: stats
URL: http://masanjin.net/blog/bayesian-average.txt
IMDB rates movies using a score they call the true Bayesian estimate [1]
(bottom of the page). I'm pretty sure that's a made-up term. A couple other
sites, like BoardGameGeek, use the same thing and call it a "Bayesian
average". I think that's a made-up term, too, even through there's a Wikipedia
article on it [2].
Nonetheless, the formula is simple, and it has a nice interpretation. Here it
is:
where $$C$$ is the mean vote across all movies, $$v$$ is
the number of votes, $$R$$ is the mean rating for the movie, and
$$m$$ is the "minimum number of votes required to be listed in the
top 250 (currently 1300)".
The nice interpretation is this: pretend that, in addition to the
$$v$$ votes that users give a movie, you're also throwing in
$$m$$ votes of score $$C$$ each. In effect you're
pushing the scores towards the global average, by $$m$$ votes.
Is this arbitarary? Actually, no. It's the mean (i.e. MLE) of the posterior
distribution you get when you have a Normal prior with mean $$C$$
and precision $$m$$, and a Normal conditional with variance 1.0.
In other words, you're starting with a belief that, in the absense of votes, a
movie/boardgame should be ranked as average, and you're assuming that user
votes are normally-distributed around the "true" score with variance 1.0.
Then you're looking at the posterior distribution (i.e. the probability
distribution that arises as a result of those assumptions), and you're picking
the most likely value from that, which in the case of Gaussians is the mean.
Let's see how that works.
To find the posterior distribution, we could work through the math, or we
could just look at the Wikipedia article on conjugate priors [3]. We'll see
that the posterior distribution of a Normal, when the prior is also a Normal,
is a Normal with mean
where $$\mu_0$$ and $$\tau_0$$ are the mean and precision of
the prior, respectively, $$\tau$$ is the precision of the vote
distribution, and $$n$$ is the number of votes. In the case of
IMDB, we assumed above that $$\tau=1$$, so we have
Comparing the IMDB equation to this, we can see that $$v$$ above
is $$n$$ here, $$C$$ above is $$\mu_0$$ here,
$$Rv=\frac{1}{v}\left(\sum_{i=1}^v v_i\right)\ v = \sum_{i=1}^v
v_i$$ above is $$\sum_{i=1}^{n} x_i$$ here, and $$m$$ above
is the hyperparameter $$\tau_0$$. So we know that even though IMDB says
$$m$$ is the "minimum number of votes required to be listed in the
top 250 list", that's an arbitrary decision on their part: it can be anything
and the formula still works. $$m$$ is the precision of the prior
distribution; as it gets bigger, the prior distribution gets "sharper", and
thus has more of an effect on the posterior distribution.
Now the assumptions we made to get to this point are almost laughable. If
nothing else, we know that Gaussians are unbounded and continuous, and user
votes on IMBD are integers in the range of 1-10. The interesting take-away
message here is that even though we made a lot of assumptions above that were
laughably wrong, the end result is a reasonable formula with an nice,
intuitive meaning.
[1] http://www.imdb.com/chart/top
[2] http://en.wikipedia.org/wiki/Bayesian_average
[3] http://en.wikipedia.org/wiki/Conjugate_prior
(13 replies on this article at http://masanjin.net/blog/bayesian-average.txt.)
Pages
-----
* Page 1: http://masanjin.net/blog/index.txt
* Page 2: http://masanjin.net/blog/index/1.txt
* Page 3: http://masanjin.net/blog/index/2.txt
* Page 4: You're reading it.
* Page 5: http://masanjin.net/blog/index/4.txt
* Page 6: http://masanjin.net/blog/index/5.txt
* Page 7: http://masanjin.net/blog/index/6.txt
* Page 8: http://masanjin.net/blog/index/7.txt
* Page 9: http://masanjin.net/blog/index/8.txt
This delicious text version served up by Whisper .