william's blog | 2012-02-04 12:09:48 +0000
===========================================
Smoothing users' votes
----------------------
Date: March 31, 2009 11:16pm
Author: William Morgan
Labels: stats
URL: http://masanjin.net/blog/smoothing.txt
In a previous post [1] I describe how you can cook up a Bayesian framework
that results in IMDB's so-called "true Bayesian estimate", a formula which, on
its face, doesn't look particularly Bayesian.
As my astute commenters pointed out, this formula has many simpler
interpretations without needing to invoke the B word. For example, it's a
linear interpolation between two values: $$\define{\that}{\hat{\theta}}$$
where $$R$$ is our mean vote, $$\tau$$ is some smoothing
target, $$\lambda(v)$$ is the smoothing weight. $$\lambda(v)$$ can be any
function, as long as it increases with $$v$$, stays between 0 and
1, and is 0 when $$v$$ is 0. Those constraints give you the right
behavior: with no votes, your estimate is $$\tau$$ exactly; as you add
votes, it approaches $$R$$, and $$\lambda(v)$$ controls how fast
that happens.
This formulation naturally leads to the following question: if I'm smoothing
like this to deal with paucity-of-data issues, what value of $$\tau$$
should I pick? IMBD uses the $$\tau=C$$, the global movie mean.
Intuitively that makes sense, but is it the right choice?
What's nice about the expression for $$\that$$ above is that the
behavior we're most interested in is when $$v=0$$, i.e. when there
are no votes. In that case, $$\that=\tau$$, because of how I've constrained
$$\lambda(v)$$. So finding the best $$\tau$$ is equivalent to
finding the best $$\that$$ when $$v=0$$.
$$
\define{\risk}{R(\theta, \that)}
\define{\loss}{L(\theta, \that)}
\define{\exp}[1]{E_\theta\left[#1\right]}
$$
Happily, we can answer the question of the best $$\that$$
analytically, at least if we're happy to imagining that there is a "true"
value of the movie $$\theta$$.
Given $$\theta$$, we can define a loss function $$\loss$$ that
describes how bad we think a particular value of $$\that$$ is. But we
don't really know what $$\theta$$ is for any movie (if we did, we
wouldn't be bothering with any of this). So we can generalize that a step
further and define a risk function $$\risk=\exp{\loss}$$ quantifying our _expected
loss_: the aggregate of the loss function across all possible values of
$$\theta$$, weighted by the probability of each value. This gives us
the tool we really need to answer the question above: the $$\that$$
that minimizes our risk is the winner.
In the absence of any specific notions about errors, we'll use the standard
loss function for reals, squared-error loss: $$\loss = (\theta-\that)^2$$. Then it's just
a matter of churning the crank:
We can drop that first term since we're only interested in minimimizing this
as a function of $$\tau$$. To find the minimum:
Unsurprisingly, we see that the best estimate of $$\theta$$ under
squared-error loss is the mean of the distribution of $$\theta$$. Since
we're interested in the case where $$v=0$$, this implies that the
best value to use for $$\tau$$ is also the mean.
So IMDB's choice of $$C$$ makes sense: the mean vote over all your
movies is a great estimate of the mean of the distribution of
$$\theta$$.
A couple concluding points:
1. This answer is specific to squared-error loss; if you plug in another loss
function, the optimal value for $$\tau$$ might very well change. And
you might actually have a specific model in mind for how "bad" mis-estimates
are. Maybe over-estimates are worse than under-estimates, or something like
that.
2. The definition of the distribution of $$\theta$$ is actually
completely vague above. In fact we don't even talk about it; we just use it
implicitly in our $$\exp{\cdot}$$ terms. So you should feel free to plug in
(the mean of) whatever distribution you believe most accurately represents
your product/movie/whatever. IMDB could arguable to better by plugging in
per-category means, or something even fancier.
3. IMDB is actually a particularly bad case because movie opinions are extremely
subjective. If you're serious about modeling very subjective things, we
should be talking about multinomial models, Dirichlet priors, and the like.
But the take-home message is: in the absence of a specific loss function that
you really believe, smoothing towards the mean isn't just intuitive, it's
minimizing your risk.
[1] http://all-thing.net/bayesian-average
Replies
--------
Adrian-Bogdan Morut, on December 15, 2010 10:03pm:
["| On Wed, Dec 15, 2010 at 23:37, William Morgan wrote:\n", "| \n", "| \n", "| I'm not all that fluent in math jargon, so there's a pretty important bit of\n", "| all this that's still unclear to me: what exactly is it that you/they are\n", "| calling the \"mean vote over all movies\"? Is it A. The mean of all elementary\n", "| scores entered by individual users (regardless of which movies they were for)\n", "| or B. The global mean of all the movie-specific mean scores (a mean of means)?\n", "| \n", "| \n", "| (And as a sidenote: I'm not sure if you're using some other literature than\n", "| the Wikipedia article on the \"Bayesian average\", but you seem to have reversed\n", "| the meanings of C and m: they're using \"m\" for the prior mean and C for the\n", "| number of instances of m that are added to the numerator.)\n", "| \n"]
William Morgan, on January 5, 2011 8:45pm:
[" | Hi Adrian-Bogdan,\n", " | \n", " | Thanks for the comment!\n", " | \n", " | \n", " | It's option B, the mean of the score of the movies. That these individual\n", " | scores also happen to be calculated as means of votes is only incidental. The\n", " | same risk function analysis would apply to any method of scoring.\n", " | \n", " | \n", " | I was going by IMDB's terminology at the bottom of\n", " | http://www.imdb.com/chart/top. You're right that $C$ and $m$ have the opposite\n", " | definitions in the Wikipedia page. Strange...\n"]
This delicious text version served up by Whisper .