In a previous post I describe how you can cook up a Bayesian framework that results in IMDB’s so-called “true Bayesian estimate”, a formula which, on its face, doesn’t look particularly Bayesian.
As my astute commenters pointed out, this formula has many simpler interpretations without needing to invoke the B word. For example, it’s a linear interpolation between two values: where is our mean vote, is some smoothing target, is the smoothing weight. can be any function, as long as it increases with , stays between 0 and 1, and is 0 when is 0. Those constraints give you the right behavior: with no votes, your estimate is exactly; as you add votes, it approaches , and controls how fast that happens.
This formulation naturally leads to the following question: if I’m smoothing like this to deal with paucity-of-data issues, what value of should I pick? IMBD uses the , the global movie mean. Intuitively that makes sense, but is it the right choice?
What’s nice about the expression for above is that the behavior we’re most interested in is when , i.e. when there are no votes. In that case, , because of how I’ve constrained . So finding the best is equivalent to finding the best when .
Happily, we can answer the question of the best analytically, at least if we’re happy to imagining that there is a “true” value of the movie .
Given , we can define a loss function that describes how bad we think a particular value of is. But we don’t really know what is for any movie (if we did, we wouldn’t be bothering with any of this). So we can generalize that a step further and define a risk function quantifying our expected loss: the aggregate of the loss function across all possible values of , weighted by the probability of each value. This gives us the tool we really need to answer the question above: the that minimizes our risk is the winner.
In the absence of any specific notions about errors, we’ll use the standard loss function for reals, squared-error loss: . Then it’s just a matter of churning the crank:
We can drop that first term since we’re only interested in minimimizing this as a function of . To find the minimum:
Unsurprisingly, we see that the best estimate of under squared-error loss is the mean of the distribution of . Since we’re interested in the case where , this implies that the best value to use for is also the mean.
So IMDB’s choice of makes sense: the mean vote over all your movies is a great estimate of the mean of the distribution of .
A couple concluding points:
- This answer is specific to squared-error loss; if you plug in another loss function, the optimal value for might very well change. And you might actually have a specific model in mind for how “bad” mis-estimates are. Maybe over-estimates are worse than under-estimates, or something like that.
- The definition of the distribution of is actually completely vague above. In fact we don’t even talk about it; we just use it implicitly in our terms. So you should feel free to plug in (the mean of) whatever distribution you believe most accurately represents your product/movie/whatever. IMDB could arguable to better by plugging in per-category means, or something even fancier.
- IMDB is actually a particularly bad case because movie opinions are extremely subjective. If you’re serious about modeling very subjective things, we should be talking about multinomial models, Dirichlet priors, and the like.
But the take-home message is: in the absence of a specific loss function that you really believe, smoothing towards the mean isn’t just intuitive, it’s minimizing your risk.
On Wed, Dec 15, 2010 at 23:37, William Morgan <comments@all-thing.net> wrote:
IMDB’s choice of C makes sense: the mean vote over all your movies is a great estimate of the mean of the distribution of \theta.
I’m not all that fluent in math jargon, so there’s a pretty important bit of all this that’s still unclear to me: what exactly is it that you/they are calling the “mean vote over all movies”? Is it A. The mean of all elementary scores entered by individual users (regardless of which movies they were for) or B. The global mean of all the movie-specific mean scores (a mean of means)?
(And as a sidenote: I’m not sure if you’re using some other literature than the Wikipedia article on the “Bayesian average”, but you seem to have reversed the meanings of C and m: they’re using “m” for the prior mean and C for the number of instances of m that are added to the numerator.)
Hi Adrian-Bogdan,
Thanks for the comment!
Is it A. The mean of all elementary scores entered by individual users (regardless of which movies they were for) or B. The global mean of all the movie-specific mean scores (a mean of means)?
It’s option B, the mean of the score of the movies. That these individual scores also happen to be calculated as means of votes is only incidental. The same risk function analysis would apply to any method of scoring.
(And as a sidenote: I’m not sure if you’re using some other literature than the Wikipedia article on the “Bayesian average”, but you seem to have reversed the meanings of C and m: they’re using “m” for the prior mean and C for the number of instances of m that are added to the numerator.)
I was going by IMDB’s terminology at the bottom of http://www.imdb.com/chart/top. You’re right that $C$ and $m$ have the opposite definitions in the Wikipedia page. Strange…