Is That A Good Probability Score?
/Each week over on the Wagers & Tips blog, and occasionally here in the Statistical Analysis blog I use probability score as a measure of a probability predictor's performance.
I define the Log Probability Score (LPS) for a prediction as being:
LPS = 1 + log2(Probability Assessed for Actual Outcome)
(I add 1 to the standard definition because I think it's cruel to make a predictor strive to achieve a "perfect" score of zero. So call me a positivist ...)
So, for example, if a predictor rated a team as a 70% chance before the game and that team won, the LPS for that prediction would be 1+log2(0.7), which is about 0.49. If, instead, the team rated a 70% chance lost, the LPS of the prediction would be 1+log2(0.3), or about -0.74 since, implicitly, the probability associated with the actual outcome was 100% - 70% or 30%.
As I write this, the leading Probability Predictor over in Wagers & Tips has an average LPS of about 0.156 per game. Last year's best Predictor finished the season with an average of 0.263. That leads naturally to a few questions:
- Are these scores, in some meaningful sense, 'good' or 'bad'?
- Why are they so different this year compared to last?
- How does the performance of a probability predictor vary depending on its bias and variability?
DEFINING THE SIMULATION CONDITIONS
To come up with some answers to these questions I concocted the following simulation scenario:
- Create a true (but unknown to the predictor) value for the probability of a particular team winning
- Define a probability predictor whose predictions are Normally distributed with a mean equal to this true value plus some fixed bias, Bias, and with a fixed standard deviation, Sigma. (NB We truncate probability estimates to ensure that they all lie in the 1% to 99% range.)
- Generate a result for the game using the probability from Step 1
- Assess the LPS for the prediction
- Repeat 1 million times, calculating the average LPS
- Repeat for different values of Bias and Sigma
Before we dive into the (real-world) complexity of the situation where the true game probabilities are unknown for Step 1, let's consider the case where these true probabilities are known.
Simulating this situation for various values of the true victory probability of the home team, of Bias and Sigma gives the following chart:
Each panel in this chart relates to a different home team probability and each curve within a panel tracks the relationship between the average LPS per game for the 1,000,000 simulations across the range of values of Sigma used in the simulation, which was 0% to 20%. Different curves in each panel relate to different values of the Bias parameter, which I allowed to vary from -10% to 10%.
As you'd expect, for any given home team probability, greater Bias and higher Sigma values lead to lower expected average LPS results. Put another way, more biased and more variable predictors fare worse in terms of their expected LPS, regardless of the pre-game probability of the home team. But what's immediately apparent from this chart is the dominance of the home team's probability in bounding the likely range of any predictor's probability score, no matter how biased or variable he or she is. Even the worst predictor in a contest where the home team has a 5% probability of victory - one with a 10% Bias and a 20% Sigma - is almost certain to score better than the best predictor where the home team has a 10% probability of victory.
Games involving mismatched opponents lend themselves to higher LPS'. Consider, for example, a game with a 90% favourite. A predictor assigning a 90% probability to this favourite can expect to record an LPS given by:
Expected LPS = 0.9 x (1+log2(0.9)) + 0.1 x (1+log2(0.1)) = 0.53
Contrast this with a game where the favourite is only a 55% chance. A predictor assigning a 55% to the favourite in this game can expect an LPS of:
Expected LPS = 0.55 x (1+log2(0.55)) + 0.45 x (1+log2(0.45)) = 0.01
That's an enormous difference.
(Some of the obvious outlying curves and points on these charts are due to the impact of constraining predicted probabilities to lie in the 1% to 99% range. Where, for example, the true home team victory probability is 10% and the bias is -10%, these constraints will become relevant - "binding" in the optimisation sense - for many simulations and will alter the "natural" shape of the curve that might otherwise have been seen had the constraint not been imposed.)
The leftmost point on the highest curve in each panel relates to a predictor with zero bias and zero variability - in other words, a predictor whose probability assessments are always exactly equal to the true game probability. That's as good as you can be as a predictor, and the LPS of such a perfect predictor varies from about a 0.75 for a game where the home team is a 95% favourite or a 5% underdog, to a low of 0.50 for a game where the home team is an equal-favourite with the away.
RETURNING TO THE REAL WORLD : UNKNOWN TRUE GAME PROBABILITIES
Those theoretical results are interesting, but I want to ground this analysis in the reality that the MAFL Probability Predictors have faced, so as a practical way to come up with true probabilities for actual historical games to use instead in Step 1 I'm going to use the probability estimates that come from applying the Risk-Equalising approach to the actual TAB Sportsbet prices for particular seasons.
For a given simulation for a given year then, I'll be drawing with replacement from the set of Risk-Equalising probabilities calculated for that season.
Here are the results for each of the seasons from 2006.
Each panel relates to simulations for a specific season - 2006 at the top left, and 2013 at the bottom right - and each curve tracks the relationship between the average LPS per game for the 1,000,000 simulations across the range of values of Sigma used in the simulation, which was 0% to 20%. Different curves in the same panel relate to different values of the Bias parameter, which I allowed to vary from 0% to 10%.
The general location of the curves in each panel depends on the relative preponderance of games with near-equal favourites and games with raging longshots; the higher the rainbow, the more that longshots prevailed.
One thing to notice is how much the LPS for the point of perfection varies across the seasons, from lows around 0.12 for 2006 and 2007, to highs of over 0.25 in 2012 and 2013. In 2006, the average favourite was rated a 67% chance, and in 2007 was rated a 66% chance. In 2012 the average was 76%, and this year so far it's been 75%. That's why the rainbows in the chart above are higher for this year and last year than they are for 2006 and 2007.
Looking back at the chart then, the fact that the best Probability Predictor of 2012 recorded an average LPS of 0.26 is not entirely surprising as that's about the result you'd expect from a perfect predictor, which is in essence what we've assumed the TAB Sportsbet Bookmaker to be in using his prices to define the true probabilities.
The result for this year is a little more perplexing, however, since that same "perfect" predictor has an average LPS of just 0.156, which is well below the result we'd expect for such a "perfect" predictor. There are two ways to interpret this result:
- If we maintain the assumption that probability assessments derived from the TAB Bookmaker's prices are unbiased estimates of the true victory probabilities in every game, then it must be the case that, by chance, the outcomes of games have not been accurate reflections of these true probabilities. We've had, if you like, any unusually long run of heads from a known-to-be-fair coin.
- The TAB Bookmaker has actually been a poor judge of true team probabilities this year.
There's no definitive, empirical way that I can think of to choose between these possibilities. All we can say is that favourites have won more often than their implicit probabilities suggested they should have. Whether that's down to miscalibration or bad luck on the part of the Bookmaker is impossible to tell. Certainly, however, if the phenomenon persists over the remainder of the season, the case for an miscalibration would strengthen.