Assessing Probability Forecasts: Beyond Brier and Log Probability Scores
Einstein once said that "No problem can be solved from the same level of consciousness that created it". In a similar spirit - but with, regrettably and demonstrably, a mere fraction of the intellect - I find that there's something deeply satisfying about discovering that an approach to a problem you've been using can be characterised as a special case of a more general approach.
So it is with the methods I've been using to assess the quality of probability forecasts for binary outcomes (ie a team's winning or losing) where I have, at different times, called upon the services of the Brier Score and the Log Probability Score.
Both of these scores are special members of the Beta Family of Proper Scoring Rules (nicely described and discussed in this paper by Merkle and Steyvers), which are defined in terms of loss functions as follows:
To perform these assessments I've used the R scoring package, which is maintained by Ed Merkle, one of the authors of the paper I alluded to earlier.
INTERPRETATION AND PRACTICAL IMPLICATIONS
Each row of this table relates to games in which a particular probability forecaster has assigned a home team probability within the given range. The first row, for example, provides the results for all games where the home team was assigned a victory probability of less than 40%. For games where this was true of the Overround Equalising probabilities, home teams actually won at a 28% rate. This compares with an average probability assigned to these teams of 26%, making the average absolute difference between the assessed probability and actual winning rate - the average calibration error - equal to 2.7%.
The LPSO probabilities have the smallest calibration error for games in this under 40% probability range, but they also have the smallest calibration error - much smaller, in fact - for games where the home team probability is assessed as being 80% or higher.
CONCLUSION
It's conceivable that, in a wagering context, the impact of forecasting errors might be asymmetric with respect to some outcome. The MatterOfStats funds themselves are good examples, as they wager only on home teams and only within a given price range.
More generally, I can envisage other forecasting situations where the Brier and Log Probability Scores won't necessarily be the best indicators of a probability forecaster's practical value because of the differential implications of false-positives and false-negatives. In these situations, choosing a different member of the Beta Family of Proper Scoring Rules might be an appropriate choice.