Do Bookies Undervalue Team Performance Metrics?
/In 2003 Michael Lewis' Moneyball was published, in which he related the story of Billy Beane, Oakland A's General Manager, and his discovery that the market for baseball players mispriced particular skills. Some skills that could be shown, statistically, as being associated with greater team success weren't recognised as valuable (for example, getting on base, as measured by On-Base Percentage), while other skills were over-valued because of an historical belief that they were related to success (for example, batting in runs, as measured by RBI).
Beane exploited this discovery to overcome the limitations of a small budget and assembled teams that consistently outperformed expectations.
I read Moneyball a few years ago and have always felt that there was some of the spirit of Billy Beane in the MAFL approach.
So, when long-time Friend of MAFL, Andrew, told me he'd discovered a treasure-trove of AFL player and team statistics, harvestable from the official AFL website, my interest was immediately piqued. He's kindly forwarded to me the data that he's grabbed so far, which has formed the basis of this blog and which I plan to further explore in future posts.
In today's blog I'll be looking only at team statistics, in particular to see if there's any evidence that the AFL wagering market fails to account for their predictive value.
THE DATA
As noted, for this blog I'll be using only the team level metrics, which are:
- Goals
- Behinds
- Goal Accuracy
- Kicks
- Inside 50s
- Marks Inside 50
- Disposal Efficiency
- Dream Team Points
- Rebound 50s
- Uncontested Possessions
- Contested Possessions
- Frees For
- Frees Against
- Handballs
- Bounces
- Tackles
- Marks
- Contested Marks
- Hitouts
- One Percenters
- Clangers
- Centre Clearances
- Stoppage Clearances
These are available from the AFL site for all seasons since 2001, but since I'm planning to analyse them in the context of bookmaker prices and I only have reliable prices since 2006, I'll start my analysis from then.
As well as performing the original harvest, Andrew has processed the data, calculating each team's year-to-date (YTD) per game average statistics as at the end of every game in every season. It's these YTD metrics that I'll be using for all the analysis. More specifically, I'll be using as regressors for each game the difference between the home and the away team's YTD metrics for all games prior to the game being considered.
So, for example, for a Round 5 clash between Carlton and Collingwood, one regressor would be Carlton's average goals per game over the first 4 rounds of the season minus Collingwood's average goals per game over those same first 4 rounds.
The distributions of these YTD difference metrics for every game since 2006 are summarised in the following table:
THE ANALYSIS
The outcome I'll be modelling is the game result from the home team's perspective and the full set of regressors will be, initially, the 23 YTD difference metrics, the TAB Bookmaker's Implicit Home Team probability (assuming a risk-equalising overround approach), each team's MARS Rating and Venue Experience, and the Interstate Status of the clash.
Twenty-nine regressors feels a tad excessive, so let's try to cull a few via feature selection techniques.
Selecting All Relevant Features
First let's wheel in the Boruta package in R, which uses importance scores and random forests to identify all relevant features (ie variables) in a regressor set - that is, all variables that appear to do better than chance in predicting the outcome variable.
Boruta's a slow package (at least on my laptop), but it's thorough. After agonising especially long and hard over the relevance of the difference in YTD Centre Clearances, which it eventually classified as important, and Opponent (ie Away team) Venue Experience, which it eventually couldn't decide about, it came back with the following lists of "confirmed important" (ie relevant) and "confirmed unimportant" (ie irrelevant) variables:
Confirmed Important
- Bookie_Home_Prob_RE
- Own_MARS_Rating
- Opponent_MARS_Rating Result
- Interstate_Status
- YTD_goals_Diff
- YTD_behinds_Diff
- YTD_kicks_Diff
- YTD_inside50s_Diff
- YTD_marksInside50_Diff
- YTD_disposalEfficiency_Diff
- YTD_dreamTeamPoints_Diff
- YTD_rebound50s_Diff
- YTD_contestedPossessions_Diff
- YTD_tackles_Diff
- YTD_centreClearances_Diff
Confirmed Unimportant
- YTD_freesAgainst_Diff
- YTD_uncontestedPossessions_Diff
- YTD_goalAccuracy_Diff
- YTD_handballs_Diff
- YTD_marks_Diff
- YTD_bounces_Diff
- YTD_contestedMarks_Diff
- YTD_hitouts_Diff
- YTD_onePercenters_Diff
- YTD_clangers_Diff
- YTD_freesFor_Diff
- YTD_stoppageClearances_Diff
- Own_Venue_Experience
What Boruta is telling us is that the list of "Confirmed Important" variables perform better than all of a number of randomly created variables in predicting game results, and that the list of "Confirmed Unimportant" variables are outperformed by at least one of those randomly created variables in the same task.
As I mentioned, Boruta remains undecided about the Opponent_Venue_Experience variable. I could give the algorithm more time to come to a definitive conclusion by increasing the maxRuns variable, but I used a value of 1,000 to get these results and that took almost an hour to finish, so I'll content myself with Boruta's Scot-like "not proven" verdict for the Opponent_Venue_Experience variable for now.
So, Boruta leaves us with 15 variables that perform better than chance and another that sometimes does and sometimes doesn't. While it's helpful to know that we can safely ignore the other 13 variables - and interesting to note what some of those variables are - being left with 16 variables isn't as parsimonious as I'd like. Maybe I've set the bar too low by allowing entry to any variable that can show that it performs better than a set of random variables.
Minimally Relevant
Where Boruta tends to give variables the benefit of the doubt, classifying them as relevant if they have anything at all to say about the target variable - including, I should add, if the relationship with the target variable is non-linear - other feature selection techniques adopt an opposite approach, constraining the inclusion of variables using information-theoretic measures such as AIC and BIC.
One such approach is available in R via the glmulti package. This package can be used to find the "best" set of regressors via an exhaustive search of all possible subsets, including subsets with interaction terms, but with 29 candidate regressors it's not feasible to do this in a practical time frame and we need to constrain and guide the search process. This we do by setting method = "g", which instructs the package to use a genetic algorithm to find the optimal set of regressors (rather than attempt an exhaustive search), and by setting level = 1, which instructs the package to review only main effects, not interactions. As the criterion by which to determine whether one model is better than another I used the default AIC measure, and the family parameter was set to "binomial", meaning that we're looking at binary logits as the functional form linking the regressors to the target variable.
This package comes back with the following set of variables that it deems to be important in predicting game results:
- Bookie_Home_Prob_RE
- YTD_behinds_Diff
- YTD_tackles_Diff
- YTD_inside50s_Diff
- YTD_disposalEfficiency_Diff
- YTD_freesAgainst_Diff
- YTD_centreClearances_Diff
- Own_MARS_Rating
- Opponent_MARS_Rating
- Interstate_Status
The underlying binary logit is the following:
The signs on the coefficients for most of the variables are as we'd expect, though the positive coefficient on Frees Against and the negative coefficient on Centre Clearances seem a bit odd. One interpretation would be that these particular metrics tend to be over-emphasised by the TAB Bookmaker - in other words that he's too harsh in pricing home teams that give away free kicks and too generous in pricing home teams that win a lot of centre clearances - and that we need to adjust for this in order to produce well-calibrated estimates of home team victory probabilities.
However, there's another reason to put a question mark over the inclusion of the Frees Against variable: Boruta classified it as "Confirmed Unimportant".
Once we've settled on a set of regressors we can estimate their relative contribution to explaining variability in the target variable by employing variable importance techniques. To make such an assessment in R, we have relaimpo for linear models and hier.part for general multivariate decomposition, the latter of which is appropriate here for our binary logit.
Running hier.part on the optimal binary logit model from glmulti, setting the goodness of fit measure to log likelihood (ie gof = "logLik"), yields the following assessment of variable importance:
- Bookie_Home_Prob_RE - 39%
- YTD_inside50s_Diff - 17%
- Opponent_MARS_Rating - 15%
- Own_MARS_Rating - 12%
- YTD_behinds_Diff - 8%
- YTD_disposalEfficiency_Diff - 5%
- Interstate_Status - 3%
- YTD_tackles_Diff - 2%
- YTD_freesAgainst_Diff - 0%
- YTD_centreClearances_Diff - 0%
So, for example, the TAB Bookmaker's home team probability assessment explains almost 40% of the total explained variance of game results.
The two variables whose signs were most troubling in the model are assessed as having minimal to no importance. Accordingly, let's drop them from our final preferred model, which becomes:
When we re-run hier.part on this, smaller model, the variable importances are much as before, with the Bookmaker-based probability estimate explaining a little under 40% of the variance in home team results, the two MARS variables explaining just under 30%, differences in Inside 50s contributing about 15% more, differences in Behinds about 8%, differences in Disposal Efficiency about 6%, differences in Tackles about 4%, and Interstate Status about 2%.
One measure of the efficacy of this model relative to a model with the Bookmaker probability as the sole regressor is the average Brier Score of its probability predictions relative to that of the Bookmaker-only model.
We can calculate bias-adjusted, leave-one-out cross-validated values for the Brier Score metric via the cv.glm function in the boot package, which gives us 0.1916 for the larger model compared to 0.1925 for the model using only Bookmaker probability as a regressor. (Recall that lower Brier Scores are better.)
So, there's a small but non-trivial improvement in the fitted probability predictions when we include the additional variables.
INTERPRETATION AND CONCLUSION
It seems then that the TAB Sportsbet prices might not fully incorporate the information content of some team metrics, namely behinds, tackles, inside 50s and disposal efficiency. These metrics are potentially the equivalent of Billy Beane's On-Base Percentage measure in that they're statistically important predictors of the relevant outcome but not fully encapsulated in the market's professional opinion.
The coefficients in the final binary logit on each of these metrics are all positive, suggesting that the act to which they relate is undervalued in the TAB Sportsbet prices for the home team. With the behinds metric in particular it's not hard to make a case for why this might be so. A team that registers a behind has done everything necessary to generate a scoring shot but has simply failed to convert that opportunity into a goal on this occasion. The nature of AFL scoring punishes this failure fairly dramatically, providing only one-sixth of the reward that would have accrued had, for example, the final kick been straighter. So, past behind-production could well portend future goal harvests. This might not be fully understood by the wagering market, which instead focusses on the scoreboard outcomes.
You could construct a similar argument for why inside 50s might be undervalued since these also represent an opportunity that might not always be fully realised. I can't construct similarly plausible arguments for the under-appreciation of tackling and disposal efficiency, other than to make the obvious point that they might be aspects of the game where superior performance can't be so readily brought to mind. For example, if I asked you to rank the Swans, Cats, Roos and Fremantle in terms of disposal efficiency, could you even hazard a guess?
In any case, these four metrics seem to carry some predictive weight. The question is, practically, how much?
Consider the YTD_Behinds_Diff variable. Across the sample it has a mean of about 0 and a standard deviation of about 2.7 behinds. So, if we imagine a game where:
- the Bookmaker's assessment is that the teams are equal-favourites
- the venue is neutral
- both teams are Rated 1,000 on MARS
- all difference metrics are equal to 0 apart from YTD_Behinds, which we assume is 2.7 behinds higher for the home team
Using the fitted model, the assessed victory probability for the home team in such a game is 57%, which is 7% higher than the Bookmaker's assessment (and about 4% higher than the probability we'd estimate for the home team in that same game were we to set all of the difference variables to 0 - recall that we've previously assessed TAB Bookmaker's prices for home teams as being slightly generous.)
We can perform similar calculations for the other difference variables and find, coincidentally, that setting the them all to plus 1 standard deviation yields a home team victory probability of 57% in each case.
Again these are non-trivial differences. In a future blog I'll estimate what they might mean in a wagering context.