Team Scores - Statistical Distribution and Dependence
In the most recent post on the Simulations blog I assumed that Home Team and Away Team scores were independently and Normally distributed (about their conditional means). I'll investigate both these assumptions in this blog.
For this purpose I've fitted models to game data spanning the period from the 1st round of 2006 to the end of Round 12 in 2012. The target variables are Home team and Away team scores, and the only explanatory variables are transformations of TAB bookmaker prices, specifically:
- For the "log formulation", ln(Away Team Price/Home Team Price)
- For the "probability formulation", Away Team Price/(Home Team Price + Away Team Price)
In short, I'm fitting models of the form:
- Home Team score = f(ln(Away Team Price/Home Team Price))
- Away Team score = g(ln(Away Team Price/Home Team Price))
- Home Team score = s(Away Team Price/(Home Team Price + Away Team Price))
- Away Team score = t(Away Team Price/(Home Team Price + Away Team Price))
To fit the models I used the VGAM package in R, which provides a convenient way of fitting a huge variety of functional forms. The performance of every fitted model was assessed by calculating its pseudo-R squared - that is the squared correlation between its predictions and the actual values - and its average absolute prediction error. For each distribution type I also took the predictions made for the Home team and the Away team scores in each game based on the models built assuming that distribution type, and then calculated the average absolute prediction error for the Home team victory margin. So, for example, if the models built assuming that Home and Away team scores were Normally distributed produced a predicted Home Team score of 94 points for a particular game and an Away team score of 87 points for that same game, and if the actual score was 102 to 100 then the absolute prediction error for that game would be abs((94-87) - (102-100)), which is 5 points.
Here are the results:
There are a few comments that I'd make about this table:
- The "log formulation" of relative prices is universally superior to the "probability formulation" as a way of expressing the Home teams relative chances. No matter which distribution you select, its performance is better when you use the log rather than the probability formulation.
- Every one of the distributions shown in the table provides a reasonable fit to the data. For example, consider the pseudo R-squareds for the Home team score models using the log formulation of relative prices. The difference between the best (Log) and the worst (Normal, et al) models is only 0.131% in absolute terms, or about a 0.7% increase in the proportion of explained variance in Home team scores.
- The average absolute prediction errors (APE) for Home team score models are universally smaller than those for Away team scores. This could be seen as contradicting an earlier blog I wrote claiming that it was harder to predict the Away team score than the Home team score, but note that I found there too that the in-sample APE for the Home team models was smaller than for the Away team models. It was only when I came to predicting for a holdout sample that predicting Away team scores proved to be more difficult.
- Using the best-performed models we can predict Home team scores with about a 19.8 point average absolute error, Away team scores with about a 19.4 point average absolute error, and the victory margin with about a 29.2 point average absolute error.
So, the Normal distribution is not the one that best fits Home team or Away team scores as a function of TAB prices, but it performs well enough.
It produces the following formulae:
- Home Team Score = 94.2286 + 11.4984 * ln(Away Team Price/Home Team Price)
- Away Team Score = 91.69645 - 11.69356 * ln(Away Team Price/Home Team Price)
One implication of these formulae is that, for games with equal favourites on the TAB, the Home team is expected to score about 2.5 points more than the Away team. This once again demonstrates the slight edge that has accrued to punters backing Home teams on the TAB across the period from 2006.
Testing Independence
The VGAM package allows models with bivariate responses to be fitted for only some functional forms - one of them, fortunately, being the bivariate Normal. If we accept that the Normal distribution fits the Home team and the Away team scores reasonably well then, we can fit a bivariate Normal to the Home and Away team scores simultaneously to investigate the assumption of independence.
Perhaps not surprisingly, the fitted model includes a negative and statistically significant value for the correlation in the errors of the fitted Home team and Away team scores. The correlation is -0.21 meaning that, for example, when the Home team scores 10 points more than its pre-game bookmaker price would have suggested, the Away team will tend to score about 2 points fewer. Conversely, when the Home team scores 10 points fewer than expected, the Away team scores about 2 points more.
(UPDATE 08 AUG 2013: Knowing what I do now about the parameter estimates provided by VGAM I realise that I misinterpreted the "rhobit" estimate it provides by taking it as the estimate of the residual correlation between the Home and Away team scores; it's actually an estimate of a transform of that parameter, specifically of ln[(1+rho)/(1-rho)]. Unwinding that transformation gives an estimated covariance, rho, of about -0.105 instead of -0.21. Using this lower value reduces the differences in absolute game margins described below. For the typical game the mean absolute margin is now 30.5 points per game in the dependent case and 28.9 in the independent case for a difference of 1.6 points per game; for the 107.6 v 78.1 case the figures are 38.4 and 37.5 points for a difference of 0.9 points; for the 75 v 111.2 case the figures are 42.7 and 41.7 points for a difference of 1.0 points; and for the 94.2 v 91.7 case the figures are 29.6 and 28.2 points for a difference of 1.4 points.)
In any given contest, therefore, the two teams' scores are not independent: when one team does better than the bookmaker expects, the other tends to do worse.
Impact on Expected Absolute Margin
On finding this result my immediate intuition was that the dependence between the team scores would tend to increase the expected absolute game victory margin relative to what it would be if the team scores were independent.
This is an easy intuition to test - and prove wrong.
To do this we need values for the mean score and the standard deviation of the error for each team score to go along with the correlation of the errors, which we've already estimated as -0.21.
For the Home team, the average score is 97.1 points per game and the standard deviation of the errors from the fitted model is 25.1 points. For the Away team the equivalent values are 88.8 points and 24.7 points.
Based on 100,000 simulations we find that:
- Assuming independence between the Home and Away team scores, the average Home team margin is +8.3 points per game and the average absolute Home team margin is 31.6 points per game
- Recognising the dependence between the Home and Away team scores, the average Home team margin is again +8.3 points per game, but the average absolute Home team margin is 28.9 points per game, or about 2.7 fewer points per game.
In conclusion then, for a "typical" game, which is one where the Home team is expected to score about 97.1 points and the Away team 88.8 points (which implies prices of about $1.70/$2.15), the existence of a dependency between team scores means that the expected absolute victory margin in the game is about 2.7 points fewer than you'd estimate if you assumed that team scores were independent.
Simulating games with other Home and Away team prices and hence other expected Home and Away team scores shows that the difference attributable to the dependence in scores ranges between about 1.5 and 3 points. For example:
- For games with Home team/Away team prices of $1.25/$4.00 the expected scores are 107.6 points and 78.1 points and the expected absolute victory margin is 39.5 points assuming independence and 37.4 points assuming dependence, a difference of 2.1 points
- For games with Home team/Away team prices of $6.00/$1.13 the expected scores are 75.0 points and 111.2 points and the expected absolute victory margin is 43.5 points assuming independence and 41.8 points assuming dependence, a difference of 1.7 points
- For games with Home team/Away team prices of $1.90/$1.90 the expected scores are 94.2 points and 91.7 points and the expected absolute victory margin is 31.0 points assuming independence and 28.1 points assuming dependence, a difference of 2.9 points
(Note that, for the purposes of these simulations, when I've changed the mean values for the Home and Away team scores I've assumed that the variance of the errors remains unchanged.)