Matter of Stats

View Original

Predicting the Lead at Every Change Part II: Quantile Regression Splines

In the previous blog I fitted four separate quantile regressions to game margins at the end of each quarter using the TAB Bookmaker probability as the sole regressor.

At the end of that blog I noted 

"The functional form that I've assumed for the regression model in this blog is linear in the home team pre-game probability and therefore assumes that every 1% increase in the home team's pre-game probability estimate has a fixed (in points terms) effect on the home team margin at any percentile."

That might be an overly restrictive assumption so for this blog, as promised, I'll be relaxing it and, instead of fitting:

Home Team Lead at End of Quarter X = a + b x Home Team's Implicit Probability

will incorporate splines (using the bs function from R's splines package) to fit:

Home Team Lead at End of Quarter X = bs(Home Team's Implicit Probability,5)

By default, bs fits splines of degree 3 (ie cubics) without an intercept. The parameter value of 5 that I've shown above represents the degrees of freedom (df) to be used for the set of fitted splines and means that, here, bs will choose 2 knots (ie df - degree) at which to smoothly spline 3 cubics. 

In less mathematical parlance, in my base regression equation I'm replacing the straight line that I used in the previous blog to relate home team probability to game margin with a piecewise curve comprising 3 cubics smoothly joined.

All of which should become clearer, I hope, with some examples.

RESULTS

Firstly I should point out that the charts I've produced for this blog differ from those I produced in the previous blog.

In that earlier blog each curve related to a home team with a specified pre-game probability and mapped out the cumulative distribution function for that team's game margin at the end of the specified quarter. For the current blog each curve instead relates to a particular quantile - I've charted only the 10%, 25%, 50%, 75% and 90% quantiles - and traces the value of that quantile as at the end of the specified quarter as we vary the pre-game home team probability. In other words, it traces out the piecewise splined regression function for a given quantile.

More simply expressed, where before in the charts I was holding home team probability constant and allowing the quantile to vary across each curve, now I'm holding the quantile constant and allowing the home team probability to vary along each curve.

For the end of Quarter 1 the chart is as follows: 

The green line traces the median margin (ie the 50th percentile) and shows, for example, that a home team with about a 40% pre-game probability has a median Quarter Time margin of around zero. In other words, a home team priced at about $2.40, assuming a 5% overround, would be expected to lead at Quarter Time about half the time. A home team enjoying equal favouritism would, instead, be expected to record a Quarter Time margin of +2 points or less.

Note that, had I produced this same chart using the regression equation I employed in the previous blog, all of the lines on this chart would have been straight. The extent to which they diverge from linearity is a reflection of the efficacy of adopting the spline approach - though we should acknowledge that this flexibility comes with an elevated risk of overfitting the data.

It's interesting to review the tail behaviour of these curves, most notably that tracing the median margin. Consider, for example, a home team with a 90% pre-game probability, which has according to the green line a 50% chance of leading by about 13.7 points or more at Quarter Time. That team's mirror-image, one's complement underdog, a home team with a 10% pre-game probability, has a 50% chance of trailing by about 12.5 points or more. Whilst we need to be a little cautious about interpreting the tail behaviour of piecewise cubics, this asymmetric result, if real, is intriguing and worth investigating in the results for the other quarters.

Next, the curves for the Half Time game margin from the home team's perspective.

Focussing again on the green line tracing the median margin we find that, by Half Time, a home team with a 50% pre-game probability is expected to lead about 50% of the time. A heavy-favourite home team with a 90% pre-game probability shows a median Half Time lead of about 28 points, while the 10% underdog shows a median Half Time trail of about 22 points.

This asymmetry in the median margins of home teams now extends to a larger portion of home team probabilities (so we've less reason to suspect it might solely be an artefact of our use of cubic splines). For example, 70% favourites have a median Half Time margin of about +10 points while 30% underdogs have a median Half Time margin of about -6 points. Home team underdogs, it seems, stay in touch with their opponents moreso than do away team underdogs. 

Now the model for home team margins at Three-Quarter Time.

The 10% quantile aside, these Three-Quarter Time curves have remarkably similar shapes to those for Half Time.

What's more, the curve for the median again passes through a margin of 0 for a pre-game home team probability of 50%, and the asymmetry of the median curve about this 50% point remains. Now, for example, a 70% favourite has a median margin of about +14 points and a 30% underdog a median margin of about -9 points.

We can also use these curves, as we could also have used the curves for Quarter and Half Time, to construct 80% confidence intervals for the Three-Quarter Time margins of home teams of differing abilities by calculating the distance between the purple and red lines (ie the 90th and 10th percentiles). For a 10% home team underdog that interval is (-78,+1), for a 25% home team underdog it's (-56,+27), for a 50% home team equal favourite it's (-41,+39), for a 75% home team favourite it's (-16,+59), and for a 90% home team favourite it's (+6,+80).

Lastly, we turn to the curves for the Full Time home team margin.

The median quantile for a home team equal favourite dips very slightly below a zero margin (it's at -0.6 points) suggesting that, if we use the median margin as our measure, there's very little bias in the TAB Bookmaker's pre-game prices. The approximately 2 point bias that we've detected before in this Bookmaker's assessments is a bias in the mean game margin, not the median, and results, it appears, from an asymmetry in home team performances. 

For example, the 10th percentile for the final margin for home team equal favourites is -44 points whereas the 90th percentile is +48 points. The distribution of home team margins is skewed slightly to the left, which drags the median below the mean.

Similar 80% confidence intervals for other home team probabilities are as follows:

  • For a 10% home team underdog (-105,-2)
  • For a 25% home team underdog (-68,+28)
  • For a 40% home team underdog (-50,+42)
  • For a 60% home team favourite (-38,+55)
  • For a 75% home team favourite (-20,+71)
  • For a 90% home team favourite (+10,+110)

Note how here too, the confidence intervals for an X% home team favourite are not mirror-images of those for a 1-X% home team underdog - the wins tend to be larger and the losses smaller.

EMPIRICAL FIT OF FULL TIME MEDIAN MARGIN CURVE

Given the skewed nature of the median margin curve, we'll clearly not be able to model it accurately with a Normal distribution. An empirical fit to the 19 home team probability values using Nutonian's Eureqa application yields a wide variety of candidate solutions, one of the best of which, of moderate complexity, is the following:

See this content in the original post

This equation yield fitted margins with a maximum absolute error of just 1.6 points across the 19 observations and with an average absolute error of just 0.9 points.

SUMMARY AND CONCLUSION

On balance, I think replacing the assumption of linearity for the relationship between home team probability and game margin at a given change with the assumption that the relationship can be better represented by a polynomial spline function, is justified.

Given that, I'd be inclined to prefer the fitted models from this blog to those from the previous blog, though the differences are not large, especially for home teams with probabilities in the 30% to 70% range.

The most compelling result, for me, from these blogs is the apparent skewness in home team margins, which complicates any attempt to map a bookmaker probability to an estimated game margin, and which makes it important to explicitly consider whether it's the mean or the median margin - or, indeed, the entire margin distribution - that we care about.