In-Running Models: Their Uses, Construction and Efficacy for AFL
This week, there's been a lot of Twitter-talk about the use of in-running probability models, inspired in part no doubt by the Patriots' come-from-behind victory in the Superbowl after some models had estimated their in-running probability as atom-close to zero.
In response, TheArc footy wrote a piece earlier this week in defence of the practice of using and creating such models, which I'd recommend you read and on which I'd like to build, touching on some similar areas, but also delving a little more into the technical side (which, I hope by now, you've come to expect from here).
WHY BOTHER?
In-running probability models are designed for one, single purpose: to provide an estimate of the likelihood of some final outcome during the course of a sporting contest. That's it. If you've ever watched a game involving your team and wondered about how safe their 20-point lead was, or how likely it was that they'd be able to run down a 13-point deficit, then, at some level, you've naturally craved an in-running model.
You might merely have thought it "unlikely" that your team would surrender their lead, or "virtually impossible" that they'd make up the gap - which is fine if you're happy with qualitative assessments, but less so if you want to say how "unlikely" and whether it's more or less unlikely than the time they previously surrendered a similar lead (which floods all-too-readily into consciousness).
And that's what in-running probability models do - provide a numerical answer to a perfectly legitimate question: what's the best estimate of my team's chances now?
Can these models be used for gambling? Sure. Are they? Of course. But just because they can be used for one purpose doesn't mean they can't be used for others. In fact, the entire field of probability was founded on the need to estimate the in-running chances of an interrupted dice game, and this field does seem to have turned out to be useful for a couple of other things. Writing football-based blogs, for one ...
So, in short, creating in-running probability models seems like a sensible thing to do even if you have quite reasonably-held qualms about the social effects of gambling.
One other implicit criticism of in-running models stems, I think, from a belief that they don't actually provide what they promise anyway. If a team rated a near-zero chance winds up winning, how could the model possibly have been correct?
Whilst it is never possible - or, I'd add, sensible - to argue the efficacy of a forecaster on the basis of a single forecast, or even those of a single game, it is possible to define an intuitive metric and measure a forecaster using that metric across a sufficiently large set of forecasts.
Exactly what that measure might be I'll come back to in a later section.
HOW?
How then might we go about building an in-running model of AFL games for ourselves?
I've written about this a number of times in the pages of the MoS website, most recently (I think) in this post from 2013, so here I'll spend only a few paragraphs describing my most-recent attempt.
The latest models once again draws on the score progression data from the remarkable afltables site, which provides the time for every scoring and quarter-end event in every game since 2008. Altogether, that spans 1,786 games. We use this data to create 200 observations for every game - the score at 0.5% intervals throughout the game, comprising 50 equally-spaced points within any given quarter. Roughly speaking, at an average of 1,800 seconds per quarter, that means we're sampling the score every 36 seconds. This discretised data is used for modelling, rather than the raw event-based scoring data, because it ensures that games with fewer scoring events have equal weight in our modelling as games with more scoring events.
Accurate pre-game estimates of the teams' chances are exceptionally important because, in essence, an in-running model can be thought of as a continuous updating of this pre-game Bayesian prior. To provide these estimates we'll use either the TAB bookmaker's pre-game head-to-head prices (converted to a probability using the overround-equalising method) or the pre-game probability estimates derived from the MoSHBODS Team Rating System. We'll build two models, one TAB-based, the other MoSHBODS-based.
Unlike the previous in-running models I've built, for which I've used binary logit or probit formulations, today's models will employ an algorithm called quantile regression, which I used for a similar purpose way back in 2014. What's appealing about the quantile regression approach is that it gives me more than just a simple point-estimate of a team's victory probability, but instead yields a distribution of the expected final margin (from which such a victory probability point-forecast can be estimated, as discussed later).
Now the efficacy of an in-running probability model is highly dependent on an appropriate definition of regressors or "features" as the machine learning adherents seem to prefer calling them. We would expect, for example, that the range of feasible final margins would contract as the game progresses, but that characteristic won't necessarily emerge from our model without some thoughtful feature design.
Which is why we end up with a quite odd-looking functional form for the in-running model, the details of which I'll spare you from today - partly because, if I'm honest, I'm just not up to wrestling with MS Word's Equation editor right now - but which includes a number of terms involving (1-Game Fraction) as a multiplier or a divider. This ensures that particular components have a diminishing effect as the fraction of the game completed goes toward 1, and provides the narrowing in the projected final margin estimates that we desire.
Variables other than Game Fraction that appear in the models are:
- Home Lead - the current lead from the home team's perspective
- Projected Home Score, Projected Away Score and Projected Total - a combination of the actual scores to date and the pre-game expectations of scores according to MoSHBODS
- Pre-Game Home Team Probability estimate - either the TAB-based or the MoSHBODS-based estimates
- Home Team and Away Team Scoring Shot Run - the number of consecutive scoring shots recorded by either the Home or the Away team.
The full set of features, transformed and combined in various ways, are used to explain variability in the final game margin from the Home team's perspective.
As noted earlier, two models were fitted, one using TAB-based pre-game probability estimates, and the other using MoSHBODS-based estimates. We can measure the relative fit of the two variants using the Akaike Information Criterion, which reveals that the model using the TAB-based estimates is marginally superior.
AN EXAMPLE OUTPUT
The chart below summarises the fitted results for a single game, here the 2016 Grand Final.
In the top chart we have the score progression data itself, tracking the lead from the home team's perspective, here considered to be the Western Bulldogs (though that designation changed in the week leading up to the Grand Final).
Beneath that we have the in-running probability estimates of the two models, the one built using MoSHBODS in black, and the other built using the TAB bookmaker data in red. These probabilities are derived from the quantile regression outputs, which appear in the two lowest sections. Note that both models provide very similar in-running estimates throughout the game largely because MoSHBODS' (39%) and the TAB's (40%) pre-game estimates of a Dogs' win were very similar.
In the bottom section we have estimates for the projected final margin, these provided by the quantile regression that used the TAB prices for initial probability estimates. We've chosen to map the values of the 5th, 25th, 50th, 75th, and 95th percentiles across the game. This output allows us to say, for example, that at quarter time the model attached a 90% confidence level to the final margin lying somewhere in the -57 to +50 range, and a 50% confidence level to the final margin lying somewhere in the -23 to +19 range.
Armed with the model's fitted values for all percentiles, we derive an in-running probability estimate at any point in time by determining the percentile whose associated final margin value is nearest to 0. So, for example, at quarter time, that was the 49th percentile for the TAB-based model, so our in-running estimate is, accordingly, 49%. At the same point in time, the MoSHBODS-based model assigned the Dogs a 46% chance of victory.
To give you an idea of how in-running estimates look when MoSHBODS and the TAB have differing pre-game assessments, here's the output for an early-season 2014 game where MoSHBODS' rated the Saints over 40% points better chances than did the TAB bookmaker. The MoSHBODS-based model had the Saints as comfortable favourites all game, but it took the TAB-based model until the second half of Quarter 2 before it nudged them over 50% (an even then not for the remainder of the game).
From fairly early in the final term, however, the two models had become much more aligned.
I have created these in-running outputs for every game since 2008, copies of which can be downloaded from the Links & Archives page under the Downloads - In-Running Charts section.
Also, team-by-team in-running charts for all 22 home-and-away games of the 2016 season are available at the bottom of this page from the Static Charts - Score Progressions page.
HOW ACCURATE?
As I mentioned earlier, I think some of the criticisms of in-running models come from people witnessing their occasional spectacular failures. To get a more balanced perspective of a probability forecaster, however, we need to look at more than just these top-of-mind examples.
The measure we'll use is called calibration, the intuition for which is that events assessed as having an X% chance of occurring should, in the long run, eventuate about X% of the time if the forecaster is any good. If you think about that measure for a moment, you'll realise that events assessed as having, say, a 99% chance of occurrence should fail to occur about 1% of the time.
If, instead, events rated as 99% chances by a probability forecaster never occurred, then that forecaster would not be well-calibrated. As much as they might be embarrassing for the forecaster at the time, occasional "mistakes" of this kind are actually a sign of his or her ability.
So let's assess how well calibrated are our two in-running models. We'll do this firstly by "binning" the models' probability estimates into 1% point buckets, and then calculating for each bucket the proportion of times that the home team went on to win. Ideally, for example, looking at all the occasions where a model made an assessment that the home team had a 25% chance of victory at some point in the game, we'd find that the home team actually went on to win about 25% of the time.
That desirable behaviour would manifest in this chart as the black (TAB-based) and red (MoSHBODS-based) lines tracking a 45 degree course from (0,0) to (100,100). We see that this is almost the case, though there is some suggestion that both models are slightly too pessimistic about home teams' chances when their estimates are in roughly the 15% to 40% range, where they seem to be about 5% points too low.
Overall though, across the full range of estimates, the calibration looks fairly good - certainly good enough to suggest that the models have practical efficacy. On average, when they estimate a team has an X% chance of winning at some point during a game, that team will go on to win about X% of the time.
You really can't ask for much more than that from an in-running model.
Now it might be the case that the models are better at certain times of the game than at others. For example, their estimates might be better in one Quarter than in another. That information would also be useful to know. To obtain it, we proceed to bin estimates and calculate winning rate in the same fashion as we just did, but look separately at forecasts made during Quarter 1, Quarter 2, Quarter 3, and Quarter 4.
What we find is that the home team pessimism we saw in the overall chart is largely driven by estimates from Quarter 1. Knowing that, we might choose to go back and review the model, looking in particular at some of the terms involving (1 - Game Fraction), which could be running off at the wrong rate in the first Quarter when Game Fraction is low. As it stands, home team underdogs or home teams whose estimated probability drifts into the 15 to 40% range during Quarter 1 according to both models, do a little better than the models forecast.
SO WHAT?
To summarise then:
- In-running probability models are designed to answer a fairly natural question arising during the course of a sporting contest, namely: how likely is a particular result?
- Whilst it's possible to use them for gambling in-running, that's by no means their only use (nor, in this country, where in-running wagering must be done over-the-phone, a particularly appealing one)
- It's not possible to meaningfully measure their efficacy on the basis of a single game. Someone who's just correctly predicted 5 coin-tosses probably hasn't demonstrated exceptional in-running two-up ability, but instead just got lucky. Measuring the ability of any probability forecaster requires a sufficiently large volume of forecasts to allow us to differentiate ability (or lack of it) from chance.
- Calibration is a useful way to quantify a forecaster's ability. It assesses the proportion of times that an event occurs after it has been estimated an X% chance of doing so. The closer is that proportion is to X, for all possible values of X, the better-calibrated is the forecaster.
- Well-calibrated forecasters should be wrong - in the sense that the event they favoured does not occur - about 100-X% of the time for events they rate as X% chances. That means, for example, that even 99% chances should fail to eventuate 1% of the time.
- It's possible to build well-calibrated in-running models for AFL games, even without using any bookmaker data as input (though they are slightly better if bookmaker data is incorporated).