Building a Score-by-Score Men's AFL Simulator: Part II
In the normal course of things, it would have taken me months to create a simulator that I was happy with, but the current situation has given me larger blocks of time to devote to the problem than would otherwise have been the case, so the development process has been, as the business world loves to say, “fast-tracked”.
The new version is somewhat similar to the one I wrote about in this earlier blog, but different in a number of fundamental ways, each of which we’ll address during the remainder of this blog.
THE DATA
We’ll again be using the score progression data sourced from the AFLtables site for all games from 2001 to 2019, although this time we’ll include all of the home and away games from the past 10 seasons, which includes 1,946 games and just over 94,000 scoring events.
THE METHODOLOGY
STEP 1: PREDICT SCORING EVENTS
The first step is to create a predictive model that will provide an estimate of the probability that the next event in the game will be a score for the home team or for the away team, or will be quarter end.
This is different from the preceding blog in two ways:
We adopt an event-by-event view of each game, rather than a period-by-period view.
We attempt to estimate probabilities only for whether the next event will be a home team score, an away team score, or an end of quarter event.
For the purposes of modelling, we again need to summarise the game state, this time as at the end of each event, for which purpose we use inputs similar to those from the previous blog.
MODEL INPUTS
We use:
Event Time: the proportion of the game that had been completed when the event took place (ie a fraction between 0 and 1). Note that each quarter is defined to extend for exactly 25% of the game, so 1% will represent a larger span of actual time in longer quarters compared to shorter quarters in the same game.
Remaining Time in Quarter: the proportion of the quarter that remained when the event took place, expressed as a fraction of the entire game (ie a fraction between 0 and 0.25)
Score Type HAQ: whether the event that just occurred was a home score, away score, or end of quarter
Score Type Full: whether the event that just occurred was a home goal, home behind, away goal, away behind, or end of quarter
Current Lead: the home team’s current lead (after the event)
Current Leader: a categorical recording whether the home team or the away team leads, or whether the game is tied
Scoring Rate, Home Scoring Rate, Away Scoring Rate: the projected final total score, home team score, and away team score if we extrapolate based on the scoring rate so far (eg if 20 points have been scored and we’ve completed 10% of the match, then the Scoring Rate value will be 200)
Scoring Shot Rate, Home Scoring Shot Rate, Away Scoring Shot Rate: defined equivalently to those above but using scoring shots rather than points
Conversion Rate, Home Conversion Rate, Away Conversion Rate: the proportion of all scoring shots that have been registered as goals as at the end of the most-recent event (and the same figure calculated, separately, for the home team and for the away team)
Home Score Run, Away Score Run: the number of consecutive points scored by the home/away team without any score being recorded by the opposing team, as at the end of the most recent event. Note that the end of a quarter resets the run to zero.
Home Scoring Shot Run, Away Scoring Shot Run: defined equivalently to those above but using scoring shots rather than points
Expected Home Score, Expected Away Score, Expected Final Margin, Expected Home Scoring Shots, and Expected Away Scoring Shots: set to the actual figures from the game being analysed. This is a key difference from the previous blog, and the source of much of the improvement in the final simulations. In using bookmaker and MoSH2020 data previously, we were baking into the models any biases or inaccuracies those pre-game forecasts had. By instead using the actual values, we “teach” the models that the expected values are good guides to the real ones, which is how we’ll want them to behave when we set expected values for the simulations.
Expected Progressive Home Score, Away Score, Margin, Home Scoring Shots, and Away Scoring Shots: the expected values for these variables as at the end of the most-recent event, assuming a linear relationship across time (for example, if 30% of the game is completed then the expected lead is 30% of the final lead)
Home Score Last 10PC, Away Score Last 10PC, Home Scoring Shots Last 10PC, Away Scoring Shots Last 10PC: these record how many points or scoring shots each of the teams has registered in the most recent 10% of the game, as at the end of the most recent event. If less than 10% of the game has been played they are simply the scores or scoring shots so far.
Home Score Last 25PC, Away Score Last 25PC, Home Scoring Shots Last 25PC, Away Scoring Shots Last 25PC: these are defined as above but look at the most-recent 25% of the game, instead.
The modelling challenge here then is to predict the next event type - home score, away score, or quarter end - as a function of the game state at the end of the most recent event, as encapsulated by these variables.
MODEL SELECTION
We again use the caret package in R for the modelling task, and consider candidate algorithms suitable for fitting a target variable that is a multi-class categorical.
Chosen Algorithms
We’ll use the following algorithms: kknn, knn, C50, avNNet, xgbLinear, rpart, ctree, treebag, xgbDART, xgbTree, and rf.
Train vs Test Samples
The available data will be randomly split into a 30% training sample and a 70% testing sample.
Performance Metric for Parameter Tuning
The metric used for tuning will be logLoss. As I mentioned in the previous blog, this seems a more logical metric than accuracy, because calibration is fairly important here.
Performance Metric for Model Selection
The metric used for model selection will again be the multi-class Brier Score, measured using only the test data. With it, we again choose the xgbTree model.
STEP 2: PREDICT CONVERSION
As noted in the previous blog, little is gained by trying to predict whether the next scoring event will be a goal or a behind, so we’ll just opt for fixed estimates of conversion rates.
For the seasons in question that gives us an estimated conversion rate of 52.8% for home teams, and 52.9% for away teams. Those are the values we’ll use for the simulations.
STEP 3: PREDICT TIME TO NEXT EVENT
Because we’ve shifted to an event-by-event view, we now need to estimate when it is that the next event that’s been predicted by our models from Steps 1 and 2 combined, will occur. In other words, we know that the next event will be, say, a home goal, but how long will it occur after the event that just took place?
For this purpose we build separate models for each of 10 possible pairings of event-just-completed and next event, which are:
Quarter End to Goal (1.68%)
Quarter End to Behind (1.66%)
Goal to Goal for Same Team (2.15%)
Goal to Goal for Opposite Teams (2.15%)
Goal to Behind for Same Team (2.14%)
Goal to Behind for Opposite Teams (2.15%)
Behind to Goal for Same Team (1.69%)
Behind to Goal for Opposite Teams (1.80%)
Behind to Behind for Same Team (1.65%)
Behind to Behind for Opposite Teams (1.90%)
The rationale for building separate models for each pairing is because they might have quite different distributions, due to the different logistics of the ball movements associated with each of them. The means are shown in brackets above, and do, indeed, reveal some variation.
I had hoped that a meaningful proportion of the additional variation in time to next event values for a given event pairing might be explained in terms of the game state variables we used in Step 1, but none of the models I created could account for more than about 1 or 2% of the remaining variability.
Next I tried assuming that next event times might be distributed as a (shifted) exponential random variable with a fixed mean, using the vglm() function in R’s VGAM package. By “shifted” I mean here that each distribution has a floor below which no value can be observed because of the physical logistics of the game. We’ll never, for example, witness two goals scored within a few seconds.
Quarter End to Goal, Quarter End to Behind, Behind to Goal for Same Team, Behind to Behind for Same Team (0.15% of game)
Behind to Goal for Opposite Teams, Behind to Behind for Opposite Teams (0.25% of game)
Goal to Goal for Same Team (0.4% of game)
Goal to Behind for Same Team (0.55% of game)
Goal to Goal for Opposite Teams, Goal to Behind for Opposite Teams (0.65% of game)
That turns out to be ill-advised too, because the fit to an exponential distribution is relatively poor.
What we end up using are empirical cumulative density functions (CDFs), one for each event pair. This means that, whenever we need a time to next event we notionally dip into a bag that has all of the historical time to event values for the relevant pairing, and select one to use.
THE SIMULATIONS
Simulation 1
Let’s use our next event (xgbTree), conversion (fixed probabilities), and time to next event models (all xgbTrees) to simulate games between teams of varying ability.
Specifically, we’ll simulate 5,000 games where the expected scoring shot values for each replicate are chosen, at random, from the actual home and away team scoring shot data for games played across the period 2017 to 2019. In other words, each replicate will be the actual home and away team scoring shot results for one of the games in the sample.
This expected scoring shot data will be converted to expected scoring by assuming a conversion rate of 52.8% for the home teams, and 52.9% for the away teams. The expected margin will then be the expected difference between the home and away team scores.
Each simulation will proceed as follows:
Event 1: use the xgbTree model to provide estimates of a home scoring shot, an away scoring shot, or a quarter end as the first event of the game. Use a random draw from a uniform distribution on the interval (0,1) and cutoffs based on these estimates to determine which of the three events has occurred. Note that the xgbTree model requires input values as at the “previous” event, which was the start of the game. We initialise all of the input variables to “sensible” values, including setting the conversion rate variables to their expected values (ie 52.8% for the home team, and 52.9% for the away team).
So if, for example, the xgbTree model provides estimates of 52% for the first event to be a home team scoring shot, 47.9% for it to be an away team scoring shot, and 0.1% for it to be quarter end, and the first draw from the uniform distribution produces a value less than or equal to 0.52, then a home team scoring event has occurred.
If the next event is deemed to be a scoring shot, we use a random draw from a uniform distribution on the interval (0,1) and cutoffs based on the expected conversion rates for the home and away teams to determine whether that scoring shot was a goal or a behind.
Finally, we need to estimate the time between the previous event and the one just deemed to occur. For this purpose we use whichever of the 10 empirical CDFs we created in Step 3 is appropriate to provide us with a value for the time to the next event.
We then check to see if the time produced would push us past a quarter end (ie past 0.25, 0.5, 0,75 or 1) and, if so, ignore the scoring shot and record the event as a quarter end. Otherwise, we record the time of the event and update all of the regressor variables based on the event that just occurred.
Event 2: repeat the same steps as in Event 1.
…
Event n: repeat the same steps until an event is generated with an associated event time greater than 1.
The chart below is a scatter plot based on the 5,000 simulation replicates following this protocol.
There is very close to a one-to-one relationship between the pre-game expected margins and the actual margins in the simulated games. The equation for the linear regression of expected on actual margins is:
E(Actual Margin) = -0.594 + 0.969 x Expected Margin
That’s a much better result than we obtained in the previous simulations.
Some other comparisons between actual and expected results are:
Home Team Scoring Shots per game: Actual 23.9, Expected 24.1
Away Team Scoring Shots per game: Actual 22.4, Expected 22.3
Home Team Conversion Rate: Actual 52.9%, Expected 52.8%
Away Team Conversion Rate: Actual 52.9%, Expected 52.9%
Home Team Score per game: Actual 87.2, Expected 87.8
Away Team Score per game: Actual 81.8, Expected 81.7
Correlation Between Home Team and Away Team Scoring Shots: Actual -0.45, Expected -0.39
There is one more significant wrinkle, however, and that is the relatively small number of games where the actual result differs from the expected result by more than 50 points. There are only 41 of them across the 5,000 replicates, or about 0.8%. Across the 10 seasons we’ve used as the basis for this modelling, final margins have differed from bookmaker expected margins by more than 50 points in about 17% of games.
In short, our means look good, but we have too little variability.
ADDING BACK VARIABILITY
We need to add additional variability into the results in a realistic way, and in a way that won’t overly affect the relationship between the expected and mean actual margins. Ultimately, we want to be able to specify a pre-game expected margin for a game and know that the resulting simulations will tend to produce final margins close to that expected value.
We can think of that expected margin as itself an average of the likely margins for a game (ie a Mean Expected Margin), averaged across all of the on-the-day causes of pre-game variability. So, for example, whilst a team might be expected to register 22 scoring shots in a game were that game played hundreds of times, on-the-day circumstances might mean that it actually enters the game with a higher or lower expected scoring shot count.
Similarly, on-the-day circumstances might mean that the home team’s likely conversion rate isn’t 52.8%, but some higher or lower figure.
To model this interpretation, we do two things:
draw on-the-day expected scoring shots for the two teams by means of a multivariate normal with mean equal to the underlying expected values (which we’ve been using, unperturbed, before) and a covariance matrix of matrix(c(36, -4.5, -4.5, 36), nrow = 2), which implies a standard deviation of 6 scoring shots for each team, and a correlation between their on-the-day scoring shots of -0.125. The negative correlation means that, on a day where one team generates more scoring shots than expected, the other team will tend to generate fewer.
draw on-the-day expected conversion rates for the two teams by means of a multivariate normal with mean equal to the underlying expected values (which we’ve been using, unperturbed, before) and a covariance matrix of matrix(c(0.01, 0.002, 0.002, 0.01), nrow = 2), which implies a standard deviation of 10% points for each team, and a correlation between their on-the-day scoring shots of +0.2. The positive correlation means that, on a day where one team converts at a higher rate than expected, the other team will also tend to convert at a higher rate. On a wet day, for example, both team’s conversion rates are likely to be depressed.
Both of the covariance matrices were determined by trial-and-error.
The values provided by these random draws were subject to the constraint that they not be negative and that they not imply and on-the-day expected margin of over 100 points. Where any of the constraints were breached, a new draw was made.
Introducing variation by these two mechanisms interacts with the other models in such a way that home teams generate fractionally too many scoring shots per game, on average, which we adjust for by slightly multiplying the probabilities coming out of the next event model. Specifically, we multiply the probabilities for a home team score by 0.9895 and allocate the other 0.0105 to the quarter end event (ie we leave the away team score probabilities untouched).
Running 5,000 simulations with this final model gave the following results:
E(Actual Margin) = 0.154 + 0.967 x On-the-Day Expected Margin
E(Actual Margin) = -0.161 + 0.825 x Mean Expected Margin
(This second result is interesting because you might recall we found that the relationship between actual margins and bookmaker expectations, which are like mean expected margins, had 0.9 as the coefficient on Mean Expected Margin).
Here are the relevant charts of the data:
We can think of the chart on the left as the view once we’ve adjusted, pre-game, for on-the-day effects, and the view on the right as the “traditional” view that we’d have because we were unaware of the on-the-day effects.
Some other comparisons between actual and (on-the-day) expected results are:
Home Team Scoring Shots per game: Actual 23.8, Expected 23.8
Away Team Scoring Shots per game: Actual 22.3, Expected 22.4
Home Team Conversion Rate: Actual 52.0%, Expected 52.8%
Away Team Conversion Rate: Actual 52.5%, Expected 52.9%
Home Team Score per game: Actual 85.7, Expected 86.2
Away Team Score per game: Actual 80.8, Expected 81.3
Correlation Between Home Team and Away Team Scoring Shots: Actual -0.56, Expected -0.39
We also find that 19% of games have a final margin that differs from the mean expected margin by more than 50 points.
EXAMPLE SIMULATION
To finish, just to give you a concrete idea about what we can do with the simulation model as it stands, here’s the score progression for one of the 5,000 replicates..
(And just look at the momentum in that first game …)
CONCLUSION
I think we now have a viable simulation model that adequately mirrors the main characteristics of men’s AFL games played over the last decade.
It produces scores with a broadly realistic event-to-event cadence both in terms of the time between events and the likely sequence of them given what’s gone before in the same game.
It also provides results that are consistent with pre-game expectations and that mimic the real-world variation around those expectations I think to an acceptable level, though I will be investigating this further.
One thought for a future piece is to use the simulation model to generate a few thousand games and then to cluster those games in a manner similar to that used in this earlier blog, to see if we find similar clusters in the simulated data.