Characterising AFL Seasons

September 23, 2012 Tony Corke

I can think of a number of ways that an AFL season might be characterised but for today's blog I'm going to call on a modelling approach that I used back in 2010, which is based on Brownian motion and which was inspired by a JASA paper from Hal S Stern.

The Model

In brief, the approach estimates the probability that the home team will win at any point in a game on the basis of the lead that it has at that point and the time that's remaining in the game by fitting a binary logit of the form Prob(Home Team Wins) = A/(1+A)

where A = exp(C + L*Home Team Lead/sqrt(1-T) + D*sqrt(1-T)) and where T = Time Elapsed in Game (as a proportion of the total game, so 0<=T<=1).

The modelling process estimates C, L and D once we provide it, for each game in our sample, the Home Team Lead at Quarter-time, Half-time and Three-Quarter time (T = 0.25, 0.5 and 0.75, respectively), and the final result from the Home Team's point of view (1 for a win and 0 for a loss; drawn games are excluded from the modelling). So, each game in the sample provides three data points to be fitted.

If we fit a separate model to the results of every game in a single season we can obtain estimates of C, L and D for each season and these values can then be used to characterise that season in a way that I'll describe next.

Interpreting the Coefficients

The estimates of the coefficients in the binary logit tell us something about the typical progress and outcome of the games of a particular season from the viewpoint of the Home team.

C and D, together, provide an estimate of the Home teams' victory chances at the start of the game (ie when T=0). We obtain this base level Home team victory probability by calculating exp(C+D)/(1+exp(C+D)). When C+D is high it implies that Home teams are likely to have won a larger proportion of games in the season and when C+D is low it implies the opposite.
D (for Decay), on its own, measures how quickly the Home team's home ground advantage dissipates as the game progresses. Higher values imply a more rapid decaying of the advantage, and lower values a less rapid decay.
L (for Lead) measures the likelihood of a lead being run down over the remainder of the game. Higher values of L imply that leads are more likely to be run down, and lower values that they are less likely to be run down.

Estimates for 1897 to 2012

The table below provides estimates of C, L and D for every season since 1897 (click it for a larger version). These appear as the leftmost three columns in the table immediately adjacent to the year, which is in red.

On the right of the table are Home team probability estimates produced using the coefficient estimates and estimated for 15 different game scenarios. In the first three scenarios the Home team trails by 20 points at Quarter-time (column 1), Half-time (column 2), and Three-Quarter time (column 3). The next three scenarios consider a Home team trailing by 10 points, the next three a drawn game, the next three a Home team lead of 10 points, and the final three a Home team lead of 20 points.

(Errata: The columns headed 1/sqrt(Time Left) in all of the tables below should be headed sqrt(Time Left))

The sparklines at the bottom of each column provide a thumbnail view of the timeseries of data in that column and very broadly suggest that the trend has been for leads to be less and less likely to be run down over the years (evidenced by the generally downward sloping nature of the sparklines where the Home team trails and the (less prominent) generally upward sloping nature of the sparklines where the Home team leads).

I've noted previously the relative infrequency with which teams that have lead have been run down this season, and this observation is borne out by the value of the L coefficient for the 2012 model. At 0.076 it's the highest value recorded for this coefficient since 1965.

Demonstrating the Link Between Coefficients and Game Outcomes

In this next table I've recorded alongside the coefficients for each season two summary statistics that should be related to those coefficients in the manner I've described above (viz the Home team winning percentage should be related to the 1st and 3rd coefficients (C and D), and the percentage of games in which the ultimately winning team trailed at half time should be related to the 2nd coefficient (L))

Again, 2012 stands out, here in relation to the small proportion of games where the team that ultimately won trailed at half time. This was the case for only 14% of games in 2012, the lowest proportion of such games that we've seen for an entire season since 1985.

It's hard to confirm the link between the model coefficients and the summary statistics based solely on the data in the table, so here are a few charts that might help.

The top chart plots, for each season, the estimate of Home team probability at time 0 (given by exp(C+D)/(1+exp(C+D)) against the actual Home team win percentage for the season and shows a relatively high correlation between the two (the R-squared is about 0.67).

Below that, the lower chart plots, for each season, the L coefficient against the actual proportion of games in which the team that ultimately won trailed at half time. This relationship is not quite as strong as the one above, but the R-squared is still an acceptable 0.43.

Together these charts confirm to a reasonably high level the link between the model coefficients and actual season summary statistics, with the nature of the links as I described them earlier.

Having established this link, another question that seems worth exploring is whether - and, if so, to what extent - games from the home-and-away season differ from Finals and then, within Finals, whether Grand Finals differ further still.

Comparing Home and Away Season Games with Finals and Grand Finals

Since there are only a handful of Finals each year and just one (decisive) Grand Final, if we're to estimate the binary logits for Finals and Grand Finals we need to aggregate results over a number of seasons.

Scanning the data and the sparklines in the earlier charts suggests that there was a change in the game around the time of the end of the First World War. For one set of models then I used the data for all seasons from 1921 to 2012. To investigate the possibility that more recent seasons might have been different, I also fitted models only to the data for seasons from 1980 to 2012.

The results appear in the table below:

From this table we can make a few observations:

The characteristics of the home-and-away seasons for the period 1921 to 2012 are virtually identical to those for the period 1980 to 2012
The precariousness of any given sized lead in a Final or Grand Final (ie the middle coefficient estimate, L) has also been remarkably similar across the two periods and remarkably similar for Finals and for Grand Finals within and across those two periods
What has differed between the two periods is the base winning rate of the Home team in Finals and Grand Finals which has been lower, though still high, in the period 1980 to 2012 compared to the period 1921 to 2012.

The relative dominance of the Home team in Grand Finals (ie the team finishing higher on the ladder) is underscored by the fact that, even based on the model fitted to game data since 1980, a Home team trailing by as much as 20 points at quarter time is still almost an even money proposition, and they remain about a 2/1 prospect even if they trail by this same amount at the main break, and a 4/1 prospect if they find themselves down by this margin at the final change.