Do Umpires and Coaches Notice Different Things In Assigning Player Votes?
At the conclusion of each game in the men’s AFL home and away season, umpires and coaches are asked to vote on who they saw as the best players in the game. Umpires assign 3 votes to the player they rate as best, 2 votes to the next best, 1 vote to the third best, and (implicitly) 0 votes to every other player. It is these votes that are used to award the Brownlow Medal at the end of the season.
Similarly, the coaches of both teams are asked to independently cast 5-4-3-2-1 votes for the players they see as the five best, meaning that each player can end up with anywhere between 0 and 10 Coaches’ votes.
The question for today is: to what extent can available player game statistics data tell us whether and how coaches and umpires differ in how they arrive at their votes.
(Note that we’ll not be getting into the issue of individual umpire or coach quirks, snubs, or biases, and instead be looking at the data across all voting umpires and coaches.)
THE DATA
For this analysis we’ll be using all player statistics and voting data from seasons 2015 to 2024 available from AFLTables and Footywire via the fitzRoy R package, with the following exclusions:
any player who recorded no statistics in an entire match or for whom at least one statistic was missing
two games from 2015, and three for 2016, for which no coaches votes data is available
the entirety of 2020 because of its clearly anomalous characteristics (and, frankly, because who wants to remember it?)
The player statistics we have are as follows:
Stoppage Clearances, Centre Clearances, and Total Clearances
Intercepts
Rebound 50s
Inside 50s
Time on Ground
Hit Outs
Disposals, Effective Disposals, Kicks, and Handballs
Marks, Marks Inside 50, and Contested Marks
Contested Possessions and Uncontested Possessions
Goals, Behinds, Score Involvements, and Goal Assists
Metres Gained
Clangers, Turnovers
Bounces
One Percenters
Tackles, Tackles Inside 50
Frees For, Frees Against
We supplement this data with one further metric - the Team Result, which is the final game margin from the point of view of the player’s team. So, a 10-point win for the player’s team would result in a Team Result of 10, and a 10-point loss in a Team Result of -10.
In the end, we have data for 70,611 players.
THE MODEL
Those of you with some modelling experience are probably already wondering about how to handle the high levels of correlation between many of the player statistics (and the perfect collinearity is some cases, such as Kicks + Handballs = Disposals) and how that will tend to confound the measurement of each metric’s unique contribution to any sort of prediction. That will always be a challenge, but some model types can at least give us partial estimates.
For this analysis we’ll be using one such model type, an ordinal forest, the R version of which provides as one output variable importance, which is designed to give an estimate of the individual contribution of a metric to the final fitted values.
So, in the first instance, we will fit an ordinal forest to Brownlow vote data, with all of the player statistics as input variables and, in the second instance, we will fit an ordinal forest to coaches’ vote data, with all of the same player statistics as input variables. We can then compare the variable importance outputs for the two models to get some idea of how the two groups weight player statistics similarly, or differently, in arriving at their votes.
A TECHNICAL INTERLUDE
For the more technically minded reader, I’ll note that I chose ‘probability’ as my ‘performance function’ for the ordinal forest algorithm, since my primary goal was to produce a model that did a reasonable job at assigning probabilities to the various target categories, and this is the best choice, according to the notes for the ordinalForest package.
Also, fitting a single model to the 70,611 data points proved intractable, so I proceeded by fitting 10 models to a randomly-selected and unique one-tenth subset of the data, and then averaged the variable importance models output.
To get an idea about the quality of the model fits, I used the rps function from the verification R package, which outputs the Rank Probability Score and the Rank Probability Skill Score (see here for a practical use of these metrics and a good non-technical explanation) of a fitted model.
(The ordinalForest package also puports to return RPS values in the model’s perffunctionvalues object, but those values seem to be wrong.)
We’re looking for RPS values near 0 and RPSS values near 1, and the 10 models fitted to Brownlow votes data had RPS scores ranging from 0.0089 to 0.0105, and RPSS scores from 0.7592 to 0.7820. The 10 models fitted to coaches’ votes data had RPS scores ranging from 0.0126 to 0.0136, and RPSS scores from 0.7686 to 0.7898.
There is no standard for what RPS values are low enough or what RPSS values are high enough to constitute and acceptable model (and I think the RPS scores will benefit from the fact that the 0 category heavily predominates for both Brownlow and Coaches’ votes), but the values above seem sufficiently non-terrible to provide the models with at least a passing grade.
THE RESULTS
The table below records the average variable importance recorded for each metric across the 10 fitted models, normalised so that the most important variable has an importance of 100%.
The first thing to note is that umpires collectively and coaches collectively each have Disposal counts as the metric most associated with their voting behaviour, and that no other metric is more than about 60% as important as Disposals in explaining voting behaviour using the ordinal forests.
The fact that Disposals is the most important metric for both voting groups makes direct comparisons of the other metrics easier, although we should note that the average importance score for coaches is about 1.2 times the average importance score for umpires.
Nonetheless, I think it’s fair to say that the relative importance of all metrics is fairly similar for umpires and coaches, with the following exceptions:
Coaches consider Effective Disposals, Contested Possessions, Score Involvements, the Team Result, Intercepts, Contested Marks, One Percenters (and, maybe Goals Assists, though the importance is tiny in absolute terms) relatively more important than do umpires
Umpires consider Tackles Inside 50 and, maybe Marks Inside 50 relatively more important than do coaches
Looking at the broad grouping of metrics into Highly Important (say 10% plus), Somewhat Important (say 5 to 10%), and relatively Unimportant (under 5%), we can see that:
The Highly Important metrics are mostly about possession, gaining ground, and scoring goals
The Unimportant metrics are about mistakes, free kicks, bounces, tackles, and discretionary effort. They also include, unintuitively, behind scoring and assisting in the scoring of goals.
SUMMARY AND CONCLUSION
The modelling results seem to suggest that, on the whole, umpires and coaches notice similar things in arriving at their voting decisions, although there are some differences in the weightings in the case of a few metrics.
The only other things I note is that the variable importance outputs we’ve used for the analysis are somewhat dependent on our choice of algorithm, and other algorithms might well produce very different relative importance scores.
That said, the fact that the models with Brownlow votes as the target and those with Coaches’ votes as the target produce fairly similar orderings and relative importance values, provides some comfort that what is being measured is a genuine phenomenon, albeit that the ordinal forest algorithm might tend to consistently favour some metrics over others because of the manner of its calculation.