Testing for In-Game Momentum
Momentum. It's something that you're almost guaranteed to hear mentioned at least once in any sporting commentary, and it's a topic that continues to pique my interest. Whether or not it exists remains a legitimate subject for debate, but detecting it, if it does exist, has been elusive.
I've written about various forms of it on a number of occasions on this website, summarised in this post from 2012 where I also looked for the first time at in-game momentum, finding that the data, in total, was consistent with the conclusion that this form of momentum, measured using scoring sequence data, did not exist.
In that analysis, I used a runs test to investigate the phenomenon and today, inspired by a recent piece on the hothand phenomenon is basketball shooting in which there is considerable discussion of the power of that and other tests, I want to investigate the power of the runs test in the context of detecting momentum in AFL scoring.
THE RATIONALE
Runs in scoring are, arguably, the most obvious potential indicia of momentum. If the team that just scored is consistently and meaningfully more likely to score next (adjusting for their underlying relative ability), then it seems reasonable to claim that they have some momentum on account of having scored last. Or, in statistical terms, if we see dependence in the sequence of scoring, such that a home team score will be more likely to be followed by another home team score (and, perhaps, ditto for an away team score), then we’ll feel more comfortable rejecting the null hypothesis of independence.
Extending that thought, if a game has fewer runs of scoring than would be the case if each scoring event was independent of the previous one, then we have evidence of momentum. (Actually, if a game has more runs of scoring than expected, that would, instead, constitute evidence for anti-momentum, which would stem from a decrease in a team's probability of scoring next if it scored last - perhaps because the team just scored against is spurred to greater effort.)
A common test of whether or not the number of runs of scoring seen in a game is consistent with each scoring event being independent of the previous scoring event is the Wald-Wolfowitz test, an exact version of which is available in the randomizeBE package in R.
THE RESULTS
If we run that test on the scoring sequences for all of the AFL games from 2001 to 2019, extracted from the afltables site, we find that:
9.3% have (one-tail) p-values less than 10%
4.9% have p-values less than 5%
1.2% have p-values less than 1%
That seems entirely consistent with the view that momentum, if it exists, is insufficiently prevalent to move the all-game distribution of runs test results far from what we'd expected under the null hypothesis that in-game momentum doesn't exist - at least in the sense we've described it.
Nonetheless, there are some games that produce startlingly fewer runs than we'd expect under the null and, accordingly, produce very small (one-tail test) p-values, the 25 smallest of which appear in the table below.
In that Geelong v Adelaide game at the top of the list, for example, there were only 6 runs of scoring when, under the null hypothesis of a fixed probability of scoring for both teams throughout the contest, and given 16 scoring shots for the home team and 17 by the away team, we would expect to see 17 runs.
If momentum exists, then this game is an archetypal example, but if it does not, then this game is just one of those that, by chance, will generate such a small number of runs.
Not shown, but at the bottom of the list are the most anti-streaky games, including a 2003 game between Richmond and Collingwood that produced 33 runs from just 48 scoring shots (17 for Richmond, and 31 for Collingwood) against an expectation of 22.5 runs.
THE ISSUE
So let’s investigate the power of the Wald-Wolfowitz test - that is, its ability to reject the null hypothesis when the null is false. In our situation that means its ability to detect a lack of independence in scoring for a given base probability of scoring, effect size and sample size.
Effect size here means the percentage point increase in a team's scoring probability above its constant, base rate, when it is the team to score last. So, for example, a team whose relative ability might see it expected to register 60% of the scoring shots in a game, would have a probability of 80% of scoring next should it have scored last in a simulation replicate where we were using a 20% effect size.
We consider three sample sizes - 25, 50 and 75 scoring shots - this covering the range of values of scoring shots seen during the 2001 to 2019 period. The average number of scoring shots per game is 49.7.
THE RESULTS
The table below, each entry in which is based on 10,000 simulation replicates, records the proportion of replicates for which the exact runs test produced a one-tail p-value under 5% for varying effect sizes, sample sizes, and base probabilities.
For this first table we assume that the home team is momentum-prone in both directions, so that their probability of scoring next increases by the effect size whenever they are the last team to score, and decreases by the effect size when they are not. (Note that, in the footnote, we use X(t) = 1 to denote the fact that the home team registered the tth scoring shot, and X(t) = 0 to denote that the away team did.)
The first row of the table, which is for an effect size of 0% and is therefore for the null hypothesis of no dependence in scoring, sees numbers in the 3 to 4% range for all sample sizes and base probabilities. Ideally, these should be 5%, but the Wald-Wolfowitz test is an asymptotic one, and so only produces truly exact p-values as the sample size tends to infinity. Practically, what this tells us is that, for the sample sizes we're investigating, the Wald-Wolfowitz test will reject the null hypothesis slightly less often than it should when the null hypothesis is true. That is, it will suggest that momentum exists when it doesn’t slightly less often than it should given the confidence level chosen.
From the second row onwards, we start exploring the performance of the test when the null hypothesis is false. The values in these rows would, ideally, be 100%, reflecting the fact that we'd be rejecting the null in every case. The reality is a long way from the ideal.
In the second row, for example, we find that, for games with 25 scoring shots, where the base probability of the home team scoring is 30% and the effect size is 5%, we reject the null only 7% of the time. Put another way, we fail to detect dependence of this type in the scoring behaviour 93% of the time.
As the effect size increases, and as the base probability moves nearer 50%, we do a little better, although with a sample size of 25 and an effect size as large 25%, the dependence is detected only two-thirds of the time.
A larger sample size certainly helps and, once we get to samples of 75 scoring shots, we detect large effect sizes most of the time. That said, an effect size of 10% is detected less than half the time.
Next we consider the situation where only the home team displays momentum, so that the home team’s probability of scoring next increases by the effect size when it scored last, but reverts to the base probability when the away team scored last.
Not surprisingly, since it’s active less often in the scoring sequence (ie only when the home team scored last), this form of momentum is detected at even lower rates by the Wald-Wolfowitz test.
For a game with the average number of scoring shots, even a 25% effect size will be detected less than half the time.
Assuming that momentum kicks in the moment a team scores is the most generous assumption we can make. Commentators more often, I’d suggest, claim that a team has the momentum only after it’s registered at least the last two scoring shots.
How good is the Wald-Wolfowitz test at detecting that form of momentum, again assuming that it has a positive effect for two consecutive scores, and a negative effect for two consecutive scores by the opponents?
As we’d expect, the detection rates are far lower in this scenario than the one where momentum kicks in after a single scoring event. They’re generally slightly better, however, than the scenario where momentum only works in the positive direction.
Finally, for completeness, let’s consider the situation where momentum only works in the positive direction and requires two consecutive scoring events to trigger it.
Unsurprisingly, the Wald-Wolfowitz test performs worst of all in this scenario, never detecting the dependence in scoring much more than 40% of the time.
THE CONCLUSION
This fact that the power of the Wald-Wolfowitz test is so low in many of the scenarios we’ve investigated doesn’t make it more (or less) likely that in-game momentum is a real phenomenon, but it does provide some context for interpreting the fact that the test detects in-game momentum not much more often than we’d expect if none actually existed.
It could be that:
momentum of the types we’ve investigated don’t exist in AFL scoring, and the lack of power in the test we’re using is irrelevant
momentum of the types we’ve investigated do exist in AFL scoring, but the test we’re using isn’t powerful enough to detect it
It’s also possible, of course, that momentum only manifests itself in the scoring for some games but, again, that the test we’re using fails to detect it.
The overarching conclusion, I’d suggest is that AFL games produce too few scoring events for the Wald-Wolfowitz test to be adequately powered.
We might, then, need to look for signs of in-game momentum in other statistics that produce many more relevant cases in the course of a typical game (for example, contested possessions). But the practical significance of determining that, say, contested possession sequences show signs of serial correlation (aka “momentum”) is, I’d suggest, less compelling than demonstrating such a correlation in actual scoring.
If having the momentum doesn’t lead to scoring, then how much does it really matter?
The search continues …