Set of Games Ratings: A Comparison With VSRS

December 31, 2013 Andrew Hunt

(Here's another post from Andrew, who you might remember as the author of several very popular posts on game statistics, including this one introducing us to the analysis of game statistics in general, this followup piece on the links between game statistics and game outcomes, and this one on kicking statistics which brought the Matter of Stats website to the attention of the respected revolutionanalytics blog.

Today, Andrew's presenting another angle on team ratings ...)

A few weeks back, Tony introduced the Very Simple Rating System (VSRS). It’s an ELO-style rating system applied to the teams in the AFL, designed so that the difference in the ratings between any pair of teams plus some home ground advantage (HGA) can be interpreted as the expected difference in scores for a game involving those two teams played at a neutral venue. Tony's explored a number of variants of the basic VSRS approach across a number of blogs, but I'll be focussing here on the version he created in that first blog.

VSRS team ratings are designed to be updatable immediately after the completion of a game so that they reflect the latest information about the relative strengths of the teams that just played, knowledge that might be useful for any subsequent wagering.

During the course of a season, teams' VSRS ratings will drift up and down, in so doing hopefully reflecting the ‘true rating’ or ability, in some sense, of each team at each point in the season.

True Ratings?

But, there is no objective way of estimating an AFL team's "true rating" at any point in time. It will, certainly, be revealed in the outcome of every game a team plays, but consider the myriad factors that also influence a team's performance on a particular day: players in and out of the squad, opposition matchups, the game venue, ground size, crowd size, crowd fan mix, team strategies, the weather, the mood of the coach - and even, some may say, the location of Mars in the zodiac. Consider also that some of these factors can change over time and with varying timescales: game tactics and crowd mood can change during the course of a single game or even a single quarter, while rule changes and player transfers usually play out between seasons and alterations in the styles of play adapt and evolve over the course of many seasons, perhaps spanning decades.

As well, if we think of a team's "true rating" as a measure of its underlying, reasonably persistent "class", we need to isolate this element of any performance from those aspects related merely to shorter-term issues of "form". If a "class" team loses to a less classy opposition, does it signal a permanent recalibration of the teams' strengths or instead just a temporary hiccup in form?

There are, though, only so many games in a year, nowhere near enough to allow us to separately quantify the effects of all the factors just described. In science, such scarcity of information is sometimes attributed to “poor experimental design” - there are just too many things varying simultaneously. Fortunately, it's what makes sport interesting. (Imagine what commentary and betting would be like if games were too predictable - think staged wrestling.)

ELO-style team rating systems - such as the MARS model, which has been used now for some time on Matter of Stats, and the VSRS - incorporate a tuning parameter that controls the extent to which team ratings can change on the basis of a single result. Optimal values for these tuning parameters have tended to be small - for example, most of the optimised VSRS systems have values of k equal to about 8 or 9%, meaning that team ratings change by an amount less than one-tenth of the difference between the actual and the expected game margin, which is usually just 1 or 2 rating points.

By favouring small ratings changes when they're based on single game outcomes these rating systems implicitly assume that a team's ‘true rating’ is reasonably stable from week to week. However, it's conceivable that large changes in a team's fundamental ability - and hence 'true rating' - might take place during the course of a single season, even over periods as short as a few weeks. Rating systems like the VSRS will, however, react to these more significant changes in ability only slowly.

Another potential source of discrepancy between a team's 'true rating' and its rating as measured by, say, a VSRS, is the choice of a team's initial rating for the season. Largely, this comes down to the choice of a 'carryover' parameter, which determines how much of a team's rating in the previous season is carried into the next. In reality, the extent to which a team's ability in one year will be related to its ability in the previous season will also depend on a variety of factors, amongst them the extent and nature of changes in the player roster and coaching staff. Based again on some form of optimisation, the MARS and VSRS systems all embody about a 50% carryover of "excess" rating (ie rating over or under 1,000) from one year to the next. Whilst that proportion might be a reasonable approximation of the average effect for most teams in most years, it could be very inaccurate for a given team in a given year.

In the remainder of this blog I'll be exploring a way of estimating team ratings using multiple games from a single season - even as much as an entire season - using Ordinary Least Squares regression, and then investigating how different these longer-term assessments of team ability are from the week-to-week assessments made by a VSRS.

The Case of Carlton

Carlton, it turns out, are a good example of a team for which the VSRS methodology encounters some difficulties. Let's start by reviewing a graph of Carlton’s VSRS from 1999 to 2013. For those curious about the technical details, the VSRS used here is the one from the very first blog on the topic and so is optimised for Mean Absolute Error, uses a single HGA parameter for all teams and venues, and sets each team's initial rating to 1,000 as at the start of 1999, except Gold Coast and GWS whose initial ratings are treated as parameters to be optimised. The black dotted line is the VSRS as at the end of each game. (The horizontal, coloured bars will be explained later.)

The X-axis in this chart tracks time; the gaps represent off-seasons. Note how, at the start of each season, the VSRS rating is reset to about half way between the end of season rating for Carlton and an average team's rating of 1,000.

If we look firstly at Carlton's competition results, we find that, after 2nd- and then 5th-placed finishes in 2000 and 2001 that promised much, the Blues then made a dash for the bottom of the ladder across seasons 2002 through 2007 finishing, in order across those seasons: last, second-last, fifth-last, last, last and second-last.

Within the confines of most of those 6 seasons, Carlton's VSRS rating seems to be tentatively searching each week for a reassuring floor that it just can't find. Then, after entire seasons of unresolved searching, as is its way, the VSRS optimistically resets Carlton's rating back each year nearer to 1,000 only for Carlton’s subsequent on-field performances to openly mock that optimism.

Season 2008 marks the start of the turnaround for Carlton, with an 11th-placed finish in the competition and with an upwardly trending VSRS for most of the season.

Throughout seasons 2009 and 2010, Carlton's VSRS rating remains comparatively stable, suggesting both that the ratings at the start of each season were close to Carlton's ‘true rating’ and that Carlton's performances throughout those years were broadly consistent with that rating. In the competition, Carlton finishes 7th and 8th, having remained around mid-table for most of both seasons.

Set of Games Ratings (SOGRs)

As mentioned earlier, one of the attractive features of VSRS Ratings is that they can be thought of as having points scored as their natural unit. So, for example, a team rated 1,030 is expected to score 30 points more than a team rated 1,000 in a game played on a neutral venue.

A simple parameterisation of game data - for a number of games within a season or for a season in its entirety - allows us to use Ordinary Least Squares regression to derive team ratings with a similar interpretation.

Consider, for example the following two games from 2002:

In Round 1, St. Kilda defeated Carlton 89-65 (ie by 24 points) with St. Kilda the designated home team at Docklands.
In Round 2, St. Kilda lost at home to Sydney by 78 points.

For each of these games we create a row of data with 17 columns, 1 for each of the (then) 16 teams and 1 for the game margin. To encode the first game we place a 1 in the column for St Kilda to signify that they are the home team, and we place a -1 in the column for Carlton to show that they are the away team. We place a 0 in the columns for all other teams to denote that they were not participants, and we put 24 in the game margin column to show that the home team, St Kilda, won this game by 24 points. For losses by the designated home team we use negative numbers for the margin, and for draws we use 0.

The second game would be encoded with a 1 in the column for St Kilda, a -1 in the column for Sydney, a -78 in the game margin column, and with zeroes in the columns for all teams other than St Kilda and Sydney.

This process is then repeated for as many games as we wish to include in the SOGR ratings assessment.

Armed with this matrix we can then use Ordinary Least Squares regression to estimate the following linear model:

Expected Margin = HGA + c1 x Carlton + c2 x St Kilda + c3 x Sydney + …

(Mathematicians and hardened data wranglers will recognise that the matrix of team-related columns is singular since every row adds to zero. To address this we arbitrarily exclude the column for one team, implicitly setting its coefficient to zero. We can choose any team to be excluded as the resulting ratings are unaffected by this selection.)

The difference between the estimated coefficients for any two teams from this model can be interpreted, as can VSRS ratings, as the difference in relative points-scoring abilities of the teams. So if, for example, the estimated coefficients for the equation above came out with c1 equal to -10.5 and c2 equal to +4.3, we could interpret that as implying that St Kilda were a 14.8 point better team than Carlton (and hence could be expected to defeat them by this margin in a game played at a neutral venue).

To make these ratings consistent with those of the VSRS they need to average 1,000 for which purpose we:

add 1,000 to each estimated coefficient
calculate the average coefficient
add or subtract the same amount from each coefficient sufficient to make them average exactly 1,000 across all teams

We'll call the ratings that emerge from this process Set of Games Ratings (SOGRs).

If we follow the entire process of model building and coefficient normalisation as just described for the 185 games of 2002 (including all the Finals), we obtain the SOGRs as shown at left.

Taking season 2002 as a whole, therefore, Brisbane were almost a 2-goal better team than any other and were about a 10-goal better team than Carlton.

We also find that the estimated HGA is 9.9 points.

Exactly one half of the teams had an SOGR above 1,000 and one half had an SOGR below 1,000.

Comparing the teams' end-of-season ladder positions with their rankings based on SOGRs shows a reasonable degree of alignment, with Sydney and the Western Bulldogs the obvious exceptions. A closer look at the results for that season reveals that these two teams finished 5th and 7th in terms of points difference, with each team's ladder position substantially affected by a series of narrow losses. Sydney were involved in 11 games decided by less than two goals finishing on the wrong side of the outcome in those games 7 times as well as drawing once, while the Bulldogs also drew once and lost 8 games by less than 4 goals.

SOGR Performance

Let’s apply the estimated 2002 SOGRs to Carlton’s first game of the 2002 season, which was away game against St. Kilda. The expected game margin for the home team in this case, St. Kilda, is 978.7 (St Kilda's SOGR) less 974.1 (Carlton's SOGR) plus the 9.9 HGA. So, the fitted margin is a St. Kilda win by about 15 points. The actual result was a St. Kilda win by 24 points.

It would be awesome for the model to be this accurate for every game, but it isn’t. It's not too bad though:

The R-squared for the model is 33.7% (meaning that the SOGRs explain about one-third of the variability in game margins across the season)
The standard deviation of the errors is 32.3 points
Half the estimated game margins lie between -24 and +22 points
The largest model errors are +94 and -81 points (from the viewpoint of the home team)
The mean absolute error is 26.2 points per game, falling inside Tony’s magical figure of 30.

SOGRs provide superior fits to the actual game results of 2002 than do VSRS Ratings. However, some caution in interpreting this results is called for because:

both rating systems have been optimised within-sample, so their measured in-sample performance is unlikely to generalise to a post-sample environment
the VSRS has been optimised for the entirety of the 15 seasons from 1999 to 2013, using the same optimised parameters for that entire span of history, whereas SOGRs are optimised for each season separately

Carlton Revisited

As alluded to earlier, 2002 was a poor season for Carlton. Their SOGR of 974 rates them as more than a 4-goal per game poorer than average team. Even playing at home that makes them likely 3-goal losers; playing away, the expected margin is more like 6 goals. No surprise then that they won only 3 of 22 games in that season.

Here’s a close-up of the VSRS ratings for Carlton for the season of 2002 (a zoom in of the earlier chart).

The red line indicates a full season SOGR for Carlton of 974.1 points, while the green line shows an SOGR of 975.5, obtained by considering only games from the first half of the season. Hidden behind the red line is a blue line showing an SOGR of 974.1, calculated by including only those games from the second half of the season, and suggesting that Carlton may have been slightly worse in the second half of the season than in the first. (Note though that the margin of error on these SOGRs is approximately plus or minus 10 points so the difference is well within the inherent "noise" in the data.)

This graph highlights the issue described earlier in this blog where it takes quite some time for the VSRS to attain a rating consistent with Carlton's true apparent ability. Here the source of the discrepancy is two-fold: an overly optimistic initial rating and a sluggish, downward revision of that initial assessment. (That said, it is conceivable that Carlton were playing at a level consistent with a rating higher than 974 at some points in the season, but if that's true they must also have performed in at least part of the season consistent with a rating below 974 in order for the overall season result to net out at 974.)

If it is the case that, as the SOGRs suggest, Carlton ended 2001 with a rating of about 1,022 but started 2002 with a rating of about 974, then the VSRS in rating Carlton at 1,011.5 was in error by over 37 rating points at the start of season 2002. It then took until week 18 of the season until VSRS rated Carlton at 974..

(Returning to the earlier chart, we can see Carlton bouncing along the bottom of the ratings ladder for several years until the end of 2007. The 2003 seasons is a particularly miserable one for the Blues whose performances earn them a full-season rating of 963.4 and a second-half only rating of only 944.1. (Only Melbourne, GWS and Gold Coast have also plumbed these depths during this millennium.)

Then, after an awful 2007, a Carlton comeback in 2008 sees their SOGR rise dramatically, though it takes some time for the VSRS to attain the team rating estimated by the SOGR.

SOGR Ratings Uncertainty

One nice feature of using Ordinary Least Squares to estimate team ratings is that the estimation process comes, if we're willing to accept the underlying Normality assumptions, with estimates of the level of confidence we can have with those estimated ratings.

The standard error on the coefficients is broadly constant across all teams at around 9.5. So, as an example, a 1 standard deviation-wide confidence interval for the Brisbane Lions' 2002 rating would be 31.8 +/- 9.4. That’s 22.4 to 41.2, which is a sizeable range reflecting the high levels of uncertainty about estimating a team's rating using a single value across the entirety of a season.

Some Musings

We've seen that SOGRs are somewhat better at fitting game results than VSRS ratings. But, as I've defined them here, SOGRs can only be determined retrospectively once game results are already known and not prospectively, as would be required to be useful, for example, as inputs to MAFL betting Fnds.

As well, this post has focused on ratings for Carlton, which, to some extent, might be a pathologic case. Nonetheless, the analysis for Carlton suggests that abrupt changes in team ability - from week-to-week and across seasons - may not be well-modelled by a VSRS system.

Some immediate questions that follow are:

To what extent do VSRS ratings for other teams exhibit this same behaviour? To answer this I’m working on a post that looks across all teams for the same period of 15 years.
Can the approach here be used to establish better initial team ratings at the start of each season?
Can it be used to determine better ways to update team ratings from one week to the next? Should, for example, different values of k be used for different teams or for different parts of the season?
Can changes in team roster or team coaching staff be incorporated into team ratings?

Most importantly, I wonder, how did Carlton supporters survive the early noughties?