Matter of Stats

View Original

A Comparison of SOGR & VSRS Ratings

(Here's the latest installment in Andrew's Set of Games Ratings (SOGR) analyses. In this one he compares SOGR ratings calculated for full and half-seasons with ratings produced using a version of the VSRS.

Earlier posts on the Very Simple Rating System (VSRS) and Set of Games Ratings (SOGR) included a range of attractive graphs depicting team performance within and across seasons.

But, I wondered: how do the two Systems compare in terms of the team ratings they provide and the accuracy with which game outcomes can be modelled using them, and what do any differences suggest about changes in team performance within and across seasons?

Home Ground Advantage

In the first post on SOGR I noted that it provides an optimised estimate of home ground advantage (HGA) for the set of games on which it is based, this being the intercept of the model fitted to those games.

The chart below shows the estimated HGA for SOGR models built for each of the seasons 1999 to 2013 in full, and for the first and second halves of each of those seasons - 45 models in all. The colour scheme used in this chart and throughout the remainder of this blog post matches that used in the SOGR charts from earlier posts, as described in the following table:

The season-to-season variability in full-season estimated HGA is substantial. The minimum HGA is 1.7 points, recorded in 2001, and the maximum is 13.4 points for 2005, making the range almost two whole goals wide.

The estimated standard errors associated with these HGA coefficients are about 3 to 4 points, varying a little from season to season, so we can comfortably claim that many of the differences in estimated HGA between one season and another are statistically significantly different from zero.

This high level of variability suggests that the fixed HGA employed in the original VSRS is likely to have significantly reduced its ability to accurately model game outcomes.

In fact, as well as varying from season to season, HGA sometimes appears to vary substantially within seasons too. The evidence for this is the data in the column headed H1 – H2, where we can see differences in the estimated HGAs for the first and second halves of seasons as large as about 10 or 11 points in 2002 and 2009. This suggests that HGA varies enough within a season to warrant separate modelling of each half.

Speculating on possible reasons for the variability of HGA, I came up with the following:

  • Scheduling across and within seasons may lead to differences in the average relative strength of home teams in one season versus another, or in one half of a season compared to the other half (for example we might find relatively stronger home teams in one half of a season and then relatively weaker home teams in the other half when the home and the away teams are reversed).
  • A season or half-season with more games pitting Melbourne-based teams against one another would tend to reduce the estimated HGA (see this blog for some quantification of this).
  • Changes in game play, competition structure, refereeing and so on might introduce inter-seasonal effects driving variability in HGA.

Whatever the reason, the SOGR results suggest that estimating time-varying HGAs might be worthwhile. (On a related note, in a later version of the VSRS, used with the ChiPS Prediction System, Tony utilised team- and venue-varying HGAs, which demonstrated significantly superior accuracy. In a future post I'll write about some modelling I've done trialling variable HGAs, which also leads to improved accuracy.)

Rating System (Dis-)Agreement

The large chart in this previous post on SOGRs shows some level of consistency between the team ratings produced by VSRS and by SOGR, at least in the sense than the SOGR Ratings tend to lie somewhere within the range of week-by-week ratings provided by the VSRS for the same team and season or half-season.

Closer inspection of that chart and the more detailed analysis of Carlton in the earlier blog post suggests that VSRS ratings generally spend some time above and some below the half- and full-season ratings of SOGR for the same team.

Let's look at that phenomenon in some detail.

The charts below compare the ratings differences between VSRS and the three SOGR ratings for each round of the season, averaged across all teams and all seasons.

The vertical axis in these charts is the mean absolute difference between the VSRS rating and each of the SOGR ratings. The horizontal axis is the Round number. Note that in all of these analyses the 1999 season is omitted because this is the "calibration" season for VSRS when the majority of currently-active teams started the season with a 1,000 rating.

The top-left chart shows the mean absolute difference between VSRS and each of the SOGRs, while the three boxplots show the distributions of those differences. In the boxplots, the line across each box marks the average absolute difference for the round and tracks the line chart values in the chart at top-left.

The red line in the chart at top-left shows that the VSRS ratings used for this blog and the SOGR-Season ratings are closest in a given season from about Round 20 onwards, at which point the average absolute difference is about 4 rating points.

VSRS and SOGR ratings, on average, start each season with a difference of about 12 rating points, though in SOGR's case it "starts" the season with the benefit of complete knowledge about the upcoming season or half-season, dependent on the SOGR variant we're considering. By comparison, the VSRS ratings at any point incorporate complete knowledge about the entirety of all future and past outcomes but are constrained by being compelled to use the same parameters for the full expanse of history. In that sense, both SOGR and VSRS are fitted models, though SOGR has more degrees of freedom in relation to games in the season or half-season to which it is being fitted.

In contrast, it is in the nature of the ELO-style approach that underpins VSRS ratings that recent results are weighted more heavily than older ones, with the relative influence of previous results decaying broadly exponentially. As such, the influence of games from earlier in the season becomes more-and-more insignificant for VSRS as the season progresses.

In most recent seasons, Finals have started after about Round 22 or 23, at which point there are fewer games each week and, in the case of VSRS, fewer team ratings being updated. I’ve left the data for these rounds off the graphs but retained the figures in the accompanying tables for those who might be interested.

The green line in the chart shows that VSRS and SOGR-H1 converge over the entirety of the first half of the season and then diverge over the remainder as the SOGR-H1 ratings become less relevant. Over the first 11 rounds of a season, on average the absolute difference in ratings drops from about 13 rating points to about 5.

During the second half of the season, VSRS ratings converge towards the SOGR-H2 ratings. The difference between the SOGR-H2 and VSRS ratings drops from a maximum of about 12 ratings points after Round 11 to around 4 to 5 ratings points by the end of the season.

Interestingly, SOGR-season ratings are more similar to VSRS ratings than either SOGR-H1 or SOGR-H2 ratings. It was my expectation that, instead, at the end of each half-season, SOGR-H1 or SOGR-H2 would be closer to VSRS ratings than SOGR-season ratings. That's not the case though the differences are not large.

One possibility (suggested by Tony) is that VSRS week-to-week ratings track above and below SOGR-season, SOGR-H1 and SOGR-H2 ratings at different points of the season but, on average, take the same or similar values. To investigate this possibility, the following charts compare the average VSRS rating against the SOGR rating for each team and season combination. As usual the comparison is made for the season as a whole and, separately, for each half of the season.

At the season level, there is very strong agreement between the VSRS and SOGR ratings (r = +0.97). Interestingly, the slope of the relationship is significantly less than 1 (it's 0.74) meaning that the SOGR ratings are more discriminating (ie tend to be further from 1,000) than VSRS ratings. Since both are designed to measure the same aspect, the relative strength of teams measured in points scored, and since SOGR ratings produce fitted game margins with a smaller MAE, this suggests that VSRS may be underestimating the ratings of strong teams and overestimating the ratings of weak teams. 

Turning next to the comparisons for the first and second halves, we again find high, positive correlations between VSRS and SOGR ratings. The slope for the first half of the season (0.60) is considerably smaller than that for the second half (0.73). This suggests that VSRS ratings at the start of each season might be too compressed and not capturing the true variability in team abilities.

Together, these reinforce the earlier finding that VSRS tends to take most of the season to reach an appropriate team rating.  

In summary, the key findings from this analysis are that:

  • VSRS ratings and all three SOGR ratings are, generally, in good agreement
  • VSRS and SOGR-season ratings tend to be more in alignment as they share more of the same data
  • The largest divergence between VSRS and SOGR ratings is at the start of every season
  • There are significant differences between team ratings in the first and the second half of seasons

Error in "Predicted" Game Margins

The extent to which one set of team ratings is better or worse than another set at predicting game margins is a critical consideration. More accurate models can make for more accurate wagering -  an important issue if this is why you visit the MatterOfStats website.

Since we don't have any holdout data to use for VSRS and since SOGR is, by its nature, fitted only to known data, we can't assess the true predictive abilities of the two rating systems on games that were not included in their creation. Instead, here we'll look at the relative abilities of the two rating systems to fit the results from seasons 1999 to 2013.

The four charts that follow are scatter plots of estimated game margins (X axis) against actual scoring margins (Y axis). The chart with black points is for the VSRS ratings, the other three relate to SOGR model estimates and use the colour-coding described earlier. The charts are annotated with the correlation between the estimated and actual game margins, the mean absolute error and the number of games depicted (the half seasons have roughly half the games).

On each chart I've included a solid 45-degree line depicting the ideal fit. On the VSRS chart I've also added a regression line though it differs only marginally from the 45-degree line.

In all cases, estimated game margins are formed by taking the difference between the ratings of the relevant teams and then adding the fixed HGA. SOGR-H1 ratings are used to estimate game margins from the first half of seasons only, while SOGR-H2 ratings are used to estimate game margins from the second half of seasons only.

The best fit is associated with SOGR-H2 ratings (r = 0.71), though those associated with SOGR-H1 ratings are only slightly worse (r = 0.69). These ratings should, of course, provide good estimates of game margins since they're optimised to do exactly that for the half-season to which they apply.

The mean average error of 25 and 26 points per game recorded by these two models is well below the 'magic' figure of 30, albeit that it's been achieved by SOGR-H1 and SOGR-H2 purely in-sample. The SOGR-Season model is not far behind with a correlation of 0.65 and an MAE of 27 point per game.

By way of reference, the simple VSRS used for this blog has a correlation of 0.53 and an MAE of 30.0 points per game. Technically, its results are also being achieved in-sample, though, qualitatively speaking, somewhat less in-sample for the reasons described earlier.

(In passing, I note that Tony recently published a blog on the ChiPS Prediction System, which incorporates an enhanced version of VSRS and achieves an MAE of 29.3, including season 1999.)

Some findings:

  • The MAE of 25 to 26 points per game for SOGR-H1 and H2, being lower than the SOGR-Season MAE of 27 points per game, implies that team abilities (and/or HGA) shift measurably within seasons.
  • The within- and between-season differences in estimated team abilities and HGA are sufficiently large to suggest that rating systems should model these effects.

Further improvements to SOGR might be made by including additional parameters, such as team- and venue-specific HGAs as in the ChiPS System.

Round-By-Round Errors

The charts below record similar information to that charted in the previous section (ie differences between estimated and actual game margins), but do so for each round of the season separately.Unlike the earlier charts, however, those for SOGR-H1 and SOGR-H2 here include estimated game margins for both halves of the season. The estimates for the half of the season on which the model ratings were not determined (ie the second half of the season for SOGR-H1 and the first half of the season for SOGR-H2) are shaded in the respective chart.

Each chart includes the relevant MAE figure for the season and each half season. (NB The horizontal lines are meant to depict the span of the particular MAE figure, not its value in terms of the Y-axis.)

Start of Season Rating Reset

One difficult-to-model factor in AFL is how to adjust end-of-season ratings from one year for use in the following year given that:  

  • Coaches change, team rosters change, strategies change, home grounds change and so on.
  • The draft process takes place, and other AFL initiatives are implemented which are intended to increase the competitive balance of the league.
  • Pre-season games seem to hold very little predictive power for the season ahead.
  • Doubts persist about the relevance of team performances in games late in the season that don’t affect Finals positions.

VSRS captures this change via a single "carryover" parameter, which drags all teams' ratings nearer to the average rating of 1,000. In essence, this approach implies that regression to the mean prevails in team ratings, an assumption borne out empirically since this parameter's optimised value in the simpler VSRS I've been using for this blog is 50%, and for the enhanced VSRS used in ChiPS is 61%.
 
SOGR ratings provide another view on the optimal level of inter-season adjustment. The following charts show values for two inter-seasonal changes:

  • SOGR in one season (X axis) versus the SOGR for the same team for the next season (Y axis)
  • SOGR-H2 of one season (X axis) versus the SOGR-H1 for the same team in the next season (Y axis)

In these charts:

  • Each point relates to a single team's pair of ratings at the two points in time
  • The dashed grey line is the "line of no change"
  • The red line depicts the ratings reset of the simple VSRS with its 50% carryover parameter
  • The green line is an OLS fit to the data (its intercept and slope is recorded in the legend)
  • The correlation between the two SOGR ratings is provided as an annotation

Some observations:

  • The fit for SOGR-H2 to SOGR-H1 suggests about a 46% reset is optimal. This is very close to the VSRS and MARS ratings resets of 50% and is more empirical evidence of the efficacy of that choice despite its simplicity.
  • The SOGR-H2 to SOGR-H1 fit is more appropriate for determining the optimal inter-seasonal reset because of the superior fit of these half-season models to actual game margins, as shown earlier. However, the fact that the season-to-season fit exhibits higher correlation suggests that “reversion to the mean” takes place on a timeframe less than 12 months.

There are many factors that might help better predict the performance of teams at the start of a new season. Changes in team roster and team coach come to mind as immediate factors to investigate in the future. SOGR provides a simple way to test the efficacy of these factors, assuming they can be quantified. 

Intra-Season Rating Changes

And, finally, the same methodology used above for inter-season analysis can be applied to the comparison of the first and second halves of seasons.

Some findings:

  • The correlation between teams' ratings in the first half of a season and their ratings in the second half of the same season, at +0.72 is considerably lower than the correlation between their ratings in the second half of one season and the first half of the next (+0.52). That seems logical, since any changes for a team are likely to be less dramatic within a season than between seasons, but it's still good to confirm.
  • The slope of the best-fitting linear model (0.814) suggests that team ratings in the first and second halves of seasons are quite stable. The fact that this slope is less than 1 suggests that the second halves of seasons tend to be more competitive than first halves (since second-half team ratings are more compressed than first-half ratings in the same season, so the average rating difference between teams in a contest will tend to be smaller).
    (Note that I haven’t separated the effects of finals in this analysis. Since these games are, by design, played only between more-highly rated teams, the difference in team ratings for these games will be smaller than the average difference in the regular season.)

Postscript

Analysis of VSRS and SOGR is ongoing and I will, life permitting, be blogging about it before too long.

The analysis in this post suggests that:

  • Inter-season changes in team ratings are more substantial than intra-season changes.
  • The inter-season rating carryover value of about 50% used in VSRS and MARS ratings systems seems reasonable - at least, according to the SOGR model - but there’s still plenty of unexplained variability to explore.
  • Home Ground Advantage changes enough within and between seasons to warrant a dynamic model of its effects (and possibly team- and venue-specific treatment)
  • VSRS and SOGR ratings are broadly compatible with the greatest differences seen at the start of each season. Over the course of half and full seasons, the two sets of ratings tend to converge.