Testing the HELP Model
It had been rankling me that I'd not come up with a way to validate any of the LAMP, HAMP or HELP models that I chronicled the development of in earlier blogs.
In retrospect, what I probably should have done is build the models using only the data for seasons 2006 to 2008 and then test the resulting models on 2009 data but you can't unscramble an egg and, statistically speaking, my hen's albumen and vitellus are well and truly curdled.
Then I realised that there is though another way to test the models - well, for now at least, to test the HELP model.
Any testing needs to address the major criticism that could be levelled at the HELP model, which is a criticism stemming from its provenance. The final HELP model is the one that, amongst the many thousands of HELP-like models that my computer script briefly considered (and, as we'll see later, of the thousands of billions that it might have considered), was able to be made to best fit the line betting data for 2008 and 2009, projecting one round into the future using any or all of 47 floating window models available to it.
From an evolutionary viewpoint the HELP model represents an organism astonishingly well-adapted to the environment in which its genetic blueprint was forged, but whether HELP will be the dinosaur or the horseshoe crab equivalent in the football modelling universe is very much an open question.
Test Details
With the possible criticism I've just described in mind, what I've done to test the HELP model is to estimate the fit that could be achieved with the modelling technique used to find HELP had the line betting result history been different but similar. Specifically,what I've done is taken the actual timeseries of line betting results, randomised them, run the same script that I used to create HELP and then calculated how well the best model the script can find fits the alternative timeseries of line betting outcomes.
Let me explain what I mean by randomising the timeseries of line betting results by using a shortened example. If, say, the real, original timeseries of line betting results were (Home Team Wins, Home Team Loses, Home Team Loses, Home Team Wins, Home Team Wins) then, for one of my simulations, I might have used the sequence (Home Team Wins, Home Team Wins, Home Team Wins, Home Team Loses, Home Team Loses), which is the same set of results but in a different order.
From a statistical point of view, it's important that the randomised sequences used have the same proportions of "Home Team Wins" and "Home Team Loses" as the original, real series, because part of the predictive power of the HELP model might come from its exploiting the imbalance between these two proportions. To give a simple example, if I fitted a model to the roll of a die and its job was to predict "Roll is a 6" or "Roll is not a 6", a model that predicted "Roll is not a 6" every time would achieve an 83% hit rate solely from picking up on the imbalance in the data to be modelled. The next thing you know, someone introduces a Dungeons & Dragons 20-sided dice and your previously impressive Die Prediction Engine self-immolates before your eyes.
Having created the new line betting results history in the manner described above, the simulation script proceeds by creating the set of floating window models - that is, the models that look only at the most recent X rounds of line betting and home team bookie price data for X ranging from 6 to 52 - then selects a subset of these models and determines the week-to-week weighting of these models that best fits the most recent 26 rounds of data. This optimal linear combination of floating window models is then used to predict the results for the following round. You might recall that this is exactly the method used to create the HELP model in the first place.
The model that is built using the currently best-fitting linear combination of floating window models is, after a fixed number of models have been considered, declared the winner and the percentage of games that it correctly predicts is recorded. The search then recommences with a different set of line betting outcomes, again generated using the method described earlier and another, new winning model is found for this line betting history and its performance noted.
In essence, the models constructed in this way tell us to what extent the technique I used to create HELP can be made to fit a number of arbitrary sequences of line betting results, each of which has the same proportion of Home Team wins and losses as the original, real sequence. The better the average fit that I can achieve to such an arbitrary sequence, the less confidence I can have that the HELP model has actually modelled something inherent in the real line betting results for seasons 2008 and 2009 and the more I should be concerned that all I've got is a chameleonic modelling technique capable of creating a model flexible enough to match any arbitrary set of results - which would be a bad thing.
Preliminary Test Results
You'll recall that there are 47 floating window models that can be included in the final model, one floating window model that uses only the past 6 weeks, another that uses the past 7 weeks, and so on up to one that uses the past 52 weeks. If you do the combinatorial maths you'll discover that there are almost 141,000 billion different models that can be constructed using one or more of the floating window models.
The script I've written evaluates one candidate model about every 1.5 seconds so it would take about 6.7 billion years for it to evaluate them all. That would allow us to get a few runs in before the touted heat death of the universe, but it is at the upper end of most people's tolerance for waiting. Now, undoubtedly, my code could do with some optimisation, but unless I can find a way to make it run about 13 billion times faster it'll be quicker to revert to the original idea of letting the current season serve as the test of HELP rather than use the test protocol I've proposed here. Consequently I'm forced to accept that any conclusions I come to about whether of not the performance of the HELP model's is all down to chance are of necessity only indicative.
That said there is one statistical modelling 'trick' I can use to improve the potential for my script to find "good" models and that is to bias the combinations of floating window models that my script considers towards those combinations that are similar to those that have already demonstrated above-average performance during the simulation so far. So, for example, if a model using the 7-round, 9-round and 12-round floating window models look promising, the script might next look at a model using the 7-round, 9-round and 40-round floating window models (ie change just one of the underlying models) or it might look at a model using the 7-round, 9-round, 12-round and 45-round floating window models (ie add another floating model). This is a very rudimentary version of what the trade calls a Genetic Algorithm and it is exactly the same approach I used in finding the final HELP model.
You might recall that HELP achieved an accuracy of 60.8% across seasons 2008 and 2009, so the statistic that I've had the script track is how often the winning model created for a given set of line betting outcomes performs as well as or better than the HELP model.
From the testing I've done so far my best estimate of the likelihood that the HELP model's performance can be adequately explained by chance alone is between 5 and 35%. That's maddeningly wide and spans the interval from "statistically significant" to "wholly unconvincing" so I'll continue to run tests over the next few weeks as I'm able to tie up my computer doing so.