This weekend's North London derby, we are told, is the most important game of the year. You can accept this as one of those cliched statements that are broadly true, you can snark at the cliche, or you can create half a gigabyte of spreadsheets and spend hours on data entry in order to say that the importance of this weekend's North London derby can be quantified to some X. I have chosen door number three.
What I've made, basically, is a Monte Carlo projection of the remainder of the EPL season. A Monte Carlo experiment (for more see previous wiki link) is a method for discovering the probabilities of different outcomes of an event by running a whole lot of iterations of the said event. So if you had a six-sided di, and you were very very bored, you could roll it hundreds of times to discover that, yes, there is a 16.6% chance of any particular number coming up. That's basically a physical Monte Carlo experiment. With a much more complicated event, like the English Premier League season, this method requires computers, and it is reasonably useful for producing probabilities that can't be simply extrapolated logically. So my spreadsheets simulate the remaining season 10,000 times to see what the likely outcomes are.
(I have put some further nerdy explanations of the method at the bottom of the post, and if folks are interested, I plan on writing more of these posts with more discussions of the underlying statistical problems. I assume there are folks in the community much more educated than myself in statistics and such, so I welcome feedback.)
First, you'll be happy to hear that the system thinks a Spurs win is reasonably likely. Here's a table of probabilities, including a kind of useless projected final score, encoded using my crappy html skills.
The computer thinks we're going to win! (A plurality of the time!) What is useful about the Monte Carlo projection here is that I can put some numbers to the question of how important this game is. What are the chances of Tottenham finishing top 4 depending on the outcome of the NLD?
I currently have Tottenham with a 77% chance of making the Champions League places, and Arsenal at 43%. This game, by my numbers, matters for Tottenham not because we're screwed without a win, but because a win will put us in great position for the top four. For Arsenal, this is an undeniably huge game with major implications either way.
Here's another crappy looking table, showing the implications of different outcomes for the Champions League places:
|Team||Tot W||Draw||Ars W|
That is, Tottenham increase their odds of making the Champions League to about 88% with a win, while Arsenal would see their odds of making the CL drop below one-in-three with a loss. And so on across the table. So, this is a huge game, but if Spurs are as good as my numbers say, even a loss wouldn't be crippling to our hopes. A win, though, man, that'd be nice as the run-in gets tough for the next two months. For Arsenal, this is just a huge game full stop.
At the same time, nothing is locked in. In the first of my iterations of the season, Arsenal won the NLD 2-1, but Liverpool went on an insane run, 9-1-1 to the finish to edge us and Arsenal out of the Champions League by one point apiece. This sort of thing didn't happen often, and it's hardly likely in the real world, but unlikely events happen all the time. The question is which unlikely events are going to happen. My hope is that the numbers can be useful in roughly quantifying some of these likelihoods and unlikelihoods.
(Nerdiness to follow. Well, nerdier nerdiness.)
You might be surprised that the stats like Tottenham more than Arsenal, even though Arsenal have a notably superior goal difference. This is because my method for estimating team quality is built mostly from underlying stats including shots on target, shots in the box, and Opta-classified "big chances". One of the major insights of football sabermetrics is that shot-on-target conversion (the percentage of shots on target which end up as goals) is highly variable within a season for both players and teams. So to account for this variation, it is generally better to estimate the quality of a team's attack or defense based on the number and quality of chances rather than the outcome of those chances. Tottenham do very well on these underlying statistical metrics, better than Arsenal (or Chelsea).
A note: this does not mean I think there's no difference between players and teams in "finishing". It means, first, that I think a lot of "finishing" skill is already contained in the shots on target numbers - just putting a shot on target takes great skill, and you can ask Gylfi Sigurdsson how well G/SoT measures the quality of the shots taken over a relatively small sample. Second, I think that there is so much variation in the G/SoT numbers that we can't isolate from the numbers alone finishing skill in only a season of data, so it's better to use the underlying stats even if we do end up missing some real variation between players in finishing skill.
For the Monte Carlo projection, it works by projecting a score for each game. I create a "mean goals scored" for each team for each game based on projected team quality and projected quality of opposition. I model goals scored using the Poisson Distribution, which is a pretty good model for EPL goal scoring, using the mean goals scored for the game as the mean for the Poisson. Each game thus gets a projected score, and I take the average outcomes over 10000 iterations as my projection.
Obviously there are any number of things this model misses, and I don't mean to attribute false precision to it. Teams change, tactics change, football is a wonderfully complicated thing. But I think the numbers can be useful and fun, even if they are far from definitive.
Finally, thanks to the good folks in the soccer thread at Baseball Think Factory for helping me work through the logic and programming of the spreadsheets.