Judging Win Probability Models

February 11, 2018 update: The Brier score chart at the bottom of this post had an incorrect value for the ESPN "Start of Game" score. The corrected numbers (with updates through 2/10/18) can be found in this tweet. With the update, my comments regarding the ESPN model being too reactive no longer apply.

Win probability models tend to get the most attention when they are "wrong". The Atlanta Falcons famously had a 99.7% chance to win Super Bowl LI according to ESPN, holding a 28-3 lead in the third quarter, and the Patriots facing fourth down. Google search interest in "win probability" reached a five year high in the week following the Patriots' improbable comeback.

Some point to the Falcons' 99.7% chances, and other improbable results, as evidence of the uselessness of win probability models. But a 99.7% prediction is not certainty, and should be incorrect 3 out of every 1,000 times. But it's not like we can replay last year's Super Bowl 1,000 times (unless you live inside the head of a Falcons fan).

So, in what sense can a probability model ever be wrong? As long as you don't predict complete certainty (0% or 100%), you can hand wave away any outcome, as I did above with the Falcons collapse. Or take another high profile win probability "failure": the November 2016 Presidential Election. On the morning of the election, Nate Silver's FiveThirtyEight gave Hilary Clinton a 71% chance of winning the presidency and Donald Trump a 29% chance.

FiveThirtyEight's favored candidate, Clinton, did not win the election, so in one sense you could say they were "wrong". And this is often how this plays out in public discourse. The media, and humans in general, have a hard time judging and understanding probabilities, and often round up or down any probability estimate to a nuance-free binary prediction.

But that 29% chance FiveThirtyEight gave Trump now looks very prescient, if you're grading on a curve:

Clinton win chances, forecast by forecast. Election Eve edition pic.twitter.com/2sswHFJqYM
— Josh Katz (@jshkatz) November 8, 2016

FiveThirtyEight's model gave Donald Trump a better chance than any of the other high profile election models, which should count for something. One way to count for that, quite literally, is to calculate the Brier Score of each prediction. The Brier Score is a way to quantify the accuracy of a probabilistic prediction, and the calculation is very simple: it is the squared difference between the predicted probability and the actual outcome - where you assign 0 if the outcome did not happen and 1 if it did. The lower the Brier Score, the better the model did.

So, FiveThirtyEight's Brier Score for predicting the probability of a Trump presidency was (0.29 - 1)^2 = 0.504. In contrast, the Princeton Election Consortium projected a Trump win probability of just 7%, leading to a higher (i.e. worse) Brier Score of 0.865.

For this post, I will use the Brier Score to compare the accuracy of my NBA win probability model to the recently launched ESPN NBA model.

Inpredictable vs ESPN

For the past four years, I have been publishing on this site in-game win probabilities for all NBA games (with one major revamp along the way). Beginning with the 2017-18 NBA season, ESPN, using their own proprietary model, began publishing their own graphs, both on their website and their app.

Motivated by equal parts academic curiosity and a deeply ingrained competitive streak, I wanted to see how similar the two models were, and ultimately, which one was more accurate in predicting actual outcomes.

The nice thing about in-game win probability models is that they make a lot of predictions, so I have a very large sample size to work with. To do this, I matched up my model to the ESPN win probabilities on a play by play basis. Due to differences in data sources, I had to drop a handful of games, and for the games I did keep, I was not always able to match up every play. But for the purpose of this analysis, I am only counting plays in which I have a win probability estimate from both models.

For example, here is how the two models differed for this past year's Christmas matchup between the Cavaliers and the Warriors:

For this game, you can see that the probabilities were fairly consistent, as shown by their similar Brier Scores (although ESPN is the narrow winner here). Also note that both probability graphs start with the Warriors at greater than 50% probability, reflecting the fact that the Warriors were favored to beat the Cavs at home. The Inpredictable model uses the Vegas point spread to set the pre-game probability - the Warriors were a five point favorite. The ESPN model uses their own Basketball Power Index (BPI) to set the probabilities, which were more bullish on the Warriors (driven perhaps by the absence of Steph Curry, which the point spread would reflect, but BPI would not).

And for other games, the two models show more of a divergence. On December 11, the Houston Rockets fell behind by as much as 13 points to the New Orleans Pelicans before mounting a comeback in the fourth quarter and ultimately won by four. Here is the win probability graph comparison:

Both graphs start with the Rockets at about an 80% win probability (they were a 12.5 point favorite). But as New Orleans built their early lead, the ESPN model was more reactive and gave the Pelicans a greater probability of winning the game. For example, with 2:33 left in the second quarter, the Pelicans led by seven points. According to the ESPN model, the game was a 50/50 toss up at this point, while Inpredictable still had the Rockets favored with a 65% win probability.

And in general, I've noticed that the ESPN model tends to be more reactive than mine, in particular when an underdog builds an early lead. Now, in this case, my model was more correct (or "less wrong"), as shown by the lower Brier score. But that is just one game, and on the face of it there is nothing wrong, with being more reactive to new data. As John Maynard Keynes once said (or was it Samuelson?): "When the facts change, I change my mind. What do you do?".

To settle this, I have compiled the Brier score for every play in which I have both an Inpredictable and ESPN probability, and then summarized that score by various phases of the game.

Thus far, the Inpredictable model holds a slight lead over ESPN, with a 0.162 Brier score compared to ESPN's 0.166. Also, note that Inpredictable outperforms ESPN despite having less accurate starting win probabilities (see the "Start of Game" bar). This is a vindication of ESPN's Basketball Power Index as it appears to be outperforming the market in assessing pre-game win probabilities (as noted above, my starting win probabilities are derived from the betting market point spread).

This chart also supports the notion that the ESPN win probability model is in fact too reactive early on in the game. As the game progresses, one would expect Brier scores to decrease, and that is what you see for the Inpredictable Brier scores. But for ESPN, the Brier score actually goes up in the first quarter, compared to the pre-game Brier score. And its Q2 Brier score is only just as good as the pre-game number.

What this means is that the ESPN model would have been better off ignoring entirely the actual game results in the 1st two quarters and instead stuck with its pre-game estimate. Sometimes new data can actually lead you astray. I'm reminded of a Ken Pomeroy study which showed that the pre-season College Basketball AP poll was more accurate in predicting tournament winners than the final poll prior to the tournament. This despite the final poll reflecting a full season's worth of results, injuries, roster changes, etc.

Also note that my model is actually more reactive towards the very end of the game, projecting more certainty in the last minute than ESPN. For example, with 1:13 left in the Rockets-Pelicans game, Houston had possession and a three point lead. My model put the Rockets chances at 95.5%, while ESPN said they only had an 83% chance. Based on spot checking of actual data, I'm inclined to believe the "correct" answer is somewhere in between. I do think there is a good chance my model becomes too confident in end of game situations. Building a win probability model from actual data is an exercise in finding the "Goldilocks zone" of smoothing, and I may have fallen out of that zone for the final couple minutes of game time. Smooth too much and you pull in too many irrelevant data points, but smooth too little and you create an overfit, erratic model with irrational results.

I will continue to compile the results of future games this season, and produce a final tally later this year, but I'm pleasantly surprised by the result thus far. There are a lot of smart people working for the ESPN Analytics team, and it's a nice validation of the model that it is as accurate as what that group put together. If their model came out with a Brier score under 0.100, it would indicate a fundamental shortcoming in my own modeling work. But for now, I am still confident in saying the Inpredictable model is a reflection of the "true" win probabilities of an NBA game.