Sunday, November 17, 2013

NBA Win Probability Calculator

Classic shot of the ENIACThis past summer, I rolled out my initial attempt at a win probability model for the NBA (introductory post | the insanity that was Game 6 | graphs for all playoff games | win probability added). Since that time, I have refined the model somewhat via more rigorous cross-validation of the model parameters. And while I may still have some tinkering to do, I'm ready to share a beta version of an online Win Probability calculator.

Much like the Win Probability tool from Advanced NFL Stats, this tool allows you to input the game state (time remaining, margin, possession) and it will return the win probability of said game state.

So what's the point? From my perspective, a win probability model has three main potential uses:

  • Entertainment Value - A win probability model allows you to put a precise number on the outcome everybody cares about most: winning. You can then track and chart that number throughout the course of the game, giving each game its own visual fingerprint. You can see where the turning points where, where the low points were, and where things just went batshit insane.
  • Player Evaluation - Looking at Win Probability Added on a player level is the most direct way to measure and quantify those eternally debated questions regarding who is "clutch" and who is "most valuable". Win Probability Added can tell you that very clearly (although its usefulness as a predictor of future performance remains to be seen).
  • In Game Decisions - Compared to baseball, and especially football, there are relatively few strategic decisions a coach must make in the course of a basketball game which would be informed by a win probability model. One that does come to mind though is when a team that is trailing late should start fouling, and it's a topic I intend to address in a future post. 
Once again, here is a link to the calculator. I hope some of you find it useful. More updates to come regarding player evaluation and in-game decision making.

Appendix - Methodology Overview

  • Data source: All NBA games from the 2004 through 2011 seasons
  • Modeling Approach: Locally weighted logistic regression. I used R's Locfit package for the heavy lifting, with bandwidths optimized via cross-validation.
  • Model inputs: Betting point spread, time remaining, margin, possession. While the tool I'm sharing here does not provide probabilities as a function of point spread, I am controlling for point spread in the modeling, such that the probabilities shared with this tool should properly reflect win probability for evenly matched teams. In other words, I am eliminating selection bias in the results and calculating probability as a function of pure game state.
  • Overtime: This is a bit of a cheat, but I am modeling overtime as if it was the fourth quarter. So, a team being down by 2 with 2:30 to go in the fourth quarter has the same win probability as a team down 2 with 2:30 to go in overtime.


  1. Why not an ability to distinguish between home and away? Surely this makes a difference... or is it negligible?

  2. What does it mean to "control for point spread"?

    1. It's an explanatory variable in the model (the pre-game Vegas point spread). There's selection bias in the raw results. The data sample of games with a team leading by 10 with 5 minutes to go is going to be biased towards better teams, because better teams tend to build leads. This will affect the raw win percentages, making a 10 point lead appear "safer" than its true value for evenly matched teams.

      Including the Vegas point spread as a variable allows me to peel out that bias in the results and return a probability that is more reflective of pure game state (time remaining, margin, and possession).

  3. According to this, TMac's 13 points vs the spurs has 0.0% chance of happening. Surely since it has occurred, it should be slightly over it?

    1. TMac is not bound by the laws of probability.

      More seriously, the model returns a non-zero probability for that situation (down 8, with possession, and 0:44 on the clock). But it rounds to 0.0%.

      I looked at the raw numbers. Since 2000, there have 244 games in which a team had possession, down 8, with between 0:45 and 0:40 on the clock. 243 out of 244 games ended in a loss. TMac's comeback was the only victory. And oddly enough, there have been 229 games with a team down 7 with 0:45-0:40 to go, and no team has won.

      All that being said, the model smooths out actual results, and it looks like the smoothing may not be quite right there. I am working on an updated model that should be rolling out soon. The new model returns a 0.2% probability for that situation, which seems more appropriate.