*Of course*the Warriors erased a 20 point deficit in the fourth quarter (despite being only the third playoff team to do so).

*Of course*Steph Curry hits that game tying three from the corner to force overtime (despite having missed from the wing just three seconds prior).

But a 20 point comeback is anything but inevitable, and we tend to forget the games in which a blowout stays a blowout because, well, those games are forgettable. So what do the numbers say?

My own win probability model put the Warriors chances as low as 0.2%. That low point occurred after a miss by the Warriors' Shaun Livingston with 6:24 left in the game and Golden State down by seventeen. Livingston would rebound his own miss for the put back slam dunk, tripling his team's chances to 0.6%.

But that win probability estimate assumes teams that are evenly matched. The Warriors, however, are anything but an even match for the Pelicans. At home, they were 12.5 point favorites over the Pelicans in the first two games of their first round series. With game three being in New Orleans, the Warriors were still favorites, but not overwhelmingly so at just 5 points. If we use the win probability model calibrated to pre-game odds, the Warriors comeback becomes

*slightly*less improbable, with a low point of 0.4%.

How does this compare to other estimates of the Warriors' chances? I am aware of two others:

- Gambletron 2000 - A site that aggregates in-game betting data. Think of it as a stock ticker for each NBA game (along with many, many other sports)
- numberFire - The popular fantasy/analytics site which has recently begun publishing in-game probabilities for both football and basketball.

According to Gambletron, the Warriors chances sunk as low as 2.4% with around 6 or 7 minutes left in the fourth quarter. This amount is somewhat higher than my low point estimate of 0.4%.

numberFire's estimate is even further off from mine, with a Warriors' win probability of 6% right around the seven minute mark of the fourth quarter. While neither model is going to be objectively "right", this is clearly a significant difference.

So let's look at the raw data. I built my win probability model from play by play data spanning the 2000-2012 seasons. With over a thousand games per season, this makes for a fairly robust dataset. Here is how often teams came back from large deficits midway through the fourth quarter:

minutes to go: | |||||||||
---|---|---|---|---|---|---|---|---|---|

seven minutes | six minutes | five minutes | |||||||

trailing by | games | won | pct | games | won | pct | games | won | pct |

20 | 594 | 0 | 0.0% | 613 | 0 | 0.0% | 587 | 0 | 0.0% |

19 | 703 | 0 | 0.0% | 724 | 0 | 0.0% | 708 | 0 | 0.0% |

18 | 764 | 3 | 0.4% | 760 | 1 | 0.1% | 754 | 1 | 0.1% |

17 | 839 | 6 | 0.7% | 887 | 1 | 0.1% | 884 | 2 | 0.2% |

16 | 914 | 2 | 0.2% | 960 | 3 | 0.3% | 921 | 1 | 0.1% |

15 | 1059 | 16 | 1.5% | 1004 | 7 | 0.7% | 1036 | 2 | 0.2% |

14 | 1138 | 19 | 1.7% | 1129 | 14 | 1.2% | 1194 | 4 | 0.3% |

The Warriors were down 17 with six minutes to go in their game. Over the course of 13 NBA seasons, there were 887 games in which a team trailed by that many with that much time remaining. Only

__once__did that team go on to win - a raw frequency of just 0.1%. Eyeballing these numbers, my model estimate of about 0.5% seems most in line with the actual data, compared to numberFire and Gambletron.

But this dataset includes all games, underdogs and favorites alike. What if we restrict the view to favorites? The Warriors were 5 point favorites at New Orleans. The table below only looks at outcomes for trailing teams that were favored by 2.5 to 7.5 points:

2.5 to 7 point favorites, minutes to go: | |||||||||
---|---|---|---|---|---|---|---|---|---|

seven minutes | six minutes | five minutes | |||||||

trailing by | games | won | pct | games | won | pct | games | won | pct |

20 | 55 | 0 | 0.0% | 65 | 0 | 0.0% | 68 | 0 | 0.0% |

19 | 74 | 0 | 0.0% | 78 | 0 | 0.0% | 91 | 0 | 0.0% |

18 | 83 | 0 | 0.0% | 96 | 0 | 0.0% | 83 | 0 | 0.0% |

17 | 117 | 1 | 0.9% | 121 | 0 | 0.0% | 99 | 0 | 0.0% |

16 | 116 | 0 | 0.0% | 107 | 0 | 0.0% | 102 | 0 | 0.0% |

15 | 152 | 1 | 0.7% | 132 | 1 | 0.8% | 125 | 0 | 0.0% |

14 | 161 | 4 | 2.5% | 133 | 1 | 0.8% | 142 | 0 | 0.0% |

And here is the data for 7.5 to 12 point favorites:

minutes to go: | |||||||||
---|---|---|---|---|---|---|---|---|---|

seven minutes | six minutes | five minutes | |||||||

trailing by | games | won | pct | games | won | pct | games | won | pct |

20 | 13 | 0 | 0.0% | 12 | 0 | 0.0% | 15 | 0 | 0.0% |

19 | 17 | 0 | 0.0% | 16 | 0 | 0.0% | 16 | 0 | 0.0% |

18 | 16 | 0 | 0.0% | 19 | 0 | 0.0% | 26 | 0 | 0.0% |

17 | 28 | 1 | 3.6% | 24 | 1 | 4.2% | 21 | 0 | 0.0% |

16 | 28 | 0 | 0.0% | 35 | 0 | 0.0% | 30 | 0 | 0.0% |

15 | 27 | 1 | 3.7% | 34 | 1 | 2.9% | 31 | 0 | 0.0% |

14 | 43 | 2 | 4.7% | 35 | 1 | 2.9% | 40 | 1 | 2.5% |

As you can see, the data gets fairly sparse and noisy once we start slicing and dicing. The art of building a win probability model is drawing smooth, rational lines through a messy cloud of data points. Feel free to draw your own conclusions, but the data above gives me confidence in my model's estimates, as well as a proper appreciation of what Steph Curry and the Warriors pulled off Thursday night in New Orleans.

> ... But that win probability estimate assumes teams that are evenly matched. ...

ReplyDeleteI was under the impression that your win probability model takes the Vegas spread as an input: (http://www.inpredictable.com/2015/02/updated-nba-win-probability-calculator.html)

"... The win probability is a function of game time, point differential, possession, and the Vegas point spread. ..."

> ... How does this compare to other estimates of the Warriors' chances? ...

There's always the classic: Assume the score differential is a 1-dimensional random walk with a per-game standard deviation of 12.3 points. Then with 0.13 of a game left, the standard deviation over the rest of the game will be about 4.5 points. That makes the comeback a four-sigma event with a probability around .0001.

I also wonder how much market factors like the bid-ask spread and the granularity of contract sizes distort the numbers that Gambletron2000 produces in extreme situations like this.

> ...I was under the impression that your win probability model takes the Vegas spread as an input...

DeleteIt does, but the default graph uses zero as the point spread. So it explicitly assumes evenly matched teams.

I agree with you that the Gambletron numbers get harder to interpret as you get closer to a 1.0 probability. You rarely get fair odds on extreme long shot bets.

Hi Mike,

ReplyDeleteHere are my brief thoughts on win probability models.

1- They are really hard to do, and I appreciate the effort. They add much to the context of the game.

2- As you describe here, you are using past results to predict the future. As a result, you are reporting estimating probabilities, not true probabilities. As with any estimates, they come with error. For example, I don't think your approaches can accurately distinguish between a 1 in 500 chance of a Warriors comeback and a 1 in 50 chance. That uncertainty is important, albeit hard to report. Even things like saying "The Warriors had about a 1 in 200 chance of a comeback" is far preferred over "The Warriors had a 1 in 200 chance of a comeback.

3- Not sure which of your approaches incorporate pre-game knowledge, but any probability estimate that doesn't is missing context.

-Mike

Thanks Michael, appreciate the feedback. I agree that my model doesn't represent "true" probabilities, but I doubt any model could given that we're trying to predict human behavior.

DeleteI view the model as saying "Here is how past NBA teams have fared when in a similar situation". It is up to the modeler to define "similar situations", and how granular you go (and how you smooth out the noise in the historical results).

As you call out, that is difficult to communicate. I could preface every statement with "according to my modeling", or "based on historical results", but that gets tiresome after awhile. Although using "about" as you suggest may be a compromise. But these caveats will be ignored by most.

To your last question, the underlying model takes in pre-game knowledge. Point spread is an independent variable in the regression. For the 50/50 graphs, I deliberately set the point spread input to 0 to arrive at an unbiased view of win probability. So it deliberately strips away pre-game knowledge, which is a bad thing if you're using the model to bet, but a good thing if you want to measure things like win probability added.