|Pythagoreans celebrate sunrise|
I first created these rankings for the NFL over at Brian Burke's Advanced NFL Stats Community site. After launching this blog in January, I created similar rankings for the NBA and College Basketball.
The basic idea is to use the information contained in the moneylines and totals used by the Vegas sports books to reverse engineer an implied power ranking. See the Methodology page for a simple example. In the same way that prediction markets like Intrade.com can be used to gain insight into things like the upcoming Supreme Court decision on the individual mandate, I'm following the money in the betting markets to gain insight on team (and pitcher) strength.
MethodologyCreating a set of rankings for baseball proved to be more challenging than my previous efforts for other sports. The NFL, NBA, and NCAAB are all point spread based markets. Point spreads were very convenient for my ranking systems because they were additive. If the Patriots were favored by 3 over the Steelers and the Steelers were favored by 3 over the Bengals, I found that the Patriots would be favored by 6 over the Bengals (give or take a point).
But in baseball, there is no point spread (or run spread, rather). There is simply not enough scoring in baseball to use run margin as a way of balancing the amount of money bet on either side. Instead, if you want to bet on the outcome of a baseball game, you bet on the moneyline. It is easy enough to convert the moneyline into an implied win probability, but I was initially at a loss as to how to convert that probability into a ranking.
Basically, I needed a way to convert probabilities (which don't add well) into runs (which do).
Pythagoras to the RescueIt should have been immediately obvious, but it took me a while to realize that one of the most famous formulas in baseball statistical analysis had exactly what I needed: Pythagorean Expectation. The pythagorean formula in baseball takes runs scored and runs allowed by a team and converts them into an expected win probability. All I had to do was reverse the formula to derive expected runs from a win probability. Here is an example:
The moneyline for Saturday's Nationals-Orioles game was -200 for the Nationals and +170 for the Orioles. Using a little algebra, this implies that the betting market gives the Orioles about a 65% chance of winning the game.
So, that gives:
0.65 = Orioles^1.75 / (Orioles^1.75 + Nationals^1.75)
Nationals = Expected runs scored by the Nationals
Orioles = Expected runs scored by the Orioles
and 1.75 is the Pythagorean exponent (I derived this from trial and error)
But that still leaves two unknowns and only one equation. Fortunately, we can grab a second equation from another popular betting option: the run total (sometimes called the over/under). This is where you bet on whether you think the total runs scored by both teams combined will or won't exceed a specified amount. For the Nationals-Orioles game, the total was 8.5 runs. So, the second equation is:
Orioles + Nationals = 8.5
A little more algebra and we have:
Orioles = 5.0 runs expected
Nationals = 3.5 runs expected
Which adds up to the betting total of 8.5 points and results in a 65% probability when plugged into the pythagorean expectation formula.
Linear RegressionFrom this point, things become a bit more straightforward. I can now use the same linear regression techniques I used for other sports' rankings. Each game produces two equations:
Orioles Offense + Nationals Defense = 5.0 runs
Orioles Defense + Nationals Offense = 3.5 runs
Each team gets two variables, one for offense and one for defense/pitching. With 30 teams, that means I have 60 variables, so I just need to feed in enough games into the regression model to determine the optimal values for my 60 variables. But it's actually a bit more complicated than that.
Because each team has a rotation of about 5 starting pitchers, it's as if each team is actually 5 different teams when it comes to defense. The Tigers are going to be expected to give up fewer runs (and have a higher win probability) when Justin Verlander is starting compared to when Rick Porcello is starting.
So, instead of having 1 variable for team defense, I create a variable for each starting pitcher. So, the equation for the Nationals-Orioles game looks more like this:
Orioles Offense + Nationals:Edwin Jackson = 5.0 runs
Orioles:Wei-Yin Chen + Nationals Offense = 3.5 runs
This complicates things a bit, but it's nice in that it allows me to derive a separate set of pitcher rankings from the betting market information. So, in addition to the team rankings (which will be based on average pitching strength), I will produce a separate ranking for starting pitchers. One caveat in all of this is that, technically, the starting pitcher rankings will also reflect fielding strength and bullpen strength of the given team, but there's no way to disentangle those contributions from the information I use in my rankings (moneylines and totals).
Home Field Advantage
I actually left out a step in the above calculations and that is the correction for home field advantage. Home teams win about 54% of the time in baseball, so the betting market is bound to factor that in. Through trial and error experimentation, I found that home field advantage was worth about 0.4 runs. In contrast, it's worth about 2.5 points in the NFL and about 3.25 points in the NBA.
So, when building my equations for expected runs, I assume that the betting market adds 0.2 runs to the home team's expected runs and subtracts 0.2 runs from the visiting team's expected runs. Going back once again to the Nationals-Orioles example, the actual equations would look like this:
Orioles Offense + Nationals:Edwin Jackson = 4.8 runs
Orioles:Wei-Yin Chen + Nationals Offense = 3.7 runs
(the Orioles were the home team on Saturday)
There's a tradeoff I need to consider when building these rankings. My goal is to determine team and pitcher strength as of today, so I'd prefer to just use recent moneylines and totals. But, in order to have enough "inter-connectedness" between the teams, I need to look back over several days and/or weeks. For the NFL, the lookback period I used was 4 weeks, for the NBA it was 3 weeks.
For baseball, I am using 4 weeks. I initially expected that I could use a shorter lookback period than that, given that MLB teams play nearly every day. But, the fact that teams change starting pitchers every day, and usually play multiple game series against the same team, four weeks seemed to work best (although I plan on tinkering with some alternate approaches).
On the to-do list
- Inter-league play - My method of handling inter-league play is a bit clumsy and I hope to improve on it. If there aren't any inter-league games in the lookback period of 28 days, it's as if I'm creating rankings for two completely different sports leagues and then interleaving them, with the hope that the average strength of the American and National leagues are roughly equivalent.
- Home field advanatge - There are enough games in baseball that I hope to be able to derive team-specific home field advantage adjustments, as opposed to the simple 0.4 runs approach I described above.
- Designated hitter - Somewhat related to inter-league play, I think I may be able to generate rankings for each team with and without a designated hitter. When inter-league games are played, the league rules of the home team apply. So, for National League home games against an American League team, neither team is allowed a designated hitter (and vice versa for American League home games). By comparing moneylines with and without the DH, I should be able to derive an implied "Designated Hitter Ranking"
So, here is an overview of my ranking approach:
Step 1: Convert the moneyline for each game into an implied probability
Step 2: Use the Pythagorean expectation formula in combination with the Step 1 win probability and betting run total to derive expected runs scored for each team
Step 3: Adjust the expected runs for each team for home field advantage
Step 4: Run a linear regression for all games in the past 28 days, using team offense and starting pitcher as the independent variables
Step 5: Publish to blog
See part two of this post for the results.