Handicapping the Handicappers

Last year, I launched my thoroughbred betting market rankings, building off the same techniques I used to create similar rankings for the NFL, NBA, WNBA, MLB, College Basketball, College Football, and the NHL.

At the same time, I also launched roboCap, an automated horse race handicapping tool built from those rankings. roboCap allows you to enter any arbitrary field of horses, and it will generate predicted final odds for said field.

roboCap was backtested to maximize predictive accuracy of the final odds. In the original launch post, I found that roboCap was about as accurate as the morning line in predicting the odds of a given horse race. But that was backtesting, and the gambling world is littered with examples of methodologies and "trends" that work great when backtested, but vanish into the ether when making real world predictions.

With that in mind, I can now do a proper test of roboCap's accuracy by seeing how well it did on "out of sample" races - those races run subsequent to the November 2018 launch date. I have two goals for this post:

Measure how accurate roboCap is in predicting closing odds and race results, and compare that the accuracy of the morning line
Objectively measure the accuracy of each track's morning line handicappers and report the results here (hence the post title)

How to measure roboCap's accuracy

I gave a lot of thought to this, and realized that there are many ways to measure accuracy. Rather than try to pick just one, I instead decided to measure accuracy in four different ways as follows:

Kullback-Liebler divergence - Both roboCap and the morning line are, in effect, trying to predict a probability distribution (the one implied by the closing odds). Kullback-Liebler divergence is a measure that compares how closely two different probability distributions match. The lower the score, the better the match. However, I think this puts the morning line at somewhat of a disadvantage, as it appears that there is a convention among morning line handicappers to "herd" their odds in a way that prevents heavy favorites and extreme long shots. roboCap has no such qualms, which likely gives it an edge in predictive accuracy.
Rank correlation to final odds - One way to address the "herding" problem is to just rank the horses by the projected odds, and then see how well that ranking correlates with the final odds ranking. For this measure, a higher score is better. You could argue that the morning line may have an unfair edge here. If the morning line impacts how the public actually bets races, there could be a self-fulfilling prophecy if you're trying to measure how the morning line predicts the final odds.
Log-likelihood of the winning horse - We can also measure accuracy by seeing how roboCap and the morning line predict race outcomes. A common way to judge probabilistic predictions is via log-likelihood. For each race, we see which horse won, and then calculate the logarithm of the probability that the morning line and roboCap gave for that horse to win. A higher log-likelihood means a more accurate prediction. A nice thing about this approach is that this also gives a measure of accuracy for the closing odds. However, this also likely puts the morning line at a disadvantage due to the aforementioned "herding".
Rank correlation to finishing order - I feel this is probably the fairest measure. This measures how the predicted odds correlates to the actual finishing order of the race. This avoids the morning line herding disadvantage, as well as the advantage the morning line has in possibly affecting the odds (the morning line can't affect finishing order). We can also quantify accuracy of the closing odds using this measure.

The results

And here are the results, with a table for each measure. Results are split by "top tracks" versus "other". I defined a "top track" to be one with an average win/place/show pool of at least $75k.

kullback-liebler divergence to final odds (lower is better)
Tracks	races	morning line	inpredictable
top tracks	8,218	0.088	0.106
all other	16,593	0.121	0.099
total	24,811	0.110	0.101

rank correlation to final odds (higher is better)
Tracks	races	morning line	inpredictable
top tracks	8,218	0.825	0.750
all other	16,593	0.779	0.782
total	24,811	0.794	0.771

log-likelihood of winning horse (higher is better)
Tracks	races	morning line	inpredictable	final odds
top tracks	8,218	-1.736	-1.743	-1.626
all other	16,593	-1.714	-1.659	-1.561
total	24,811	-1.722	-1.687	-1.582

rank correlation to finishing order (higher is better)
Tracks	races	morning line	inpredictable	final odds
top tracks	8,218	0.407	0.386	0.453
all other	16,593	0.391	0.415	0.463
total	24,811	0.396	0.405	0.460

roboCap outperforms the morning line under the Kullback-Liebler measure, but underperforms when rank correlated to the final odds. When it comes to actual race results, roboCap actually outperforms the morning line under both measures.

In general, the handicappers for the top tracks are more accurate than their lesser track counterparts, both in predicting closing odds as well as race results. For the "all other" bucket, roboCap actually outperforms the morning line in all four measures.

And the wisdom of the market remains undefeated, with significantly better performance than either roboCap or the morning line when predicting race results.

I continue to be surprised by how well roboCap performs, given the relative simplicity of the model. Everything about a horse's ability to win a race is boiled down to a single number: "Generic Lengths Advantage". Here is what roboCap does not consider when predicting odds: jockey, recency, weight, track, surface, distance, or pace. And it takes as inputs only closing odds and finishing margin. Yet, it is as accurate as your average morning line handicapper.

Results by track

What follows are a series of charts (using my favorite data visualization technique) summarizing the accuracy of each track's morning line handicappers. I will largely let the charts speak for themselves, but I will make a couple observations:

In general, morning line handicapper accuracy correlates with average win/place/show handle and purse size
David Aragona, the official track handicapper for the NYRA, is very good at his job. He sets the line for Belmont, Aqueduct, and Saratoga. Belmont and Aqueduct consistently show up at the top of these charts, regardless of what measure we choose (Saratoga did not have enough races to qualify for the charts)

Bonus Charts

The chart below shows how accurate each track's closing odds are in predicting winners. Note that I am using McFadden's Pseudo R-Squared here, which attempts to adjust for differences in the size of the field (e.g. if a certain track tends to run races with fewer horses, it would have an advantage in a metric like log-likelihood, but not necessarily McFadden's Pseudo R-Squared).

The least accurate track is Churchill Downs. I am guessing this is likely not due to the lack of sophistication in the Churchill parimutuel pool, but more likely due to Churchill running more competitive races with balanced fields that are harder to predict.

Many of the most "accurate" markets are at tracks with relatively small handles. I don't have any evidence to support this, but I would theorize that this could be due to inside information. I'm sure "insider trading" occurs at all tracks, but at the smaller tracks, the pools are small enough that insider information can materially skew the closing parimutuel odds. The chart below shows how market odds accuracy correlates with average win/place/show pool size.