Thursday, December 31, 2015

2015 Early Season Power Rankings - The Results

In which we rank the rankings...

Power rankings are everywhere, but how do we know if they are any good? If they are intended to be predictive, we can test them for their predictive value. So, for the past three years, I have archived a cross section of power rankings from early in the NFL season (after week four). Then, once the season is done, I assess each ranking's accuracy in predicting future wins in that same season.

Each team's win percentage for weeks 5-16 (week 17 excluded) is rank ordered. This is then tested for agreement with each power ranking. I use the Spearman Rank Correlation coefficient for this purpose. A Spearman value of 100% would indicate perfect agreement between two rankings (e.g. if 1-32 in win percentage lined up identically with the power ranking in question). A value of -100% would indicate complete disagreement between the two lists, meaning 1-32 in win percentage lines up with 32-1 in the power ranking. A value of 0% would mean, roughly, no correlation between the two rankings.

I archived the week 4 power rankings in this post from October. As a reminder, the rankings I evaluate are: Football Outsiders DVOA, ESPN's FPI (as a replacement for Brian Burke's AFA efficiency rankings), the Simple Ranking System, ESPN's official NFL power rankings, FiveThiryEight's Elo rankings, and the Betting Market Rankings published here at inpredictable.

The table below ranks teams in order of week 5-16 win percentage, and lays that alongside where each ranking had that time after week four. The Arizona Cardinals and the Carolina Panthers led the league in win percentage over those 12 weeks, both going 10-1. All six rankings were fairly bullish on the Cardinals (average ranking of 4). The Panthers' strong season was more of a surprise, as Carolina had just an average ranking of 12 prior to their 10-1 tear.

The Tennessee Titans are at the bottom of the list, going 2-10. In general, the various ranking systems weren't too fond of the Titans, though the DVOA black box thought they were about an average team after week four.


Weeks 5-16 Week 4 Rankings
Team Record Rank SRS DVOA 538 ESPN MARKET AFA/FPI
ARI 10-1 1.5 4 2 7 4 6 2
CAR 10-1 1.5 17 10 6 8 14 15
KC 9-2 3 19 27 16 18 13 14
NE 9-3 4 1 1 1 1 1 1
MIN 8-3 5 18 20 18 14 12 20
CIN 7-4 8.5 7 4 5 5 5 5
DEN 7-4 8.5 12 7 3 3 4 8
HOU 7-4 8.5 28 30 23 28 24 24
NYJ 7-4 8.5 8 9 19 9 10 9
PIT 7-4 8.5 6 6 8 10 17 10
SEA 7-4 8.5 16 12 4 7 3 4
DET 6-5 13 27 22 21 26 20 23
GB 6-5 13 11 3 2 2 2 3
WAS 6-5 13 13 18 27 24 22 26
BUF 5-6 18.5 5 8 12 12 8 11
CHI 5-6 18.5 31 31 29 29 30 31
IND 5-6 18.5 22 21 13 16 19 13
NO 5-6 18.5 20 25 24 21 21 22
OAK 5-6 18.5 23 16 26 20 23 29
PHI 5-6 18.5 9 17 14 22 11 6
STL 5-6 18.5 14 13 22 11 18 19
TB 5-6 18.5 32 29 32 32 31 32
ATL 4-7 25 2 5 9 6 7 7
BAL 4-7 25 15 14 11 19 9 12
JAC 4-7 25 29 26 31 31 32 30
MIA 4-7 25 26 28 25 30 27 25
NYG 4-7 25 3 11 15 15 15 17
SF 3-8 28 30 32 20 27 29 28
CLE 2-9 30 25 24 28 25 26 27
DAL 2-9 30 10 19 10 13 25 16
SD 2-9 30 24 23 17 17 16 18
TEN 2-10 32 21 15 30 23 28 21

Scanning the table, each ranking system had its share of hits and misses. But the eye test can only get us so far, which is why we have the Spearman coefficient to make sense of it all.

The table below summarizes each ranking's Spearman coefficient for the 2015 season, adding it alongside the 2007-2014 results (reminder: a higher coefficient means a more accurate ranking):

Week 4 Ranking Correlation to Future Wins
ranking average 2007 2008 2009 2010 2011 2012 2013 2014 2015
espn 49% 55% 42% 51% 55% 43% 39% 34% 72% 49%
dvoa 48% 57% 45% 47% 46% 41% 65% 30% 64% 39%
afa 39% 50% 42% 51% 15% 49% 32% 23% 44%46%
market 55% 68% 36% 67% 45% 52% 56% 37% 75%55%
srs 47% 70% 46% 47% 38% 56% 54% 25% 56%31%
538 60% 73%47%

Call shenanigans if you like, but for the third straight year, this site's betting market rankings performed the best when predicting future wins. And that same market-based ranking system has a commanding lead when averaged over the past nine seasons. Note that the 538 Elo rankings only have two season's worth of results, so the average is misleading.

10 comments:

  1. It's somewhat shenanigans because you are kind of comparing apples and oranges. SRS and DVOA and even AFA aren't really predictive tools they are more descriptive. I don't know what the new tweaks are to FPI so maybe its meant to be more predictive. Not surpised to see 538's ELO do well as it takes a longer sample of games beyond just the 4 week window and I think there are some tweaks vs the classic ELO that make it more predictive if I understand it right.

    Most straight analytic systems perform better after a larger sample than 4 weeks and aren't taking into account injuries or future SOS.

    All that being said its going to always be tough to beat the market which is likely using all these kinds of tools plus their own subjective/objective tweaks. I think the analytic systems would be very close if they were tweaked to be as predictive as possible early in the season.

    ReplyDelete
    Replies
    1. "... you are kind of comparing apples and oranges. SRS and DVOA and even AFA aren't really predictive tools they are more descriptive. ..."

      Regardless of what the design goal of a metric is, we can still sensibly test how well it predicts something. As silly as it might be, we could test how well the number of vowels in a team's name corresponds to the win rate.

      If a metric doesn't predict anything, then how do you know that it's actually measuring something meaningful? To be sure, win count raking in the NFL is, of course, quite noisy, so testing predictions of other things like SRS or point differentials might be more sensible.

      Delete
    2. I disagree that AFA is descriptive. I think Brian Burke has been quite clear that his model was specifically designed to be predictive. FO is a bit more vague when it comes to describing what DVOA is actually intended to measure (my biggest beef with DVOA). But from what I've read on FO, they do intend it to be predictive.

      I agree with Nate's comments as well. Regardless of intent, why not test on ability to predict wins. If your ranking can't do that, I'm not really sure what the point of a ranking is. In other words, if your ranking is "descriptive", what exactly is it trying to describe? That is not already described by W-L record?

      Delete
  2. We can also look at the following. Who did best in predicting who the market rankings in week 16 rather than future wins. If I compare the ranks from week 4 to the markets current week 16 ranks. I get the following:

    DVOA DAVE 74 %
    ESP 72 %
    MARKET 70 %
    ELO 66 %
    FPI 65 %
    DVOA 62 %
    SRS 47 %

    ReplyDelete
    Replies
    1. Thanks for compiling this. I still think comparing to W-L record is the best comparison (even though it is noisy), but it is interesting to see that both DVOA and ESPN better correlated with the market ranking in week 16 than the week 4 market ranking itself.

      Delete
    2. Good stuff. It might also be interesting to look at the rankings for weeks 5+ to see if the systems that use more than four weeks of data do any better. It might also reduce some of the noise caused by different strengths of schedule. Ultimately the best test of predictiveness is individual games but the betting market system doesn't need rankings for that.

      I agree that systems that aren't intended to be predictive don't have much value so I have no problem measuring them against future performance.

      Delete
  3. Love the concept. Just wish you'd run it later in the season. I would never expect a power ranking system to beat the betting market in assessing true team talent when it is limited to 3-4 games of data.

    ReplyDelete
    Replies
    1. I took a look at week 8 rankings a couple years back:
      http://www.inpredictable.com/2014/01/early-season-power-rankings-2013-update.html

      All the stat-based rankings got better. SRS was just as good as DVOA, surprisingly enough. ESPN subjective rankings didn't get any better with more data. Market rankings were still the best, although with a smaller lead. Haven't had time to rerun this with 2014 and 2015 data though.

      Although it is not too surprising that the market rankings were on top, part of the point of the exercise was to validate my methodology. The market rankings are not just sitting on the shelf somewhere. They require some methodology in order to derive them from point spread data.

      Delete
  4. One additional source to consider is www.massey-peabody.com. I have no affiliation with the site, but have followed it for several years.

    ReplyDelete
    Replies
    1. I did a quick calc for Massey-Peabody last year. Their score was middle of the pack. But when I look at their website now, I can't find the week 4 rankings archived anywhere.

      Delete