## Saturday, August 23, 2014

### How to improve your chances of scoring a goal in soccer? Concede one first.

"If you want to be a millionaire, start with a billion dollars and launch a new airline."

Apologies for the facetious (and somewhat clickbait-y) post title. In the same way that the opening quote from Richard Branson is not intended to be serious advice on how to become a millionaire, I am not suggesting that allowing your opponent to score is a viable strategy for winning soccer matches.

What I will suggest in this post is that it appears that teams play more optimally when trailing their opponent (or similarly, teams play less optimal when holding a lead). I found this result interesting for two reasons:
1. Arriving at this conclusion provides a good example of the pitfalls of conflating correlation with causation.
2. The scourge of modern sports strategy is loss aversion (and its cousin, risk aversion). This result appears to show that soccer is not immune.
I had already touched on this topic in a prior post (see On the Probability of Scoring a Goal). In this post, however, I have expanded my dataset, and in addition will do my best to illustrate my point with the raw data. The results from the previous post were the end result of a regression analysis, and somewhat of a black box from the point of view of the reader. I will try to be more transparent here.

### The Data

My dataset includes 7,000+ matches from the past two seasons of 10 of the top leagues in the world: England's Premier League, Germany's Bundesliga, Spain's Primera Division, the Netherlands' Eredivisie, Italy's Serie A, Argentina's Primera Division, France's Ligue 1, Brazil's Serie A, the United States' MLS, and Ukraine's Premier League. The play by play data from these matches is merged together with pre-game betting odds, courtesy of OddsPortal.

### A Spurious Correlation

In a 2007 paper, a team of German physicists argued for the existence of what they called "self-affirmation" of goal scoring in soccer. In their words:
In the present context of scoring in football, goals are likely not independent events but, instead, scoring certainly has a profound feedback on the motivation and possibility of subsequent scoring of both teams (via direct motivation/demotivation of the players, but also, e.g., by a strengthening of defensive play in case of a lead).
I have a knee-jerk skepticism for these "just so" types of explanations, especially ones that resort to armchair psychoanalysis of the athletes playing the game. I find it hard to believe that professional athletes, having spent a lifetime training to play at the highest level of their sport, react like sullen teenagers in response to a temporary in-game setback. And like any good "just so" story, we can create an equally plausible explanation for the opposite result: Shouldn't teams be more motivated to score when trailing their opponent?

My hangups aside, what does the data say? The German authors used game results from some 40 years of German league play, but had limited data with respect to timing of each goal. As mentioned above, my dataset only goes back two years, but draws from a more diverse collection of leagues. I also have much more detailed data on when goals are actually scored in-match. Using that dataset, I have charted per-minute goal frequency as a function of game time. There are two lines: one for teams that are trailing by a goal, and another when leading by a goal. Goal frequencies are averaged in five minute buckets to keep the chart from getting too noisy.

At first glance, this graph seems to support the "self-affirmation" theory of the 2007 paper. Teams leading by one goal appear to have a slightly higher probability of scoring another goal, particularly after minute 35. But this is just correlation, not necessarily causation. Does being up a goal really motivate a team to score more? And de-motivate their opponent? Or is there an underlying variable at play here?

### Ice cream leads to murder and running the football leads to victory

A commonly used example of the correlation-causation fallacy concerns ice cream sales and the murder rate. When ice cream sales rise, there is a corresponding increase in the murder rate. But ice cream doesn't promote homicidal tendencies, and committing a murder doesn't induce cravings for a triple scoop waffle cone (as far as I know). The actual cause is warmer weather, which leads to both ice cream sales and higher crime (including murder). See below for an illustration.

Another example, perhaps more appropriate to this blog, would be the correlation between rushing attempts and victory in the NFL. There is no doubt that there is a strong and persistent correlation between the number of rushing attempts a team has in a game and the likelihood of victory. See below for average win percentage by rushing attempts for the past 10 seasons of the NFL:

The correlation is readily apparent, but as a causal relationship, this makes only slightly more sense than the ice cream->murders hypothesis. Rushes yield 4.2 yards per play on average, while passes yield 6.6 yards. Yards are the raw currency for points and wins in the NFL. Why would a less efficient play lead to victory? The answer, of course, is that it doesn't. Rushing the ball doesn't lead to victory. Instead, they are both the result of building an early lead. A team with a lead (particularly late in the game) will tend to rush the ball more because it takes more time off the clock. Once, again, in flowchart form:

### Mis-matched matches

Returning now to the goal frequency graph, is it possible that there is a bias in the results that is distorting the outcome? Each soccer match is not a coin flip, and will feature teams with varying degrees of talent and skill. On average, the superior team in a match is more likely to find themselves up one goal, as opposed to down. And, it stands to reason that a superior team will be more likely to score goals in general, regardless of the game situation. See below:

If so, then the goal frequency graph is not necessarily measuring the causal impact of being up or down a goal in a soccer match. The higher goal frequency for teams up by one goal may be because we have oversampled superior teams. So, how do we determine if that is the case?

### Punters to the rescue

We can use the pre-match betting odds as a proxy for team strength, and then control for any oversampling in our data. The most common form of betting is in a '1 - X - 2' format, where you can bet on a Team 1 victory, a Draw ("X"), or a Team 2 victory, each with an associated payout and implied probability. I built a simple regression model to convert these odds into an implied "expected average goals scored per game". I have saved the details for the postscript to this post.

With that regression model, I can calculate an expected goals for each team. I then grouped teams into quantiles, based on their expected goals for each individual match. Here are the quantiles:

quantileaverage goals
1st1.02
2nd1.30
3rd1.47
4th1.69
5th2.26

What if we now run the goal frequency graph separately for each of these five quantiles? See below:

What was a slight bias in favor of being up one goal has been replaced with a slight, somewhat erratic bias in favor of being down one goal. Another way to look at this is to reweight these results so that each quantile is equally represented. For example, between minutes 60 and 64, of the games in which a team is up by one goal, 28% of those games are teams from the 5th quantile, while just 11% are from the 1st quantile. Let's re-weight the results so that each quantile gets equal weight (i.e. 20%). Here are the results:

Some of the game-time volatility washes out in this version, and we have a more consistent demonstration of the difference in goal probability as a function of scoring margin. On average, a team's probability of scoring a goal in a given minute is 0.16% higher when down by a goal than when up. That may not seem like much, but when that per-minute probability is compounded over a 90 minute match, the impact is closer to 0.17 goals per game. Scoring is at a premium in soccer, so a 0.17 goal advantage is significant.

This result is consistent with at least one other sport that I know of (and I suspect it to be true of others). In a recent issue of ESPN the Magazine, Ken Pomeroy took a look at scoring efficiency in college basketball as a function of point differential. He found that for evenly matched games, a team being down by 10 points scores about 0.10 points more per possession than a team that is up by 10 points.

So what could be behind this difference in goal probabilities? At the risk of creating my own "just so" narrative, I believe this is due to loss aversion on the part of both the players and coaches. On the pitch, each team must strike a balance between offensive and defensive tactics. Presumably, a more aggressive offensive strategy leaves a team more vulnerable to counterattacks by their opponent. Loss aversion theory states that people have a preference for avoiding losses over achieving gains. If so, teams may tilt their strategy away from offense to defense because they "hate" conceding goals more than they "love" scoring them.

However, once a team finds itself down a goal, perhaps the mindset changes. Teams may hate conceding goals, but they hate conceding matches even more. Being down a goal, and the unease that ensues, may nudge teams towards a more optimal balance between offense and defense. So, what we see is almost the opposite of momentum. A score by one team doesn't snowball into future goals, but rather increases the probability of an offsetting "equalizer". Momentum as a concept has its roots in Newton's First Law of Motion. But it's the colloquial version of the Third Law of Motion that may be more appropriate here: every action has an equal and opposite reaction (well, not equal, but you get the point).

### Postscript: Converting betting odds to expected goals

To convert betting odds to expected goals, you first need to convert them into implied probabilities. I will use the betting odds for this year's Germany-Argentina World Cup Final as an example. Here are the average odds:
• Germany: +135
• Draw: +225
• Argentina: +231
Converting these into implied probabilities yields 43%, 31%, and 30% respectively. Because of the vig, these add up to greater than 100%, so we divide by a common factor to get the vig-adjusted probabilities:
• Germany: 41.1%
• Draw: 29.7%
• Argentina: 29.2%
I now need to associate these probabilities with goals. To do so, I built a simple linear regression model which modeled goals scored as a function of the implied betting probabilities. But in order to do a proper regression, I am going to convert the probabilities into log odds

As a general rule, if you've got probabilities in a regression model, either as independent or dependent variables, it is best to convert the probabilities into log odds. Regression models, not knowing any better, will calculate coefficients that apply to any real value, even though probabilities only vary from 0 to 1. When you convert probability to odds, you go from a range of 0:1 to 0:infinity. When you then convert those odds into log odds you go from 0:infinity to -infinity:+infinity. And now your regression model is free to extrapolate along the entire number line. Here are the log odds for the match:
• Germany: 0.360
• Draw: 0.861
• Argentina: 0.887
Here are the regression coefficients to convert log odds into expected goals:
• expected goals = 1.07 - (log odds of winning) * (0.5017) + (log odds of draw) * (0.5416)
Expected goals for each team:
• Germany: 1.36
• Argentina: 1.09