Saturday, July 12, 2014

On the Probability of Scoring a Goal

Soccer goal low angleIn this post, I will describe my attempts to model the probability of a goal being scored in soccer. After correcting for team imbalances, I find that a trailing team has a higher probability of scoring in most situations. This result has potential implications for strategy and whether teams should be adopting a more aggressive style of play.

The Model

Using the same dataset I used for my win probability model (~3,000 matches from five of the top European Leagues), I employed LOESS smoothing to build a model that predicts the probability of a goal being scored within the next minute of game time. The model is a function of the following:
  • game time
  • goal difference
  • team strength
I derive the team strength from the pre-match betting odds, and convert it into an expected goals scored per game. Including team strength as a parameter is crucial for this type of analysis, because the model is also a function of goal differential. There is going to be heavy selection bias in the raw historical results. Favorites are going to be over-represented in game situations in which a team has a positive goal differential. As a result, the raw goal probability is higher for teams that have a lead (favorites tend to score more). But having a lead in and of itself does not lead to a higher probability of scoring more goals. This is correlation, not causation.

In fact, once we control for the bias in the results, the exact opposite conclusion emerges: For most of the game, a team trailing by one goal is more likely to score than when leading by a goal or tied. See below for the (smoothed) goal probabilities as a function of game time. The probabilities reflect a team that would be expected to score 1.4 goals per game, on average.

The First Half

The graph above charts the per minute goal probability as a function of game time. In general, the probability of a goal being scored increases as the half progresses. Although for tied games, the goal probability actually starts to decline after about the 30 minute mark. But there is a clear differential when it comes to teams separated by a goal, with the trailing team more likely to score.

The Second Half

Here is the corresponding chart for the second half.

As with the first half, teams trailing by a goal continue to have a higher goal probability. In fact, there seems to be a "sweet spot" between minutes 55 and 70, with a ~0.3% per minute higher goal probability for a team trailing by 1, compared to than a team leading by 1. After that, the probabilities for all three scenarios start to converge. They become largely indistinguishable for the last five minutes of regulation and any stoppage time.

Not Entirely Random

These results both confirm and conflict with prior research on soccer goal scoring. In a 2010 paper, three German physicists found that goal scoring largely followed a random Poisson process. However, the researchers noted that draws were over-represented in actual game results, when compared against what one would expect from a true poisson process. You can look at the above graphs to see why. There is a bias in the probabilities to either maintain a tie, or return the match to a tie score.

These results are at odds, however, with a 2007 paper entitled "Self-affirrmation model for football goal distributions". In this paper (once again by a team of German physicists), they argue for the existence of goal "self-affirmation", meaning that scoring a goal increases the likelihood of future goals. I have not had a chance to unpack all of the math in their paper, but the authors appear to have fallen into the correlation/causation trap I mentioned above. Scoring a goal does not "cause" future goal scoring (via some nebulous motivation/de-motivation mechanism). Being a better team leads to more goal scoring. Once you remove the bias in your results due to team imbalances, you come to the exact opposite conclusion of the 2007 paper.

Implications for Game Strategy

Assuming my modeling above is capturing real phenomena (and not just chasing noise), these results indicate that teams may be playing sub-optimally. It appears that teams would be better off adopting a more aggressive style of play throughout the game, not just when they're trailing. Whatever teams do when they are trailing by a goal, they should be doing all the time. Similarly, teams that are up by one goal, should keep their foot on the gas. Their current style of play leads to a higher probability of letting the trailing team back in the game. Of course, once you get up by more than one goal, the calculus changes, and it makes sense to forego goal opportunities for the chance to burn clock.

This is also consistent with the psychological effect known as "loss aversion", in which there is a strong tendency to avoid loss rather than acquire gains. It's an effect that has been demonstrated across several sports: NFL teams don't go for it on fourth down as often as they should and NBA teams have historically undervalued the three pointer, although they appear to have caught on in recent years.

For next steps, I hope to flesh this analysis out a bit in future posts and share some of the underlying data I'm using for my modeling.


  1. You also need to factor in chance that no goals are scored the rest of the game. Certainly if there's guaranteed to be another goal scored, each team should maximize their odds of scoring it. But since there's a chance of no more goals, you have to account for that possibility. If it's high enough, it could still be a good strategy even if that team has NO chance of scoring the next goal.