June 11, 2014

World Cup Prediction Mathematics Explained

The World Cup is back, and everyone's got a pick for the winner. Gamblers have been predicting the outcome of sporting contests since the first foot race across the savannah, but in recent years a unique type of statistical analysis has taken over the prediction business.

By Michael Moyer

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

The World Cup is back, and everyone’s got a pick for the winner. Gamblers have been predicting the outcome of sporting contests since the first foot race across the savannah, but in recent years a unique type of statistical analysis has taken over the prediction business. Everyone from Goldman Sachs to Bloomberg to Nate Silver’s FiveThirtyEight has an online World Cup predictor that uses numbers, not hunches, to generate precise probabilities for match outcomes. Goldman Sachs, for instance, gives host nation Brazil a 48.5 percent chance of winning it all; FiveThirtyEight puts the odds at 45 percent while Bloomberg Sports has concluded there’s just a 19.9 percent chance of a triumph for the Seleção.

Where do these numbers come from? All statistical analysis must start with data, and these soccer prediction engines skim results from former matches. A fair bit of judgment is necessary here. Big international soccer tournaments only come around every so often, so the analysts have to choose how to weight team performance in lesser events such as international “friendlies,” where nothing of consequence is at stake. The modelers also have to decide how far back to pull data from—does Brazil’s proud soccer history matter much when its oldest player is 34?—and how to rate the performance of individual players during their time playing for club teams such as Manchester United or Real Madrid.

Wherever the data comes from, the modeler now has to incorporate it into a model. Frequently, the modeler translates the question of “who is going to win?” into the form “how many goals will team X score against team Y?” And for this, she relies [PDF] on a statistical tool called a bivariate Poisson regression.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Those are three unfamiliar words. Let's unpack them one-by-one. “Bivariate” means there’s two inter-related variables for which we are trying to predict a single outcome—team’s X performance against team Y. “Regression” just means that we’re fitting a set of data to a model. “Poisson” is the interesting one.

Imagine that you’re standing by the side of the road and you want to know how many cars go by in a minute. First, you’d take some data. Armed with a stopwatch and a counter, you’d see that 15 go by one minute, 18 the next, just four the third minute. Do this for enough minutes and you’d begin to see a pattern build up, a Poisson distribution, named for the French mathematician who invented it in order to estimate the frequency of false convictions.

The number of goals in a game also tend to be distributed according to the Poisson distribution. A given team may be most likely to score one or two goals, sometimes zero or three, and much less frequently four or five (or more). Modelers will map the data from a team’s previous performance onto a Poisson distribution of the number of goals they are likely to score against their opponent.

And the gamblers? As of this writing the online sportsbook Betfair has Brazil as a 3-to-1 favorite, or 24.4 percent. If you believe the analysts at Goldman Sachs or FiveThirtyEight, who have Brazil at nearly a 50 percent favorite, a betting opportunity has opened up for you. Of course, presumably all those people betting on Brazil at 3-to-1 odds have also read the Goldman Sachs and FiveThirtyEight analysis.

The question becomes: What do they know that the statisticians don’t?

Image by Digo Souza on Flickr