Peter Smith

Publications, Lectures and Other Stuff

The Limitations of ‘Big Data’

Tim Harford has a useful article in the FT Magazine, ‘Big data: are we making a big mistake?’ [Link below]

Harford points to the enormous amount of data that is now accessible on the Internet (‘found data’), and how this can have major predictive value, as for example Google Flu Trends during the first few years of its operation. Such data is often portrayed as being ‘theory free’ — so massive and all-inclusive that it avoids the problem of sampling error and provides highly accurate results.

Harford claims that this view is erroneous. Users of ‘big data’ should remember the story of the polling debacle prior to the 1936 American presidential election. The Literary Digest conducted an enormous postal survey of voters’ preferences in the upcoming election, and discovered that out of 2.4 M returned responses [out of 10 M sent out, 1/4 of the electorate], 55% supported Alfred Landon, the Republican candidate against only 41% for the Democrat FDR. But in the actual election, Roosevelt won with 61% of the votes compared to Landon’s 37%. By contrast, a much smaller opinion poll by George Gallop (reportedly of around 3,000 people) had accurately predicted a Roosevelt win. Whilst the people at the Digest had avoided the likelihood of sampling error by having such a large sample, their results were useless because of ‘sampling bias‘ — they had sent out their questions to people on a list compiled from car registrations and telephone directories, which at that times comprised mostly prosperous people (who might be assumed to be more likely to vote Republican) and which excluded the poor. Again, those 2.4 M who had responded were not necessarily typical of the total population surveyed.

Similarly, modern-day opinion watchers might be attracted by the great number of messages posted on Twitter, say, but forget that in the US at least, most Twitter users were not a representative sample of the population as a whole, being apparently more likely to be young, urban or semi-urban, and black.

Supporters of big data often made 4 major claims:

-1. That the results are accurate.

-2. That ‘every single data point can be captured, making old statistical sampling techniques obsolete’.

-3. That we should not worry about what causes what.

-4. That “the numbers speak for themselves”, so there is no need for scientific or statistical models.

These were all questionable assumptions. There was no doubt that much information could be gleaned from big data, but its limitations always had to be remembered, and the analyses based upon could not ignore difficult questions of causation.

There was also the problem of ‘multiple-comparisons‘, that were likely to become more pronounced in large data sets. Basically, unless it was carefully controlled, the larger a data set was the more hidden relationships between variables there were likely to be, some of which might only become evident decades after a study had been made (This was well-illustrated in a 2005 paper by the epidemiologist John Ioannidis, ‘Why Most Published Research Findings Are False’).

Useful as more data could be, it did not of itself produce insight. To understand what was going on possible and probable causal relationships always had to be teased out.

The full article is here:

Tim Harford’s latest book is ‘The Undercover Economist Strikes Back’.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: