March 13, 2014

Why Big Data Isn't Necessarily Better Data

Tech companies--Facebook, Google and IBM, to name a few--are quick to tout the world-changing powers of "big data" gleaned from mobile devices, Web searches, citizen science projects and sensor networks.

By Larry Greenemeier

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Tech companies—Facebook, Google and IBM, to name a few—are quick to tout the world-changing powers of “big data” gleaned from mobile devices, Web searches, citizen science projects and sensor networks. Never before has so much data been available covering so many areas of interest, whether it’s online shopping trends or cancer research. Still, some scientists caution that particularly when it comes to data, bigger isn’t necessarily better.

Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions. Case in point are the difficulties that Google Flu Trends (GFT) has experienced at times in accurately measuring influenza levels since Google launched the service in 2008. A team of researchers explains where this big-data tool is lacking—and where it has much greater potential—in a Policy Forum published Friday in the journal Science.

Google designed its flu data aggregator to provide real-time monitoring of influenza cases worldwide based on Google searches that matched terms for flu-related activity. Despite some success, GFT has overestimated peak flu cases in the U.S. over the past two years. GFT overestimated the prevalence of flu in the 2012-2013 season, as well as the actual levels of flu in 2011-2012, by more than 50 percent, according to the researchers, who hail from the University of Houston, Northeastern University and Harvard University. Additionally, from August 2011 to September 2013, GFT over-predicted the prevalence of flu in 100 out of 108 weeks.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Nature reported in a February 2013 news article that GFT predicted more than twice the number of doctor visits for influenza-like illness than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from a number of U.S. laboratories. (Scientific American is part of the Nature Publishing Group.)

Google’s software “relies on data mining records of flu-related search terms entered in Google’s search engine, combined with computer modeling,” Nature reported. Even though the researchers who wrote this week’s Policy Forum for Science cite several instances where GFT has faltered, Nature pointed out that GFT’s overall body of work has “almost exactly matched the CDC’s own surveillance data over time—and it delivers them several days faster than the CDC can.”

Google itself concluded in a study last October that its algorithm for flu (as well as for its more recently launched Google Dengue Trends) were “susceptible to heightened media coverage” during the 2012-2013 U.S. flu season. "We review the Flu Trends model each year to determine how we can improve—our last update was made in October 2013 in advance of the 2013-2014 flu season,” according to a Google spokesperson. “We welcome feedback on how we can continue to refine Flu Trends to help estimate flu levels."

The Policy Forum researchers recognize that increased traffic to flu-related online resources could have factored into the problem, but they question whether “a media-stoked panic last flu season” fully explains “why GFT has been missing high by wide margins for more than [two] years. A more likely culprit is changes made by Google’s search algorithm itself.”

This is key to the researchers’ argument and they contend that two issues have contributed far more to GFT’s mistakes: algorithm dynamics and “big data hubris."

“[GFT’s] ad hoc method of throwing out peculiar search terms failed when GFT completely missed the nonseasonal 2009 influenza A–H1N1 pandemic,” the researchers say. “In short, the initial version of GFT was part flu detector, part winter detector.”

Big data hubris is the “often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.” The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments.

Other studies have shown the value of big data, the researchers acknowledge, yet “we are far from a place where they can supplant more traditional methods or theories.”

They note that “greater value can be obtained by combining GFT with other near–real-time health data.” For example, “by combining GFT and lagged CDC data, as well as dynamically recalibrating GFT, we can substantially improve on the performance of GFT or the CDC alone.” Big data could likewise be an effective tool for better understanding the unknown, in areas where CDC data does not work well, such as presenting flu prevalence at very local levels.

Projects would also benefit from more transparency by improving others’ ability to replicate them, according to the researchers. Platforms such as Google, Twitter and Facebook are always re-engineering their software, and whether studies based on data collected at one time could be re-done with data collected from earlier or later periods is an open question.