ADVERTISEMENT
  About the SA Blog Network













Observations

Observations


Opinion, arguments & analyses from the editors of Scientific American
Observations HomeAboutContact

How to Find Meaning in a Maelstrom of Data

The views expressed are those of the author and are not necessarily those of Scientific American.


Email   PrintPrint



All of the data in the world—and the amount is growing at a frightening rate—won’t help researchers solve the big problems if they can’t make sense of it. Which is why a team of researchers from Harvard University and the Broad Institute of Harvard and M.I.T. has developed analytical data-mining software that can find an oasis of meaning in a desert of numbers. They’ve used the software to find insights on the socioeconomic impact of obesity, bacteria in the gut and baseball.

The software teases out relationships among data points (potentially millions of them) and measures the strength of these connections. As the researchers report in a paper appearing in the December 16 issue of the journal Science, most data-mining tools used today can either find correlations between data or determine how solid those connections are—few can do both.

“When we started this project we wanted a way to summarize what was in these datasets in a very simple way, asking what were the variables in these datasets that are most strongly associated,” says David Reshef, a co-first author of the paper and graduate student in the Harvard-M.I.T. Health Sciences and Technology program. “It’s a very simple question but it turned out to be very complicated because variables can be related in lots of different ways and there are various methods for finding different patterns.”

David Reshef—working with younger brother Yakir Reshef, Broad Institute associate member Pardis Sabeti and Harvard computer science professor Michael Mitzenmacher—tested the tool on social, economic, health and political data from the World Health Organization (WHO) and its partners. The data pool was large, covering 200 countries and containing 357 data variables per country, including household income and obesity.

The tool is part of a larger program the researchers call MINE (Maximal Information-based Nonparametric Exploration). It examined every possible combination of variables (more than 60,000 of them) and a list of relationships ranked by the strength of one variable’s statistical dependence on the other (i.e. how much one variable is related to the other).

One identified relationship, for example, was between household income and female obesity. From this pairing, the researchers saw that the data from many countries follow a parabolic curve, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. However, in the Pacific Islands, where female obesity is a sign of status, the rate of obesity followed a completely separate trend from the rest of the countries in the world, climbing rapidly even at low incomes.

The idea is to use MINE to generate new ideas and connections that no one has thought to look for before, says Yakir Reshef, a co-first author of the paper and a Fulbright scholar at the Weizmann Institute of Science in Israel. “The interdisciplinary nature of the project shows to us the widespread application of this work,” he adds. “It doesn’t matter whether it’s global health data, genomic data or Internet search statistics—on some level it’s all the same.” The researchers explain their work in more detail on their Web site and in a video accompanying their paper.

In another test, they took nearly 6,700 pieces of data related to microorganisms that live in the gut collected by Harvard colleague Peter Turnbaugh. The software made more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.

The researchers also tested the software on baseball. They found that the statistics that most related to a player’s salary were hits, total bases and an aggregate statistic that reflects how many runs a player generates for a team. During the 2008 season the Tampa Bay Devil Rays, Atlanta Braves and current world-champion Saint Louis Cardinals (not surprisingly) proved to have the fewest number of overpaid players compared to the number of “overperforming” players on their rosters. Predictably, the New York Yankees finished dead last. It’s not easy to find overperforming players when your payroll is the highest in baseball.

Photo: Brothers David Reshef (second from left) and Yakir Reshef (right) developed MIC under the guidance of advisers Michael Mitzenmacher (left) of the Harvard University School of Engineering and Applied Sciences and Pardis Sabeti (second from right) of the Broad Institute. Image courtesy of ChieYu Lin

About the Author: Larry is the associate editor of technology for Scientific American, covering a variety of tech-related topics, including biotech, computers, military tech, nanotech and robots. Follow on Twitter @lggreenemeier.

The views expressed are those of the author and are not necessarily those of Scientific American.





Rights & Permissions

Comments 4 Comments

Add Comment
  1. 1. jayjacobus 4:20 pm 12/16/2011

    A good use for this program would be to tell scientists what are the independent variables for world temperature.

    Apparently the temperature for 2007 (for one) is substantially off the trend line. What is the cause of this aberration?

    Link to this
  2. 2. denysYeo 5:45 pm 12/16/2011

    Absolutely the way to go for future data analysis; with the power of computing today it makes no sense to keep using statistical methods that were developed, in part, to squeeze data into units that could be handled by very limited computational resources. So it is great to see people using the full capability of computers to extract patterns and trends.

    Link to this
  3. 3. jtdwyer 6:11 am 12/17/2011

    Perhaps socioeconomic researchers can one day use this software to identify how they can economically contribute to society.

    Link to this
  4. 4. Quinn the Eskimo 1:02 am 12/19/2011

    The punch line only:

    There is NOW!

    Link to this

Add a Comment
You must sign in or register as a ScientificAmerican.com member to submit a comment.

More from Scientific American

Scientific American Back To School

Back to School Sale!

12 Digital Issues + 4 Years of Archive Access just $19.99

Order Now >

X

Email this Article



This function is currently unavailable

X