To find out when whooping cough started making a comeback in Ohio, or how often measles kills in America, we turn to historical records. But those records aren’t very useful when they’re squirreled away in a distant office basement. The same goes for when they are embedded in a report—you can only look at them in the same way you might admire a painting, but you cannot drop the data into a spreadsheet and hunt for statistical significance. If you are only looking at a couple years’ worth of information that formatting dilemma is not such a big deal. You can scour the data and manually punch it into your analysis. It only becomes a huge problem when you are looking at hundreds or thousands of data points.
Such is the problem that public health experts at University of Pittsburgh encountered when they were exploring old medical data and developing models that predict future outbreaks. “We found ourselves going back and pulling out historical datasets repeatedly. We kept doing it over and over and finally got to the point where we thought it would be not only a service to ourselves but everybody if all the data was made digital and open access,” says Donald Burke, the dean of Pittsburgh’s graduate school of public health.
Four years ago, buoyed by funds from the National Institutes of Health and the Gates Foundation, they started the process of digitalizing 125 years worth of medical records. The endeavor was dubbed Project Tycho, named for the Danish nobleman Tycho Brahe who made the voluminous astronomical observations that Kepler later tapped to develop the laws of planetary motion. (But no pressure, right?)
The online, open-access resource now features accounts of 47 diseases between 1888 and today. It includes data from the weekly Nationally Notifiable Disease Surveillance reports for the United States, standardized in such a way that the data can be immediately analyzed.
In the research world, that’s a big accomplishment. Making this data usable takes more than casually monitoring a scanner while sipping coffee. The data has to be made uniform, a tedious process of manual input with unenviable tasks like removing periods, dashes and other inconsistencies while identifying data gaps.
Pittsburgh researchers also gave their new data trove a test drive to illustrate what could be done with the data. They mined Tycho for information on eight common diseases detailed in the records—polio, measles, rubella, mumps, hepatitis A, diphtheria and pertussis. Looking at available records before and after vaccines were discovered for those diseases, they estimated that 103 million cases of those contagious diseases have been prevented since 1924, (assuming the reductions were all attributable to vaccination programs). Their findings are published in this week’s New England Journal of Medicine. The data also points to what can happen when communities become too lax about vaccinations (among other factors). They quantified the resurgence in recent years of pertussis throughout the country, particularly in the Midwest to Northwest and in the Northeast and also ongoing cases of mumps. “Reported rates of vaccine refusal or delay are increasing,” the authors write. “Failure to vaccinate is believed to have contributed to the reemergence of pertussis, including the large 2012 epidemic.”
When vaccines work well, sometimes “people no longer fear the disease and they undervalue the vaccine and in some ways that is what is going on right now,” says Burke, pointing to the discredited vaccine-autism link which prompted some parents to turn away from childhood vaccines. With this newly available data collection, more can be done than simply looking at where the disease is happening—or not happening. Researchers can begin looking for drivers of disease and identifying patterns about the burden of disease by say, climate or socioeconomic-status.
Flip through some of the data yourself here after it becomes searchable to the public on November 28.