Skip to main content

The Limits of Big Data in Medical Research

It could help large institutions reach new insights into disease—but also make it harder for small labs with original ideas to compete for grants

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


The National Institutes of Health this spring will start recruiting one million people in the United States to have their genomes sequenced, but also to provide their medical records and regular blood samples, and to submit to monitoring diet, physical activity, heart rate and blood pressure. The $1.455 billion, 10-year All of Us Research Program will create the largest and most diverse dataset of its kind, and could provide new insights into diseases. But such big data projects tend to overpromise, while dedicating so much of the NIH budget to large institutional controls and players makes it harder for smaller labs with original hypotheses to win grant money and compete in science.

The building of such “biobanks” of patient data is consistent with trends in science toward large institutional projects with objectives determined by hierarchy and rank; these projects recall the public-private race to sequence the human genome, but also projects such as ENCODE, which benefit high-profile labs over the course of decades. The Veterans Affairs Department, another public research department, began the Million Veteran Program and has so far recruited 650,000 vets with plans to sequence the DNA of 100,000 participants in the next two years, through contractors such as Booz Allen Hamilton. Health insurers such as Kaiser Permanente and Geisinger Health are building genetic datasets in conjunction with patient medical records. Companies such as Regeneron Pharmaceuticals, deCode Genetics, a subsidiary of Amgen, and Patients Like Me, are also creating large genetic databases.

In truth, public or private groups cannot say for certain where the data will end up decades from now. Biobanks may in some cases resell data to third parties, generating tensions similar to those that social networking companies such as Facebook are now facing. Importantly, if genetic causes or risk factors are further elucidated for more diseases, no one knows for sure whether insurers will claim genetic variations as a form of preexisting injuries, which may affect the structure or cost of insurance coverage.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


To the extent that people contribute data to a “public good” of a bank, it could backfire on future generations who face bias or discrimination from insurers. At the same time, it is not even probable that such expensive biobanks will lead to fundamental changes in treatments for most diseases.

Genes often interact with other genes in complicated ways, with positive or negative effects that depend on their genetic background. For instance, a genetic variant may contribute a small amount of risk for a complex disease, but it may not be harmful in the context of at least some genetic variants that have canceling, or “negative pleiotropic,” effects. Therefore, even if scientists have the genomes of a million people sequenced, thereby enhancing statistical power to detect the effects of some genetic variants to predict a disease, it still does not help to solve the intractable combinational problem of how the effects of any single gene variant are enhanced or diminished in the context of other genetic variants.

In fact, genetic variants that compromise a function are surprisingly common in circulation. A recent study showed that 3.7 percent of patients in a hospital system carried a genetic variant that was linked to a single-gene, or Mendelian, disease, and were undiagnosed based on subclinical symptoms. Another study showed that nearly all of us have at least one, and may have up to six, genetic variants for a recessive genetic disease, meaning the disease only presents if a person inherits two copies of a defective gene. Harmful mutations can remain in the population if they contribute to balancing selection, meaning they add to genetic diversity in some cell types, such as immune cells, but add a risk for schizophrenia or cancer. 

The more we generate genetic data the more we find we are each a “mass of quirky imperfections working well enough (often admirably); a jury-rigged set of adaptations, built of curious parts, made available by past histories in different contexts,” in the words of Stephen Jay Gould. Biology is based on genetic tradeoffs and does not comport with a progressive neoliberal view of data evangelists who believe we are moving us closer to human perfection, or to a world without genetic risk.

And, if time is more primal than logic or statistics, biological breakdown is inevitable and not reversible. A recent study applied information theory to the epigenetic codes that control the expression of genes to show how a loss of conservation of this epigenetic information—loss of memory for when to turn on, a sort of entropy in the expression of genes—contributes to the process of aging and cancer.

The cost of such biobank projects is small in comparison to the defense budget, which is $700 billion for 2018. Certainly, I and other scientists would prefer to cut the defense budget to aid the NIH budget. However, sinking so much of the NIH budget into big systematic projects contributes to the problem of top-heavy institutional control in the life sciences and will make it harder for smaller labs with original hypotheses about biological mechanisms to get grant money. And realistically, the technical limitations of explaining genetic interactions, the paradoxes of genetic tradeoffs, and the problem of entropy in biology, mean that big systematic data projects are not apt to lead to a health care revolution.