Skip to main content

We Need More Diversity in Our Genomic Databases

The ones we have now are too heavily skewed toward people of European descent

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


The underrepresentation of nonwhite ethnic groups in scientific research and in clinical trials is a disturbing trend. The implications of this problem in the limited realm of clinical trials have been reported, but the roots of the disparity run much deeper. Human genomic databases—collections of all the genetic information that has been sequenced over the years—are also heavily skewed toward people of European descent.

This creates inequity in the usefulness of the information they contain, which informs everything from genetic test results to clinical trial outcomes, guaranteeing that these ethnic populations will benefit most. If left unaddressed, the inherent bias the databases contain will continue to contribute to the lack of diversity seen in drug trials as well as to the uneven success rates in precision medicine.

Lack of diversity in medical research stems not from an insidious bias but rather from the underlying structure of science. In the earliest days of genomics, funding for sequencing projects was often highest among predominantly white countries, so naturally those populations are better represented in public databases. Also, some ethnic minorities have been historically mistreated by scientists—the Tuskegee syphilis experiment is among the most notorious examples—and now members of those minorities are understandably reluctant to participate in research studies.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


When the research community began sequencing human genomes, therefore, it was inevitable that the initial results would be a better biological match to some populations than to others. It took time to analyze enough people to understand the full extent of the diversity that exists across and within populations. Early studies were also biased by the type of genetic variation they included. Initially scientists looked only at tiny, single-base-pair DNA differences between populations, ignoring larger variation in genome sequences that were more difficult to assess.

The community has learned that there are far more of these larger, so-called structural variations than expected. These are now known to cause genetic disease and impact the way drugs are metabolized by individuals and ethnic populations. Early sequencing technologies were not capable of detecting them accurately, but more advanced technologies are now allowing scientists to identify variations that in many cases have never been seen before.

This is an exciting step forward: we’re finding that some of these structural variations can explain diseases for which no cause had previously been found: Carney complex, for example, a rare disorder that causes tumors to appear in many parts of the body; or a mutation that may contribute to bipolar disorder and schizophrenia. These new technologies also make comprehensive human genome sequencing much more affordable.

As a result of these advances, I am pleased to report that the genomics community can now start to address the serious problem of improving the representation of ethnic diversity in our databases. As chief scientist at a DNA sequencing technology company, I get to witness these efforts every day. For example, many countries have launched population-specific genome projects that aim to produce extremely high-quality reference genomes. Excellent results using these methods in Korea, China and Japan have led to genomic resources that more accurately capture the natural genetic diversity present in those populations, with clinical implications for anyone of Korean, Chinese or Japanese descent. Such high-quality sequences are also enabling new large-scale studies of specific ethnic groups to dramatically improve their representation in genomic databases.

Already, these projects have led to new discoveries that can make clinical trials and medical care more successful for participants with these genetic backgrounds. For example, the Korean genome project found a population-specific variant in a gene that regulates how some medications are metabolized by the body; this is essential information for dosing and for gauging the likelihood that a patient will respond to a particular therapy.

As more of these projects move forward, there will be similarly important discoveries that will be relevant to any number of ethnic groups. An ongoing National Institutes of Health effort called “All of Us” aims to sequence a diverse sampling of Americans across gender, sexual orientation, ethnicity and race. Being inclusive is an essential goal of this program, and participation is free so as to open the doors to anyone who wants to join.

In the rare-disease field, genome sequencing has proven remarkable at increasing the diagnosis rate, giving answers to patients who might otherwise have gone undiagnosed. Today that approach is most effective for Caucasian patients because more of their DNA can be interpreted using current genomic data repositories; people of different ancestries are less likely to get definitive answers. But as we build up data for people of other ethnicities, we can expect such successes in rare disease diagnosis to extend rapidly to patients of any background. Given the collective burden of rare diseases, this advance alone stands to dramatically improve the healthcare provided to hundreds of millions of people.

Achieving the vision of precision medicine that can be applied equally to people of any ethnic group requires more diverse representation in the biological repositories that underlie clinical programs. Advanced DNA sequencing technology is one tool of many needed to help generate better information about people from all ethnicities for the equitable application of that data in clinical practice.