Imagine opening Google Earth to find a large blackout zone roughly the size of Africa covering the surface of the globe. This region is completely unexplored and unknown. No satellite images exist, and our traditional mapping techniques cease to work here. Basic information about the general landscape is replaced only with darkened pixels. In a world where we can spend our lunch break staring at the surface of Mars or watching wind patterns swirl around the earth in real time, it is hard to imagine an uncharted zone of this scale. Undoubtedly, this would revitalize an exciting age of exploration, drum up media coverage and inspire modern adventurers. It seems unthinkable that we would allow this blackout zone to remain for 24 hours, let alone years.
While gaps of this size do not exist in our world maps, they are very present in our genomic map. Since the initial release of the human genome over a decade ago, researchers have extensively studied areas, accounting for roughly 8 percent of the genome, where data remains missing. These ‘blackout zones’ span millions of bases on each chromosome. To demonstrate this to scale, if Earth’s surface is 196.9 million square miles, we are talking about a group of blackout zones in the human genome that are collectively larger than Africa by 4 million square miles, or four times the size of the United States.
What are these mysterious blackout zones? And why are these regions absent from our map of the human genome?
To better understand the types of sequences in these regions, imagine standing in the middle of what appears to be a typical suburban neighborhood. As you walk past rows of houses with manicured yards, you note very little out of the ordinary to suggest why this neighborhood would be difficult to place on a map. You keep walking to the end of the block and quickly notice that the rows of houses ahead of you are completely identical to block directly behind you. After a few miles of walking through what appears to be the same neighborhood again and again, you begin to feel lost, as you have little context to let you know where you are going or how far you have traveled. Over time you may notice a single red mailbox that breaks the uniformity and serves as an unexpected mile-marker. You could walk for months or even years through these neighborhoods of identical houses, only occasionally finding that one of those houses has a red mail box, a broken window or an open garage door to let you know that are not just walking in a circle. At some point you might get back onto the map, only to later wander into a new unmapped neighborhood with a completely different set of houses, also repeated over and over again, block after identical block.
In the human genome these neighborhood blocks are equivalent to stretches of DNA that are repeated in a head-to-tail fashion over and over again with shocking uniformity. These repeats occupy millions of bases that have only occasional sites of variation within them to differentiate one repeat from another. Researchers have called these repeat-rich regions “blackholes of the genome,” “puzzle pieces of a blue sky” and “a hall of mirrors.” Standard techniques used to put the puzzle pieces of the human genome together fail in these regions due the inability to predict the true linear order of identical repeats, and as a result research in these areas of our map stagger far behind the great advancements of the rest of our genome.
This has all changed, however, with the most recent release of the human genome reference assembly, known to sequence-gazers as GRCh38. For the first time in our history, we are able to extend our maps into each of the blackout zones, describing an initial street view of roughly 60 million bases. This advance is worthy of a brief celebration—as one would when you break ground for a new building or christen a new boat before it heads out to sea—to toast that we are moving forward while acknowledging the amount of work ahead.
As one of the contributors, I would like to take this opportunity to provide some cautionary points to those who wish to venture into these strange regions of the genome.
First, this release describes roughly a third of the missing data, offering sequence information for one type of tandem repeat family, named ‘alpha satellite DNA.’ Alpha satellite is the expected flagship sequence for these blackout zones as it benefits from prior experimental characterization and associates with the centromere, or the site responsible for proper chromosome segregation during cell division.
Every time the cell divides, DNA must be duplicated and then partitioned equally to each of the resulting daughter cells. As you can imagine, the success of this process is important for cell viability. Any misstep could lead to an unequal distribution of your genetic code. To move and ultimately segregate each chromosome during division requires an attachment, or a physical interaction between the DNA and a specific group of proteins known as the kinetochore. The DNA where this attachment takes place is called the centromere, which for normal human cells, forms over regions enriched in alpha satellite DNA. While it is not yet clear why such a critical biological process associates with this peculiar repetitive landscape, the presence of new sequence models in the GRCh38 release will offer an opportunity to take a high-resolution look into the structure and function in these regions.
It is important to note that these are sequence models, and quite different from the rest of the map of the human genome. Much like we would identify a destination using GPS, researchers navigate the human genome by assigning each of our three billion bases to a distinct set of reference-based coordinates. Sites of potential biological interest (for example, protein coding genes) have the equivalent of a ‘street address’ in this map. This gives a spatial context to each base, allowing one to quickly determine the number of bases upstream and downstream between any two sites of interest. In other words, the linear ordering of these bases along the entire length of a given chromosome provides biologically meaningful information. This is not necessarily true of the new centromeric reference models, since by design, they do not ensure the correct long-distance ordering of the tandem repeats.
Consider the problem this way: rather than walking through a series of identical neighborhood blocks, imagine that you only have a collection of aerial photographs. The majority of the images from the blackout zone provide a seemingly endless photo shoot of what appears to be the same neighborhood. Unlike pictures taken outside of these regions, where it is possible to unambiguously identify overlap in the captured frames and stitch them into a longer panoramic, these images alone provide little information to arrange them in a linear manner. As an alternative, it is possible to simply array the images in a way that illustrates the full spectrum of neighborhoods observed in proportion to the entire zone. For example, if houses with red mailboxes were observed in 2 percent of the neighborhoods, you would expect them to also be present in 2 percent of the houses represented in the final stitched together panoramic. This resulting model provides information across the region without requiring the correct linear ordering.
Why care about these small sites of variation? As you might expect, having an inventory of potential ‘mile-markers’ is extremely useful for advancing our maps and ensuring the correct ordering of repeats. Additionally, the number of repeats and the corresponding number of variant sites are expected to differ between individuals in the human population. That is, depending on the genome you are studying, a particular variant could be represented once or hundreds of times. This offers a new source of sequence variation to study in the context of human population genetics and biomedical research. Although the current reference models provide information from one individual, this initial sequence description offers an important ‘mapping target’ to collect and study patterns of sequence variation across large cohorts of individuals and diverse datasets. For example, initial surveys of sequence variation using alpha satellite reference models from the X and Y chromosomes provided evidence that the size of these regions or the number of repeats can vary by an order of magnitude between individuals. The centromeric region of your X chromosome, for example, could be close to five million bases in length, while your neighbor could have a centromeric region one-tenth the size with a slight different collection of variants.
Do these differences have any biological effect? We are just starting to understand the functional role of these genomic regions. Personally, I look forward to watching the first sunrise in these blackout zones. This is the beginning of an exciting age of genome exploration.
Image Credit: Julie Himes, Science Illustrator @ www.juliehimes.weebly.com