August 19, 2014

Similarities Between Human and Chimp Genomes Revealed by Hilbert Curve

Editor’s Note: The following is a guest post from Martin Krzywinski, a contributing artist who designed the Graphic Science illustration in the September issue of Scientific American magazine.

By Martin Krzywinski

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Editor’s Note: The following is a guest post from Martin Krzywinski, a contributing artist who designed the Graphic Science illustration in the September issue of Scientific American magazine.

For a graphic in the September 2014 issue of Scientific American, the editors challenged me to visually support the statement that we’re more like chimps and bonobos than gorillas, genomically speaking.

Here, we’ll look at how this information might be displayed visually, and I’ll take you through the thought process that resulted in our final product. But first, there are a few things to clear up about what a genome is and what a genome isn’t. The genome is not a blueprint. In fact, it looks nothing like it (Figure 1).

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Figure 1 | 1936 Joy Oil gas station blueprints (top); sequence from human chromosome 1 (bottom).

A blueprint shows you “what” but a genome doesn’t encode “what.” The genome can instead, be thought of as encoding a set of tools (proteins). It tells you nothing about what the function of each tool is, what the tools act on, how the tools act together or what the tools are used to build.

What makes genome analysis and visualization difficult is not only this deep interplay between its parts–how, when and why the tools are used–but also its physical architecture: the size and the density and distribution of functional regions. (Our genome is packed into 24 chromosomes, about 3 billion bases in all).

The first thing to note is that the tools (proteins) are not necessarily encoded by neighboring regions of the genome. For example, the code for four proteins that convert tyrosine to epinephrine are located on chromosomes 3, 9, 11 and 17. When we draw the chromosomes in their natural order and orientation this information is hidden.

Next, out of the 3 billion bases, not all have a well-defined job. Genes — which make up only about 33 percent of the genome — refer to segments of the genome that that code for proteins. But strictly speaking the term “coding regions” in the genome correspond only to specific staccato protein-coding sequences bundled within those larger genes. These segments are the exons (about 2.5 percent of the full genome, Figure 2). The rest of the genome (segments which are included both inside and outside gene regions) doesn’t have an obvious function–and has been disparagingly called “junk DNA”. However, junk DNA isn’t all junk and its role is hotly debated.

Figure 2 | Only 2.5% (75 million bases) of the human genome is translated into proteins. These regions are packaged in genes, which collectively span about 1/3 of the genome. A linear representation of the genome makes showing details difficult. Even the largest gene, Titin, cannot be discerned at this scale. Its exons would be about 0.1% of the length of the exon line, barely visible even at 100x magnification.

Filling space with a line
The basic scale of the human genome can be displayed linearly, as in Figure 2. Because the genome is big and not all of it matters to an equal extent, however, a dense visual representation is necessary to show detail in the context of the full genome.

For a static image for a magazine-sized page this is essentially impossible. As we see in Figure 3, if we represent the genome by a 1000 x 1000 pixel square then only a 160 x 160 pixel square would hold the critical information (content related to the exons).

Figure 3 | Relative size of exons and genes in the human genome represented by area. If the genome is a 1000 x 1000 pixel image, the square representing bases in exons is only 160 x 160 pixels.

Nonetheless, a square is still ideal because it has more pixels (in print, dots) that can be used for data than a line, or a series of lines. The question is how do you pack a genome (a one-dimensional object) into a square (a two-dimensional object)? The answer is a space-filling curve, such as a Hilbert curve.

A Hilbert curve is easy to construct. Take a square and divide it into four quadrants. Connect three pairs of centers of the quadrants with a line to obtain a horseshoe shape. This is a Hilbert curve of order 1. One of the pairs isn’t connected (it doesn’t matter which one) so that the curve has a start and end. Higher order curves are constructed by repeatedly dividing each quadrant into sub-quadrants, as in Figure 4:

Figure 4 | Hilbert curve of order 1, with the opening on the left (backward C) so that the curve starts the upper left and first travels to the right. Higher orders are constructed by recursively dividing each quadrant into quadrants. In each case the start of the curve is at the top left and the end at bottom left. The length of the curve approximately doubles at each order. The order 7 curve is about 128 times longer than the order 1 curve.

Space filling curves provide a way to pack a 1-dimensional object (the genome) onto a 2-dimensional space (the page or the screen) in such a way that neighboring regions in the genome remain proximate in the 2-dimensional representation.

Creating the graphic
For the Scientific American graphic comparing the human genome to those of other primates, there wasn’t a great deal of space in the print layout – about 5 x 5 inches. Working with such a small area it was important to illustrate the concept quickly, ideally at first glance, and then provide another, more subtle and rich layer of information.

I make quite a lot of Circos figures–a method for visualizing data in a circular layout–often appropriate for showing similarities across genomes. But here the idea was to show differences, so a different form was required.

I thought that a Hilbert curve was a great approach. It’s somewhat of a boutique visualization that takes some getting used to. If you’re looking at it for the first time the multi-level square patterns can be a little distracting, but it’s a powerful way to compress information coherently into a small space. You don’t actually need to know anything about the intricacies of the curve to see pattern differences.

To compare gorilla, bonobo, chimp and Denosovian genomes to that of a human, I used order 5 Hilbert curves, striking a balance between legibility and detail. Figure 5 is a chromosome map of the full human genome on an order 5 curve.

Figure 5 | The chromosomes of the human genome mapped on an order 5 Hilbert curve.

For the final graphic, I was only concerned with how primate genomes differ from the gene regions in the human genome. On the Hilbert curve shown in Figure 6, you can see which parts of the sequenced portions of the human genome (colored regions) are genes (black rectangles).

Figure 6 | Gene regions (black rectangles) superimposed on sequenced regions of the human genome (colored lines).

Since we were only concerned with the gene regions, the areas outside of the black rectangles above could be omitted. Figure 7 shows only the gene regions of the genome. The black rectangles now represent exons (the critical 2.5 percent of the genome I wrote about earlier). Notice that the color boundaries are different here than in Figures 5 and 6 because this winnowing down results in the removal of different lengths of various chromosomes.

Figure 7 | Density of exons in the gene regions in the human genome. The curve represents all the gene regions (black regions in Figure 6), about 1/3 of the genome, colored by their chromosome. The density map encodes the fraction of bases at that location that fall within an exon. In total about 2.5% of the genome (about 7.5% of genes) is in exons.

Figure 7 is exactly how the final magazine graphic is set up, except that instead of the fraction of bases in exons, what is shown is the fraction of bases that have an alignment (a region of sequence similarity) to another genome (e.g. chimp). Because of space constraints I decided to condense the bases along the Hilbert curve into 2,048 bins, each represented by a circle, rather than a density map of the type shown above. This would give the figure a more stylized and geometrical look. When density maps use a lot of tones graphics can have a blurry look to them that, without strong regions of contrast, don’t sit firmly on the page.

Steps Along the Way
I have developed graphics of full genome panels for multiple species before, such as the British Library Beautiful Science exhibit, but those highlighted similarities between species, not differences. Whatever approach I took had to produce a graphic in which the differences were visually obvious.

I started by looking at how the alignments between the human and the primate genomes corresponded in terms of location. We know that as we look at species that are evolutionarily more distant from us the segments get mixed up (or shuffled) between chromosomes–perhaps this shuffling would be good way to visually compare the genomes, I thought. But the resulting figures were too complicated.

For our purposes there was also a question about how much the shuffling of genes between chromosomes matters, as long as discrete sequences are kept intact. For example, if you wanted to compare two libraries you wouldn’t necessarily care about the order in which they shelved their books. If both libraries had exactly the same books, then you might say that they’re identical. The rest is organization. (In the genome some of this organization is implicated in function, but this is getting into too much detail).

In order to determine the difference between libraries, you might instead ask what which books is one library missing that the other one has. This brought me closer to a useful form for my Hilbert curve. I shifted to looking at only the unaligned bases (base sequences in the human genome that were not represented in the other genomes), as seen in Figure 8.

Figure 8 | Comparing differences using fraction of unaligned bases. Larger red dots indicate a larger percent difference between the species listed and the human genome.

I refined things further by using different colors and adjusting the scale to emphasize differences. We were getting reasonably close to a candidate figure, as shown in Figure 9.

Figure 9 | Comparing differences in unaligned bases using 5 color scheme.

At this point, we removed the mouse in favour of adding the Denisovan genome, which seemed more appropriate (Figure 10).

Figure 10 | Comparing differences in unaligned bases using 5 color scheme. The mouse was replaced with the Denisovan, which was more relevant to the article.

Ultimately, instead of the Brewer spectral palette shown above, we decided to go with a yellow-red palette in the final version. With this approach the differences across the genomes are more intuitive.

I’m happy with the result. It is clean, symmetric, unadorned and I think fairly well gives an appreciation of the extent of the differences between the genomes. Yes, about a metric ton of detail has been left unsaid, but we’re out of space.

The least part of the reason for my satisfaction was the fact that I didn’t wind up making a circular representation. It was a much needed break from round – until I looked more closely at the final figure and realized that 8,188 little circles were staring back at me.

For art based on the Hilbert curve (meet the Hilbertonians!) and this project on my website, click here.