In part 1 of this series, I talked about what DNA sequencing is, and why it's an important tool. In part 2, I explained some of the technologies that scientists are currently using to actually "read" the letters of DNA sequences from organisms. In this final piece, I'll explain how we go from reads of a sequencer, to understanding something about the organisms in the sample sequenced.

Assembling Genomes

The genome of an organism contains all* of the instructions for living and replicating, written in the language of DNA (or RNA in the case of some virueses). The first organisms to have their entire genomes sequenced were bacteriophages - viruses that infect bacteria - in the 1970's. These sequencing projects were incredibly laborious, and the genomes were only a few thousand bases long. In the 1990's, the first genome of a bacterium (nearly 2 million bases) and the yeast saccharomyces cerevisiae (12.5 million bases) were sequenced. The first human genome was finished in 2004, at a whopping 3.3 billion base-pairs.

I described sequencing the genomes of microbes in a complex community as being like walking into a library and reading books off of the shelf. But here's the thing - the analogy of reading a book breaks down a bit here because you can't just start from the beginning and read each letter. A better analogy here would be throwing a book into a wood-chipper and trying to assemble it again from the pieces. Even more accurately, you've only got the text that was on each each scrap, you don't even have the shape of the edges, so you can't know what fits with what. In fact, this project would be impossible. Take a look at the last sentence, and imagine it fragments to:


in fact

would be

this project

This could be "impossible this project would be in fact," or " in fact impossible would be this project." Instead, let's throw the same sentence into the wood chipper a couple of times - each time it would fragment randomly, so the second time we might get:

be impossible


project would

fact this

Now, by aligning overlapping fragments, we have enough information to reconstruct the entire sentence:

Now multiply that by a few million times. Hopefully this analogy makes a couple of things clear quickly - the longer the sequence (the larger the fragments coming out of the wood chipper), the easier this will be, and you need to read each letter on average more than once, in some cases many more than once.

16S Ribosomal Profiling

Modern sequencing methods can generate millions or even billions of these short sequence "reads" at a time, but as I said above you need many more than one read per base, and even a single genome often has millions of bases. If you try to read the entire genome of every microbial member of a complex community, you're going to need several sequencing runs. And though prices have dropped a lot since the days of the human genome project, it's still far from cheap.

But if you want to know the information in a library, you don't necessarily have to read every page of every book - just getting a list of the titles is probably sufficient. Sure, some of the books are probably obscure, so knowing the title alone won't tell you everything, but if you're interested in comparing, say, the diversity of information in New York Public libraries vs those in Massachusetts, lists of titles in plenty. This is the idea behind 16S ribosomal profiling - essentially going through the pile of fragments that came out of your wood chipper and only looking at the spines.

In the same way that every book has a title, every bacterium has a gene for the RNA component of the 16S Ribosome - a molecule necessary for making proteins. Also useful, this gene doesn't change very much in bacterial evolution, so the degree of difference between the 16S genes of two microbes is a good proxy for how distantly related they are. And the best part: sequencing a few hundred bases of this gene is plenty to extract the necessary information.

This is the way that a huge number of microbial ecology papers are done - if you see a pie chart with different colors representing different microbes, it was probably done with 16S ribosomal profiling.

The 16S profile of human skin [Image from Wikimedia Commons]


Sometimes, we want more information. Extending our metaphor, let's say we're comparing the New York and Massachusetts public libraries again, but instead of just looking at the diversity of titles, we want to know a little bit more about the content. For example, we want to know what the average level of sophistication of the books on the shelves is. Instead of going through our wood-chipper pile to assemble each book cover-to-cover, we could just try to build complete sentences and analyze those. It's not necessarily important to know which book each sentence came from, we just need to know, on average, what's the reading level of a sentence in that library.

Metagenomics is a happy medium in between assembling whole genomes and 16S profiling. It requires more "sequencing depth" - more copies of each book thrown into the wood chipper - than 16S, but not nearly as many as trying to assemble whole genomes. The metagenome of a sample is a representation of all the genes present in an environment, without necessarily knowing what genes are present in which microbes. 16S genes will be revealed in metagenomic sequencing as well, so the only advantage of 16S profiling at this point is cost.

Sequencing RNA

Maybe knowing the level of sophistication of the books in NY vs MA libraries isn't what we're after, we want to know the level of sophistication of the patrons. Maybe NY libraries have a bunch of Shakespeare and Rumi, but the people going into the library are only reading E.L. James. What we really want to be able to do is analyze what books are being pulled off the shelves.

This is the idea behind RNAseq, which looks at the relative abundance of - you guessed it! - RNA. When genes are turned on in a cell, they make copies of the DNA gene in a molecule of RNA, and the amount of a particular sequence of RNA in a cell is a measure of how much the gene is on. It's as if patrons of our metaphorical library aren't allowed to checkout books, they're only allowed to photocopy the pages of the books they want to read.

Which genes are on or off governs the behavior of a cell more than which genes it has (after all, your heart cells and skin cells have the same genes, but very different behavior), but sequencing the DNA tells us nothing about which genes are actually being expressed, any more than knowing the books in a library tells you about the reading behavior of its patrons. The same technologies I described in part 2 of this series can be turned on the RNA extracted from cells.


I recognize that the preceding posts lack anything specific to hang your hat on, but trust me, it's going to pay off. Next month, I'm going to start talking about some research that uses these methods, and hopefully these explainers will be worthwhile references to return to frequently.


*This isn't strictly true - there's other information (see "epigenetics") that can have important consequences - but it's mostly true.

Part 1: DNA Sequencing Introduction

Part 2: Next Generation Sequencing

Part 3: From Genes to Genomes (Current)