In Part 1 of this series, I described a bit about why sequencing the DNA of microbes is a useful way to study them
An individual microbe is like a single book in a vast library. Over the last 100 years, we’ve learned to read and interpret, at least to some extent, the language of biological systems. But for most of that time, our investigations have been limited to pulling individual books off of the shelf and investigating them in isolation.
Sequencing DNA is like reading a book whose sentences are written in DNA - sequences of the chemical bases A, T, G and C. Now it's time to talk about how this is actually done.
As I mentioned in the last post, the first generation of sequencing technologies was pioneered by Frederick Sanger in the 1970's. His method took advantage of another hugely important technique - the polymerase chain reaction (PCR) - which allows scientists to duplicate strands of DNA in a test tube. Normally when performing PCR, you mix your DNA template (what you want to duplicate), primers (not important here - we'll get to those later), an enzyme called "polymerase" that will build the new strands of DNA, and the building blocks of DNA (those DNA bases, A, T, G and C) in a form that the enzyme can use. When these four things are mixed together, the primers stick to the template DNA, and the polymerase adds the A's, T's, G's and C's according to the sequence of the template.
But Sanger used a clever trick: in addition to the normal bases that were used in PCR, Sanger included a small amount of a base that could be added to the chain, but couldn't be added to. In other words, if the polymerase enzyme grabbed this block, the reaction would stop.
Imagine we're going to PCR the sequence AATCCCGTCAGT. We include mostly the normal bases A, T, G and C in the reaction, but include a very small amount of a modified T*. If this nucleotide gets grabbed by the enzyme, it will get added to the chain, but the reaction will stop. If we didn't include any of the normal T, we'd get out a bunch of AAT, and every reaction would stop there. But since we've also added the normal T, we'll also get AATCCCGT and the full sequence AATCCCGTCAGT. Sanger couldn't read the sequences directly, but he could determine the length of each fragment, so he'd see that there was a T at position 3, 8 and 12. By running separate reactions with terminator bases of each type, one could determine the lengths of fragments returned when each base, and thus the complete sequence.
This process is quite laborious, but later iterations of the Sanger method used fluorescent labels of different colors for each base, and their addition to the chain can be visualized with a laser and a microscope (it's technically more complicated, but this is the gist of it). These innovations transformed DNA sequence from a highly specialized technique to something used every day by most biology labs in the world. Sanger sequencing is still used today, but is impractical for sequencing the billions of bases required by many modern applications, like studying microbial communities. For that, we need to turn to "next gen" sequencing techniques.
Sequencing by Synthesis (Illumina)
These days, the next-gen sequencing market is dominated by Illumina. That graph of falling prices for gene sequencing has largely been driven by this one company. Just last year Illumina announced that they'd achieved a $1000 human genome. But conceptually, the process isn't all that different from Sanger sequencing - they're still looking at the fluorescent labels of bases that are added one-by one to a template strand. The key to their success is maximizing the number of templates that can be read at one time.
This relatively jargony video explains the process in more detail, but the gist of it is, template strands are fixed to a solid surface and amplified in-place. In other words, a lot copies of a single strand are amplified in a method similar to PCR, except they're stuck in place rather than floating around in a soup. But, on a single surface (called a flow cell), you can have a bunch of different DNA strands. Then, when sequencing occurs, you're watching fluorescent signals from a single location on a physical surface, rather than needing a separate tube for each strand of DNA.
The newest generation of sequencers (called "HiSeq") can sequence 3 billion pieces of DNA at a time. The key limitation to this technology is the length of each individual sequence - they're only 150 bases long. Depending on how the sequences are being used can cause issues, which I'll address in the next post. Sometimes, fewer sequences, if they're longer, is a better way to go.
Take a look at the first ~3 min of this video (the rest is an ad pitch for a different technology explained below):
Ion Torrent Sequencing
Once again, the idea behind ion torrent sequencing is similar, in that bases are read as they are added to a DNA template. But ion torrent shares more with Sanger's initial method of a separate reaction for each base type. However, instead of running each reaction out on a gel, ion torrent measures the electrical change in a tiny volume of liquid. As each base is added, hydrogen ions are released, subtly changing the pH of the solution. Reactions are performed on a semiconductor chip that can read these minute changes, and register that a base has been added. The system cycles through each base once every 15 minutes, and records which individual wells had pH changes.
This technology is theoretically faster than Illumina sequencing, produces marginally longer sequence reads and doesn't require modified bases (no need for fluorescent labels). Life Technologies is pushing ion torrent as potentially putting sequencing in the hands of more labs and hospitals, since the analyzers are significantly cheaper than giant Illumina machines and don't require a huge amount of technical training to use. However, the output from these systems is still lagging behind Illumina's technology, and is still more expensive per-base read than Illumina. This isn't an issue if you're just trying to, say, identify a virus in a clinical sample at a remote hospital, but makes the technology impractical for huge microbial community samples, where getting the most reads as possible for the cheapest price is the priority.
Single Molecule Real Time (SMRT) Sequencing
SMRT technology, commercialized by the company PacBio, is quite different. As the name implies, this sequencing method detects the sequence of single DNA molecules (rather than relying on amplified pools). Rather than immobilizing the DNA strand and adding polymerase, in a SMRT cell, the polymerase is immobilized and the DNA strand is attached to it. DNA can also be circularized, so that the same piece will keep looping through the same enzyme, allowing multiple reads through each molecule as the reaction proceeds. The bases are read, again by fluorescence, but there's some fancy optics and physics involved in the SMRT cell that quite frankly, I don't understand.
The upshot of this technology is that PacBio can generate enormous sequence reads with high accuracy. In other words, while Illumina's sequencing by synthesis generates reads that are 100-200 bases long, and ion torrent can generate reads up to ~400 bases long, SMRT sequencing can generate average read lengths of several thousand bases. This comes at the cost of fewer individual reads per base (which, if you watched the video above means error checking is more difficult), but having very long reads can be critical to resolve the sequences in certain types of genomic regions - more detail on that coming up in the next post.
There are a lot of smart people working on ways to sequence DNA, and I'm sure there are other technologies that hold promise that I'm less familiar with. In general, the most important features of any given technology for large-scale sequencing efforts are:
- Read length
- Read number
- Cost per base read
In the next post, I'll explain how we go from individual DNA sequences to understanding the structure or individual organisms or entire communities.
Part 2: Next Generation Sequencing (Current)
Part 3: From Genes to Genomes (coming soon!)