Full Disclosure: The author was involved in the research study discussed in this article.

Over four thousand people infected, nine hundred of them suffering hemolytic-uremic syndrome, a disorder whose first symptoms are vomiting and diarrhea. For fifty individuals, it ended in death. Such is the grim toll that the 2011 E. coli epidemic wrought upon Europe, and in particular France and Germany.

When Yonatan Grad, an infectious disease physician at Brigham and Women’s Hospital and research fellow at the Harvard School of Public Health (HSPH), first read of the outbreak in the New York Times, he thought he and his colleague, Bill Hanage, could contribute to the understanding of the outbreak.

Grad was right. For Hanage had taken the Francis Crick’s ‘gossip test and combined his knowledge of infection disease with his interest in evolution to pursue a career in infectious disease epidemiology. As a result, Hanage’s career crossed paths with the world of genomic sequencing.

Speaking of the E. coli O104:H4 strain responsible for the outbreak, Hanage says, “It’s often not clear exactly how virulent an emerging infection is. But that said, it was pretty plain that this was a fairly vicious strain. My first reaction was that this would be a really interesting project for sequencing.”

Hanage, also an associate professor at HSPH, knew that conventional molecular epidemiology methods wouldn’t do for the scale of the project he had in mind. He also knew that the Sanger Institute in the United Kingdom had been using next-generation technologies to sequence large pathogen samples consisting of hundreds of strains for several years.

“This sort of work represents a basic shift in the way we think about sequencing, from thinking about one representative isolate, to thinking about many. In an outbreak like the one we are talking about here, if you sequence just one isolate you can find out useful things like how the outbreak relates to the rest of the species. However if you sequence more than one isolate from an outbreak you can define things like the diversity of the outbreak, and any individual lineages within it. Using this information, you can say things about transmission, or how things relate to one another.”

Together, Grad and Hanage devised an epidemiological study to sequence sixteen isolates of E. coli 0104:H4. This study would’ve been impossible a decade ago. In 2001, the cost to sequence a human-sized genome was $100,000,000. In a feat that would leave Moore’s Law in the dust, companies like Roche and Illumina helped bring that cost to below $10,000 in 2010. Now consider that the E. coli genome is a seventh of one percent the size of the human genome.

It wasn’t enough to sequence the isolates cheaply. The researchers also wanted to do it rapidly. “At the time we started, the outbreak had been going for three weeks, and health authorities were starting to understand the key elements of its origins and scope,” Grad says. “Our hope was that we would be able to provide insights into the outbreak that would be helpful, ideally in real time, and, even if not in time to impact this outbreak then as demonstration of the utility of this approach for future outbreaks.”

With the reduction in sequencing costs came a reduction in sequencing time. While the Human Genome Project used a process known as shotgun sequencing, in which bits of the human genome were sequenced in piecemeal fashion, the next generation technologies opened up the possibility of whole genome sequencing. Whereas the original Human Genome Project took over a decade to complete, a single person’s genome can now be sequenced in one month on a single machine.

Without these advances, Grad and Hanage’s proposed study (and other, earlier rapid E. coli O104:H4 sequencing projects such as those conducted by Beijing Genome Institute, Pacific Biosciences, as well as crowd-sourced efforts) wouldn’t have been possible. But quick and cheap would not be good enough if the data wasn’t accurate.

The Broad Institute, a non-profit research center minutes from MIT and Harvard University, had what Grad and Hanage needed: The machines to rapidly sequence a number of the isolates (some isolates were sequenced in Europe), and the expertise to ensure the data were accurate. But a sequencing and analysis powerhouse couldn’t help them without DNA samples. With one cog remaining for their study, the pair turned to Karen Krogfelt of the Statens Serum Institute in Denmark, and Francois-Xavier Weill of Institut Pasteur. “While those of us in Boston were watching (the outbreak) from afar,” Hanage says, “colleagues in Europe were faced with a massive challenge to public health, with the associated interest and scrutiny of the media. Frankly, preparing DNA was understandably not a priority!”

But prepare it they did. The E. coli isolate genomes progressed from DNA in a test-tube to data on a computer screen in a few days. The analysis that followed was critical. “Sequencing and comparing very closely related isolates is a very different proposition from comparing divergent lineages,” Hanage says. “If isolates differ at only a handful of SNPs (single nucleotide polymorphisms, representing mutations in genetic sequence), it is really important to get your SNP calling right if the results are not going to be swamped with false positives. Developing methods for that and checking the results were a big deal.”

The scientists at the Broad Institute employed a combination of computational algorithms and lab chemistry to determine where mutations occurred and to validate their results. What they found astonished them. Though the outbreak in Germany was more widespread and originated from a larger quantity of fenugreek seeds, the isolates from France had more mutations. In other words, the isolates of the larger, more widespread German outbreak were more closely related to each other than the isolates of the smaller French outbreak.

In their study published in the Proceedings of the National Academy of Sciences, Grad and Hanage propose several possible explanations for these results. In the bottleneck hypothesis, the authors suggest that a bottleneck filtered out the diversity of the German strains before the outbreak proliferated throughout the country. For example, the outbreak could have begun from a single infected human associated with the German sprout farm.

Another hypothesis posed by Grad and Hanage is that sprouting conditions for the French seeds were more favorable for diversification than in the German sprout farm. The French seeds were germinated one and a half days longer than their German counterparts, and were watered with tap water of varying temperatures as opposed to the consistent, 20° C well-water used on the German farm. It’s possible, then, that variable conditions produced greater diversity, though further research would be needed to support this hypothesis.

There is more to this study than these proposed hypotheses. "For genome-based epidemiology to be useful in an outbreak it will need to be done quickly," Hanage says. "Part of the goal of this work was to show that it can be done quickly, and convince people of the value of sequencing more than just one isolate."

Presuming enough people are convinced, this kind of large scale, rapid, and inexpensive sequencing and analysis forecasts the future of epidemiology. “This is going to be the way epidemiology is done in the future,” Hanage says. “Imagine if we found that there were multiple deep lineages, suggesting more than one source of contamination? That is something we would want to tell people pretty sharpish. Depending on what we found, we might want more strains, and it would be easier to get them from work that is more relevant to an ongoing epidemic than one that happened a year ago.”

In other words, epidemiology is shifting from a reactive process to a proactive one, and that should mean fewer lives lost to disease outbreaks.


1. Grad, Yonatan H., et al. Genomic epidemiology of the Escherichia coli O104:H4 outbreaks in Europe, 2011. PNAS 2012 109: 3065-3070.

2. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 4/2/2012.