February 21, 2013

ENCODE, Apple Maps and function: Why definitions matter

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Remember that news-making ENCODE study with its claims that “80% of the genome is functional”? Remember how those claims were the starting point for a public relations disaster which pronounced (for the umpteenth time) the "death of junk DNA"? Even mainstream journalists bought into this misleading claim. I wrote a post on ENCODE where I expressed surprise at why anyone would be surprised by junk DNA to begin with.

Now Dan Graur and his co-workers from the University of Houston have published a meticulous critique of the entire set of interpretations from ENCODE. Actually let me rephrase that. Dan Graur and his co-workers have published a devastatingtakedown of ENCODE in which they pick apart ENCODE’s claims with the tenacity and aplomb of a vulture picking apart a wildebeest carcass. Anyone who is interested in ENCODE should read this paper, and it’s thankfully free.

First let me comment a bit on the style of the paper which is slightly different from that in your garden variety sleep-inducing technical article. The title – On the Immortality of Television Sets: Function in the Human Genome According to the Evolution-Free Gospel of ENCODE – makes it clear that the authors are pulling no punches, and this impression carries over into the rest of the article. The language in the paper is peppered with targeted sarcasm, digs at Apple (the ENCODE results are compared to AppleMaps), a paean to Robert Ludlum and an appeal to an ENCODE scientist to play the protagonist in a movie named "The Encode Incongruity". And we are just getting warmed up here. The authors spare little expense in telling us what they think about ENCODE, often using colorful language. Let me just say that if half of all papers were this entertainingly written, the scientific literature would be so much more accessible to the general public.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

On to the content now. The gist of the article is to pick apart the extremely liberal, misleading and scarcely useful definition of “functional” that the ENCODE group has used. The paper starts by pointing out the distinction between function that’s selected for and function that’s merely causal. The former definition is evolutionary (in terms of conferring a useful survival advantage) while the latter is not. As a useful illustration, the function of the human heart that is selected for is to pump blood while the function that’s causal is an additional weight of 300 grams and a capacity for producing thumping sounds.

The problem with the ENCODE data is that it features causal functions, not selected ones. Thus for instance, ENCODE assigns function to any DNA sequence that displays a reproducible signature like binding to a transcription factor protein. As this paper points out, this definition is just too liberal and often flawed. For instance a DNA sequence may bind to a transcription factor without inducing transcription. In fact the paper asks why the study singled out transcription as a function: “But, what about DNA polymerase and DNA replication? Why make a big fuss about 74.7% of the genome that is transcribed, and yet ignore the fact that 100% of the genome takes part in a strikingly “reproducible biochemical signature” – it replicates!”

Indeed, one of the major problems with the ENCODE study seems to be its emphasis on transcription as a central determinant of “function”. This is problematic, since as the authors note, there's lots of sequences that are transcribed which are known to have no function. But before we move on to this, it’s worth highlighting what the authors call “The Encode Incongruity” in homage to Robert Ludlum. The Encode Incongruity points to an important assumption in the study; the implication that a biological function can be maintained without selection and that the sequences with “causal function” identified by ENCODE will not accumulate deleterious mutations. This assumption is unjustified.

The paper then revisits the five central criteria used by ENCODE to define “function” and carefully takes them apart:

1. “Function” as transcription.

This is perhaps the biggest bee in the bonnet. First of all, it seems that ENCODE used pluripotent stem cells and cancer cells for its core studies. The problem with these cells is that they display a much higher level of transcription than other cells, so any deduction of function from transcription in these cells would be exaggerated to begin with. But more importantly as the article explains, we already know that there are three classes of sequences that are transcribed without function; introns, pseudogenes and mobile elements (“jumping genes”). Pseudogenes are an especially interesting example since they are known to be inactive copies of protein-coding genes that have been rendered dead by mutation. Over the past few years as experiments and computational algorithms have annotated more and more genes, the number of pseudogenes has gone up even as the number of protein-coding genes has gone down. We also know that pseudogenes can be transcribed and even translated in some cells, especially of the kind used in ENCODE, just as we know that they are non-functional by definition. Similar arguments apply to introns and mobile elements, and the article cites papers which demonstrate that knocking these genes out doesn't impair function. So why would any study label these three classes of sequences as functional just because they are transcribed? This seems to be a central flaw in ENCODE.

A related point made by the authors is statistical in which they say that the ENCODE project has sacrificed selectivity for sensitivity. There are some simple numerical arguments that point to the large number of false positives inherent in sacrificing selectivity for sensitivity. In fact this is a criticism that goes to the heart of the whole purpose of the ENCODE study:

“At this point, we must ask ourselves, what is the aim of ENCODE: Is it to identify every possible functional element at the expense of increasing the number of elements that are falsely identified as functional? Or is it to create a list of functional elements that is as free of false positives as possible. If the former, then sensitivity should be favored over selectivity; if the latter then selectivity should be favored over sensitivity. ENCODE chose to bias its results by excessively favoring sensitivity over specificity. In fact, they could have saved millions of dollars and many thousands of research hours by ignoring selectivity altogether, and proclaiming a priori that 100% of the genome is functional. Not one functional element would have been missed by using this procedure.”

2. “Function” as histone modification

Histones are proteins that pack DNA into chromatin. The histones then undergo certain chemical modifications called post-translational modifications that cause the DNA to unpack and be expressed. ENCODE used the presence of 12 histone modifications as evidence of “function”. This paper cites a study that found a very small proportion of possible histone modifications associated with function. Personally I think this is an evolving area of research but I too question the assumption of having a function associated with most histone modifications.

3. “Function” as proximity to regions of open chromatin

In contrast to histone-packaged DNA, open chromatin regions are not bound by histones. ENCODE found that 80% of transcription sites were within open chromatin regions. But then they seem to have committed the classic logical fallacy of inferring the opposite, that most open chromatin regions are functional transcription sites (there’s that association between transcription and function again). As the authors note, only 30% or so of open chromatin sites are even in the neighborhood of transcription sites, so associating most open chromatin sites with transcription seems to be a big leap to say the least.

4. “Function” as transcription-factor binding.

This to me is another huge assumption inherent in the ENCODE study, especially as a chemist. As I mentioned in my earlier post, there are regions of DNA that might bind transcription factors (TFs) just by chance through a few weak chemical interactions. The binding might be extremely weak and may be a quick association-dissociation event. To me it seemed that in associating any kind of transcription-factor binding with function, the ENCODE team had inferred biology from chemistry. The current analysis gives voice to my suspicions. As the authors say, transcription sites are usually very short which means that TF-binding “look-alikes” may arise in a large genome purely by chance. Any binding to these sites may be confused with real TF-binding sites. The authors also cite a study in which only 86% of TF-binding sites in a small sample of 14 sites showed experimental binding to a TF. Extrapolating to the entire genome, it could mean that a fraction of the conjectured TF-binding sites may actually bind TFs.

5. “Function” as DNA methylation.

This is another instance in which it seems to me that biology is being inferred from chemistry. DNA methylation is one of the dominant mechanisms of epigenetics. But by itself DNA methylation is only a chemical reaction. The ENCODE team built on a finding that negatively correlated gene expression with methylation in CpG (cytosine-guanine) sites. Based on this they concluded that 96% of all CpGs in the genome are methylated, and therefore functional. But again, in the absence of explicit experimental verification, CpG methylation cannot be equated with gene expression. At the very least this indicates follow-up work which will need to confirm the relationship. Until then the hypothesis that CpG methylation implies function will have to remain a hypothesis.

So what do we make of all this? It’s clear that many of the conclusions from ENCODE have been extrapolations devoid of hard evidence. But the real fly in the ointment is the idea of “junk DNA” which seems to have evoked rather extreme opinions that have ranged from proclaiming junk DNA as extinct to proclaiming it as God. Both these opinions perform a great disservice to the true nature of the genome. The former reaction virtually rolls the red carpet for “designer” creationists who can now enthusiastically remind us of how each and every base pair in the genome has been lovingly designed. At the same time, asserting that junk DNA must be God is tantamount to declaring that every piece of currently designated junk DNA must forever be non-functional. While the former transgression is much worse, it’s important to amend the latter belief. To do this the authors remind us of a distinction made by Sydney Brenner between “junk DNA” and “garbage DNA”. There’s the rubbish we keep and the rubbish we discard, but some rubbish may potentially turn useful in the future. At the same time, rubbish that may be useful in the future is not rubbish that’s useful in the present. Just because some “junk DNA” may turn out to have a function in the future does not mean most junk DNA will be functional. In fact as I mentioned in my post, the presence of large swathes of non-functional DNA in our genomes is perfectly consistent with standard evolutionary arguments.

The paper ends with an interesting discussion about “small” and “big” science that may explain some of the errors in the ENCODE study. The authors point out that big science has generally been in the business of generating and delivering data in an easy-to-access format. Small science has been much more competent in then interpreting the data. This does not mean that scientists working on big science are incapable of data interpretation; what it means is that the very nature of big data (and the time and resource allocation inherent in it) may make it very difficult for these scientists to launch the kinds of targeted projects that would do the job of careful data interpretation. Perhaps, the paper suggests, ENCODE’s mistake was in trying to act as both the deliverer and the interpreter of data. In the authors’ considered opinion, ENCODE “tried to perform a kind of textual hermeneutics on the 3.5 billion base-pair genome, disregarded the rules of scientific interpretation and adopted a position of theological hermeneutics, whereby every letter in a text is assumed a priori to have a meaning”. In other words, ENCODE seems to have succumbed to an unfortunate case of ubiquitous pattern seeking from which humans often suffer.

In any case, there are valuable lessons in this whole episode. The mountains of misleading publicity it generated, even in journals like Science and Nature, were a textbook study in media hype. As the authors say:

“The ENCODE results were predicted by one of its lead authors to necessitate the rewriting of textbooks (Pennisi 2012). We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.”

From a scientific viewpoint, the biggest lesson here may be to always keep fundamental evolutionary principles in mind when interpreting large amounts of noisy biological data under controlled laboratory conditions. It’s worth remembering the last line of the paper:

“Evolutionary conservation may be frustratingly silent on the nature of the functions it highlights, but progress in understanding the functional significance of DNA sequences can only be achieved by not ignoring evolutionary principles…Those involved in Big Science will do well to remember the depressingly true popular maxim: “If it is too good to be true, it is too good to be true.”

The authors compare ENCODE to AppleMaps, the direction-finding app in the iPhone that notoriously bombed when it came out. Yet AppleMaps also provides a useful metaphor. Software can evolve into a useful state. Hopefully, so will our understanding of the genome.