Copyright © 2012 Elsevier Ltd All rights reserved.
Current Biology, Volume 22, Issue 21, R898-R899, 6 November 2012
doi:10.1016/j.cub.2012.10.002
The C-value paradox, junk DNA and ENCODE
Sean R. Eddy
HHMI Janelia Farm Research Campus, Ashburn VA 20147, USA
What is the C-value paradox? You might expect more complex organisms to have progressively larger genomes, but eukaryotic genome size fails to correlate well with apparent complexity, and instead varies wildly over more than a 100,000-fold range. Single-celled amoebae have some of the largest genomes, up to 100-fold larger than the human genome. This variation suggested that genomes can contain a substantial fraction of DNA other than for genes and their regulatory sequences. C.A. Thomas Jr dubbed it the ‘C-value paradox’ in 1971.
The C-value paradox is related to another puzzling observation, called ‘mutational load’: the human genome seems too large, given the observed human mutation rate. If the entire human genome were functional (in the sense of being under selective pressure), we would have too many deleterious mutations per generation. By 1970, rough calculations had suggested to several authors that maybe only 1–20% of the human genome could be genic, with the rest evolving neutrally or nearly so.
So why not call it the ‘genome size paradox’? What is a ‘C-value’ anyway? ‘C-value’ means the ‘constant’ (or ‘characteristic’) value of haploid DNA content per nucleus, typically measured in picograms (1 picogram is roughly 1 gigabase). Around 1950, the observation that different cell types in the same organism generally have the same C-value was part of the evidence supporting the idea that DNA was responsible for heredity.
Why is it a paradox, maybe we just don’t understand how to measure complexity? For sure, we don’t understand how to meaningfully measure an organism’s complexity, and we don’t have any theoretical basis for predicting how many genes or regulatory regions one needs. But the C-value paradox isn’t just an observation that different species have different genome sizes, it’s the observation that even similar species can have quite different genome sizes. For example, there are many examples of related species in the same genus that have haploid genome sizes that differ by three- to eight-fold; this is particularly common in plants, as seen in species of rice (Oryza), Sorghum, or onions (Allium). The maize (Zea mays) genome expanded by about 50% in just 140,000 years since its divergence from Zea luxurians (and not merely by polyploidization). Unlike what we expect of genes and regulatory sequences, which generally evolve slowly and conservatively, for some reason genome size can change rapidly on evolutionary timescales.
OK, cool; I’ve already come up with some hypotheses — maybe the extra DNA has a structural role in the nucleus? Remember, the C-value paradox is old. Many hypotheses have been proposed and carefully weighed in the literature. At first, people looked for explanations in terms of some functional significance of the extra DNA — an adaptive function that would maintain nongenic, nonregulatory DNA by natural selection. But to explain mutational load — and more modern observations from comparative genomics, showing that only a small fraction of most eukaryotic genome sequence is conserved and under selective pressure — you have to posit an adaptive role where only the bulk amount of the DNA matters, not its specific sequence. To explain the C-value paradox, you have to explain why this bulk amount would vary quite a bit even between similar species. Although some such adaptive explanations have been speculated, a rather different line of thinking, starting with Ohno and others in the early 1970s, ultimately led to a reasonably well-accepted explanation of the C-value paradox.
So what is the explanation for the C-value paradox? Genomes carry some fraction of DNA that has little or no adaptive advantage for the organism at all. Some genomes carry more than others, and some genomes carry quite a lot of it. Ohno, who believed that strongly polarizing statements clarify scientific debate, called this ‘junk DNA’.
So the idea is that all noncoding DNA is junk DNA? No. Of course we’ve also known since the earliest days of molecular biology (including the Jacob/Monod lac operon paradigm) that genes are regulated by sequences that often occur in noncoding DNA. Rather the idea is that there is a fraction of DNA that is useful and functional for the organism (genes and regulatory regions) which does more or less scale with organismal complexity, and a ‘junk’ fraction which varies widely in amount, creating the C-value paradox.
I’m having a hard time with your derogatory term ‘junk’… Ohno’s zest for polarizing provocation went too far. Far from clarifying, his term tends to incense people, and the science behind the idea gets muddled. If you like, call it ‘nonfunctional’ DNA instead — and by nonfunctional, we mean ‘having little or no selective advantage for the organism’. These words, especially ‘for the organism’, will become important.
How much nonfunctional DNA an organism would harbor will be a tradeoff between how deleterious it is to carry versus how easy it is to get rid of. It’s actually not obvious that extra DNA would be all that deleterious; DNA replication is a relatively small part of the energy budget of most organisms. Still, DNA deletions are common enough mutations. If there were even a small selective disadvantage to having a junky genome, especially in species with large population sizes (where small selection coefficients have more effect) and fast growth rates (where an obese genome might especially be a hindrance), it would be surprising to see a lot of nonfunctional DNA.
That’s what I mean: natural selection wouldn’t tolerate junk; if you can’t explain how this extra DNA got there and why it’s maintained, ‘junk DNA’ is an argument from ignorance — you can’t just assume it’s junk. Ohno was mostly focused on pseudogenes, which do occur, but not nearly in large enough numbers to explain the C-value paradox. So indeed, what Ohno’s idea lacked to make it convincing was an observable mechanism that creates large amounts of junk DNA rapidly, faster than natural selection deletes it. In 1980, two landmark papers, by Orgel and Crick and by Doolittle and Sapienza, established a strong case for such a mechanism. They proposed that ‘selfish DNA’ elements, such as transposons, essentially act as molecular parasites, replicating and increasing their numbers at the (usually slight) expense of a host genome. Selfish DNA elements function for themselves, rather than having an adaptive function for their host.
The massive prevalence of transposable elements in eukaryotic genomes was only just becoming appreciated at the time. One transposable element in humans, called Alu, occurs in about a million copies and accounts for about 10% of our genome. Almost all copies of transposons in genomes are partial or defective elements that were inserted in the evolutionary past and are now decaying away, largely by neutral mutational drift. Active DNA transposons (one kind of ‘selfish DNA’) generate a mass of decaying dead transposons (one source of ‘junk DNA’).
We can affirmatively identify transposon relics by computational genome sequence analysis methods. These studies show that transposable elements invade in waves over evolutionary time, sweeping into a genome in large numbers, then dying and decaying away. 45% of the human genome is detectably derived from transposable elements. The true fraction of transposon-derived DNA in our genome must be greater, because neutrally evolving sequences decay so rapidly that after only a hundred million years or so, they eventually become too degraded to recognize. The C-value paradox is mostly (though not entirely) explained by different loads of decaying husks of transposable elements. Larger genomes have a larger fraction of transposon relics.
...