What is a gene, post-ENCODE? History and updated definition
Mark B. Gerstein1,2,3,9, Can Bruce2,4, Joel S. Rozowsky2, Deyou Zheng2, Jiang Du3, Jan O. Korbel2,5, Olof Emanuelsson6, Zhengdong D. Zhang2, Sherman Weissman7, and Michael Snyder2,8
- Author Affiliations
1 Program in Computational Biology & Bioinformatics, Yale University, New Haven, Connecticut 06511, USA;
2 Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06511, USA;
3 Computer Science Department, Yale University, New Haven, Connecticut 06511, USA;
4 Center for Medical Informatics, Yale University, New Haven, Connecticut 06511, USA;
5 European Molecular Biology Laboratory, 69117 Heidelberg, Germany;
6 Stockholm Bioinformatics Center, Albanova University Center, Stockholm University, SE-10691 Stockholm, Sweden;
7 Genetics Department, Yale University, New Haven, Connecticut 06511, USA;
8 Molecular, Cellular, & Developmental Biology Department, Yale University, New Haven, Connecticut 06511, USA
While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century—from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition sidesteps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.
The classical view of a gene as a discrete element in the genome has been shaken by ENCODE
The ENCODE consortium recently completed its characterization of 1% of the human genome by various high-throughput experimental and computational techniques designed to characterize functional elements (The ENCODE Project Consortium 2007). This project represents a major milestone in the characterization of the human genome, and the current findings show a striking picture of complex molecular activity. While the landmark human genome sequencing surprised many with the small number (relative to simpler organisms) of protein-coding genes that sequence annotators could identify (∼21,000, according to the latest estimate [see www.ensembl.org]), ENCODE highlighted the number and complexity of the RNA transcripts that the genome produces. In this regard, ENCODE has changed our view of “what is a gene” considerably more than the sequencing of the Haemophilus influenza and human genomes did (Fleischmann et al. 1995; Lander et al. 2001; Venter et al. 2001). The discrepancy between our previous protein-centric view of the gene and one that is revealed by the extensive transcriptional activity of the genome prompts us to reconsider now what a gene is. Here, we review how the concept of the gene has changed over the past century, summarize the current thinking based on the latest ENCODE findings, and propose a new updated gene definition that takes these findings into account.