Desafiando a Nomenklatura Científica: A maioria das cópias de "matéria escura" está associada com genes conhecidos: o ENCODE pisou na bola?

sexta-feira, julho 30, 2010

Most “Dark Matter” Transcripts Are Associated With Known Genes

Harm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1 Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada

Abstract

A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.

Author Summary

The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.

Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most “Dark Matter” Transcripts Are Associated With Known Genes. PLoS Biol 8(5): e1000371. doi:10.1371/journal.pbio.1000371

Academic Editor: Sean R. Eddy, HHMI Janelia Farm, United States of America

Received: December 3, 2009; Accepted: April 9, 2010; Published: May 18, 2010

Copyright: © 2010 van Bakel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by Genome Canada (http://www.genomecanada.ca) through the Ontario Genomics Institute, the Ontario Research Fund, and March of Dimes (http://www.marchofdimes.com). HvB was supported by the Netherlands Organization for Scientific Research (NWO; http://www.nwo.nl) (grant no. 825.06.033) and the Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) (grant no. 193588). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: APA, alternative cleavage and polyadenylation; BW, bandwidth parameter; CAGE, capped analysis of gene expression; CNV, copy number variation; lincRNAs, large intervening noncoding RNAs; ncRNAs, noncoding RNAs; ORF, open reading frames; pasRNA, promoter-associated RNA; TSS, transcription start site; TTS, transcription termination site; TU, transcript unit; TUF, transcript of unknown function

* E-mail: t.hughes@utoronto.ca

+++++

FREE PDF GRÁTIS [OPEN ACCESS]

+++++

PERGUNTA INDISCRETA DO BLOGGER:

Quer dizer então que a turma do ENCODE pisou na bola na transcrição de genomas??? Pereça tal pensamento.

+++++

Vote neste blog para o prêmio TOPBLOG 2010.