O pequeno segredo sujo da Biologia: os dados de genomas estão corrompidos ou mal anotados

segunda-feira, maio 12, 2014

Tuesday, May 06, 2014

Biology's Dirty Little Secret

Biology has a dirty little secret. It's a well-known secret among those who deal with sequenced-genome data intensively, but I suspect many non-biologists are unaware of the problem, which is: Much of the existing genome data (for sequenced genomes, ranging from bacteria to human DNA) is either corrupt or misannotated.

"Junk DNA" probably doesn't exist in living cells. But it certainly exists in published genomes. 

A substantial portion of published genome data is suspect, at this point, either because of contamination issues, technical problems surrounding DNA sequencing technology, or faulty gene annotation. An example is the Oryza sativa indica (rice) genome, which inexplicably contains at least 10% of the genome of the bacterium Acidovorax citrulli. There's also a Culex (mosquito) genome with a complete copy of Wolbachia embedded. The genome of Rothia mucilaginosa DY-18 contains over 300 genes incorrectly annotated in antisense orientation (as does the genome of Burkholderia pseudomallei strain 1710b, a truly execrable train-wreck of a genome).

Another example of a genome gone wrong (arguably) is that of the bacterium Ktedonobacter racemifer, which is filled with forward and backward copies of transposases. Incredibly, one in 13 Ktedonobacter genes is a transposase, integrase, or resolvase (and that's not counting the many "hypothetical proteins" with "transposase-like" mentioned in the gene ontology notes). Disregarding the 40% of that organism's genes that are marked as hypothetical proteins, one can say that in Ktedonobacter, one in four genes of known function is a transposase, integrase, or resolvase. (Some of the organism's 4000+ "hypothetical proteins" are actually transposases incorrectly annotated in an antisense orientation.) Common sense says something's amiss.


The "dark matter" problem in microbial genetics is widespread and openly acknowledged. At least 20% 28.3% (according to the Joint Genome Institute) of bacterial genes are annotated as "hypothetical protein," and most of these are so annotated because they have no sequence similarity match to any known protein. In many cases, there's no match because many of the sequences are in the wrong reading frame, or have an improperly located start codon (or other serious issues). When Ely and Scott (PLoS ONE, 2014) manually reannotated the genome of the bacterium Caulobacter crescentus, they identified 11 new genes, modified the start site of 113 genes, changed the reading frame of 38 genes, and found that 112 "hypothetical proteins" were actually non-coding DNA (not genes at all). A recent transcriptome analysis of the archaeon Sulfolobus solfataricus resulted in correction of 162 gene annotations and the addition of 80 new open reading frames. But these numbers barely hint at the extent of gene misannotation. In examining the Gene Ontology database (GOSeqLite), Jones et al. found:

Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%.

Surprisingly, the use of sequence similarity as a guide to function identification is less reliable than non-SS methods. This is no doubt partly a reflection of the fact that gene databases contain  a great deal of aberrant data. Gene-annotation programs like the widely used Glimmer (Gene Locator and Interpolated Markov Modeler) have to be trained, using a training set. If the training set contains faulty data, it's a classic GIGO situation.
...

Read more here/Leia mais aqui: AssertTrue ()

+++++

NOTA DESTE BLOGGER: Kas Thomas é cientista evolucionista HONESTO!!!