Os muitos galhos perdidos da Árvore da Vida

quarta-feira, agosto 26, 2015

Lost Branches on the Tree of Life

Bryan T. Drew , Romina Gazis, Patricia Cabezas, Kristen S. Swithers, Jiabin Deng, Roseana Rodriguez, Laura A. Katz, Keith A. Crandall, David S. Hibbett, Douglas E. Soltis

Published: September 3, 2013
DOI: 10.1371/journal.pbio.1001636


Given that reproducibility is a pillar of scientific research, the preservation of scientific knowledge (underlying data) is of paramount importance. The standard of reproducibility can be evaluated based on criteria of methodological rigor and legitimacy, which is sometimes used to distinguish “hard” from “soft” sciences. In phylogenetics, a discipline that routinely uses DNA sequences to build trees reflecting organismal relationships, the scale of data collection and the complexity of analytical software have both increased dramatically during the past decade. Consequently, the ability to navigate publications and reproduce analyses is more challenging than ever. When DNA sequencing was initially employed in systematics during the late 1980s, there was some reluctance to deposit nucleotide sequences in open repositories such as GenBank [1]. This ultimately changed when high-impact journals (e.g., Proceedings of the National Academy of SciencesNatureScience) began requiring GenBank submission as a prerequisite for publication [1],[2]; now virtually every evolutionary biology journal observes this requirement (but see [3]).
Until recently, uploading sequences to GenBank (or EMBL) was generally considered sufficient to ensure reproducibility of phylogenetic studies using DNA sequence data. Increasingly, however, the systematics community is realizing that archiving raw DNA sequences is not adequate, and that the underlying alignments of DNA sequences as well as the resulting phylogenetic trees are pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis. Indeed, there has been a growing clamor for journals to adopt and enforce more rigorous data archiving practices across diverse disciplines [4][8]. As a result, about 35 evolutionary journals [5],[9] have adopted policies to encourage or require authors to upload alignments, phylogenetic trees, and other files requisite for study reproducibility [5] to TreeBASE (http://treebase.org/) and/or other public repositories such as Dryad (http://datadryad.org). Unfortunately, enforcement of such data deposition policies is generally lax, and most journals in systematics and evolution still do not require DNA sequence alignment or tree deposition. As a result, the alignments and trees underlying most published papers in systematics/phylogenetics and evolutionary biology remain inaccessible to the scientific community at large [8],[10].

Scope of the Problem

As DNA sequencing has become easier, faster, and cheaper, and as scientists have come to realize that phylogenies inform diverse areas of inquiry, phylogenetic trees have permeated virtually every facet of biology, including disparate subdisciplines such as medicine (e.g., [11],[12]), climate change research (e.g., [13],[14]), organismal evolution (e.g., [15]), conservation efforts (e.g., [16]), and linguistics (e.g., [17]). In building phylogenetic trees, researchers implicitly acknowledge that alignments and trees are important. However, archiving these data has been largely ignored, perhaps because researchers have considered the actual raw sequence data as the sole information necessary to replicate a phylogenetic study, while alignments and phylogenetic trees have been treated as the resulting outcome from sequence data analyses. The latter view of alignments and trees is certainly correct, but the underlying sequence alignments and associated trees should also be recognized as crucial data in their own right. The increasing use of published trees and the underlying sequence alignments as the framework for evolutionary inference and other subsequent downstream hypothesis testing dictates, however, that alignments and trees are data and need to be archived with a diligence on par with raw sequence data.
The call for ensuring reproducibility and data sharing in systematics is not new. The fundamental importance of archiving scientific datasets across numerous subdisciplines including climate change research, evolutionary biology, and medicine has received increasing attention over the past five years [5][8],[10],[18][22]. Several of these studies have examined the proportion of publications that archived data in a manner that affords public access [6],[8],[18], and all concluded that we have entered an age in which scientific journals should require and enforce data archiving policies.
Some researchers, including [23] for psychology and [4] for medical research, have taken the next step and have contacted authors directly when data of interest have not been available, which highlighted an additional problem. These workers found that data are not easily obtained via direct author contact. More recently, Stoltzfus et al. [8] examined deposition practices within the molecular systematic community, and estimated alignment/tree deposition rates to be remarkably low (~4%). Stolzfus et al. [8] focused on only two journals (American Journal of Botany and Evolution), and searched literature over just a 2-year period (2010–2011). Although the study of Stolzfus et al. [8] represents a good first step, no analysis has attempted to evaluate how often alignments/trees are deposited over a broad range of evolutionary biology journals that span organismal diversity representing the tree of life, or how archiving tendencies have changed over time.
In the process of gathering data to build the first tree of life for all ~1.9 million named species (the Open Tree of Life Project; http://opentreeoflife.org), we examined 7,539 peer-reviewed papers to evaluate data depositional practices of foundational DNA sequence alignments and phylogenetic trees by the systematic community between 2000 and 2012. Our broad survey of the literature covered animals, fungi, seed plants, microbial eukaryotes, archaea, and bacteria, and included publications from more than 100 journals (see Tables S1S2S3S4). To assess the rigor of data that were deposited in a public archive, we also examined the quality (e.g., Did deposited trees match publication figure(s)? Were there branch lengths in deposited trees?) of ca. 350 files deposited in TreeBASE (described in Text S1). Additionally, we attempted to acquire data by randomly contacting 375 authors directly (see Text S1 and Table S4). Furthermore, to evaluate depositional practices of other data critical for study replication, we surveyed 100 randomly selected publications that implemented the popular evolutionary analysis package BEAST (Bayesian Evolutionary Analysis Sampling Trees [24]; 4,153 citations as of 7-17-2013), which is widely used to obtain divergence times and phylogenies that are used to test hypotheses and draw conclusions regarding broad biological questions (e.g., phylogeography, lineage origins).
Surprisingly, only 16.7%, 1,262 from a total of 7,539 publications surveyed, provided accessible alignments/trees (Figures 1 and 2). Our attempts to obtain datasets directly from authors were only 16% successful (61/375; see Table S4), and we estimate that approximately 70% of existing alignments/trees are no longer accessible. Thus, we conclude that most of the underlying sequence alignments and phylogenetic trees produced by the systematic community during the past several decades are essentially lost, accessible only as static figures in a published journal article with no capacity for subsequent manipulation. Furthermore, when data are deposited, they are often incomplete (e.g., what characters were excluded, accepted taxon names; see Text S1 and Figure S1). Our survey of publications that implemented BEAST revealed that only 11 out of 100 (11%) examined studies provided access to the underlying xml input file, which is critical for reproducing BEAST results. Although funding agencies often require all data to be accessible from funded publications, our results reveal this is more the exception than the rule.