1 Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America, 2 Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America,3 Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America, 4 Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
Abstract
Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organismSaccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms.
Author Summary Top
Microarray expression experiments allow fast functional profiling of an organism's entire genome and significant efforts are devoted to analyzing the resulting data. Available genome sequences are also increasing quickly. However, it is unexplored how to use available functional genomics data to direct large-scale experiments in newly sequenced but poorly studied species. In this paper, we propose a strategy to systematically plan experimental treatments in the poorly studied species based on their model organism relatives. We consider both the accuracy of the datasets in capturing different biological processes and the redundancy between datasets. Quantifying the above information allows us to recommend a list of experimental treatments. We demonstrate the efficacy of this approach by designing, performing and evaluating S. bayanus microarray experiments using an available S. cerevisiae data repository. We show that this systematic planning process could reduce the labor in doing microarray experiments by 10 fold and achieve similar functional coverage.
Citation: Guan Y, Dunham M, Caudy A, Troyanskaya O (2010) Systematic Planning of Genome-Scale Experiments in Poorly Studied Species. PLoS Comput Biol 6(3): e1000698. doi:10.1371/journal.pcbi.1000698
Author Summary Top
Microarray expression experiments allow fast functional profiling of an organism's entire genome and significant efforts are devoted to analyzing the resulting data. Available genome sequences are also increasing quickly. However, it is unexplored how to use available functional genomics data to direct large-scale experiments in newly sequenced but poorly studied species. In this paper, we propose a strategy to systematically plan experimental treatments in the poorly studied species based on their model organism relatives. We consider both the accuracy of the datasets in capturing different biological processes and the redundancy between datasets. Quantifying the above information allows us to recommend a list of experimental treatments. We demonstrate the efficacy of this approach by designing, performing and evaluating S. bayanus microarray experiments using an available S. cerevisiae data repository. We show that this systematic planning process could reduce the labor in doing microarray experiments by 10 fold and achieve similar functional coverage.
Citation: Guan Y, Dunham M, Caudy A, Troyanskaya O (2010) Systematic Planning of Genome-Scale Experiments in Poorly Studied Species. PLoS Comput Biol 6(3): e1000698. doi:10.1371/journal.pcbi.1000698
Editor: David B. Searls, Philadelphia, United States of America
Received: August 3, 2009; Accepted: January 30, 2010; Published: March 5, 2010
Copyright: © 2010 Guan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This research was partially supported by NIH grant R01 GM071966, NSF CAREER award DBI-0546275, and NSF grant IIS-0513552. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: ogt@cs.princeton.edu (OT); acaudy@Princeton.edu (AC); maitreya@u.washington.edu (MD)
Received: August 3, 2009; Accepted: January 30, 2010; Published: March 5, 2010
Copyright: © 2010 Guan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This research was partially supported by NIH grant R01 GM071966, NSF CAREER award DBI-0546275, and NSF grant IIS-0513552. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
* E-mail: ogt@cs.princeton.edu (OT); acaudy@Princeton.edu (AC); maitreya@u.washington.edu (MD)
+++++