Unifying the known and unknown microbial coding sequence space
Chiara Vanni, Matthew S Schechter, Silvia G Acinas, Albert Barberán, Pier Luigi Buttigieg, Emilio O Casamayor, Tom O Delmont, Carlos M Duarte, A Murat Eren, Robert D Finn, Renzo Kottmann, Alex Mitchell, Pablo Sánchez, Kimmo Siren, Martin Steinegger, Frank Oliver Gloeckner, Antonio Fernàndez-Guerra Is a corresponding author see less
Microbial Genomics and Bioinformatics Research G, Max Planck Institute for Marine Microbiology, Germany; Jacobs University Bremen, Germany; Department of Medicine, University of Chicago, United States; Department of Marine Biology and Oceanography, Institut de Ciències del Mar (CSIC), Spain; Department of Environmental Science, University of Arizona, United States; Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Alfred Wegener Institute, Germany; Center for Advanced Studies of Blanes CEAB-CSIC, Spanish Council for Research, Spain; Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, France; Red Sea Research Centre and Computational Bioscience Research Center, King Abdullah University of Science and Technology, Saudi Arabia; Josephine Bay Paul Center, Marine Biological Laboratory, United States; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, United Kingdom; Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Denmark; School of Biological Sciences, Seoul National University, Republic of Korea; Institute of Molecular Biology and Genetics, Seoul National University, Republic of Korea; University of Bremen and Life Sciences and Chemistry, Germany; Computing Center, Helmholtz Center for Polar and Marine Research, Germany; Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Denmark.
Mar 31, 2022
https://doi.org/10.7554/eLife.67667
Abstract
Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40–60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71 M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.
Editor's evaluation
In this paper, the authors develop a sensitive and specific computational workflow for comprehensively summarizing known and unknown gene content across large collections of genomes and metagenomes. In addition to clustering and categorizing genes on a large scale, the authors show how to use their approach to both explore lineage-specific genes and generate hypotheses for the function of unknown genes.
https://doi.org/10.7554/eLife.67667.sa0