Long-read sequence assembly of the gorilla genome
David Gordon1,2,*, John Huddleston1,2,*, Mark J. P. Chaisson1,*, Christopher M. Hill1,*, Zev N. Kronenberg1,*, Katherine M. Munson1, Maika Malig1, Archana Raja1,2, Ian Fiddes3, LaDeana W. Hillier4, Christopher Dunn5, Carl Baker1, Joel Armstrong3, Mark Diekhans3, Benedict Paten3, Jay Shendure1,2, Richard K. Wilson4, David Haussler3, Chen-Shan Chin5, Evan E. Eichler1,2,†
+ Author Affiliations
↵†Corresponding author. E-mail: eee@gs.washington.edu
↵* These authors contributed equally to this work.
Science 01 Apr 2016: Vol. 352, Issue 6281,
Improving on the gorilla genome
Access to complete, high-quality genomes of nonhuman primates will also help us understand human biology. Gordon et al. used long-read sequencing technology to improve genome data on our close relative the gorilla. Sequencing from a single individual decreased assembly fragmentation and recovered previously missed genes and noncoding loci. Mapping short-read sequences from additional gorillas helped reconstruct a “pan” gorilla sequence documenting genetic variation. Comparison with human genomes revealed species-specific differences ranging in size from one to thousands of bases in length, including some that are likely to affect gene regulation.
Science, this issue p. 10.1126/science.aae0344
Structured Abstract
INTRODUCTION
The accurate sequence and assembly of genomes is critical to our understanding of evolution and genetic variation. Despite advances in short-read sequencing technology that have decreased cost and increased throughput, whole-genome assembly of mammalian genomes remains problematic because of the presence of repetitive DNA.
RATIONALE
The goal of this study was to sequence and assemble the genome of the western lowland gorilla by using primarily single-molecule, real-time (SMRT) sequencing technology and a novel assembly algorithm that takes advantage of long (>10 kbp) sequence reads. We specifically compare the properties of this assembly to gorilla genome assemblies that were generated by using more routine short sequence read approaches in order to determine the value and biological impact of a long-read genome assembly.
RESULTS
We generated 74.8-fold SMRT whole-genome shotgun sequence from peripheral blood DNA isolated from a western lowland gorilla (Gorilla gorilla gorilla) named Susie. We applied a string graph assembly algorithm, Falcon, and consensus algorithm, Quiver, to generate a 3.1-Gbp assembly with a contig N50 of 9.6 Mbp. Short-read sequence data from an additional six gorilla genomes was mapped so as to reduce indel errors and improve the accuracy of the final assembly. We estimate that 98.9% of the gorilla euchromatin has been assembled into 1854 sequence contigs. The assembly represents an improvement in contiguity: >800-fold with respect to the published gorilla genome assembly and >180-fold with respect to a more recently released upgrade of the gorilla assembly. Most of the sequence gaps are now closed, considerably increasing the yield of complete gene models. We estimate that 87% of the missing exons and 94% of the incomplete genes are recovered. We find that the sequence of most full-length common repeats is resolved, with the most significant gains occurring for the longest and most G+C–rich retrotransposons. Although complex regions such as the major histocompatibility locus are accurately sequenced and assembled, both heterochromatin and large, high-identity segmental duplications are not because read lengths are insufficiently long to traverse these repetitive structures. The long-read assembly produces a much finer map of structural variation down to 50 bp in length, facilitating the discovery of thousands of lineage-specific structural variant differences that have occurred since divergence from the human and chimpanzee lineages. This includes the disruption of specific genes and loss of predicted regulatory regions between the two species. We show that use of the new gorilla genome assembly changes estimates of divergence and diversity, resulting in subtle but substantial effects on previous population genetic inferences, such as the timing of species bottlenecks and changes in the effective population size over the course of evolution.
CONCLUSION
The genome assembly that results from using the long-read data provides a more complete picture of gene content, structural variation, and repeat biology, improving population genetic and evolutionary inferences. Long-read sequencing technology now makes it practical for individual laboratories to generate high-quality reference genomes for complex mammalian genomes.
FREE PDF GRATIS: Science