GAGE: A critical evaluation of genome assemblies and assembly algorithms

Steven L. Salzberg; Adam M. Phillippy; Aleksey Zimin; Daniela Puiu; Tanja Magoc; Sergey Koren; Todd J. Treangen; Michael C. Schatz; Arthur L. Delcher; Michael Roberts; Guillaume Marçais; Mihai Pop; James A. Yorke

doi:10.1101/gr.131383.111

GAGE: A critical evaluation of genome assemblies and assembly algorithms

¹McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA;
²National Biodefense Analysis and Countermeasures Center, Battelle National Biodefense Institute, Frederick, Maryland 21702, USA;
³Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA;
⁴Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland 20742, USA;
⁵Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA;
⁶Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland 21201, USA

Abstract

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Footnotes

↵7 Corresponding author.

E-mail salzberg{at}jhu.edu.
[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.131383.111.

Received September 1, 2011.
Accepted November 11, 2011.

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Abstract

Footnotes

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

Related Content

Share

Preprint Server

Current Issue

In This Issue

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Abstract

Footnotes

Related Articles

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

Related Content

Share

Preprint Server

Current Issue

In This Issue