A framework for variation discovery and genotyping using next-generation DNA sequencing data

DePristo, Mark A; Banks, Eric; Poplin, Ryan; Garimella, Kiran V; Maguire, Jared R; Hartl, Christopher; Philippakis, Anthony A; del Angel, Guillermo; Rivas, Manuel A; Hanna, Matt; McKenna, Aaron; Fennell, Tim J; Kernytsky, Andrew M; Sivachenko, Andrey Y; Cibulskis, Kristian; Gabriel, Stacey B; Altshuler, David; Daly, Mark J

doi:10.1038/ng.806

Technical Report
Published: 10 April 2011

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Mark A DePristo¹,
Eric Banks¹,
Ryan Poplin¹,
Kiran V Garimella¹,
Jared R Maguire¹,
Christopher Hartl¹,
Anthony A Philippakis^1,2,3,
Guillermo del Angel¹,
Manuel A Rivas^1,4,
Matt Hanna¹,
Aaron McKenna¹,
Tim J Fennell¹,
Andrew M Kernytsky¹,
Andrey Y Sivachenko¹,
Kristian Cibulskis¹,
Stacey B Gabriel¹,
David Altshuler^1,3,4 &
â¦
Mark J Daly^1,3,4Â

Nature Genetics volumeÂ 43,Â pages 491â498 (2011)Cite this article

75k Accesses
7968 Citations
70 Altmetric
Metrics details

Subjects

Abstract

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (â¼4Ã) 1000 Genomes Project datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Framework for variation discovery and genotyping from next-generation DNA sequencing.**

Figure 2: Integrative genomics viewer (IGV) visualization of alignments in region chr.1: 1,510,530â1,510,589 from the Trio NA12878 Illumina reads from the 1000 Genomes Project (a) and NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment (b).

Figure 3: Raw (pink) and recalibrated (blue) base quality scores for NGS paired-end read sets of NA12878 of Illumina/GA (a), Roche/454 (b) and Life/SOLiD (c) lanes from the 1000 Genomes Project and Illumina/HiSeq (d).

**Figure 4: Results of variant quality recalibration on HiSeq, exome and low-pass data sets.**

**Figure 5: Variation discovered among 60 individuals from the CEPH population from the 1000 Genomes Project pilot phase plus low-pass NA12878.**

**Figure 6: Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples.**

Comprehensive genome analysis and variant detection at scale using DRAGEN

Article Open access 25 October 2024

Towards population-scale long-read sequencing

Article 28 May 2021

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

References

The 1000 Genomes Project Consortium. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061â1073 (2010).
Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75â78 (2010).
ArticleÂ CASÂ Google ScholarÂ
Ng, S.B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30â35 (2009).
ArticleÂ Google ScholarÂ
Lee, W. et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465, 473â477 (2010).
ArticleÂ CASÂ Google ScholarÂ
Pleasance, E.D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191â196 (2009).
ArticleÂ Google ScholarÂ
Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899â905 (2010).
ArticleÂ CASÂ Google ScholarÂ
Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636â639 (2010).
ArticleÂ CASÂ Google ScholarÂ
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966â1967 (2009).
ArticleÂ CASÂ Google ScholarÂ
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, 1851â1858 (2008).
ArticleÂ CASÂ Google ScholarÂ
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754â1760 (2009).
ArticleÂ CASÂ Google ScholarÂ
Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725â1729 (2001).
ArticleÂ CASÂ Google ScholarÂ
Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186â194 (1998).
ArticleÂ CASÂ Google ScholarÂ
Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 18, 763â770 (2008).
ArticleÂ CASÂ Google ScholarÂ
Li, M., Nordborg, M. & Li, L.M. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 32, 5183â5191 (2004).
ArticleÂ CASÂ Google ScholarÂ
Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124â1132 (2009).
ArticleÂ CASÂ Google ScholarÂ
Drmanac, R. et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78â81 (2010).
ArticleÂ CASÂ Google ScholarÂ
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53â59 (2008).
ArticleÂ CASÂ Google ScholarÂ
Koboldt, D., Chen, K., Wylie, T. & Larson, D. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25, 2283â2285 (2009).
ArticleÂ CASÂ Google ScholarÂ
Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872â876 (2008).
ArticleÂ CASÂ Google ScholarÂ
Mokry, M. et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries. Nucleic Acids Res. 38, e116 (2010).
ArticleÂ Google ScholarÂ
Shen, Y. et al. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 20, 273â280 (2010).
ArticleÂ CASÂ Google ScholarÂ
Hoberman, R. et al. A probabilistic approach for SNP discovery in high-throughput human resequencing data. Genome Res. 19, 1542â1552 (2009).
ArticleÂ CASÂ Google ScholarÂ
Malhis, N. & Jones, S. High quality SNP calling using Illumina data at shallow coverage. Bioinformatics 26, 1029 (2010).
ArticleÂ CASÂ Google ScholarÂ
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078â2079 (2009).
ArticleÂ Google ScholarÂ
Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269â276 (2011).
ArticleÂ CASÂ Google ScholarÂ
McKenna, A.H. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297â1303 (2010).
ArticleÂ CASÂ Google ScholarÂ
Browning, B.L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847â861 (2009).
ArticleÂ CASÂ Google ScholarÂ
Langmead, B., Schatz, M.C., Lin, J., Pop, M. & Salzberg, S.L. Searching for SNPs with cloud computing. Genome Biol. 10, R134 (2009).
ArticleÂ Google ScholarÂ
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
ArticleÂ Google ScholarÂ
Green, R.E. et al. A draft sequence of the Neandertal genome. Science 328, 710â722 (2010).
ArticleÂ CASÂ Google ScholarÂ
Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182â189 (2009).
ArticleÂ CASÂ Google ScholarÂ
Ng, S., Turner, E., Robertson, P. & Flygare, S. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461, 272â276 (2009).
ArticleÂ CASÂ Google ScholarÂ
Mckernan, K.J. et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res. 19, 1527â1541 (2009).
ArticleÂ CASÂ Google ScholarÂ
Ebersberger, I., Metzler, D., Schwarz, C. & PÃ¤Ã¤bo, S. Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70, 1490â1497 (2002).
ArticleÂ CASÂ Google ScholarÂ
Freudenberg-Hua, Y. et al. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res. 13, 2271â2276 (2003).
ArticleÂ CASÂ Google ScholarÂ
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (Cambridge University Press, Cambridge, UK, 1998).
Dohm, J.C., Lottaz, C., Borodina, T. & Himmelbauer, H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105 (2008).
ArticleÂ Google ScholarÂ
HUGO Consortium. et al. Mapping human genetic diversity in Asia. Science 326, 1541â1545 (2009).
Bishop, C. Pattern Recognition and Machine Learning (Springer, New York, New York, USA, 2006).

Download references

Acknowledgements

Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis Toolkit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.

Author information

Authors and Affiliations

Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
Mark A DePristo,Â Eric Banks,Â Ryan Poplin,Â Kiran V Garimella,Â Jared R Maguire,Â Christopher Hartl,Â Anthony A Philippakis,Â Guillermo del Angel,Â Manuel A Rivas,Â Matt Hanna,Â Aaron McKenna,Â Tim J Fennell,Â Andrew M Kernytsky,Â Andrey Y Sivachenko,Â Kristian Cibulskis,Â Stacey B Gabriel,Â David AltshulerÂ &Â Mark J Daly
Brigham and Women's Hospital, Boston, Massachusetts, USA
Anthony A Philippakis
Harvard Medical School, Boston, Massachusetts, USA
Anthony A Philippakis,Â David AltshulerÂ &Â Mark J Daly
Center for Human Genetic Research, Massachusetts General Hospital, Richard B. Simches Research Center, Boston, Massachusetts, USA
Manuel A Rivas,Â David AltshulerÂ &Â Mark J Daly

Authors

Mark A DePristo
View author publications
You can also search for this author in PubMedÂ Google Scholar
Eric Banks
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ryan Poplin
View author publications
You can also search for this author in PubMedÂ Google Scholar
Kiran V Garimella
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jared R Maguire
View author publications
You can also search for this author in PubMedÂ Google Scholar
Christopher Hartl
View author publications
You can also search for this author in PubMedÂ Google Scholar
Anthony A Philippakis
View author publications
You can also search for this author in PubMedÂ Google Scholar
Guillermo del Angel
View author publications
You can also search for this author in PubMedÂ Google Scholar
Manuel A Rivas
View author publications
You can also search for this author in PubMedÂ Google Scholar
Matt Hanna
View author publications
You can also search for this author in PubMedÂ Google Scholar
Aaron McKenna
View author publications
You can also search for this author in PubMedÂ Google Scholar
Tim J Fennell
View author publications
You can also search for this author in PubMedÂ Google Scholar
Andrew M Kernytsky
View author publications
You can also search for this author in PubMedÂ Google Scholar
Andrey Y Sivachenko
View author publications
You can also search for this author in PubMedÂ Google Scholar
Kristian Cibulskis
View author publications
You can also search for this author in PubMedÂ Google Scholar
Stacey B Gabriel
View author publications
You can also search for this author in PubMedÂ Google Scholar
David Altshuler
View author publications
You can also search for this author in PubMedÂ Google Scholar
Mark J Daly
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

M.A.D., E.B., R.P., K.V.G., J.R.M., C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S. and K.C. conceived of, implemented and performed analytic approaches. M.A.D., E.B., R.P., K.V.G., G.d.A., A.M.K. and M.J.D. wrote the manuscript. M.A.D., M.H. and A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D., S.B.G., D.A. and M.J.D. lead the team.

Corresponding author

Correspondence to Mark A DePristo.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figure 1, Supplementary Tables 1â7 and Supplementary Note (PDF 806 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

DePristo, M., Banks, E., Poplin, R. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491â498 (2011). https://doi.org/10.1038/ng.806

Download citation

Received: 27 August 2010
Accepted: 17 March 2011
Published: 10 April 2011
Issue Date: May 2011
DOI: https://doi.org/10.1038/ng.806

This article is cited by

The nature and distribution of putative non-functional alleles suggest only two independent events at the origins of Astyanax mexicanus cavefish populations
- Maxime Policarpo
- Laurent Legendre
- Didier Casane
BMC Ecology and Evolution (2024)
Fine mapping and identification of two NtTOM2A homeologs responsible for tobacco mosaic virus replication in tobacco (Nicotiana tabacum L.)
- Xuebo Wang
- Zhan Shen
- Dan Liu
BMC Plant Biology (2024)
Comparison of capture-based mtDNA sequencing performance between MGI and illumina sequencing platforms in various sample types
- Zehui Feng
- Fan Peng
- Xu Guo
BMC Genomics (2024)
COSAP: Comparative Sequencing Analysis Platform
- Mehmet Arif Ergun
- Omer Cinal
- Mehmet Baysan
BMC Bioinformatics (2024)
Reconstructing the ancestral gene pool to uncover the origins and genetic links of HmongâMien speakers
- Yang Gao
- Xiaoxi Zhang
- Shuhua Xu
BMC Biology (2024)