A framework for variation discovery and genotyping using next-generation DNA sequencing data


Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

Figure 1: Framework for variation discovery and genotyping from next-generation DNA sequencing.
Figure 2: Integrative genomics viewer (IGV) visualization of alignments in region chr.1: 1,510,530–1,510,589 from the Trio NA12878 Illumina reads from the 1000 Genomes Project (a) and NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment (b).
Figure 3: Raw (pink) and recalibrated (blue) base quality scores for NGS paired-end read sets of NA12878 of Illumina/GA (a), Roche/454 (b) and Life/SOLiD (c) lanes from the 1000 Genomes Project and Illumina/HiSeq (d).
Figure 4: Results of variant quality recalibration on HiSeq, exome and low-pass data sets.
Figure 5: Variation discovered among 60 individuals from the CEPH population from the 1000 Genomes Project pilot phase plus low-pass NA12878.
Figure 6: Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples.

Many thanks to our colleagues in Medical and Population Genetics and Cancer Informatics and the 1000 Genomes Project who encouraged and supported us during the development of the Genome Analysis Toolkit and associated tools. This work was supported by grants from the National Human Genome Research Institute, including the Large Scale Sequencing and Analysis of Genomes grant (54 HG003067) and the Joint SNP and CNV calling in 1000 Genomes sequence data grant (U01 HG005208). We would also like to thank our excellent anonymous reviewers for their thoughtful comments.

Author information

Authors and Affiliations



M.A.D., E.B., R.P., K.V.G., J.R.M., C.H., A.A.P., G.d.A., M.A.R., T.J.F., A.Y.S. and K.C. conceived of, implemented and performed analytic approaches. M.A.D., E.B., R.P., K.V.G., G.d.A., A.M.K. and M.J.D. wrote the manuscript. M.A.D., M.H. and A.M. developed Picard and GATK infrastructure underlying the tools implemented here. M.A.D., S.B.G., D.A. and M.J.D. lead the team.

Corresponding author

Correspondence to Mark A DePristo.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Cite this article

DePristo, M., Banks, E., Poplin, R. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491–498 (2011). https://doi.org/10.1038/ng.806

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.806

