Key Points
-
Haplotype phase may be generated through either computational or experimental methods.
-
Computational phasing is simple and inexpensive and results in good accuracy for common variants over small regions.
-
Computational phasing of closely related individuals (such as parentâoffspring trios) results in high accuracy at a high proportion of sites because of the additional information provided by Mendelian constraints.
-
Although specialized software for analysing complex relationships is somewhat limited, good results can be obtained by treating the related individuals as if they were unrelated when performing computational phasing.
-
A new development in computational phasing of unrelated individuals is the detection and use of segments of identity-by-descent that arise from distant relationships. In their current form, these methods are only suitable for small, isolated populations, but improvements in algorithms may lead to applicability to large samples from outbred populations.
-
Experimental phasing has a very high accuracy at a high proportion of sites and can phase de novo or very rare variants without the need to obtain data from closely related individuals.
-
Experimental phasing currently adds substantially to the cost of generating the genotype or sequence data (at least doubling the cost) and requires technical expertise, additional preparation time and, in some cases, specialized equipment.
Abstract
Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by-descent for computational phasing.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nature Rev. Genet. 12, 215â223 (2011).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906â913 (2007).
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210â223 (2009).
Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816â834 (2010).
Kang, H., Qin, Z. S., Niu, T. & Liu, J. S. Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. Am. J. Hum. Genet. 74, 495â510 (2004).
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847â861 (2009).
Yu, Z., Garner, C., Ziogas, A., Anton-Culver, H. & Schaid, D. J. Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics 10, 63 (2009).
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061â1073 (2010).
Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952â960 (2011).
Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 21, 940â951 (2011).
Scheet, P. & Stephens, M. Linkage disequilibrium-based quality control for large-scale genetic studies. PLoS Genet. 4, e1000147 (2008).
Tishkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271, 1380â1387 (1996).
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet. 40, 1068â1075 (2008). This paper describes the use of an IBD-based phasing method called 'long-range phasing' in a large sample from the Icelandic population.
Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832â837 (2002).
Tao, H., Cox, D. R. & Frazer, K. A. Allele-specific KRT1 expression is a complex trait. PLoS Genet. 2, e93 (2006).
Gusfield, D. Haplotype inference by pure parsimony. Lect. Notes Comp. Sci. 2676, 144â155 (2003).
Wang, L. & Xu, Y. Haplotype inference by maximum parsimony. Bioinformatics 19, 1773â1780 (2003).
Weale, M. E. A survey of current software for haplotype phase inference. Hum. Genomics 1, 141â144 (2004).
Clark, A. G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7, 111â122 (1990). This paper describes the first computational phasing method for more than two markers.
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1â38 (1977).
Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921â927 (1995). This was one of the earliest papers describing the use of the EM algorithm for statistical phasing of unrelated individuals.
Hawley, M. E. & Kidd, K. K. HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Hered. 86, 409â411 (1995).
Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56, 799â810 (1995).
Qin, Z. S., Niu, T. & Liu, J. S. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 71, 1242â1247 (2002).
Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978â989 (2001).
Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10, 564â567 (2010).
Drysdale, C. M. et al. Complex promoter and coding region β 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl Acad. Sci. USA 97, 10483â10488 (2000).
Rosenberg, N. et al. The frequent 5,10-methylenetetrahydrofolate reductase C677T polymorphism is associated with a common haplotype in whites, Japanese, and Africans. Am. J. Hum. Genet. 70, 758â762 (2002).
McVean, G. A. & Cardin, N. J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. B 360, 1387â1393 (2005).
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213â2233 (2003). This paper describes the approximate coalescent model used by the MACH and IMPUTE statistical phasing methods. The model is similar to that used by PHASE.
Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Statist. Soc. B 62, 605â655 (2000).
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299â1318 (2001).
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449â462 (2005). This paper describes PHASE, which has been considered as a gold standard for computational phasing accuracy, although it is too computationally intensive to be applied to large data sets.
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629â644 (2006). This paper describes fastPHASE, which was one of the first computational phasing methods suitable for genome-wide SNP data.
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Celeux, G. & Diebolt, J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Statist. Quart. 2, 73â82 (1985).
Tregouet, D. A., Escolano, S., Tiret, L., Mallet, A. & Golmard, J. L. A new algorithm for haplotype-based association analysis: the stochastic-EM algorithm. Ann. Hum. Genet. 68, 165â177 (2004).
Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437â450 (2006).
Delaneau, O., Coulonges, C. & Zagury, J. F. Shape-IT: new rapid and accurate algorithm for haplotype inference. BMC Bioinformatics 9, 540 (2008).
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084â1097 (2007). This paper describes the BEAGLE method for statistical phasing in samples of unrelated individuals.
Auton, A. et al. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res. 19, 795â803 (2009).
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913â918 (2007).
Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851â861 (2007).
The International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52â58 (2010).
Kenny, E. E. et al. Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proc. Natl Acad. Sci. USA 106, 13886â13891 (2009).
Browning, S. R. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124, 439â450 (2008).
Tregouet, D. A. et al. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nature Genet. 41, 283â285 (2009).
Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88, 173â182 (2011).
Browning, S. R. & Browning, B. L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 86, 526â539 (2010).
Hickey, J. M. et al. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet. Sel. Evol. 43, 12 (2011).
Daetwyler, H. D., Wiggans, G. R., Hayes, B. J., Woolliams, J. A. & Goddard, M. E. Imputation of missing genotypes from sparse to high density using long-range phasing. Genetics 24 Jun 2011 (doi:10.1534/genetics.111.128082).
Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868â874 (2009).
Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nature Genet. 43, 316â320 (2011).
Kruglyak, L., Daly, M. J., ReeveDaly, M. P. & Lander, E. S. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. H. Genet. 58, 1347â1363 (1996).
Schaid, D. J., McDonnell, S. K., Wang, L., Cunningham, J. M. & Thibodeau, S. N. Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Am. J. Hum. Genet. 71, 992â995 (2002).
Roach, J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636â639 (2010).
Rohde, K. & Fuerst, R. Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum. Mutat. 17, 289â295 (2001).
Zhang, K., Sun, F. & Zhao, H. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 21, 90â103 (2005).
Abecasis, G. R. & Wigginton, J. E. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am. J. Hum. Genet. 77, 754â767 (2005).
Zhang, F. & Deng, H. W. Confounding from cryptic relatedness in haplotype-based association studies. Genetica 138, 945â950 (2010).
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nature Rev. Genet. 12, 443â451 (2011).
Andres, A. M. et al. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genet. Epidemiol. 31, 659â671 (2007).
Huang, L. et al. Genotype-imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84, 235â250 (2009).
Jostins, L., Morley, K. I. & Barrett, J. C. Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets. Eur. J. Hum. Genet. 19, 662â666 (2011).
Geraci, F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26, 2217â2225 (2010).
He, D., Choi, A., Pipatsrisawat, K., Darwiche, A. & Eskin, E. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, i183âi190 (2010).
Long, Q., MacArthur, D., Ning, Z. & Tyler-Smith, C. HI: haplotype improver using paired-end short reads. Bioinformatics 25, 2436â2437 (2009).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860â921 (2001).
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nature Biotech. 29, 59â63 (2011). This paper describes the use of an experimental phasing method that was applied to the sequence of an individual and the population-genetic inferences that were made using the phased haplotypes.
Suk, E.-K. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 3 Aug 2011 (doi:10.1101/gr.125047.111).
Duitama, J., Huebsch, T., McEwen, G., Suk, E.-K. & Hoehe, M. R. in Proc. 1st ACM Int. Conf. Bioinf. Comp. Biol. 160â169 (Association for Computing Machinery, Niagara Falls, New York, 2010).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153âi159 (2008).
Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nature Biotech. 29, 51â57 (2011).
Ma, L. et al. Direct determination of molecular haplotypes by chromosome microdissection. Nature Methods 7, 299â301 (2010).
Hert, D. G., Fredlake, C. P. & Barron, A. E. Advantages and limitations of next-generation sequencing technologies: a comparison of electrophoresis and non-electrophoresis methods. Electrophoresis 29, 4618â4626 (2008).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Metzker, M. L. Sequencing technologies â the next generation. Nature Rev. Genet. 11, 31â46 (2010).
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133â138 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297â1303 (2010).
Su, S. Y. et al. Inferring combined CNV/SNP haplotypes from genotype data. Bioinformatics 26, 1437â1445 (2010).
Li, Z. et al. A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn). Cell Res. 19, 519â523 (2009).
Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev. Genet. 11, 415â425 (2010).
Yang, H., Chen, X. & Wong, W. H. Completely phased genome sequencing through chromosome sorting. Proc. Natl Acad. Sci. USA 108, 12â17 (2011).
The UK IBD Genetics Consortium & The Wellcome Trust Case Control Consortium 2. Genome-wide association study of ulcerative colitis identifies three new susceptibility loci, including the HNF4A region. Nature Genet. 41, 1330â1334 (2009).
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661â678 (2007).
Acknowledgements
This study was supported by the US National Institutes of Health (NIH) awards R01HG005701 and R01HG004960. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under awards 076113 and 085475. The content of this study is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the Wellcome Trust.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Glossary
- Imputation
-
In the context of this article, this is the estimation of missing genotype values by using the genotypes at nearby SNPs and the haplotype frequencies seen in other individuals.
- Calling genotypes
-
Estimating genotype values from raw data. Genotyping technology provides information about the underlying genotype, typically in the form of signal intensities or read counts of the two alleles. Statistical techniques are used to resolve this information into genotype calls. Typically, information across individuals is used, and correlation across SNPs (that is, haplotype phase) is also helpful.
- Identical-by-descent
-
Two haplotypes are identical-by-descent if they are identical copies of a haplotype inherited from a common ancestor.
- Cryptic relatedness
-
The undocumented existence of relatives within a sample.
- Posterior distribution
-
Probabilities that account for the prior information and the information in the data. For haplotype phase estimation, the posterior distribution accounts for all available information, including the genotypes and the estimated haplotype frequencies in the population.
- Expectation maximization algorithm
-
(EM algorithm). An iterative approach for finding the values of the unobserved data (such as haplotype phase) that maximize the statistical likelihood of the observed incomplete data. Although the likelihood increases with each iteration, the approach is not guaranteed to find the global maximum.
- Partitionâligation
-
A divide-and-conquer strategy that is designed to reduce the computational burden for phasing methods that do not scale well with increasing region size. A large region is divided up into smaller regions, and haplotype phase estimates from the smaller regions are used to limit the possibilities when phasing the large region.
- Hidden Markov model
-
(HMM). A mathematically elegant and computationally tractable class of models in which the observed data are generated by an unobserved Markov process. A Markov process is a probabilistic process in which the distribution of future states (for example, states that are further along the chromosome) depends only on the current state and not on previous states.
- Haplotype block
-
A short genomic region within which inter-marker linkage disequilibrium is strong.
- Approximate coalescent
-
The coalescent is a model for the process by which the ancestry of alleles converges when looking back in time. An approximate coalescent is a model that generates patterns of genetic variation that are similar to patterns generated by the coalescent but that is computationally simpler.
- Linkage disequilibrium
-
(LD). Non-independence (correlation) between genetic variants at the population level. In general, LD decreases with genomic distance and is not present between variants on different chromosomes.
- Effective population size
-
The size of a population of randomly mating individuals that would show the same amount of genetic drift as is found in the actual population. The effective population size is usually smaller than the actual population size.
- Compound heterozygosity
-
The presence of two deleterious variants located in the same gene but on different chromosome copies of an individual. It is possible to distinguish between compound heterozygosity and the occurrence of two variants on the same chromosome copy by determining the haplotype phase.
- Dâ²
-
A measure of linkage disequilibrium (LD) between two markers. Dâ² takes values between 0 and 1. Absence of LD is indicated by 0, and 1 indicates maximum possible LD given the allele frequency of the markers.
- Reference panel
-
A collection of samples that are not of direct interest but that are included in an analysis for the purposes of increasing statistical power or accuracy for the samples of interest. Reference panels are commonly used for genotype imputation and can also be used for haplotype phasing.
- Genotype likelihoods
-
Statistical likelihoods that encapsulate the relative evidence for each possible genotype call.
- Fluorescence-activated cell sorting
-
(FACS). A type of flow cytometry in which individual particles (such as chromosomes) are separated and fluorescence intensities (from earlier staining) are measured.
- Barcode labelling
-
Tagging of each sample with a unique short sequence (barcode) before pooling samples. After sequencing, the sample corresponding to each read can be determined from the barcode.
- Admixed ancestry
-
An individual has admixed ancestry if he or she has recent ancestors deriving from different continental populations.
- Large-insert clones
-
Large haplotype fragments that are inserted into, for example, bacterial artificial chromosomes (BACs).
- Shotgun sequencing
-
A sequencing method in which DNA is randomly sheared into small fragments before being sequenced.
- Fosmid
-
A type of hybrid DNA molecule comprising bacterial DNA and a section of genomic DNA of ~40 kb in length.
- Microfluidics
-
The manipulation of fluids on a very small scale. This approach can be used to separate individual chromosomes before sequencing for experimental phasing.
- Metaphase
-
A stage of mitosis at which chromosomes are highly condensed, facilitating their separation for some experimental phasing methods.
- Paired-end sequencing
-
Sequencing of haplotype fragments from each end. The two sequenced ends are typically separated by a gap.
Rights and permissions
About this article
Cite this article
Browning, S., Browning, B. Haplotype phasing: existing methods and new developments. Nat Rev Genet 12, 703â714 (2011). https://doi.org/10.1038/nrg3054
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3054
This article is cited by
-
Inferring compound heterozygosity from large-scale exome sequencing data
Nature Genetics (2024)
-
Analysis of dog breed diversity using a composite selection index
Scientific Reports (2023)
-
Multiallelic models for QTL mapping in diverse polyploid populations
BMC Bioinformatics (2022)
-
A joint use of pooling and imputation for genotyping SNPs
BMC Bioinformatics (2022)
-
Duet: SNP-assisted structural variant calling and phasing using Oxford nanopore sequencing
BMC Bioinformatics (2022)