Abstract
A commonly used tool in disease association studies is the search for discrepancies between the haplotype distribution in the case and control populations. In order to find this discrepancy, the haplotypes frequency in each of the populations is estimated from the genotypes.
We present a new method HAPLOFREQ to estimate haplotype frequencies over a short genomic region given the genotypes or haplotypes with missing data or sequencing errors. Our approach incorporates a maximum likelihood model based on a simple random generative model which assumes that the genotypes are independently sampled from the population. We first show that if the phased haplotypes are given, possibly with missing data, we can estimate the frequency of the haplotypes in the population by finding the global optimum of the likelihood function in polynomial time. If the haplotypes are not phased, finding the maximum value of the likelihood function is NP-hard. In this case we define an alternative likelihood function which can be thought of as a relaxed likelihood function. We show that the maximum relaxed likelihood can be found in polynomial time, and that the optimal solution of the relaxed likelihood approaches asymptotically to the haplotype frequencies in the population.
In contrast to previous approaches, our algorithms are guaranteed to converge in polynomial time to a global maximum of the different likelihood functions. We compared the performance of our algorithm to the widely used program PHASE, and we found that our estimates are at least 10% more accurate than PHASE and about ten times faster than PHASE.
Our techniques involve new algorithms in convex optimization. These algorithms may be of independent interest. Particularly, they may be helpful in other maximum likelihood problems arising from survey sampling.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Clark, A.G.: Inference of haplotypes from pcr-amplified samples of diploid populations. Journal of Molecular Biology and Evolution 7(2), 111–122 (1990)
Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S.: High-resolution haplotype structure in the human genome. Nature Genetics 29(2), 229–232 (2001)
Excoffier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution 12(5), 921–927 (1995)
Fallin, D., Schork, N.J.: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. American Journal of Human Genetics 67, 947–959 (2000)
Gusfield, D.: Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In: Proceedings of the 6th Annual International Conference on (Research in) Computational (Molecular) Biology (2002)
Gusfield, D.: A practical algorithm for optimal inference of haplotypes from diploid populations. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, ISMB (2000)
Gusfield, D.: Inference of haplotypes from samples of diploid populations: complexity and algorithms. Journal of Computational Biology 8(3), 305–323 (2001)
Halperin, E., Eskin, E.: Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics (2004)
Halperin, E., Karp, R.: The minimum-entropy set cover problem (2003) (manuscript)
Hawley, M.E., Kidd, K.K.: Haplo: a program using the em algorithm to estimate the frequencies of multi-site haplotypes. Journal of Heredity 86(5), 409–411 (1995)
Khachiyan, L.G.: Polynomial algorithms in linear programming. USSR Computational Mathematics and Math. Phys. 20, 53–72 (1980)
Kimmel, G., Shamir, R.: Maximum likelihood resolution of multi-block genotypes. In: Proceedings of the eighth annual international conference on Computational molecular biology, pp. 2–9. ACM Press, New York (2004)
Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R.: Snps problems, algorithms and complexity, european symposium on algorithms. In: Meyer auf der Heide, F. (ed.) ESA 2001. LNCS, vol. 2161, pp. 182–193. Springer, Heidelberg (2001)
Long, J.C., Williams, R.C., Urbanek, M.: An e-m algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 56(3), 799–810 (1995)
Michalatos-Beloin, S., Tishkoff, S.A., Bently, K.L., Kidd, K.K., Ruano, G.: Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range pcr. Nucleic Acids Res. 24, 4841–4843 (1996)
NIH. Large-scale genotyping for the haplotype map of the human genome. RFA: HG-02-005 (2002)
Niu, Qin, Xu, Liu: In silico haplotype determination of a vast set of single nucleotide polymorphisms. Technical report, Department of Statistics, Harvard University (2001)
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., Nguyen, B.T., Norris, M.C., Sheehan, J.B., Shen, N., Stern, D., Stokowski, R.P., Thomas, D.J., Trulson, M.O., Vyas, K.R., Frazer, K.A., Fodor, S.P., Cox, D.R.: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294(5547), 1719–1723 (2001)
Stephens, M., Smith, N., Donnelly, P.: A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics 68, 978–989 (2001)
Wolkowicz, H., Saigala, R., Vandenberghe, L.: Handbook of semidefinite programming. International Series in Operations Research and Management Science, vol. 27 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Halperin, E., Hazan, E. (2005). HAPLOFREQ – Estimating Haplotype Frequencies Efficiently. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_42
Download citation
DOI: https://doi.org/10.1007/11415770_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25866-7
Online ISBN: 978-3-540-31950-4
eBook Packages: Computer ScienceComputer Science (R0)