Abstract
Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures.
Abbreviations: Mk = Fixed order Markov model of order k, PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Auto. Cont. 19, 716–723 (1974)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Borodovsky, M., McIninch, J.: Recognition of genes in DNA sequence with ambiguities. Biosystems 30, 161–171 (1993)
Bühlmann, P., Wyner, A.: Variable length Markov chains, Ann. Statist. 27(2), 480–513 (1999)
Bühlmann, P., Wyner, A.: Model selection for variable length Markov chains and tuning the context algorithm. Annals of the Inst. of Stat. Math. 52(2), 287–315 (2000)
Csiszàr, I., Shields, P.: The Consistency of the BIC Markov Order Estimator. The Annals of Statistics. 28(6), 1601–1619 (2000)
Dalevi, D., Dubhashi, D.: Bayesian Classifiers for Detecting HGT Using Fixed and Variable Length Markov Chains (submitted)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (2004)
Ellrott, K., Yang, C., Saldek, M., Jiang, T.: Identifying transcription binding sites through Markov chain optimization. Bioinformatics 18(2), 100–109 (2002)
Fan, T.-H., Tsai, C.: A Bayesian Method in Determining the Order of a Finite State Markov Chain. Comm. Statist. Theory and Methods 28(7), 1711–1730 (1999)
Forsdyke, D.: Different Biological Species “Broadcast” Their DNAs at Different (G+C)% “Wavelengths”. J. Theor. Biol. 178, 405–417 (1996)
Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11(7), 283–290 (1995)
Mächler, M., Bühlmann, P.: Variable Length Markov Chains: Methodology, Computing, and Software. J Comp Graph Stat 13(2), 435–455 (2004)
McDiarmid, C.: Concentration. In: Habib, M., McDiarmid, C., Ramirez-Alfonsin, J., Reed, B. (eds.) Probabilistic Methods for Algorithmic Discrete Mathematics Series: Algorithms and Combinatorics, vol. 16, pp. 195–248. Springer, Berlin (1998)
Peres, Y., Shields, P.: Two New Markov Order Estimators, to appear, see, http://www.math.utoledo.edu/~pshields/latex.html
Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003)
Ron, D., Singer, Y., Tishby, N.: The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning 25(2-3), 117–149 (1996)
Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., Coster, J.: Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 11(8), 1404–1409 (2001)
Schwartz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978)
Zhao, X., Huang, H., Speed, T.: Finding Short DNA motifs using Permuted Markov models. In: RECOMB, pp. 68–75 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dalevi, D., Dubhashi, D. (2005). The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity. In: Casadio, R., Myers, G. (eds) Algorithms in Bioinformatics. WABI 2005. Lecture Notes in Computer Science(), vol 3692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557067_24
Download citation
DOI: https://doi.org/10.1007/11557067_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29008-7
Online ISBN: 978-3-540-31812-5
eBook Packages: Computer ScienceComputer Science (R0)