Abstract
Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate ε, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = ⌈εℓ⌉, where ℓ is the length of the overlap. We propose new solutions for this problem based on backward backtracking (Lam et al. 2008) and suffix filters (Kärkkäinen and Na, 2008). Techniques use nH k + o(nlogσ) + rlogr bits of space, where H k is the k-th order entropy and σ the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Roche Company. 454 life sciences, http://www.454.com/
Simpson, J.T., et al.: Abyss: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)
Morin, R.D., et al.: Profiling the hela s3 transcriptome using randomly primed cdna and massively parallel short-read sequencing. BioTechniques 45(1), 81–94 (2008)
Li, R., et al.: Soap2. Bioinformatics 25(15), 1966–1967 (2009)
Wicker, T., et al.: 454 sequencing put to the test using the complex genome of barley. BMC Genomics 7(1), 275 (2006)
Ferragina, P., Manzini, G.: Indexing compressed texts. Journal of the ACM 52(4), 552–581 (2005)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2), article 20 (2007)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Hyyrö, H., Navarro, G.: Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41(3), 203–231 (2005)
Kärkkäinen, J., Na, J.C.: Faster filters for approximate string matching. In: Proc. ALENEX 2007, pp. 84–90. SIAM, Philadelphia (2007)
Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for dna sequence assembly. Algorithmica 13, 7–51 (1995)
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of dna. Bioinformatics 24(6), 791–797 (2008)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(3), R25 (2009)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (2009), Advance access
Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, R.: Unifying view of backward backtracking in short read mapping. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) LNCS Festschrifts. Springer, Heidelberg (to appear 2010)
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3) (2008)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46(3), 395–415 (1999)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surveys 33(1), 31–88 (2001)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), article 2 (2007)
Pevzner, P., Tang, H., Waterman, M.: An eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001)
Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)
Salmela, L.: Personal communication (2010)
Sellers, P.: The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms 1(4), 359–373 (1980)
Wang, Z., Gerstein, M., Snyder, M.: Rna-seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 10(1), 57–63 (2009)
Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Välimäki, N., Ladra, S., Mäkinen, V. (2010). Approximate All-Pairs Suffix/Prefix Overlaps. In: Amir, A., Parida, L. (eds) Combinatorial Pattern Matching. CPM 2010. Lecture Notes in Computer Science, vol 6129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13509-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-13509-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13508-8
Online ISBN: 978-3-642-13509-5
eBook Packages: Computer ScienceComputer Science (R0)