Abstract
The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from \(O(m\log {\log {n}}+\mathtt {occ})\) to \(O(m+\mathtt {occ})\) in the time needed to locate all the \(\mathtt {occ}\) occurrences of the pattern. We also reduce from \(O(k\log {\log {n}})\) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: CPM 2017. Leibniz International Proceedings in Informatics (LIPIcs), vol. 78, pp. 7:1–7:13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3
Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)
Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. Syst. Sci. 48(2), 214–230 (1994)
Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)
Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Heidelberg (1997). doi:10.1007/978-3-662-07675-0_9
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Proc. Lett. 67(3), 111–117 (1998)
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55
Gagie, T.: Large alphabets and incompressibility. Inf. Proc. Lett. 99(6), 246–251 (2006)
Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: DCC 2005, p. 458 (2005)
Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching. Fundam. Informaticae 56(1–2), 137–154 (2003)
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Lohrey, M., Maneth, S., Reh, C.P.: Traversing grammar-compressed trees with constant delay. In: DCC 2016, pp. 546–555 (2016)
Russo, L.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005). doi:10.1007/11496656_5
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Navarro, G., Russo, L.M.: Fast fully-compressed suffix trees. In: DCC 2014, pp. 283–291. IEEE (2014)
Raffinot, M.: On maximal repeats in strings. Inf. Proc. Lett. 80(3), 165–169 (2001)
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89097-3_17
Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H.: Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: SPIRE (2017, to appear). arXiv:1705.09779
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Belazzougui, D., Cunial, F. (2017). Fast Label Extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-67428-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67427-8
Online ISBN: 978-3-319-67428-5
eBook Packages: Computer ScienceComputer Science (R0)