Fast Label Extraction in the CDAWG

Belazzougui, Djamal; Cunial, Fabio

doi:10.1007/978-3-319-67428-5_14

Djamal Belazzougui¹⁶ &
Fabio Cunial¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10508))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

741 Accesses

Abstract

The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from $O(m\log {\log {n}})$ to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from $O(m\log {\log {n}}+\mathtt {occ})$ to $O(m+\mathtt {occ})$ in the time needed to locate all the $\mathtt {occ}$ occurrences of the pattern. We also reduce from $O(k\log {\log {n}})$ to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from $O(m\log {\log {n}})$ to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Space-Efficient Detection of Unusual Words

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Article Open access 22 September 2020

Effective delta-labeling exploration algorithms for graph representation and DNA sequence alignment

Article 09 January 2025

References

Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: CPM 2017. Leibniz International Proceedings in Informatics (LIPIcs), vol. 78, pp. 7:1–7:13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)
Google Scholar
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3
Chapter Google Scholar
Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)
Article MathSciNet MATH Google Scholar
Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. Syst. Sci. 48(2), 214–230 (1994)
Article MathSciNet MATH Google Scholar
Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet MATH Google Scholar
Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)
Article MathSciNet MATH Google Scholar
Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Heidelberg (1997). doi:10.1007/978-3-662-07675-0_9
Chapter Google Scholar
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Proc. Lett. 67(3), 111–117 (1998)
Article MathSciNet MATH Google Scholar
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55
Chapter Google Scholar
Gagie, T.: Large alphabets and incompressibility. Inf. Proc. Lett. 99(6), 246–251 (2006)
Article MathSciNet MATH Google Scholar
Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: DCC 2005, p. 458 (2005)
Google Scholar
Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching. Fundam. Informaticae 56(1–2), 137–154 (2003)
MathSciNet MATH Google Scholar
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Lohrey, M., Maneth, S., Reh, C.P.: Traversing grammar-compressed trees with constant delay. In: DCC 2016, pp. 546–555 (2016)
Google Scholar
Russo, L.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005). doi:10.1007/11496656_5
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Navarro, G., Russo, L.M.: Fast fully-compressed suffix trees. In: DCC 2014, pp. 283–291. IEEE (2014)
Google Scholar
Raffinot, M.: On maximal repeats in strings. Inf. Proc. Lett. 80(3), 165–169 (2001)
Article MathSciNet MATH Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89097-3_17
Chapter Google Scholar
Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H.: Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: SPIRE (2017, to appear). arXiv:1705.09779

Download references

Acknowledgements

We thank the anonymous reviewers for simplifying some parts of the paper, for improving its overall clarity, and for suggesting references [11, 12, 14] and the current version of Lemma 13.

Author information

Authors and Affiliations

DTISI-CERIST, 16306, Algiers, Algeria
Djamal Belazzougui
MPI-CBG, Pfotenhauerstr. 108, 01307, Dresden, Germany
Fabio Cunial

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Cunial .

Editor information

Editors and Affiliations

Università di Palermo, Palermo, Italy
Gabriele Fici
Università di Palermo, Palermo, Italy
Marinella Sciortino
Università di Pisa, Pisa, Italy
Rossano Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belazzougui, D., Cunial, F. (2017). Fast Label Extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-67428-5_14
Published: 06 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67427-8
Online ISBN: 978-3-319-67428-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Label Extraction in the CDAWG

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Space-Efficient Detection of Unusual Words

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Effective delta-labeling exploration algorithms for graph representation and DNA sequence alignment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fast Label Extraction in the CDAWG

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Space-Efficient Detection of Unusual Words

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Effective delta-labeling exploration algorithms for graph representation and DNA sequence alignment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation