Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Fast Label Extraction in the CDAWG

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10508))

Included in the following conference series:

  • 741 Accesses

Abstract

The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from \(O(m\log {\log {n}}+\mathtt {occ})\) to \(O(m+\mathtt {occ})\) in the time needed to locate all the \(\mathtt {occ}\) occurrences of the pattern. We also reduce from \(O(k\log {\log {n}})\) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from \(O(m\log {\log {n}})\) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: CPM 2017. Leibniz International Proceedings in Informatics (LIPIcs), vol. 78, pp. 7:1–7:13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)

    Google Scholar 

  2. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3

    Chapter  Google Scholar 

  3. Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  4. Berkman, O., Vishkin, U.: Finding level-ancestors in trees. J. Comput. Syst. Sci. 48(2), 214–230 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  5. Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  6. Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  7. Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Heidelberg (1997). doi:10.1007/978-3-662-07675-0_9

    Chapter  Google Scholar 

  8. Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Proc. Lett. 67(3), 111–117 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  9. Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55

    Chapter  Google Scholar 

  10. Gagie, T.: Large alphabets and incompressibility. Inf. Proc. Lett. 99(6), 246–251 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: DCC 2005, p. 458 (2005)

    Google Scholar 

  12. Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching. Fundam. Informaticae 56(1–2), 137–154 (2003)

    MathSciNet  MATH  Google Scholar 

  13. Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  14. Lohrey, M., Maneth, S., Reh, C.P.: Traversing grammar-compressed trees with constant delay. In: DCC 2016, pp. 546–555 (2016)

    Google Scholar 

  15. Russo, L.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  16. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005). doi:10.1007/11496656_5

    Chapter  Google Scholar 

  17. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  18. Navarro, G., Russo, L.M.: Fast fully-compressed suffix trees. In: DCC 2014, pp. 283–291. IEEE (2014)

    Google Scholar 

  19. Raffinot, M.: On maximal repeats in strings. Inf. Proc. Lett. 80(3), 165–169 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  20. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-89097-3_17

    Chapter  Google Scholar 

  21. Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H.: Linear-size CDAWG: new repetition-aware indexing and grammar compression. In: SPIRE (2017, to appear). arXiv:1705.09779

Download references

Acknowledgements

We thank the anonymous reviewers for simplifying some parts of the paper, for improving its overall clarity, and for suggesting references [11, 12, 14] and the current version of Lemma 13.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Cunial .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Belazzougui, D., Cunial, F. (2017). Fast Label Extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67428-5_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67427-8

  • Online ISBN: 978-3-319-67428-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics