Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

New algorithms on wavelet trees and applications to information retrieval

Published: 01 April 2012 Publication History

Abstract

Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.

References

[1]
Gagie, T., Puglisi, S. and Turpin, A., Range quantile queries: another virtue of wavelet trees. In: LNCS, vol. 5721. pp. 1-6.
[2]
Navarro, G. and Puglisi, S.~J., Dual-sorted inverted lists. In: LNCS, vol. 6393. pp. 310-322.
[3]
R. Grossi, A. Gupta, J. S. Vitter, High-order entropy-compressed text indexes, in: Proc. 14th Symposium on Discrete Algorithms, SODA, 2003, pp. 841-850.
[4]
Ferragina, P., Manzini, G., Mäkinen, V. and Navarro, G., Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms. v3 i2.
[5]
Ferragina, P. and Manzini, G., Indexing compressed texts. Journal of the ACM. v52 i4. 552-581.
[6]
Ferragina, P., Manzini, G., Mäkinen, V. and Navarro, G., An alphabet-friendly FM-index. In: LNCS, vol. 3246. pp. 150-160.
[7]
Mäkinen, V. and Navarro, G., Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing. v12 i1. 40-66.
[8]
Mäkinen, V. and Navarro, G., Implicit compression boosting with applications to self-indexing. In: LNCS, vol. 4726. pp. 214-226.
[9]
Chazelle, B., A functional approach to data structures and its use in multidimensional searching. SIAM Journal on Computing. v17 i3. 427-462.
[10]
Mäkinen, V. and Navarro, G., Position-restricted substring searching. In: LNCS, vol. 3887. pp. 703-714.
[11]
P. Bose, M. He, A. Maheshwari, P. Morin, Succinct orthogonal range search structures on a grid with applications to text indexing, in: Proc. 11th International Symposium on Algorithms and Data Structures, WADS, 2009, pp. 98-109.
[12]
Brisaboa, N., Luaces, M., Navarro, G. and Seco, D., A fun application of compact data structures to indexing geographic data. In: LNCS, vol. 6099. pp. 77-88.
[13]
J. Barbay, G. Navarro, Compressed representations of permutations, and applications, in: Proc. 26th International Symposium on Theoretical Aspects of Computer Science, STACS, 2009, pp. 111-122.
[14]
Barbay, J., Claude, F. and Navarro, G., Compact rich-functional binary relation representations. In: LNCS, vol. 6034. pp. 172-185.
[15]
Navarro, G., Indexing text using the Ziv¿Lempel trie. Journal of Discrete Algorithms. v2 i1. 87-114.
[16]
Y.-F. Chien, W.-K. Hon, R. Shah, J. S. Vitter, Geometric Burrows-Wheeler transform: Linking range searching and text indexing, in: Proc. Data Compression Conference, DCC, 2008, pp. 252-261.
[17]
Claude, F. and Navarro, G., Self-indexed text compression using straight-line programs. In: LNCS, vol. 5734. pp. 235-246.
[18]
Välimäki, N. and Mäkinen, V., Space-efficient algorithms for document retrieval. In: LNCS, vol. 4580. pp. 205-215.
[19]
J. Barbay, C. Kenyon, Adaptive intersection and t-threshold problems, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 390-399.
[20]
Har-Peled, S. and Muthukrishnan, S., Range medians. In: LNCS, vol. 5193. pp. 503-514.
[21]
S. Stolinski, S. Grabowski, W. Bieniecki, On efficient implementations of median filters in theory and practice, unpublished manuscript (2010).
[22]
Crochemore, M., Iliopoulos, C.~S. and Rahman, M., Finding patterns in given intervals. In: LNCS, vol. 4708. pp. 645-656.
[23]
Keller, O., Kopelowitz, T. and Lewenstein, M., Range non-overlapping indexing and successive list indexing. In: LNCS, vol. 4619. pp. 625-636.
[24]
M. Crochemore, C. S. Iliopoulos, M. Kubica, M. Rahman, T. Walen, Improved algorithms for the range next value problem and applications, in: Proc. 25th Symposium on Theoretical Aspects of Computer Science, STACS, 2008, pp. 205-216.
[25]
Hon, W.-K., Shah, R., Thankachan, S. and Vitter, J.~S., String retrieval for multi-pattern queries. In: LNCS, vol. 6393. pp. 55-66.
[26]
Manber, U. and Myers, G., Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing. v22 i5. 935-948.
[27]
S. Muthukrishnan, Efficient algorithms for document retrieval problems, in: Proc 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 657-666.
[28]
M. Pa-tra-cu, Succincter, in: Proc. 49th IEEE Annual Symposium on Foundations of Computer Science, FOCS, 2008, pp. 305-313.
[29]
R. Raman, V. Raman, S. Rao, Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, in: Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2002, pp. 233-242.
[30]
Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L. and Tarjan, R.E., Time bounds for selection. Journal of Computer and System Sciences. v7 i4. 448-461.
[31]
Krizanc, D., Morin, P. and Smid, M.H.M., Range mode and range median queries on lists and trees. Nordic Journal of Computing. v12 i1. 1-17.
[32]
P. Bose, E. Kranakis, P. Morin, Y. Tang, Approximate range mode and range median queries, in: Proc. 22nd Symposium on Theoretical Aspects of Computer Science, STACS, 2005, pp. 377-388.
[33]
Gfeller, B. and Sanders, P., Towards optimal range medians. In: LNCS, vol. 5555. pp. 475-486.
[34]
Petersen, H., Improved bounds for range mode and range median queries. In: LNCS, vol. 4910. pp. 418-423.
[35]
Petersen, H. and Grabowski, S., Range mode and range median queries in constant time and sub-quadratic space. Information Processing Letters. v109 i4. 225-228.
[36]
Brodal, G.S. and Jørgensen, A.G., Data structures for range median queries. In: LNCS, vol. 5878. pp. 822-831.
[37]
Brodal, G.S., Gfeller, B., Jørgensen, A.G. and Sanders, P., Towards optimal range medians. Theoretical Computer Science. v412 i24. 2588-2601.
[38]
A.G. Jørgensen, K.D. Larsen, Range selection and median: Tight cell probe lower bounds and adaptive data structures, in: Proc. 22nd Symposium on Discrete Algorithms, SODA, 2011, pp. 805-813.
[39]
Mäkinen, V., Navarro, G. and Ukkonen, E., Transposition invariant string matching. Journal of Algorithms. v56 i2. 124-153.
[40]
C.-C. Yu, W.-K. Hon, B.-F. Wang, Efficient data structures for the orthogonal range successor problem, in: Proc. 15th International Computing and Combinatorics Conference, COCOON, 2009, pp. 96-105.
[41]
H. Gabow, J. Bentley, R. Tarjan, Scaling and related techniques for geometry problems, in: Proc. 16 ACM Symposium on Theory of Computing, STOC, 1984, pp. 135-143.
[42]
E. Demaine, I. Munro, Adaptive set intersections, unions, and differences, in: Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2000, pp. 743-752.
[43]
Barbay, J. and Kenyon, C., Alternation and redundancy analysis of the intersection problem. ACM Transactions on Algorithms. v4 i1.
[44]
Barbay, J., López-Ortiz, A., Lu, T. and Salinger, A., An experimental investigation of set intersection algorithms for text searching. ACM Journal of Experimental Algorithmics. v14 i3.
[45]
Navarro, G. and Mäkinen, V., Compressed full text indexes. ACM Computing Surveys. v39 i1.
[46]
Fischer, J. and Heun, V., A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In: LNCS, vol. 4614. pp. 459-470.
[47]
Sadakane, K., Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms. v5 i1. 12-22.
[48]
Gagie, T., Navarro, G. and Puglisi, S.~J., Colored range queries and document retrieval. In: LNCS, vol. 6393. pp. 67-81.
[49]
Baeza-Yates, R. and Ribeiro, B., Modern Information Retrieval. 1999. Addison-Wesley.
[50]
Witten, I., Moffat, A. and Bell, T., Managing Gigabytes. 1999. 2nd ed. Morgan Kaufmann Publishers.
[51]
Baeza-Yates, R., Moffat, A. and Navarro, G., Searching large text collections. In: Handbook of Massive Data Sets, Kluwer Academic Publishers. pp. 195-244.
[52]
Zobel, J. and Moffat, A., Inverted files for text search engines. ACM Computing Surveys. v38 i2.
[53]
Zobel, J. and Moffat, A., Exploring the similarity space. ACM SIGIR Forum. v32 i1. 18-34.
[54]
Persin, M., Zobel, J. and Sacks-Davis, R., Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Sicence. v47 i10. 749-764.
[55]
V. Anh, A. Moffat, Pruned query evaluation using pre-computed impacts, in: Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2006, pp. 372-379.
[56]
T. Strohman, B. Croft, Efficient document retrieval in main memory, in: Proc. 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2007, pp. 175-182.
[57]
Zipf, G., Human Behaviour and the Principle of Least Effort. 1949. Addison-Wesley.
[58]
Baeza-Yates, R., A fast set intersection algorithm for sorted sequences. In: LNCS, vol. 3109. pp. 400-408.
[59]
Baeza-Yates, R. and Salinger, A., Experimental analysis of a fast intersection algorithm for sorted sequences. In: LNCS, vol. 3772. pp. 13-24.
[60]
Barbay, J., López-Ortiz, A. and Lu, T., Faster adaptive set intersections for text searching. In: LNCS, vol. 4007. pp. 146-157.
[61]
P. Sanders, F. Transier, Intersection in integer inverted indices, in: Proc. 9th Workshop on Algorithm Engineering and Experiments, ALENEX, 2007.
[62]
Culpepper, J.~S. and Moffat, A., Compact set representation for information retrieval. In: LNCS, vol. 4726. pp. 137-148.
[63]
Navarro, G., Moura, E., Neubert, M., Ziviani, N. and Baeza-Yates, R., Adding compression to block addressing inverted indexes. Information Retrieval. v3 i1. 49-77.
[64]
Hull, D.A., Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science. v47 i1. 70-84.
[65]
Xu, J. and Croft, W.B., Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems. v16 i1. 61-81.
[66]
D. Hiemstra, V. Mihajlovi¿, The simplest evaluation measures for XML information retrieval that could possibly work, in: Proc. INEX Workshop on Element Retrieval Methodology, 2005.
[67]
J. Pehcevski, Evaluation of effective XML information retrieval, Ph.D. thesis, RMIT University, Australia (2006).
[68]
Lalmas, M., XML Retrieval, vol. 1. 2009. Morgan & Claypool Publishers.
[69]
D. Arroyuelo, F. Claude, S. Maneth, V. Mäkinen, G. Navarro, K. Nguyen, J. Sirén, N. Välimäki, Fast in-memory XPath search over compressed text and tree indexes, in: Proc. 26th IEEE International Conference on Data Engineering, ICDE, 2010, pp. 417-428.
[70]
G. Jacobson, Space-efficient static trees and graphs, in: Proc. 30th Symposium on Foundations of Computer Science, FOCS, 1989, pp. 549-554.
[71]
K. Sadakane, G. Navarro, Fully-functional succinct trees, in: Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2010, pp. 134-149.
[72]
A. Golynski, I. Munro, S. Rao, Rank/select operations on large alphabets: a tool for text indexing, in: Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, 2006, pp. 368-373.
[73]
D. Okanohara, K. Sadakane, Practical entropy-compressed rank/select dictionary, in: Proc. 9th Workshop on Algorithm Engineering and Experiments, ALENEX, 2007.
[74]
Munro, I., Tables. In: LNCS, vol. 1180. pp. 37-42.
[75]
Culpepper, J.S., Navarro, G., Puglisi, S.J. and Turpin, A., Top-k ranked document search in general text databases. In: LNCS, vol. 6347. pp. 194-205.

Cited By

View all
  • (2024)The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra SpaceACM Transactions on Database Systems10.1145/364482449:2(1-45)Online publication date: 23-Mar-2024
  • (2024)Optimizing RPQs over a compact graph representationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00811-233:2(349-374)Online publication date: 1-Mar-2024
  • (2024)Computing Longest Lyndon Subsequences and Longest Common Lyndon SubsequencesAlgorithmica10.1007/s00453-023-01125-z86:3(735-756)Online publication date: 1-Mar-2024
  • Show More Cited By

Index Terms

  1. New algorithms on wavelet trees and applications to information retrieval
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    Publisher

    Elsevier Science Publishers Ltd.

    United Kingdom

    Publication History

    Published: 01 April 2012

    Author Tags

    1. 1D range queries
    2. Data structures
    3. Document retrieval
    4. Information retrieval
    5. Wavelet trees

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra SpaceACM Transactions on Database Systems10.1145/364482449:2(1-45)Online publication date: 23-Mar-2024
    • (2024)Optimizing RPQs over a compact graph representationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00811-233:2(349-374)Online publication date: 1-Mar-2024
    • (2024)Computing Longest Lyndon Subsequences and Longest Common Lyndon SubsequencesAlgorithmica10.1007/s00453-023-01125-z86:3(735-756)Online publication date: 1-Mar-2024
    • (2022)Improved structures to solve aggregated queries for trips over public transportation networksInformation Sciences: an International Journal10.1016/j.ins.2021.10.079584:C(752-783)Online publication date: 1-Jan-2022
    • (2022)Edge minimization in de Bruijn graphsInformation and Computation10.1016/j.ic.2021.104795285:PBOnline publication date: 1-May-2022
    • (2022)Computing All-vs-All MEMs in Run-Length-Encoded Collections of HiFi ReadsString Processing and Information Retrieval10.1007/978-3-031-20643-6_15(198-213)Online publication date: 8-Nov-2022
    • (2020)Tailoring r-index for Document Listing Towards Metagenomics ApplicationsString Processing and Information Retrieval10.1007/978-3-030-59212-7_21(291-306)Online publication date: 13-Oct-2020
    • (2019)Lempel–Ziv compressed structures for document retrievalInformation and Computation10.1016/j.ic.2019.01.006265:C(1-25)Online publication date: 1-Apr-2019
    • (2017)IDF for Word N-gramsACM Transactions on Information Systems10.1145/305277536:1(1-38)Online publication date: 5-Jun-2017
    • (2017)Practical Compact Indexes for Top-k Document RetrievalACM Journal of Experimental Algorithmics10.1145/304395822(1-37)Online publication date: 2-Mar-2017
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media