research-article

Indexing Word Sequences for Ranked Retrieval

Authors:

J. Shane Culpepper,

W. Bruce CroftAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 32, Issue 1

Article No.: 3, Pages 1 - 26

https://doi.org/10.1145/2559168

Published: 01 January 2014 Publication History

Abstract

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.

In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary.

Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.

References

[1]

N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.

Digital Library

[2]

M . Bendersky and W. B. Croft. 2008. Discovering key concepts in verbose queries. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 491--498.

Digital Library

[3]

M. Bendersky and W. B. Croft. 2012. Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th ACM SIGIR Conference on Research and Development in Information Retrieval. 941--950.

Digital Library

[4]

M. Bendersky, D. Metzler, and W. B. Croft. 2010. Learning concept importance using a weighted dependence model. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM). 31--40.

Digital Library

[5]

S. Bergsma and Q. I. Wang. 2007. Learning noun phrase query segmentation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 819--826.

[6]

R. Berinde, G. Cormode, P. Indyk, and M. J. Strauss. 2009. Space-optimal heavy hitters with strong error bounds. In Proceedings of the 28th ACM PODS Conference. 157--166.

Digital Library

[7]

Y. Bernstein and J. Zobel. 2006. Accurate discovery of co-derivative documents via duplicate text detection. J. Inf. Syst. 31, 7, 595--609.

Digital Library

[8]

A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th International Conference on Information and Knowledge Mangement (CIKM). 426--434.

Digital Library

[9]

A. Broschart and R. Schenkel. 2012. High-performance processing of text queries with tunable pruned term and term pair indexes. ACM Trans. Inf. Syst. 30, 1, Article 5.

Digital Library

[10]

S. Büttcher, C. Clarke, and B. Lushman. 2006. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval. 621--622.

Digital Library

[11]

S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press.

Digital Library

[12]

M. Charikar, K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the 28th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2380, Springer-Verlag, Berlin, Heidelberg, 693--703.

Digital Library

[13]

G. Cormode and M. Hadjieleftheriou. 2008. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541.

Digital Library

[14]

G. Cormode and M. Hadjieleftheriou. 2010. Methods for finding frequent items in data streams. VLDB J. 19, 1, 3--20.

Digital Library

[15]

G. Cormode and S. Muthukrishnan. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the 6th Symposium on Theoretical Informatics (Latin). Lecture Notes in Computer Science, vol. 2976, Springer-Verlag, Berlin, Heidelberg, 29--38.

[16]

G. Cormode and S. Muthukrishnan. 2005a. Summarizing and mining skewed data streams. In Proceedings of the 5th SIAM International Conference on Data Mining (SDM).

[17]

G. Cormode and S. Muthukrishnan. 2005b. An improved data stream summary: The count-min sketch and its applications. J. Algorith. 55, 1, 58--75.

Digital Library

[18]

W. B. Croft and J. Callan. 2013. The Lemur Project. http://www.lemurproject.org/.

[19]

J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. 2010. Top-k ranked document search in general text databases. In Proceedings of the 18th Annual Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 6347, Springer-Verlag, Berlin, Heidelberg, 194--205.

Digital Library

[20]

J. S. Culpepper, M. Yasukawa, and F. Scholer. 2011. Language independent ranked retrieval with NeWT. In Proceedings of the 16th Australasian Document Computing Symposium (ADCS). 18--25.

[21]

J. S. Culpepper, M. Petri, and F. Scholer. 2012. Efficient in-memory top-k document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 225--234.

Digital Library

[22]

S. Ding and T. Suel. 2011. Faster top-K document retrieval using block-max indexes. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval. 993--1002.

Digital Library

[23]

C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev. 32, 4, 323--336.

Digital Library

[24]

A. Fariña, N. R. Brisaboa, G. Navarro, F. Claude, Á. S. Places, and E. Rodríguez. 2012. Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30, 1, Article 1.

Digital Library

[25]

M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien. 2011. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endow. 4, 12, 1213--1224.

Digital Library

[26]

S. Ganguly, M. N. Garofalakis, and R. Rastogi. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 2992, Springer-Verlag, Berlin, Heidelberg, 569--586.

[27]

Amit Goyal and Hal Daumé, III. 2011. Approximate scalable bounded space sketch for large data NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 250--261.

Digital Library

[28]

Amit Goyal, Hal Daumé, III, and Graham Cormode. 2012. Sketch algorithms for estimating point queries in NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 1093--1103.

Digital Library

[29]

J. Guo, G. Xu, H. Li, and X. Cheng. 2008. A unified and discriminative model for query refinement. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 379--386.

Digital Library

[30]

O. A. Hamid, B. Behzadi, S. Christoph, and M. Henzinger. 2009. Detecting the origin of text segments efficiently. In Proceedings of the 18th International Conference on World Wide Web (WWW). 61--70.

Digital Library

[31]

B. He, J. X. Huang, and X. Zhou. 2011. Modeling term proximity for probabilistic information retrieval models. J. Inf. Sci. 181, 14, 3017--3031.

Digital Library

[32]

W.-K. Hon, R. Shah, and J. S. Vitter. 2009. Space-efficient framework for top-k string retrieval problems. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 713--722.

Digital Library

[33]

W.-K. Hon, R. Shah, and S. V. Thankachan. 2012. Towards an optimal space-and-query-time index for top-k document retrieval. In Proceedings of the 23rd Annual Symposium on Combinational Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 7354, Springer-Verlag, Berlin, Heidelberg, 673--184.

Digital Library

[34]

S. Huston, A. Moffat, and W. B. Croft. 2011. Efficient indexing of repeated n-grams. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM). 127--136.

Digital Library

[35]

S. Huston, J. S. Culpepper, and W. B. Croft. 2012. Sketch-based indexing of n-words. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM). 1864--1868.

Digital Library

[36]

W. Lu, S. Robertson, and A. MacFarlane. 2006. Field-weighted XML retrieval based on BM25. In Proceedings of the 4th International Workshop on Initiative for the Evaluation of XML Retrieval (INEX). Lecture Notes in Computer Science, vol. 3977, Springer-Verlag, Berlin, Heidelberg, 161--171.

Digital Library

[37]

C. Macdonald, I. Ounis, and N. Tonellotto. 2011. Upper-bound approximations for dynamic pruning. ACM Trans. Inf. Syst. 29, 4, 17:1--17:28.

Digital Library

[38]

K. Tamsin Maxwell and W. Bruce Croft. 2013. Compact query term selection using topically related text. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 583--592.

Digital Library

[39]

R. M. C. McCreadie, C. Macdonald, and I. Ounis. 2009. On single-pass indexing with Map-Reduce. In Proceedings of the 32nd ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[40]

D. Metzler and W. B. Croft. 2005. A Markov random field model for term dependencies. In Proceedings of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval. 472--479.

Digital Library

[41]

S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 657--666.

Digital Library

[42]

S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science Series. Now Publishers.

Digital Library

[43]

G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1, Article 2.

Digital Library

[44]

G. Navarro and D. Valenzuela. 2012. Space-efficient top-k document retrieval. In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA). Lecture Notes in Computer Science, vol. 7276, Springer-Verlag, Berlin, Heidelberg, 307--319.

Digital Library

[45]

R. Ozcan, I. S. Altingovde, B. B. Cambazoglu, F. P. Junqueira, and O. Ulusoy. 2011. A five-level static cache architecture for web search engines. Inf. Process. Manage. 48, 5, 828--848.

Digital Library

[46]

J. H. Park, W. B. Croft, and D. A. Smith. 2011. A quasi-synchronous dependence model for information retrieval. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 17--26.

Digital Library

[47]

M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. 2011. Inverted indexes for phrases and strings. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 555--564.

Digital Library

[48]

J. Peng, C. Macdonald, B. He, V. Plachouras, and I. Ounis. 2007. Incorporating term dependency in the DFR framework. In Proceedings of the 30th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 843--844.

Digital Library

[49]

J. M. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 275--281.

Digital Library

[50]

S. J. Puglisi, W. F. Smyth, and A. H. Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2, Article 4.

Digital Library

[51]

K. Sadakane. 2007. Succinct data structures for flexible text retrieval systems. J. Discr. Alg. 5, 1, 12--22.

Digital Library

[52]

J. Seo and W. B. Croft. 2008. Local text reuse detection. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 571--578.

Digital Library

[53]

Lixin Shi and Jian-Yun Nie. 2010. Using various term dependencies according to their utilities. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM). 1493--1496.

Digital Library

[54]

T. Strohman and W. B. Croft. 2007. Efficient document retieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 175--182.

Digital Library

[55]

N. Tonellotto, C. Macdonald, and I. Ounis. 2010. Efficient dynamic pruning with proximity support. In Proceedings of the Large-Scale Distributed Systems for Information Retrieval Workshop. 33--37.

[56]

F. Transier and P. Sanders. 2008. Out of the box phrase indexing. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 5820, Springer-Verlag, Berlin, Heidelberg, 200--211.

Digital Library

[57]

F. Transier and P. Sanders. 2010. Engineering basic algorithms of an in-memory text search engine. ACM Trans. Inf. Syst. 29, 1, Article 2.

Digital Library

[58]

H. Turtle and J. Flood. 1995. Query evaluation: Strategies and optimizations. Inf. Process. Manage. 31, 6, 831--850.

Digital Library

[59]

L. Wang, J. Lin, and D. Metzler. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 105--114.

Digital Library

[60]

W. Webber and A. Moffat. 2005. In search of reliable retrieval experiments. In Proceedings of the 10th Australasian Document Computing Symposium (ADCS). 26--33.

[61]

H. E. Williams, J. Zobel, and P. Anderson. 1999. What’s next? Index structures for efficient phrase querying. In Proceedings of the 10th Australasian Database Conference (ADC). 141--152.

[62]

H. E. Williams, J. Zobel, and P. Anderson. 2004. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22, 4, 573--594.

Digital Library

[63]

I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann.

Digital Library

[64]

X. Xue and W. B. Croft. 2010. Representing queries as distributions. In Proceedings of SIGIR Workshop on Query Representation and Understanding. 9--12.

Cited By

Singh V(2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
https://dl.acm.org/doi/10.4018/IJIRR.325072
Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Mahdaouy AGaussier EAlaoui S(2019)Should one Use Term Proximity or Multi-Word Terms for Arabic Information Retrieval?Computer Speech & Language10.1016/j.csl.2019.04.002Online publication date: Apr-2019
https://doi.org/10.1016/j.csl.2019.04.002
Show More Cited By

Index Terms

Indexing Word Sequences for Ranked Retrieval
1. Information systems
  1. Information retrieval

Recommendations

Sketch-based indexing of n-words
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing these types of statistics using standard inverted indexes requires unreasonable processing ...
Design and implementation of automatic word and phrase indexing for information retrieval with Arabic documents
Indexing spatiotemporal archives

Spatiotemporal objects – that is, objects that evolve over time – appear in many applications. Due to the nature of such applications, storing the evolution of objects through time in order to answer historical queries (queries that refer to past states ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 32, Issue 1

January 2014

123 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2576772

Editor:
Jamie Callan
Carnegie Mellon University, USA

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2014

Accepted: 01 October 2013

Revised: 01 June 2013

Received: 01 January 2013

Published in TOIS Volume 32, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Australian Research Council
Division of Computer and Network Systems
Center for Intelligent Information Retrieval
Division of Information and Intelligent Systems

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
548
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Singh V(2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
https://dl.acm.org/doi/10.4018/IJIRR.325072
Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Mahdaouy AGaussier EAlaoui S(2019)Should one Use Term Proximity or Multi-Word Terms for Arabic Information Retrieval?Computer Speech & Language10.1016/j.csl.2019.04.002Online publication date: Apr-2019
https://doi.org/10.1016/j.csl.2019.04.002
Sha LLucey PYue YWei XHobbs JRohlf CSridharan S(2018)Interactive Sports AnalyticsACM Transactions on Computer-Human Interaction10.1145/318559625:2(1-32)Online publication date: 11-Apr-2018
https://dl.acm.org/doi/10.1145/3185596
Chen RGallagher LBlanco RCulpepper JKando NSakai TJoho HLi Hde Vries AWhite R(2017)Efficient Cost-Aware Cascade Ranking in Multi-Stage RetrievalProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080819(445-454)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080819
Cai HXu BJiang LVasilakos A(2017)IoT-Based Big Data Storage Systems in Cloud Computing: Perspectives and ChallengesIEEE Internet of Things Journal10.1109/JIOT.2016.26193694:1(75-87)Online publication date: Feb-2017
https://doi.org/10.1109/JIOT.2016.2619369
Lu XMoffat ACulpepper JCarterette BFang HLalmas MNie J(2016)Efficient and Effective Higher Order Proximity ModelingProceedings of the 2016 ACM International Conference on the Theory of Information Retrieval10.1145/2970398.2970404(21-30)Online publication date: 12-Sep-2016
https://dl.acm.org/doi/10.1145/2970398.2970404
Sha LLucey PYue YCarr PRohlf CMatthews INichols JMahmud JO'Donovan JConati CZancanaro M(2016)ChalkboardingProceedings of the 21st International Conference on Intelligent User Interfaces10.1145/2856767.2856772(336-347)Online publication date: 7-Mar-2016
https://dl.acm.org/doi/10.1145/2856767.2856772
Dimitrov GPanayotova GGarvanov IOs BPetrov PAngelov A(2016)Performance analysis of the method for social search of information in university information systems2016 Third International Conference on Artificial Intelligence and Pattern Recognition (AIPR)10.1109/ICAIPR.2016.7585228(1-5)Online publication date: Sep-2016
https://doi.org/10.1109/ICAIPR.2016.7585228
Fernández-Reyes FValadez JSuárez Y(2015)Term Dependence Statistical Measures for Information Retrieval TasksAdvances in Artificial Intelligence and Soft Computing10.1007/978-3-319-27060-9_7(83-94)Online publication date: 30-Dec-2015
https://doi.org/10.1007/978-3-319-27060-9_7
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents