Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Indexing Word Sequences for Ranked Retrieval

Published: 01 January 2014 Publication History

Abstract

Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.
In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary.
Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
[2]
M . Bendersky and W. B. Croft. 2008. Discovering key concepts in verbose queries. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 491--498.
[3]
M. Bendersky and W. B. Croft. 2012. Modeling higher-order term dependencies in information retrieval using query hypergraphs. In Proceedings of the 35th ACM SIGIR Conference on Research and Development in Information Retrieval. 941--950.
[4]
M. Bendersky, D. Metzler, and W. B. Croft. 2010. Learning concept importance using a weighted dependence model. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM). 31--40.
[5]
S. Bergsma and Q. I. Wang. 2007. Learning noun phrase query segmentation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 819--826.
[6]
R. Berinde, G. Cormode, P. Indyk, and M. J. Strauss. 2009. Space-optimal heavy hitters with strong error bounds. In Proceedings of the 28th ACM PODS Conference. 157--166.
[7]
Y. Bernstein and J. Zobel. 2006. Accurate discovery of co-derivative documents via duplicate text detection. J. Inf. Syst. 31, 7, 595--609.
[8]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th International Conference on Information and Knowledge Mangement (CIKM). 426--434.
[9]
A. Broschart and R. Schenkel. 2012. High-performance processing of text queries with tunable pruned term and term pair indexes. ACM Trans. Inf. Syst. 30, 1, Article 5.
[10]
S. Büttcher, C. Clarke, and B. Lushman. 2006. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proceedings of the 29th ACM SIGIR Conference on Research and Development in Information Retrieval. 621--622.
[11]
S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press.
[12]
M. Charikar, K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of the 28th International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2380, Springer-Verlag, Berlin, Heidelberg, 693--703.
[13]
G. Cormode and M. Hadjieleftheriou. 2008. Finding frequent items in data streams. Proc. VLDB Endow. 1, 2, 1530--1541.
[14]
G. Cormode and M. Hadjieleftheriou. 2010. Methods for finding frequent items in data streams. VLDB J. 19, 1, 3--20.
[15]
G. Cormode and S. Muthukrishnan. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the 6th Symposium on Theoretical Informatics (Latin). Lecture Notes in Computer Science, vol. 2976, Springer-Verlag, Berlin, Heidelberg, 29--38.
[16]
G. Cormode and S. Muthukrishnan. 2005a. Summarizing and mining skewed data streams. In Proceedings of the 5th SIAM International Conference on Data Mining (SDM).
[17]
G. Cormode and S. Muthukrishnan. 2005b. An improved data stream summary: The count-min sketch and its applications. J. Algorith. 55, 1, 58--75.
[18]
W. B. Croft and J. Callan. 2013. The Lemur Project. http://www.lemurproject.org/.
[19]
J. S. Culpepper, G. Navarro, S. J. Puglisi, and A. Turpin. 2010. Top-k ranked document search in general text databases. In Proceedings of the 18th Annual Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 6347, Springer-Verlag, Berlin, Heidelberg, 194--205.
[20]
J. S. Culpepper, M. Yasukawa, and F. Scholer. 2011. Language independent ranked retrieval with NeWT. In Proceedings of the 16th Australasian Document Computing Symposium (ADCS). 18--25.
[21]
J. S. Culpepper, M. Petri, and F. Scholer. 2012. Efficient in-memory top-k document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 225--234.
[22]
S. Ding and T. Suel. 2011. Faster top-K document retrieval using block-max indexes. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval. 993--1002.
[23]
C. Estan and G. Varghese. 2002. New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev. 32, 4, 323--336.
[24]
A. Fariña, N. R. Brisaboa, G. Navarro, F. Claude, Á. S. Places, and E. Rodríguez. 2012. Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. 30, 1, Article 1.
[25]
M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien. 2011. Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endow. 4, 12, 1213--1224.
[26]
S. Ganguly, M. N. Garofalakis, and R. Rastogi. 2004. Processing data-stream join aggregates using skimmed sketches. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 2992, Springer-Verlag, Berlin, Heidelberg, 569--586.
[27]
Amit Goyal and Hal Daumé, III. 2011. Approximate scalable bounded space sketch for large data NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 250--261.
[28]
Amit Goyal, Hal Daumé, III, and Graham Cormode. 2012. Sketch algorithms for estimating point queries in NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 1093--1103.
[29]
J. Guo, G. Xu, H. Li, and X. Cheng. 2008. A unified and discriminative model for query refinement. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval. 379--386.
[30]
O. A. Hamid, B. Behzadi, S. Christoph, and M. Henzinger. 2009. Detecting the origin of text segments efficiently. In Proceedings of the 18th International Conference on World Wide Web (WWW). 61--70.
[31]
B. He, J. X. Huang, and X. Zhou. 2011. Modeling term proximity for probabilistic information retrieval models. J. Inf. Sci. 181, 14, 3017--3031.
[32]
W.-K. Hon, R. Shah, and J. S. Vitter. 2009. Space-efficient framework for top-k string retrieval problems. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 713--722.
[33]
W.-K. Hon, R. Shah, and S. V. Thankachan. 2012. Towards an optimal space-and-query-time index for top-k document retrieval. In Proceedings of the 23rd Annual Symposium on Combinational Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 7354, Springer-Verlag, Berlin, Heidelberg, 673--184.
[34]
S. Huston, A. Moffat, and W. B. Croft. 2011. Efficient indexing of repeated n-grams. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM). 127--136.
[35]
S. Huston, J. S. Culpepper, and W. B. Croft. 2012. Sketch-based indexing of n-words. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM). 1864--1868.
[36]
W. Lu, S. Robertson, and A. MacFarlane. 2006. Field-weighted XML retrieval based on BM25. In Proceedings of the 4th International Workshop on Initiative for the Evaluation of XML Retrieval (INEX). Lecture Notes in Computer Science, vol. 3977, Springer-Verlag, Berlin, Heidelberg, 161--171.
[37]
C. Macdonald, I. Ounis, and N. Tonellotto. 2011. Upper-bound approximations for dynamic pruning. ACM Trans. Inf. Syst. 29, 4, 17:1--17:28.
[38]
K. Tamsin Maxwell and W. Bruce Croft. 2013. Compact query term selection using topically related text. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. 583--592.
[39]
R. M. C. McCreadie, C. Macdonald, and I. Ounis. 2009. On single-pass indexing with Map-Reduce. In Proceedings of the 32nd ACM SIGIR Conference on Research and Development in Information Retrieval.
[40]
D. Metzler and W. B. Croft. 2005. A Markov random field model for term dependencies. In Proceedings of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval. 472--479.
[41]
S. Muthukrishnan. 2002. Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 657--666.
[42]
S. Muthukrishnan. 2005. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science Series. Now Publishers.
[43]
G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1, Article 2.
[44]
G. Navarro and D. Valenzuela. 2012. Space-efficient top-k document retrieval. In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA). Lecture Notes in Computer Science, vol. 7276, Springer-Verlag, Berlin, Heidelberg, 307--319.
[45]
R. Ozcan, I. S. Altingovde, B. B. Cambazoglu, F. P. Junqueira, and O. Ulusoy. 2011. A five-level static cache architecture for web search engines. Inf. Process. Manage. 48, 5, 828--848.
[46]
J. H. Park, W. B. Croft, and D. A. Smith. 2011. A quasi-synchronous dependence model for information retrieval. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 17--26.
[47]
M. Patil, S. V. Thankachan, R. Shah, W.-K. Hon, J. S. Vitter, and S. Chandrasekaran. 2011. Inverted indexes for phrases and strings. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 555--564.
[48]
J. Peng, C. Macdonald, B. He, V. Plachouras, and I. Ounis. 2007. Incorporating term dependency in the DFR framework. In Proceedings of the 30th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 843--844.
[49]
J. M. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 275--281.
[50]
S. J. Puglisi, W. F. Smyth, and A. H. Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2, Article 4.
[51]
K. Sadakane. 2007. Succinct data structures for flexible text retrieval systems. J. Discr. Alg. 5, 1, 12--22.
[52]
J. Seo and W. B. Croft. 2008. Local text reuse detection. In Proceedings of the 31st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 571--578.
[53]
Lixin Shi and Jian-Yun Nie. 2010. Using various term dependencies according to their utilities. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM). 1493--1496.
[54]
T. Strohman and W. B. Croft. 2007. Efficient document retieval in main memory. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 175--182.
[55]
N. Tonellotto, C. Macdonald, and I. Ounis. 2010. Efficient dynamic pruning with proximity support. In Proceedings of the Large-Scale Distributed Systems for Information Retrieval Workshop. 33--37.
[56]
F. Transier and P. Sanders. 2008. Out of the box phrase indexing. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 5820, Springer-Verlag, Berlin, Heidelberg, 200--211.
[57]
F. Transier and P. Sanders. 2010. Engineering basic algorithms of an in-memory text search engine. ACM Trans. Inf. Syst. 29, 1, Article 2.
[58]
H. Turtle and J. Flood. 1995. Query evaluation: Strategies and optimizations. Inf. Process. Manage. 31, 6, 831--850.
[59]
L. Wang, J. Lin, and D. Metzler. 2011. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 105--114.
[60]
W. Webber and A. Moffat. 2005. In search of reliable retrieval experiments. In Proceedings of the 10th Australasian Document Computing Symposium (ADCS). 26--33.
[61]
H. E. Williams, J. Zobel, and P. Anderson. 1999. What’s next? Index structures for efficient phrase querying. In Proceedings of the 10th Australasian Database Conference (ADC). 141--152.
[62]
H. E. Williams, J. Zobel, and P. Anderson. 2004. Fast phrase querying with combined indexes. ACM Trans. Inf. Syst. 22, 4, 573--594.
[63]
I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann.
[64]
X. Xue and W. B. Croft. 2010. Representing queries as distributions. In Proceedings of SIGIR Workshop on Query Representation and Understanding. 9--12.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 32, Issue 1
January 2014
123 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2576772
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2014
Accepted: 01 October 2013
Revised: 01 June 2013
Received: 01 January 2013
Published in TOIS Volume 32, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sketching
  2. indexing
  3. scalability
  4. term-dependency models

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
  • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
  • (2019)Should one Use Term Proximity or Multi-Word Terms for Arabic Information Retrieval?Computer Speech & Language10.1016/j.csl.2019.04.002Online publication date: Apr-2019
  • (2018)Interactive Sports AnalyticsACM Transactions on Computer-Human Interaction10.1145/318559625:2(1-32)Online publication date: 11-Apr-2018
  • (2017)Efficient Cost-Aware Cascade Ranking in Multi-Stage RetrievalProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080819(445-454)Online publication date: 7-Aug-2017
  • (2017)IoT-Based Big Data Storage Systems in Cloud Computing: Perspectives and ChallengesIEEE Internet of Things Journal10.1109/JIOT.2016.26193694:1(75-87)Online publication date: Feb-2017
  • (2016)Efficient and Effective Higher Order Proximity ModelingProceedings of the 2016 ACM International Conference on the Theory of Information Retrieval10.1145/2970398.2970404(21-30)Online publication date: 12-Sep-2016
  • (2016)ChalkboardingProceedings of the 21st International Conference on Intelligent User Interfaces10.1145/2856767.2856772(336-347)Online publication date: 7-Mar-2016
  • (2016)Performance analysis of the method for social search of information in university information systems2016 Third International Conference on Artificial Intelligence and Pattern Recognition (AIPR)10.1109/ICAIPR.2016.7585228(1-5)Online publication date: Sep-2016
  • (2015)Term Dependence Statistical Measures for Information Retrieval TasksAdvances in Artificial Intelligence and Soft Computing10.1007/978-3-319-27060-9_7(83-94)Online publication date: 30-Dec-2015
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media