Indexing Word Sequences for Ranked Retrieval

Published: 01 January 2014


Formulating and processing phrases and other term dependencies to improve query effectiveness is an important problem in information retrieval. However, accessing word-sequence statistics using inverted indexes requires unreasonable processing time or substantial space overhead. Establishing a balance between these competing space and time trade-offs can dramatically improve system performance.
In this article, we present and analyze a new index structure designed to improve query efficiency in dependency retrieval models. By adapting a class of (ε, δ)-approximation algorithms originally proposed for sketch summarization in networking applications, we show how to accurately estimate statistics important in term-dependency models with low, probabilistically bounded error rates. The space requirements for the vocabulary of the index is only logarithmically linked to the size of the vocabulary.
Empirically, we show that the sketch index can reduce the space requirements of the vocabulary component of an index of n-grams consisting of between 1 and 4 words extracted from the GOV2 collection to less than 0.01% of the space requirements of the vocabulary of a full index. We also show that larger n-gram queries can be processed considerably more efficiently than in current alternatives, such as positional and next-word indexes.


Published In

ACM Transactions on Information Systems  Volume 32, Issue 1
January 2014
Publication History

Published: 01 January 2014
Accepted: 01 October 2013
Revised: 01 June 2013
Received: 01 January 2013
Published in TOIS Volume 32, Issue 1


Author Tags

  1. Sketching
  2. indexing
  3. scalability
  4. term-dependency models


Cited By

View all
  • (2023)Promoting Document Relevance Using Query Term Proximity for Exploratory SearchInternational Journal of Information Retrieval Research10.4018/IJIRR.32507213:1(1-22)Online publication date: 11-Jul-2023
  • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
  • (2019)Should one Use Term Proximity or Multi-Word Terms for Arabic Information Retrieval?Computer Speech & Language10.1016/j.csl.2019.04.002Online publication date: Apr-2019
  • (2018)Interactive Sports AnalyticsACM Transactions on Computer-Human Interaction10.1145/318559625:2(1-32)Online publication date: 11-Apr-2018
  • (2017)Efficient Cost-Aware Cascade Ranking in Multi-Stage RetrievalProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080819(445-454)Online publication date: 7-Aug-2017
  • (2017)IoT-Based Big Data Storage Systems in Cloud Computing: Perspectives and ChallengesIEEE Internet of Things Journal10.1109/JIOT.2016.26193694:1(75-87)Online publication date: Feb-2017
  • (2016)Efficient and Effective Higher Order Proximity ModelingProceedings of the 2016 ACM International Conference on the Theory of Information Retrieval10.1145/2970398.2970404(21-30)Online publication date: 12-Sep-2016
  • (2016)ChalkboardingProceedings of the 21st International Conference on Intelligent User Interfaces10.1145/2856767.2856772(336-347)Online publication date: 7-Mar-2016
  • (2016)Performance analysis of the method for social search of information in university information systems2016 Third International Conference on Artificial Intelligence and Pattern Recognition (AIPR)10.1109/ICAIPR.2016.7585228(1-5)Online publication date: Sep-2016
  • (2015)Term Dependence Statistical Measures for Information Retrieval TasksAdvances in Artificial Intelligence and Soft Computing10.1007/978-3-319-27060-9_7(83-94)Online publication date: 30-Dec-2015
  • Show More Cited By

