Efficient index-based snippet generation

H Bast, M Celikik - ACM Transactions on Information Systems (TOIS), 2014 - dl.acm.org
H Bast, M Celikik
ACM Transactions on Information Systems (TOIS), 2014dl.acm.org
Ranked result lists with query-dependent snippets have become state of the art in text
search. They are typically implemented by searching, at query time, for occurrences of the
query words in the top-ranked documents. This document-based approach has three
inherent problems:(i) when a document is indexed by terms which it does not contain literally
(eg, related words or spelling variants), localization of the corresponding snippets becomes
problematic;(ii) each query operator (eg, phrase or proximity search) has to be implemented …
Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet-generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which could be problematic for very long documents.
We present a new index-based method that localizes snippets by information solely computed from the index and that overcomes all three problems. Unlike previous index-based methods, we show how to achieve this at essentially no extra cost in query processing time, by a technique we call operator inversion. We also show how our index-based method allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit-ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine.
ACM Digital Library