Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Efficient Index-Based Snippet Generation

Published: 01 April 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Ranked result lists with query-dependent snippets have become state of the art in text search. They are typically implemented by searching, at query time, for occurrences of the query words in the top-ranked documents. This document-based approach has three inherent problems: (i) when a document is indexed by terms which it does not contain literally (e.g., related words or spelling variants), localization of the corresponding snippets becomes problematic; (ii) each query operator (e.g., phrase or proximity search) has to be implemented twice, on the index side in order to compute the correct result set, and on the snippet-generation side to generate the appropriate snippets; and (iii) in a worst case, the whole document needs to be scanned for occurrences of the query words, which could be problematic for very long documents.
    We present a new index-based method that localizes snippets by information solely computed from the index and that overcomes all three problems. Unlike previous index-based methods, we show how to achieve this at essentially no extra cost in query processing time, by a technique we call operator inversion. We also show how our index-based method allows the caching of individual segments instead of complete documents, which enables a significantly larger cache hit-ratio as compared to the document-based approach. We have fully integrated our implementation with the CompleteSearch engine.

    References

    [1]
    Almeida, V., Bestavros, A., Crovella, M., and deOliveira, A. 1996. Characterizing reference locality in the WWW. Tech. rep., Boston University.
    [2]
    Anh, V. N. and Moffat, A. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 372--379.
    [3]
    Bast, H. and Weber, I. 2006. Type less, find more: Fast autocompletion search with a succinct index. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 364--371.
    [4]
    Bast, H. and Weber, I. 2007. The CompleteSearch engine: Interactive, efficient, and towards IR & DB integration. In Proceedings of the 3rd Conference on Innovative Data Systems Research (CIDR). VLDB Endowment, 88--95.
    [5]
    Bast, H., Majumdar, D., Schenkel, R., Theobald, M., and Weikum, G. 2006. IO-Top-k: Index-access optimized top-k query processing. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). VLDB Endowment, 475--486.
    [6]
    Bast, H., Chitea, A., Suchanek, F., and Weber, I. 2007a. ESTER: Efficient search on text, entities, and relations. In Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 671--678.
    [7]
    Bast, H., Majumdar, D., and Weber, I. 2007b. Efficient interactive query expansion with CompleteSearch. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 857--860.
    [8]
    Breslau, L., Cue, P., Cao, P., Fan, L., Phillips, G., and Shenker, S. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM). IEEE Computer Society Press, Los Alamitos, CA, 126--134.
    [9]
    Buckley, C. 2004. Why current IR engines fail. In Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, 584--585.
    [10]
    Buettcher, S. 2007. The Wumpus search engine. http://www.wumpus-search.org/.
    [11]
    Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. 2011. Caching query-biased snippets for efficient retrieval. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT). ACM, New York, NY, 93--104.
    [12]
    Celikik, M. and Bast, H. 2009. Fast error-tolerant search on very large texts. In Proceedings of the Symposium of Applied Computing (SAC). ACM, New York, NY, 1724--1731.
    [13]
    Clarke, C. L., Cormack, G. V., and Burkowski, F. J. 1995. An algebra for structured text search and a framework for its implementation. Comput. J. 38, 1, 43--56.
    [14]
    Clarke, C. L. A. and Cormack, G. V. 2000. Shortest-substring retrieval and ranking. Trans. Inf. Syst. 18, 1, 44--78.
    [15]
    Cutting, D. 2004. Lucene. http://lucene.apache.org/.
    [16]
    Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the 22nd Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 121--128.
    [17]
    Hoobin, C., Puglisi, S. J., and Zobel, J. 2011. Relative Lempel-Ziv factorization for efficient storage and retrieval of Web collections. Proc. VLDB Endow. 5, 3, 265--273.
    [18]
    Ko, Y., An, H., and Seo, J. 2007. An effective snippet generation method using the pseudo relevance feedback technique. In Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 711--712.
    [19]
    Losada, D. E., Azzopardi, L., and Baillie, M. 2008. Revisiting the relationship between document length and relevance. In Proceedings of the 17th Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 419--428.
    [20]
    Manolache, G. 2008. Index-based snippet generation. M.S. thesis, Saarland University.
    [21]
    Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 2--10.
    [22]
    Tsegay, Y., Puglisi, S. J., Turpin, A., and Zobel, J. 2009. Document compaction for efficient query-biased snippet generation. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval (ECIR). Lecture Notes in Computer Science, vol. 5478. Springer-Verlag, Berlin, 509--520.
    [23]
    Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. E. 2007. Fast generation of result snippets in Web search. In Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 127--134.
    [24]
    Varadarajan, R. and Hristidis, V. 2006. A system for query-specific document summarization. In Proceedings of the 15th Conference on Information and Knowledge Management (CIKM). ACM, New York, NY, 622--631.
    [25]
    Wang, D., Li, T., Zhu, S., and Ding, C. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 307--314.
    [26]
    White, R. W., Ruthven, I., and Jose, J. M. 2002. Finding relevant documents using top ranking sentences: An evaluation of two alternative schemes. In Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR). ACM, New York, NY, 57--64.
    [27]
    White, R. W., Jose, J. M., and Ruthven, I. 2003. A task-oriented study on the influencing effects of query-biased summarisation in Web searching. Inf. Process. Manage. 39, 5, 707--733.

    Cited By

    View all
    • (2023)A Lightweight Constrained Generation Alternative for Query-focused SummarizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591936(1745-1749)Online publication date: 19-Jul-2023
    • (2021)A Method for Solving Quasi-Identifiers of Single Structured Relational DataIEEE Access10.1109/ACCESS.2021.31359469(166293-166302)Online publication date: 2021
    • (2019)To index or not to index: Time-space trade-offs for positional ranking functions in search enginesInformation Systems10.1016/j.is.2019.101466(101466)Online publication date: Nov-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 32, Issue 2
    April 2014
    131 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/2610992
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 April 2014
    Accepted: 01 December 2013
    Revised: 01 December 2013
    Received: 01 August 2012
    Published in TOIS Volume 32, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Snippets
    2. advanced search
    3. caching
    4. document summarization
    5. efficiency

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)3

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Lightweight Constrained Generation Alternative for Query-focused SummarizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591936(1745-1749)Online publication date: 19-Jul-2023
    • (2021)A Method for Solving Quasi-Identifiers of Single Structured Relational DataIEEE Access10.1109/ACCESS.2021.31359469(166293-166302)Online publication date: 2021
    • (2019)To index or not to index: Time-space trade-offs for positional ranking functions in search enginesInformation Systems10.1016/j.is.2019.101466(101466)Online publication date: Nov-2019
    • (2018)Pseudo Descriptions for Meta-Data RetrievalProceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3234944.3234957(139-146)Online publication date: 10-Sep-2018
    • (2017)A Study of Snippet Length and InformativenessProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080824(135-144)Online publication date: 7-Aug-2017
    • (2016)NBLucene: Flexible and Efficient Open Source Search EngineWeb-Age Information Management10.1007/978-3-319-39937-9_39(504-516)Online publication date: 28-May-2016

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media