Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/11687238_21guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Indexing shared content in information retrieval systems

Published: 26 March 2006 Publication History

Abstract

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
[2]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD '03, pages 39-48, 2003.
[3]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In WWW '98, pages 107-117, 1998.
[4]
A. Z. Broder. Identifying and filtering near-duplicate documents. In CPM '00, pages 1-10, 2000.
[5]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM '03, pages 426-434, 2003.
[6]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In WWW '97, pages 1157-1166, 1997.
[7]
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR '01, pages 43-50, 2001.
[8]
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD '00, pages 355-366, 2000.
[9]
E. S. de Moura, C. F. dos Santos, D. R. Fernandes, A. S. Silva, P. Calado, and M. A. Nascimento. Improving web search efficiency via a locality based static pruning method. In WWW '05, pages 235-244, 2005.
[10]
M. Fontoura, E. J. Shekita, J. Y. Zien, S. Rajagopalan, and A. Neumann. High performance index build algorithms for intranet search engines. In VLDB '04, pages 1158-1169, 2004.
[11]
H. Garcia-Molina, J. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000.
[12]
Gmail. http://gmail.google.com/gmail/help/about.html.
[13]
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. JASIST, 54(8), 2003.
[14]
B. Klimt and Y. Yang. The Enron corpus: A new dataset for email classification research. In European Conference on Machine Learning, 2004.
[15]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In VLDB '03, pages 129-140, 2003.
[16]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. In WWW '01, pages 396-406, 2001.
[17]
A. Moffat and J. Zobel. Compression and fast indexing for multi-gigabyte text databases. Australian Computer Journal, 26(1), 1994.
[18]
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In SIGIR '02, pages 222-229, 2002.
[19]
R. Stata, P. Hunt, and M. G. Thiruvalluvan. The Bloomba personal content database. In VLDB '04, pages 1214-1223, 2004.
[20]
H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Proc. Management, 31(6):831-850, 1995.
[21]
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes. Morgan Kaufmann, 1999.
[22]
Z. Zhang. The behavior of duplicate pages on the world wide web. Submitted to CIKM, 2005.
[23]
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software - Practice & Experience, 25(8), 1995.

Cited By

View all
  • (2022)Search and Discovery in Personal Email CollectionsProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3501393(1617-1619)Online publication date: 11-Feb-2022
  • (2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
  • (2016)Temporal Information RetrievalProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914805(1235-1238)Online publication date: 7-Jul-2016
  • Show More Cited By
  1. Indexing shared content in information retrieval systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    EDBT'06: Proceedings of the 10th international conference on Advances in Database Technology
    March 2006
    1204 pages
    ISBN:3540329609
    • Editors:
    • Yannis Ioannidis,
    • Marc H. Scholl,
    • Joachim W. Schmidt,
    • Florian Matthes,
    • Mike Hatzopoulos

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 26 March 2006

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Search and Discovery in Personal Email CollectionsProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3501393(1617-1619)Online publication date: 11-Feb-2022
    • (2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
    • (2016)Temporal Information RetrievalProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914805(1235-1238)Online publication date: 7-Jul-2016
    • (2012)Optimizing positional index structures for versioned document collectionsProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348319(245-254)Online publication date: 12-Aug-2012
    • (2011)Indexes for highly repetitive document collectionsProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063646(463-468)Online publication date: 24-Oct-2011
    • (2011)Faster temporal range queries over versioned textProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009993(565-574)Online publication date: 24-Jul-2011
    • (2010)Improved index compression techniques for versioned document collectionsProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871594(1239-1248)Online publication date: 26-Oct-2010
    • (2010)Durable top-k search in document archivesProceedings of the 2010 ACM SIGMOD International Conference on Management of data10.1145/1807167.1807228(555-566)Online publication date: 6-Jun-2010
    • (2009)Compact full-text indexing of versioned document collectionsProceedings of the 18th ACM conference on Information and knowledge management10.1145/1645953.1646008(415-424)Online publication date: 2-Nov-2009
    • (2009)Inverted index compression and query processing with optimized document orderingProceedings of the 18th international conference on World wide web10.1145/1526709.1526764(401-410)Online publication date: 20-Apr-2009
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media