Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1277741.1277857acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Web text retrieval with a P2P query-driven index

Published: 23 July 2007 Publication History

Abstract

In this paper, we present a query-driven indexing/retrieval strategy for efficient full text retrieval from large document collections distributed within a structured P2P network. Our indexing strategy is based on two important properties: (1) the generated distributed index stores posting lists for carefully chosen indexing term combinations, and (2) the posting lists containing too many document references are truncated to a bounded number of their top-ranked elements. These two properties guarantee acceptable storage and bandwidth requirements, essentially because the number of indexing term combinations remains scalable and the transmitted posting lists never exceed a constant size. However, as the number of generated term combinations can still become quite large, we also use term statistics extracted from available query logs to index only such combinations that are frequently present in user queries. Thus, by avoiding the generation of superfluous indexing term combinations, we achieve an additional substantial reduction in bandwidth and storage consumption. As a result, the generated distributed index corresponds to a constantly evolving query-driven indexing structure that efficiently follows current information needs of the users. More precisely, our theoretical analysis and experimental results indicate that, at the price of a marginal loss in retrieval quality for rare queries, the generated index size and network traffic remain manageable even for web-size document collections. Furthermore, our experiments show that at the same time the achieved retrieval quality is fully comparable to the one obtained with a state-of-the-art centralized query engine.

References

[1]
K. Aberer, L. O. Alima, A. Ghodsi, S. Girdzijauskas, S. Haridi, and M. Hauswirth. The Essence of P2P: A Reference Architecture for Overlay Networks. In P2P, 2005.
[2]
W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden. DL Meets P2P -- Distributed Document Retrieval Based on Classification and Content. In ECDL, 2005.
[3]
W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden. Progressive Distributed Top-K Retrieval in Peer-to-Peer Networks. In ICDE, 2005.
[4]
M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer. Improving Collection Selection With Overlap Awareness in P2P Search Engines. In SIGIR, 2005.
[5]
P. Cao and Z. Wang. Efficient top-K query calculation in distributed networks. In PODC, 2004.
[6]
F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. In HPDC, 2003.
[7]
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.
[8]
F. Klemm, J.-Y. L. Boudec, and K. Aberer. Congestion Control for Distributed Hash Tables. In NCA, 2006.
[9]
J. Li, B. Loo, J. Hellerstein, F. Kaashoek, D. Karger, and R. Morris. The Feasibility of Peer-to-Peer Web Indexing and Search. In Workshop on Peer-to-Peer Systems, 2003.
[10]
Y. Li, H. V. Jagadish, and K.-L. Tan. Sprite: A Learning-Based Text Retrieval System in DHT Networks. In ICDE, 2007.
[11]
J. Lu and J. Callan. Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks. In ECIR, 2005.
[12]
J. Lu and J. Callan. User Modeling for Full-Text Federated Search in Peer-to-Peer Networks. In SIGIR, 2006.
[13]
S. Michel, M. Bender, N. Ntarmos, P. Triantafillou, G. Weikum, and C. Zimmer. Discovering and Exploiting Keyword and Attribute-Value Co-occurrences to Improve P2P Routing Indices. In CIKM, 2006.
[14]
I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys. In ICDE, 2007.
[15]
M. F. Porter. An Algorithm for Suffix Stripping. Program, 14(3):130--137, 1980.
[16]
P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. In Middleware, 2003.
[17]
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In TREC, 1992.
[18]
G. Skobeltsyn and K. Aberer. Distributed Cache Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks. In P2PIR, 2006.
[19]
G. Skobeltsyn, T. Luu, I. Podnar Zarko, M. Rajman, and K. Aberer. Query-Driven Indexing for Peer-to-Peer Text Retrieval (poster). In WWW, 2007.
[20]
G. Skobeltsyn, T. Luu, I. Podnar Zarko, M. Rajman, and K. Aberer. Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval. In Infoscale, 2007.
[21]
T. Suel, C. Mathur, J.-W. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval. In WebDB, 2003.
[22]
C. Tang, S. Dwarkadas, and Z. Xu. On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems. In SIGIR, 2004.
[23]
J. Zhang and T. Suel. Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment. In P2P, 2005.

Cited By

View all
  • (2023)WISK: A Workload-aware Learned Index for Spatial Keyword QueriesProceedings of the ACM on Management of Data10.1145/35893321:2(1-27)Online publication date: 20-Jun-2023
  • (2022)Highly distributed and privacy-preserving queries on personal data management systemsThe VLDB Journal10.1007/s00778-022-00753-132:2(415-445)Online publication date: 7-Jul-2022
  • (2019)Cloudy Knapsack Algorithm for Offloading Tasks from Large Scale Distributed ApplicationsIEEE Transactions on Cloud Computing10.1109/TCC.2017.27137767:4(949-963)Online publication date: 1-Oct-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DHT
  2. IR
  3. P2P
  4. TREC
  5. precision
  6. query-driven indexing
  7. text retrieval

Qualifiers

  • Article

Conference

SIGIR07
Sponsor:
SIGIR07: The 30th Annual International SIGIR Conference
July 23 - 27, 2007
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)WISK: A Workload-aware Learned Index for Spatial Keyword QueriesProceedings of the ACM on Management of Data10.1145/35893321:2(1-27)Online publication date: 20-Jun-2023
  • (2022)Highly distributed and privacy-preserving queries on personal data management systemsThe VLDB Journal10.1007/s00778-022-00753-132:2(415-445)Online publication date: 7-Jul-2022
  • (2019)Cloudy Knapsack Algorithm for Offloading Tasks from Large Scale Distributed ApplicationsIEEE Transactions on Cloud Computing10.1109/TCC.2017.27137767:4(949-963)Online publication date: 1-Oct-2019
  • (2018)Approximate Queries in Peer-to-Peer SystemsEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1229(137-139)Online publication date: 7-Dec-2018
  • (2017)Distributed Search Efficiency and Robustness in Service oriented Multi-agent NetworksProceedings of the 2017 International Conference on Management Engineering, Software Engineering and Service Sciences10.1145/3034950.3034975(9-18)Online publication date: 14-Jan-2017
  • (2016)Scalability analysis of distributed search in large peer-to-peer networks2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840686(909-914)Online publication date: Dec-2016
  • (2016)Approximate Queries in Peer-to-Peer SystemsEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_1229-2(1-4)Online publication date: 8-Dec-2016
  • (2014)CAPS: A cloud-assisted approach to handle spikes in peer-to-peer web searchPeer-to-Peer Networking and Applications10.1007/s12083-014-0322-y9:1(193-208)Online publication date: 19-Dec-2014
  • (2013)Studying the clustering paradox and scalability of search in highly distributed environmentsACM Transactions on Information Systems10.1145/2457465.245746831:2(1-36)Online publication date: 17-May-2013
  • (2012)Optimizing Bloom Filter Settings in Peer-to-Peer Multikeyword SearchingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2011.1424:4(692-706)Online publication date: 1-Apr-2012
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media