Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1526709.1526764acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Inverted index compression and query processing with optimized document ordering

Published: 20 April 2009 Publication History

Abstract

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.

References

[1]
V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, 2004.
[2]
V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retrieval, 8(1):151--166, 2005.
[3]
V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6):857--861, 2006.
[4]
J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally adaptive data compression scheme. Comm. of the ACM, 29(4), 1986.
[5]
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 519--526, 2007.
[6]
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005.
[7]
D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002.
[8]
P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. of the 12th Int. Conf. on String Processing and Information Retrieval, 2005.
[9]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998.
[10]
A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003.
[11]
A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, 2006.
[12]
F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. of the Int. Conf. on Web Search and Data Mining, pages 15--24, 2008.
[13]
R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31(2):109--118, June 2002.
[14]
S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005.
[15]
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, 2007.
[16]
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, pages 129--140, 2003.
[17]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retrieval, 3(1):25--47, 2000.
[18]
A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. of the 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 274--285, 1992.
[19]
M. Persin, J. Zobel, and R. Sacks--Davis. Filtered document retrieval with frequency-sorted indexes. J. of the American Society for Information Science, 47(10):749--764, 1996.
[20]
M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. of the 15th Int. World Wide Web Conference, 2006.
[21]
K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132--143, 2003.
[22]
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002.
[23]
W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003.
[24]
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007.
[25]
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004.
[26]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999.
[27]
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conference, April 2008.
[28]
J. Zhang and T. Suel. Efficient search in large textual collection with redundancy. In Proc. of the 16th Int. World Wide Web Conference, 2007.
[29]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006.
[30]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006.

Cited By

View all
  • (2024)Efficient List Intersection Algorithm for Short Documents by Document ReorderingMathematics10.3390/math1209132812:9(1328)Online publication date: 26-Apr-2024
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2024)TMan: A High-Performance Trajectory Data Management System Based on Key-Value Stores2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00376(4951-4964)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Inverted index compression and query processing with optimized document ordering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. IR query processing
    2. document ordering
    3. index compression
    4. inverted index
    5. search engines

    Qualifiers

    • Research-article

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)53
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 25 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient List Intersection Algorithm for Short Documents by Document ReorderingMathematics10.3390/math1209132812:9(1328)Online publication date: 26-Apr-2024
    • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
    • (2024)TMan: A High-Performance Trajectory Data Management System Based on Key-Value Stores2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00376(4951-4964)Online publication date: 13-May-2024
    • (2024)A blockchain datastore for scalable IoT workloads using data decayingDistributed and Parallel Databases10.1007/s10619-024-07441-942:3(403-445)Online publication date: 10-May-2024
    • (2023)An Efficient Association Rule Mining-Based Spatial Keyword IndexInternational Journal of Data Warehousing and Mining10.4018/IJDWM.31616119:2(1-19)Online publication date: 13-Jan-2023
    • (2023)Khronos: A Real-Time Indexing Framework for Time Series Databases on Large-Scale Performance Monitoring SystemsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614944(1607-1616)Online publication date: 21-Oct-2023
    • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
    • (2023)An Index for Set Intersection With Post-FilteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3329145(1-14)Online publication date: 2023
    • (2022)A large scale search dataset for unbiased learning to rankProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600353(1127-1139)Online publication date: 28-Nov-2022
    • (2022)TencentCLSProceedings of the VLDB Endowment10.14778/3554821.355483715:12(3472-3482)Online publication date: 1-Aug-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media