Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Compressing Inverted Files

Published: 01 January 2003 Publication History

Abstract

Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.
Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.

Reference

[1]
Antoshenkov G (1994) Byte aligned data compression. US Patent Number 5363098.
[2]
Bookstein A, Klein ST and Raita T(1994) Markov models for clusters in concordance compression. In: Proceedings of the 1994 IEEE Data Compression Conference DCC-94, pp. 116-125.
[3]
Bookstein A, Klein ST, and Raita T Simple bayesian model for Bitmap compression Information Retrieval 2000 1 4 315-328
[4]
Chan CY and Ioannidis YE (1999) An efficient Bitmap encoding scheme for selection queries. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 215-226.
[5]
Choueka Y, Fraenkel AS and Klein ST (1988) Compression of concordances in full-text retrieval systems. In: Proceedings of the 11th ACM-SIGIR Conference on Information Retrieval, pp. 597-612.
[6]
Choueka Y, Fraenkel AS, Klein ST and Segal E (1986) Improved hierarchical bit-vector compression in document retrieval systems. In: Proceedings of the 9th ACM-SIGR Conference on Information Retrieval, pp. 88-97.
[7]
Elias P Universal codeword sets and the representation of the integers IEEE Transactions on Information Theory 1975 21 194-203
[8]
Golomb SW Run-length encodings IEEE Transactions on Information Theory 1966 12 3 399-401
[9]
Harman DKE Proceedings of the TREC Text Retrieval Conference 1992 National Institute of Standards Special Publication
[10]
Howard P and Vitter J (1993) Fast and efficient lossless image compression. In: Proceedings of the 1993 IEEE Data Compression Conference DCC-93, pp. 351-360.
[11]
IBM Corporation (2000) IBM Deskstar 75GXP and Deskstar 40GV hard disk drives. IBM TECHFAX #7011. Available atwww.storage.ibm.com/hdd/desk/deskstar75gxp40gv.pdf (Viewed April 2002).
[12]
Intel Corporation (1997) Using the RDTSC instruction for performance monitoring. Available at cedar.intel.com/software/idap/media/pdf/rdtscpm1.pdf (Viewed April 2002).
[13]
Johnson T (1999) Performance measurements of compressed Bitmap indices. In: Proceedings of the 25th VLDB Conference, pp. 278-289.
[14]
Klein ST, Bookstein A, and Deerwester S Storing text retrieval systems on CD-ROM: Compression and encryption considerations ACM Transactions on Information Systems 1989 7 230-245
[15]
Koudas N (2000) Space efficient Bitmap indexing. In: Proceedings of CIKM 2000, pp. 194-201.
[16]
Lai CH and Chen TF (2001) Compressing inverted files in scalable information systems by binary decision diagram encoding. Presented at SC2001, available at http://www.sc2001.org/papers/pap.pap338.pdf (visited April 2002).
[17]
Microsoft Corporation (2000) CreateFile. Available at msdn.microsoft.com/library/en-us/fileio/filesio 7wmd.asp (Viewed April 2002).
[18]
Moffat A and Stuiver L (1996) Exploiting clustering in inverted file compression. In: Proceedings of the 1996 IEEE Data Compression Conference DCC-96, pp. 82-91. ll
[19]
Moffat A and Stuiver L Binary interpolative coding for effective index compression Information Retrieval 2000 3 1 25-47
[20]
Moffat A and Zobel J (1992) Parameterized compression of sparse Bitmaps. In: Proceedings of the 15th ACMSIGIR Conference on Information Retrieval, pp. 274-285. l
[21]
Moffat A and Zobel J Self-indexing inverted files for fast text retrieval ACM Transactions on Information Systems 1996 14 4 349-379
[22]
Navarro G, Moura E, Neubert M, Ziviani N, and Baeza-Yates R Adding compression to block addressing inverted indexes Information Retrieval 2000 3 1 49-77
[23]
Stockinger K (2001) Design and implementation of Bitmap indices for scientific data. In: Proceedings of International Data Engineering and Applications Symposium IDEAS-01, pp. 47-57.
[24]
Varadarajan S and Chiuen T (1997) SASE: Implementation of a compressed text search engine. In: Proceedings of the USENIX Symposium on Internet Technologies and Systems.
[25]
Vo AN and Moffat A (1998) Compressed inverted files with reduced decoding overheads. In: Proceedings of the 21st ACM-SIGIR Conference on Information Retrieval, pp. 290-297.
[26]
Williams HE (2002) goanna.cs.rmit.edu.au/∼hugh/software/integer.coding.tar.gz (viewed April 2002). l
[27]
Williams HE and Zobel J Compressing integers for fast file access The Computer Journal 1999 42 3 193-201
[28]
Witten IH, Moffat A and Bell TC (1994) Managing gigabytes. Van Nostrand Reinhold 1994.
[29]
Zobel J and Moffat A Adding compression to a full-text retrieval system Software Practice and Experience 1995 25 8 891-903

Cited By

View all
  • (2022)Reduce, Reuse, RecycleProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531766(2825-2837)Online publication date: 6-Jul-2022
  • (2020)Techniques for Inverted Index CompressionACM Computing Surveys10.1145/341514853:6(1-36)Online publication date: 6-Dec-2020
  • (2020)Large-Alphabet Semi-Static Entropy Coding Via Asymmetric Numeral SystemsACM Transactions on Information Systems10.1145/339717538:4(1-33)Online publication date: 21-Jul-2020
  • Show More Cited By

Index Terms

  1. Compressing Inverted Files
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Information Retrieval
    Information Retrieval  Volume 6, Issue 1
    Jan 2003
    97 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 January 2003

    Author Tags

    1. index compression
    2. inverted files
    3. document indexing
    4. text searching

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Reduce, Reuse, RecycleProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531766(2825-2837)Online publication date: 6-Jul-2022
    • (2020)Techniques for Inverted Index CompressionACM Computing Surveys10.1145/341514853:6(1-36)Online publication date: 6-Dec-2020
    • (2020)Large-Alphabet Semi-Static Entropy Coding Via Asymmetric Numeral SystemsACM Transactions on Information Systems10.1145/339717538:4(1-33)Online publication date: 21-Jul-2020
    • (2019)Fast Dictionary-Based Compression for Inverted IndexesProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3290962(6-14)Online publication date: 30-Jan-2019
    • (2018)Elias RevisitedProceedings of the 23rd Australasian Document Computing Symposium10.1145/3291992.3292001(1-8)Online publication date: 11-Dec-2018
    • (2018)Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional ContextsProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159663(405-413)Online publication date: 2-Feb-2018
    • (2017)ANS-Based Index CompressionProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132888(677-686)Online publication date: 6-Nov-2017
    • (2017)Compact Indexing and Judicious Searching for Billion-Scale Microblog RetrievalACM Transactions on Information Systems10.1145/305277135:3(1-24)Online publication date: 12-May-2017
    • (2017)Real-time social media retrieval with spatial, temporal and social constraintsNeurocomputing10.1016/j.neucom.2016.11.078253:C(77-88)Online publication date: 30-Aug-2017
    • (2017)The role of index compression in score-at-a-time query evaluationInformation Retrieval10.1007/s10791-016-9291-520:3(199-220)Online publication date: 1-Jun-2017
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media