Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A General SIMD-Based Approach to Accelerating Compression Algorithms

Published: 23 March 2015 Publication History

Abstract

Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.

References

[1]
V. N. Anh and A. Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval 8, 1, 151--166.
[2]
V. N. Anh and A. Moffat. 2006. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18, 6, 857--861.
[3]
V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2, 131--147.
[4]
A. Z. Broder, D. Carmel, M. Herscovici, M. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 426--434.
[5]
S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.
[6]
S. Chatterjee, L. R. Bachega, P. Bergner, A. D. Kenneth, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowski, G. K. Liu, M. P. Mendell, R. D. Nair, C. D. Wait, C. Ward, and P. Wu. 2005. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L. IBM Journal of Research and Development 49, 2--3, 377--392.
[7]
J. Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM.
[8]
S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 993--1002.
[9]
R. Delbru, S. Campinas, and G. Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Web Semantics: Science, Services and Agents on the World Wide Web 10, 33--58.
[10]
P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2, 194--203.
[11]
Intel Corporation. 2010. Intel 64 and IA-32 Architectures Software Developers Manual (Version 37). Intel Corporation, Santa Clara, CA.
[12]
D. Inkster, M. Zukowski, and P. Boncz. 2011. Integration of VectorWise with Ingres. ACM SIGMOD Record 40, 3, 45--53.
[13]
S. Jonassen and S. E. Bratsberg. 2011. Efficient compressed inverted index skipping for disjunctive text-queries. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. Springer, Berlin, 530--542.
[14]
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference. ACM, 591--600.
[15]
K. Liu, X. Qin, X. Yan, and L. Quan. 2006. A SIMD video signal processor with efficient data organization. In Proceedings of IEEE Asian Solid-State Circuits Conference, 115--118.
[16]
C. Lemke, K. U. Sattler, F. Faerber, and A. Zeier. 2010. Speeding up queries in column stores: A case for compression. In Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery, 117--129.
[17]
C. Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference.
[18]
D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1.
[19]
W. Ma and C. Yang. 2002. Using Intel streaming SIMD extensions for 3D geometry processing. In Proceedings of IEEE Pacific Rim Conference on Multimedia. 1080--1087.
[20]
C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[21]
G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. B. Yates. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77.
[22]
R. Rice and J. Plaunt. 1971. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology 19, 6, 889--897.
[23]
S. E. Robertson, S. Walker, M. Beaulieu, and P. Willett. 1999. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP 253--264.
[24]
V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11, 1080--1091.
[25]
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 222--229.
[26]
B. Schlegel, R. Gemulla, and W. Lehner. 2010. Fast integer compression using SIMD instructions. In Proceedings of the 6th International Workshop on Data Management on New Hardware. ACM, 34--40.
[27]
F. Silvestri and Venturini R. VSEncoding. 2010. Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1219--1228.
[28]
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and R. S. Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 317--326.
[29]
D. Shan, S. Ding, J. He, H. Yan, and X. Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 423--432.
[30]
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1, 385--394.
[31]
T. Willhalm, I. Oukid, I. Müller, and F. Faerber. 2013. Vectorizing database column scans with complex predicates. Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 1--12.
[32]
I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann.
[33]
J. Walder, M. Krátký, R. Bača, J. Platoš, and V. Snášel. 2012. Fast decoding algorithms for variable-lengths codes. Information Sciences 183, 1, 66--91.
[34]
H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International World Wide Web Conference. ACM, 401--410.
[35]
M. Zukowski, S. Heman, N. Nes, and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 59--71.
[36]
J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference. ACM, 387--396.

Cited By

View all
  • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
  • (2023)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: 25-Dec-2023
  • (2023)BOUNCE: memory-efficient SIMD approach for lightweight integer compressionDistributed and Parallel Databases10.1007/s10619-023-07426-041:3(439-466)Online publication date: 10-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 33, Issue 3
March 2015
184 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2737814
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2015
Accepted: 02 January 2015
Revised: 01 January 2015
Received: 01 July 2014
Published in TOIS Volume 33, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SIMD
  2. index compression
  3. integer encoding
  4. inverted index

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key Basic Research Program (973 Program) of China
  • Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China
  • National Natural Science Foundation of China
  • Natural Sciences and Engineering Research Council of Canada's

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)4
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
  • (2023)Accelerating Huffman Encoding Using 512-Bit SIMD InstructionsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.334722970:1(554-563)Online publication date: 25-Dec-2023
  • (2023)BOUNCE: memory-efficient SIMD approach for lightweight integer compressionDistributed and Parallel Databases10.1007/s10619-023-07426-041:3(439-466)Online publication date: 10-May-2023
  • (2022)BOUNCE: Memory-Efficient SIMD Approach for Lightweight Integer Compression2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW55742.2022.00025(123-128)Online publication date: May-2022
  • (2022)Partition-based SIMD Processing and its Application to Columnar Database SystemsDatenbank-Spektrum10.1007/s13222-022-00431-023:1(53-63)Online publication date: 7-Dec-2022
  • (2022)To share or not to share vector registers?The VLDB Journal10.1007/s00778-022-00744-231:6(1215-1236)Online publication date: 28-Apr-2022
  • (2021)LCTL: Lightweight Compression Template Library2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671706(2966-2975)Online publication date: 15-Dec-2021
  • (2021)Transcoding billions of Unicode characters per second with SIMD instructionsSoftware: Practice and Experience10.1002/spe.303652:2(555-575)Online publication date: 13-Oct-2021
  • (2020)MorphStoreProceedings of the VLDB Endowment10.14778/3407790.340783313:12(2396-2410)Online publication date: 14-Sep-2020
  • (2020)FPGA-Accelerated compression of integer vectorsProceedings of the 16th International Workshop on Data Management on New Hardware10.1145/3399666.3399932(1-10)Online publication date: 15-Jun-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media