Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FCBench: Cross-Domain Benchmarking of Lossless Compression for Floating-Point Data

Published: 03 May 2024 Publication History
  • Get Citation Alerts
  • Abstract

    While both the database and high-performance computing (HPC) communities utilize lossless compression methods to minimize floating-point data size, a disconnect persists between them. Each community designs and assesses methods in a domain-specific manner, making it unclear if HPC compression techniques can benefit database applications or vice versa. With the HPC community increasingly leaning towards in-situ analysis and visualization, more floating-point data from scientific simulations are being stored in databases like Key-Value Stores and queried using in-memory retrieval paradigms. This trend underscores the urgent need for a collective study of these compression methods' strengths and limitations, not only based on their performance in compressing data from various domains but also on their runtime characteristics. Our study extensively evaluates the performance of eight CPU-based and five GPU-based compression methods developed by both communities, using 33 real-world datasets assembled in the Floating-point Compressor Benchmark (FCBench). Additionally, we utilize the roofline model to profile their runtime bottlenecks. Our goal is to offer insights into these compression methods that could assist researchers in selecting existing methods or developing new ones for integrated database and HPC applications.

    References

    [1]
    Fabrice Bellard. 2021. NNCP v2: Lossless Data Compression with Transformer. (2021).
    [2]
    Guy E Blelloch. 2001. Introduction to data compression. Computer Science Department, Carnegie Mellon University (2001), 54.
    [3]
    Haran Boral and David J Dewitt. 1984. A methodology for database system performance evaluation. ACM SIGMOD Record 14, 2 (1984), 176--185.
    [4]
    William Bugden and Ayman Alahmar. 2022. Rust: The programming language for safety and performance. arXiv preprint arXiv:2206.05503 (2022).
    [5]
    MARTIN BURTSCHER. 2009. Scientific IEEE 754 32-Bit Double-Precision FloatingPoint Datasets. https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/ Accessed Feb 13, 2024.
    [6]
    Martin Burtscher and Paruj Ratanaworabhan. 2007. High throughput compression of double-precision floating-point data. Data Compression Conference Proceedings (2007), 293--302.
    [7]
    Martin Burtscher and Paruj Ratanaworabhan. 2009. pFPC: A parallel compressor for floating-point data. Data Compression Conference Proceedings (2009), 43--52.
    [8]
    Ugur Cayoglu, Frank Tristram, Jörg Meyer, Jennifer Schröter, Tobias Kerzenmacher, Peter Braesicke, and Achim Streit. 2019. Data Encoding in Lossless Prediction-Based Compression Algorithms. In 2019 15th International Conference on eScience (eScience). IEEE, 226--234.
    [9]
    Steven Claggett, Sahar Azimi, and Martin Burtscher. 2018. SPDP: An automatically synthesized lossless compression algorithm for floating-point data. Data Compression Conference Proceedings 2018-March (2018), 335--344.
    [10]
    Transaction Processing Performance Council. 2005. Transaction processing performance council. (2005). http://www.tpc.org Accessed Feb 13, 2024.
    [11]
    Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research 7 (2006), 1--30.
    [12]
    Bing Du and ZhongFu Ye. 2009. A novel method of lossless compression for 2-D astronomical spectra images. Experimental Astronomy 27 (2009), 19--26.
    [13]
    Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011. An overview of the HDF5 technology suite and its applications. In Proceedings of the EDBT/ICDT 2011 workshop on array databases. 36--47.
    [14]
    Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. 2015. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors and Actuators B: Chemical 215 (2015), 618--629.
    [15]
    Nathaniel Fout and Kwan-Liu Ma. 2012. An adaptive prediction-based approach to lossless compression of floating-point volume data. IEEE Transactions on Visualization and Computer Graphics 18, 12 (2012), 2295--2304.
    [16]
    Cynthia Freeman, Jonathan Merriman, Ian Beaver, and Abdullah Mueen. 2021. Experimental Comparison and Survey of Twelve Time Series Anomaly Detection Algorithms. Journal of Artificial Intelligence Research 72 (2021), 849--899.
    [17]
    David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR) 23, 1 (1991), 5--48.
    [18]
    Google. 2011. Google LevelDB. https://opensource.googleblog.com/2011/07/leveldb-fast-persistent-key-value-store.html Accessed Feb 13, 2024.
    [19]
    Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. 2021. DZip: Improved general-purpose loss less compression based on novel neural network modeling. Data Compression Conference Proceedings 2021-March (2021), 153--162. Issue Dcc.
    [20]
    Pascal Grosset and James Ahrens. 2021. Lightweight Interface for In Situ Analysis and Visualization of Particle Data. In ISAV'21: In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization. 12--17.
    [21]
    The HDF Group. 2023. HDF5 Filters. https://docs.hdfgroup.org/hdf5/develop/_h5_d__u_g.html#subsubsec_dataset_transfer_filter Accessed Feb 13, 2024.
    [22]
    Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, et al. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.
    [23]
    Poly Haven. 2018. HDRIs / Preller Drive. https://hdrihaven.com/hdri/?c=night&h=preller_drive Accessed Feb 13, 2024.
    [24]
    Poly Haven. 2020. HDRIs / Palermo Sidewalk. https://polyhaven.com/a/palermo_sidewalk Accessed Feb 13, 2024.
    [25]
    Christopher Holder, Matthew Middlehurst, and Anthony Bagnall. 2023. A review and evaluation of elastic distance functions for time series clustering. Knowledge and Information Systems (2023), 1--45.
    [26]
    David Huber, Ralf Kissmann, and Olaf Reimer. 2021. Relativistic fluid modelling of gamma-ray binaries-II. Application to LS 5039. Astronomy & Astrophysics 649 (2021), A71.
    [27]
    Lawrence Ibarria, Peter Lindstrom, Jarek Rossignac, and Andrzej Szymczak. 2003. Out-of-core compression and decompression of large n-dimensional scalar fields. In Computer Graphics Forum, Vol. 22. Wiley Online Library, 343--348.
    [28]
    Intel. 2023. Intel® Advisor. https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html Accessed Feb 13, 2024.
    [29]
    IRSA. 2023. Spitzer Documentation & Tools. https://irsa.ipac.caltech.edu/data/SPITZER/FLS/images/irac/ Accessed Feb 13, 2024.
    [30]
    Kaggle. 2019. Climate Weather Surface of Brazil - Hourly --- kaggle.com. https://www.kaggle.com/datasets/PROPPG-PPG/hourly-weather-surface-brazil-southeast-region Accessed Feb 13, 2024.
    [31]
    Kaggle. 2021. NYC Yellow Taxi Trip Data --- kaggle.com. https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data Accessed Feb 13, 2024.
    [32]
    Kaggle. 2022. Daily Prices for Spanish Gas Stations (2007-2022) --- kaggle.com. https://www.kaggle.com/datasets/mauriciy/daily-spanish-gas-prices Accessed Feb 13, 2024.
    [33]
    Kaggle. 2022. Jane Street Market Prediction --- kaggle.com. https://www.kaggle.com/competitions/jane-street-market-prediction/data Accessed Feb 13, 2024.
    [34]
    Kaggle. 2022. MagNet NASA Dataset --- kaggle.com. https://www.kaggle.com/datasets/kingabzpro/magnet-nasa?select=solar_wind.csv Accessed Feb 13, 2024.
    [35]
    William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes on the Status of IEEE 754, 94720-1776 (1996), 11.
    [36]
    Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.
    [37]
    R Kissmann, K Reitberger, O Reimer, A Reimer, and E Grimaldo. 2016. Colliding-wind binaries with strong magnetic fields. The Astrophysical Journal 831, 2 (2016), 121.
    [38]
    Pavol Klacansky. 2009. open-scivis-datasets. https://klacansky.com/open-scivis-datasets/ Accessed Feb 13, 2024.
    [39]
    Byron Knoll. 2023. CMIX. https://github.com/byronknoll/cmix Accessed Feb 13, 2024.
    [40]
    Fabian Knorr, Peter Thoman, and Thomas Fahringer. 2020. Datasets for Benchmarking Floating-Point Compressors. arXiv preprint arXiv:2011.02849 (2020).
    [41]
    Fabian Knorr, Peter Thoman, and Thomas Fahringer. 2021. Ndzip: A High-Throughput Parallel Lossless Compressor for Scientific Data. Data Compression Conference Proceedings 2021-March, 103--112.
    [42]
    Fabian Knorr, Peter Thoman, and Thomas Fahringer. 2021. ndzip-gpu: efficient lossless compression of scientific floating-point data on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
    [43]
    Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: efficient lossless floating point compression for time series databases. Proceedings of the VLDB Endowment 15, 11 (2022), 3058--3070.
    [44]
    Peter Lindstrom. 2017. Error distributions of lossy floating-point compressors. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).
    [45]
    Peter Lindstrom and Martin Isenburg. 2006. fpzip-Fast and efficient compression of floating-point data. IEEE Transactions on Visualization and Computer Graphics 12 (2006), 1245--1250. Issue 5.
    [46]
    Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. 2018. Beyond mac-robenchmarks: microbenchmark-based graph database evaluation. Proceedings of the VLDB Endowment 12, 4 (2018), 390--403.
    [47]
    Chunwei Liu, Hao Jiang, John Paparrizos, and Aaron J. Elmore. 2021. BUFF: Decomposed bounded floats for fast compression and queries. Proceedings of the VLDB Endowment 14 (2021), 2586--2598. Issue 11.
    [48]
    G Nigel N Martin. 1979. Range encoding: an algorithm for removing redundancy from a digitised message. In Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Recording, Vol. 2.
    [49]
    MAST. 2023. MAST: Barbara A. Mikulski Archive for Space Telescopes. https://mast.stsci.edu/portal/Mashup/Clients/Mast/Portal.html Accessed Feb 13, 2024.
    [50]
    K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. Höfer, M. Halpern, D. Hanna, A. D. Hincks, G. Hinshaw, J. M. Parra, L. B. Newburgh, J. R. Shaw, and K. Vanderlinde. 2015. A compression scheme for radio data in high performance computing. Astronomy and Computing 12 (2015), 181--190.
    [51]
    Nadim Nachar et al. 2008. The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology 4, 1 (2008), 13--20.
    [52]
    NVIDIA. 2023. nvCOMP. https://github.com/NVIDIA/nvcomp Accessed Feb 13, 2024.
    [53]
    Nvidia. 2023. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute Accessed Feb 13, 2024.
    [54]
    Molly A O'Neil and Martin Burtscher. 2011. Floating-point data compression at 75 Gb/s on a GPU. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. 1--7.
    [55]
    Emmanuel Oseret and Claude Timsit. 2007. Optimization of a lossless object-based compression embedded on GAIA, a next-generation space telescope. In Mathematics of Data/Image Pattern Recognition, Compression, Coding, and Encryption X, with Applications, Vol. 6700. SPIE, 24--35.
    [56]
    Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. 2015. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment 8 (2015), 1816--1827. Issue 12.
    [57]
    Robert B Ross, George Amvrosiadis, Philip Carns, Charles D Cranor, Matthieu Dorier, Kevin Harms, Greg Ganger, Garth Gibson, Samuel K Gutierrez, Robert Latham, et al. 2020. Mochi: Composing data services for high-performance computing environments. Journal of Computer Science and Technology 35, 1 (2020), 121--144.
    [58]
    Mark A Roth and Scott J Van Horn. 1993. Database compression. ACM Sigmod Record 22, 3 (1993), 31--39.
    [59]
    Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning assistant for floating-point precision. In Proceedings of the international conference on high performance computing, networking, storage and analysis. 1--12.
    [60]
    Majid Saeedan and Ahmed Eldawy. 2022. Spatial parquet: a column file format for geospatial data lakes. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. 1--4.
    [61]
    Khalid Sayood. 2017. Introduction to data compression. Morgan Kaufmann.
    [62]
    Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. 2018. Introducing wesad, a multimodal dataset for wearable stress and affect detection. In Proceedings of the 20th ACM international conference on multimodal interaction. 400--408.
    [63]
    LA Snider and SE Swedo. 2004. PANDAS: current status and directions for research. Molecular psychiatry 9, 10 (2004), 900--907.
    [64]
    Seung Woo Son, Zhengzhang Chen, William Hendrix, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. 2014. Data compression for the exascale computing era-survey. Supercomputing frontiers and innovations 1, 2 (2014), 76--88.
    [65]
    SQLite. 2023. The Default Page Size Change of SQLite 3.12.0. https://www.sqlite.org/pgszchng2016.html Accessed Feb 13, 2024.
    [66]
    Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM conference on embedded networked sensor systems. 127--140.
    [67]
    HS Stockman. 1999. Data compression for the next-generation space telescope. In Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096). IEEE, 542.
    [68]
    Rust teams. 2023. Rust Programming Language. https://www.rust-lang.org/ Accessed Feb 13, 2024.
    [69]
    Peter Thoman, Philip Salzmann, Biagio Cosenza, and Thomas Fahringer. 2019. Celerity: High-level c++ for accelerator clusters. In Euro-Par 2019: Parallel Processing: 25th International Conference on Parallel and Distributed Computing, Göttingen, Germany, August 26--30, 2019, Proceedings 25. Springer, 291--303.
    [70]
    Peter Thoman, Markus Wippler, Robert Hranitzky, and Thomas Fahringer. 2020. RTX-RSim: Accelerated Vulkan room response simulation for time-of-flight imaging. In Proceedings of the International Workshop on OpenCL. 1--11.
    [71]
    TPC. 2023. TPC-DS Vesion 2 and Version 3. https://www.tpc.org/tpcds/default5.asp Accessed Feb 13, 2024.
    [72]
    TPC. 2023. TPC-H Vesion 2 and Version 3. https://www.tpc.org/tpch/ Accessed Feb 13, 2024.
    [73]
    TPC. 2023. TPCx-BB. https://www.tpc.org/tpcx-bb/default5.asp Accessed Feb 13, 2024.
    [74]
    Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017. An experimental study of bitmap compression vs. inverted list compression. In Proceedings of the 2017 ACM International Conference on Management of Data. 993--1008.
    [75]
    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.
    [76]
    Annie Yang, Hari Mukka, Farbod Hesaaraki, and Martin Burtscher. 2015. MPC: a massively parallel compression algorithm for scientific data. In 2015 IEEE International Conference on Cluster Computing. IEEE, 381--389.
    [77]
    Kai Zhao, Sheng Di, Xin Lian, Sihuan Li, Dingwen Tao, Julie Bessac, Zizhong Chen, and Franck Cappello. 2020. SDRBench: Scientific data reduction benchmark for lossy compressors. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2716--2724.
    [78]
    Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory 23, 3 (1977), 337--343.

    Cited By

    View all
    • (2024)Revisiting B-tree Compression: An Experimental StudyProceedings of the ACM on Management of Data10.1145/36549722:3(1-25)Online publication date: 30-May-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 17, Issue 6
    February 2024
    369 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 03 May 2024
    Published in PVLDB Volume 17, Issue 6

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)31
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Revisiting B-tree Compression: An Experimental StudyProceedings of the ACM on Management of Data10.1145/36549722:3(1-25)Online publication date: 30-May-2024

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media