Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation

Published: 26 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Error-bounded lossy compression has been identified as a promising solution for significantly reducing scientific data volumes upon users' requirements on data distortion. For the existing scientific error-bounded lossy compressors, some of them (such as SPERR and FAZ) can reach fairly high compression ratios and some others (such as SZx, SZ, and ZFP) feature high compression speeds, but they rarely exhibit both high ratio and high speed meanwhile. In this paper, we propose HPEZ with newly-designed interpolations and quality-metric-driven auto-tuning, which features significantly improved compression quality upon the existing high-performance compressors, meanwhile being exceedingly faster than high-ratio compressors. The key contributions lie as follows: (1) We develop a series of advanced techniques such as interpolation re-ordering, multi-dimensional interpolation, and natural cubic splines to significantly improve compression qualities with interpolation-based data prediction. (2) The auto-tuning module in HPEZ has been carefully designed with novel strategies, including but not limited to block-wise interpolation tuning, dynamic dimension freezing, and Lorenzo tuning. (3) We thoroughly evaluate HPEZ compared with many other compressors on six real-world scientific datasets. Experiments show that HPEZ outperforms other high-performance error-bounded lossy compressors in compression ratio by up to 140% under the same error bound, and by up to 360% under the same PSNR. In parallel data transfer experiments on the distributed database, HPEZ achieves a significant performance gain with up to 40% time cost reduction over the second-best compressor.

    References

    [1]
    [n. d.]. Miranda application. https://wci.llnl.gov/simulation/computer-codes/miranda
    [2]
    [n. d.]. NSTX-GPI. https://w3.pppl.gov/~szweben/NSTX%20Blob%20Library/NSTXblobs.html
    [3]
    [n. d.]. Scalable Computing for Advanced Library and Environment (SCALE) -- LETKF. https://github.com/gylien/scale-letkf.
    [4]
    [n. d.]. SEGSalt. https://wiki.seg.org/wiki/SEG/EAGE_Salt_and_Overthrust_Models.
    [5]
    Mark Ainsworth, Ozan Tugluk, Ben Whitney, and Scott Klasky. 2018. Multilevel techniques for compression and reduction of scientific data-the univariate case. Computing and Visualization in Science 19, 5 (2018), 65--76.
    [6]
    Rachana Ananthakrishnan, Kyle Chard, Ian Foster, and Steven Tuecke. 2015. Globus platform-as-a-service for collaborative science applications. Concurrency and Computation: Practice and Experience 27, 2 (2015), 290--305.
    [7]
    Rafael Ballester-Ripoll, Peter Lindstrom, and Renato Pajarola. 2019. TTHRESH: Tensor compression for multidimensional visual data. IEEE transactions on visualization and computer graphics 26, 9 (2019), 2891--2903.
    [8]
    Dor Bank, Noam Koenigstein, and Raja Giryes. 2020. Autoencoders. arXiv preprint arXiv:2003.05991 (2020).
    [9]
    Franck Cappello, Sheng Di, Sihuan Li, Xin Liang, Gok M. Ali, Dingwen Tao, Chun Yoon Hong, Xin-chuan Wu, Yuri Alexeev, and T. Frederic Chong. 2019. Use cases of lossy compression for floating-point data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33 (2019), 1201--1220.
    [10]
    Kyle Chard, Jim Pruyne, Ben Blaiszik, Rachana Ananthakrishnan, Steven Tuecke, and Ian Foster. 2015. Globus data publication as a service: Lowering barriers to reproducible science. In 2015 IEEE 11th International Conference on e-Science. IEEE, 401--410.
    [11]
    Kyle Chard, Steven Tuecke, and Ian Foster. 2016. Globus: Recent enhancements and future plans. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale. 1--8.
    [12]
    Yann Collet. 2015. Zstandard -- Real-time data compression algorithm. http://facebook.github.io/zstd/ (2015).
    [13]
    Ziquan Fang, Yuntao Du, Lu Chen, Yujia Hu, Yunjun Gao, and Gang Chen. 2021. E 2 dtc: An end to end deep trajectory clustering framework via self-training. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 696--707.
    [14]
    Andrew Glaws, Ryan King, and Michael Sprague. 2020. Deep learning for in situ data compression of large turbulent flow simulations. Physical Review Fluids 5, 11 (2020), 114602.
    [15]
    Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, et al . 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.
    [16]
    Jun Han and Chaoli Wang. 2022. Coordnet: Data generation and visualization generation for time-varying volumes via a coordinate-based neural network. IEEE Transactions on Visualization and Computer Graphics (2022).
    [17]
    Lucas Hayne, John Clyne, and Shaomeng Li. 2021. Using Neural Networks for Two Dimensional Scientific Data Compression. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2956--2965.
    [18]
    Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen. 2018. Modelardb: Modular model-based time series management with spark and cassandra. Proceedings of the VLDB Endowment 11, 11 (2018), 1688--1701.
    [19]
    Pu Jiao, Sheng Di, Hanqi Guo, Kai Zhao, Jiannan Tian, Dingwen Tao, Xin Liang, and Franck Cappello. 2022. Toward Quantity-of-Interest Preserving Lossy Compression for Scientific Data. Proceedings of the VLDB Endowment 16, 4 (2022), 697--710.
    [20]
    J. E. Kay and et al. 2015. The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96, 8 (2015), 1333--1349.
    [21]
    Suha Kayum et al. 2020. GeoDRIVE -- a high performance computing flexible platform for seismic applications. First Break 38, 2 (2020), 97--100.
    [22]
    Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen. 2019. Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB. arXiv e-prints (2019), arXiv--1903.
    [23]
    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
    [24]
    Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. 2018. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations.
    [25]
    Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F. Samatova. 2011. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 366--379.
    [26]
    Sihuan Li, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2021. Resilient Error-Bounded Lossy Compressor for Data Transfer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Article 94, 14 pages.
    [27]
    Shaomeng Li, Peter Lindstrom, and John Clyne. 2023. Lossy scientific data compression with SPERR. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1007--1017.
    [28]
    Xiucheng Li, Kaiqi Zhao, Gao Cong, Christian S Jensen, and Wei Wei. 2018. Deep representation learning for trajectory similarity computation. In 2018 IEEE 34th international conference on data engineering (ICDE). IEEE, 617--628.
    [29]
    Yi Li, Eric Perlman, Minping Wan, Yunke Yang, Charles Meneveau, Randal Burns, Shiyi Chen, Alexander Szalay, and Gregory Eyink. 2008. A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence. Journal of Turbulence 9 (2008), N31.
    [30]
    Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets. In 2018 IEEE International Conference on Big Data. IEEE.
    [31]
    Xin Liang, Ben Whitney, Jieyang Chen, Lipeng Wan, Qing Liu, Dingwen Tao, James Kress, David R Pugmire, Matthew Wolf, Norbert Podhorszki, et al . 2021. MGARD: Optimizing multilevel methods for error-bounded scientific data reduction. IEEE Trans. Comput. (2021).
    [32]
    Xin Liang, Kai Zhao, Sheng Di, Sihuan Li, Robert Underwood, Ali M Gok, Jiannan Tian, Junjing Deng, Jon C Calhoun, Dingwen Tao, et al. 2022. SZ3: A modular framework for composing prediction-based error-bounded lossy compressors. IEEE Transactions on Big Data (2022).
    [33]
    Peter Lindstrom. 2014. Fixed-rate compressed floating-point arrays. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2674--2683.
    [34]
    Jinyang Liu, Sheng Di, Kai Zhao, Sian Jin, Dingwen Tao, Xin Liang, Zizhong Chen, and Franck Cappello. 2021. Exploring Autoencoder-based Error-bounded Compression for Scientific Data. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 294--306.
    [35]
    Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2022. Dynamic quality metric oriented error bounded lossy compression for scientific datasets. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 892--906.
    [36]
    Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2023. FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data. In Proceedings of the 37th International Conference on Supercomputing. 1--13.
    [37]
    Tong Liu, Jinzhen Wang, Qing Liu, Shakeel Alibhai, Tao Lu, and Xubin He. 2021. High-Ratio Lossy Compression: Exploring the Autoencoder to Compress Scientific Data. IEEE Transactions on Big Data (2021).
    [38]
    Yuanjian Liu, Sheng Di, Kyle Chard, Ian Foster, and Franck Cappello. 2023. Optimizing Scientific Data Transfer on Globus with Error-bounded Lossy Compression. arXiv:2307.05416 [cs.DC]
    [39]
    Yuzhe Lu, Kairong Jiang, Joshua A Levine, and Matthew Berger. 2021. Compressive neural representations of volumetric scalar fields. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 135--146.
    [40]
    Tuomas Pelkonen et al. 2015. Gorilla: A Fast, Scalable, in-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1816--1827.
    [41]
    Tjerk P Straatsma, Katerina B Antypas, and Timothy J Williams. 2017. Exascale scientific applications: Scalability and performance portability. CRC Press.
    [42]
    Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on circuits and systems for video technology 22, 12 (2012), 1649--1668.
    [43]
    Dingwen Tao, Sheng Di, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2019. Z-checker: A framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33, 2 (2019), 285--303. https://doi.org/10.1177/1094342017737147
    [44]
    David S Taubman, Michael W Marcellin, and Majid Rabbani. 2002. JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging 11, 2 (2002), 286--287.
    [45]
    Jiannan Tian et al. 2020. CuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT '20). 3--15.
    [46]
    Jiannan Tian, Sheng Di, Xiaodong Yu, Cody Rivera, Kai Zhao, Sian Jin, Yunhe Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. cuSZ (x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs. CoRR (2021).
    [47]
    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.
    [48]
    Xinyang Yu et al . 2020. Two-Level Data Compression using Machine Learning in Time Series Database. In 36th IEEE International Conference on Data Engineering. 1333--1344.
    [49]
    Xiaodong Yu, Sheng Di, Kai Zhao, jiannan Tian, Dingwen Tao, Xin Liang, and Franck Cappello. 2022. SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets. arXiv preprint arXiv:2201.13020 (2022).
    [50]
    Boyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Yunhe Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2023. FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs. arXiv preprint arXiv:2304.12557 (2023).
    [51]
    Dongxiang Zhang, Mengting Ding, Dingyu Yang, Yi Liu, Ju Fan, and Heng Tao Shen. 2018. Trajectory simplification: an experimental study and quality analysis. Proceedings of the VLDB Endowment 11, 9 (2018), 934--946.
    [52]
    Kai Zhao, Sheng Di, Perez Danny, Zizhong Chen, and Franck Cappello. 2022. MDZ: An Efficient Error-bounded Lossy Compressor for Molecular Dynamics Simulations. In 2022 IEEE 38th International Conference on Data Engineering (ICDE).
    [53]
    Kai Zhao, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1643--1654. https://doi.org/10.1109/ICDE51399.2021.00145
    [54]
    Kai Zhao, Sheng Di, Xin Lian, Sihuan Li, Dingwen Tao, Julie Bessac, Zizhong Chen, and Franck Cappello. 2020. SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors. In 2020 IEEE International Conference on Big Data (Big Data). 2716--2724.
    [55]
    Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (Stockholm, Sweden) (HPDC '20). Association for Computing Machinery, New York, NY, USA, 89--100. https://doi.org/10.1145/3369583.3392688

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 1
    SIGMOD
    February 2024
    1874 pages
    EISSN:2836-6573
    DOI:10.1145/3654807
    Issue’s Table of Contents
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 March 2024
    Published in PACMMOD Volume 2, Issue 1

    Permissions

    Request permissions for this article.

    Author Tags

    1. error-bounded lossy compression
    2. interpolation
    3. scientific database

    Qualifiers

    • Research-article

    Funding Sources

    • DOE U.S. Department of Energy
    • NSF (National Science Foundation)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 100
      Total Downloads
    • Downloads (Last 12 months)100
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media