research-article

High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation

Authors:

Franck CappelloAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 4, Pages 1 - 27

https://doi.org/10.1145/3639259

Published: 26 March 2024 Publication History

Abstract

Error-bounded lossy compression has been identified as a promising solution for significantly reducing scientific data volumes upon users' requirements on data distortion. For the existing scientific error-bounded lossy compressors, some of them (such as SPERR and FAZ) can reach fairly high compression ratios and some others (such as SZx, SZ, and ZFP) feature high compression speeds, but they rarely exhibit both high ratio and high speed meanwhile. In this paper, we propose HPEZ with newly-designed interpolations and quality-metric-driven auto-tuning, which features significantly improved compression quality upon the existing high-performance compressors, meanwhile being exceedingly faster than high-ratio compressors. The key contributions lie as follows: (1) We develop a series of advanced techniques such as interpolation re-ordering, multi-dimensional interpolation, and natural cubic splines to significantly improve compression qualities with interpolation-based data prediction. (2) The auto-tuning module in HPEZ has been carefully designed with novel strategies, including but not limited to block-wise interpolation tuning, dynamic dimension freezing, and Lorenzo tuning. (3) We thoroughly evaluate HPEZ compared with many other compressors on six real-world scientific datasets. Experiments show that HPEZ outperforms other high-performance error-bounded lossy compressors in compression ratio by up to 140% under the same error bound, and by up to 360% under the same PSNR. In parallel data transfer experiments on the distributed database, HPEZ achieves a significant performance gain with up to 40% time cost reduction over the second-best compressor.

References

[1]

[n. d.]. Miranda application. https://wci.llnl.gov/simulation/computer-codes/miranda

[2]

[n. d.]. NSTX-GPI. https://w3.pppl.gov/~szweben/NSTX%20Blob%20Library/NSTXblobs.html

[3]

[n. d.]. Scalable Computing for Advanced Library and Environment (SCALE) -- LETKF. https://github.com/gylien/scale-letkf.

[4]

[n. d.]. SEGSalt. https://wiki.seg.org/wiki/SEG/EAGE_Salt_and_Overthrust_Models.

[5]

Mark Ainsworth, Ozan Tugluk, Ben Whitney, and Scott Klasky. 2018. Multilevel techniques for compression and reduction of scientific data-the univariate case. Computing and Visualization in Science 19, 5 (2018), 65--76.

Digital Library

[6]

Rachana Ananthakrishnan, Kyle Chard, Ian Foster, and Steven Tuecke. 2015. Globus platform-as-a-service for collaborative science applications. Concurrency and Computation: Practice and Experience 27, 2 (2015), 290--305.

[7]

Rafael Ballester-Ripoll, Peter Lindstrom, and Renato Pajarola. 2019. TTHRESH: Tensor compression for multidimensional visual data. IEEE transactions on visualization and computer graphics 26, 9 (2019), 2891--2903.

[8]

Dor Bank, Noam Koenigstein, and Raja Giryes. 2020. Autoencoders. arXiv preprint arXiv:2003.05991 (2020).

[9]

Franck Cappello, Sheng Di, Sihuan Li, Xin Liang, Gok M. Ali, Dingwen Tao, Chun Yoon Hong, Xin-chuan Wu, Yuri Alexeev, and T. Frederic Chong. 2019. Use cases of lossy compression for floating-point data in scientific datasets. International Journal of High Performance Computing Applications (IJHPCA) 33 (2019), 1201--1220.

Digital Library

[10]

Kyle Chard, Jim Pruyne, Ben Blaiszik, Rachana Ananthakrishnan, Steven Tuecke, and Ian Foster. 2015. Globus data publication as a service: Lowering barriers to reproducible science. In 2015 IEEE 11th International Conference on e-Science. IEEE, 401--410.

Digital Library

[11]

Kyle Chard, Steven Tuecke, and Ian Foster. 2016. Globus: Recent enhancements and future plans. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale. 1--8.

Digital Library

[12]

Yann Collet. 2015. Zstandard -- Real-time data compression algorithm. http://facebook.github.io/zstd/ (2015).

[13]

Ziquan Fang, Yuntao Du, Lu Chen, Yujia Hu, Yunjun Gao, and Gang Chen. 2021. E 2 dtc: An end to end deep trajectory clustering framework via self-training. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 696--707.

[14]

Andrew Glaws, Ryan King, and Michael Sprague. 2020. Deep learning for in situ data compression of large turbulent flow simulations. Physical Review Fluids 5, 11 (2020), 114602.

[15]

Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, et al . 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (2016), 49--65.

[16]

Jun Han and Chaoli Wang. 2022. Coordnet: Data generation and visualization generation for time-varying volumes via a coordinate-based neural network. IEEE Transactions on Visualization and Computer Graphics (2022).

[17]

Lucas Hayne, John Clyne, and Shaomeng Li. 2021. Using Neural Networks for Two Dimensional Scientific Data Compression. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2956--2965.

[18]

Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen. 2018. Modelardb: Modular model-based time series management with spark and cassandra. Proceedings of the VLDB Endowment 11, 11 (2018), 1688--1701.

Digital Library

[19]

Pu Jiao, Sheng Di, Hanqi Guo, Kai Zhao, Jiannan Tian, Dingwen Tao, Xin Liang, and Franck Cappello. 2022. Toward Quantity-of-Interest Preserving Lossy Compression for Scientific Data. Proceedings of the VLDB Endowment 16, 4 (2022), 697--710.

Digital Library

[20]

J. E. Kay and et al. 2015. The Community Earth System Model (CESM) large ensemble project: A community resource for studying climate change in the presence of internal climate variability. Bulletin of the American Meteorological Society 96, 8 (2015), 1333--1349.

[21]

Suha Kayum et al. 2020. GeoDRIVE -- a high performance computing flexible platform for seismic applications. First Break 38, 2 (2020), 97--100.

[22]

Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen. 2019. Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB. arXiv e-prints (2019), arXiv--1903.

[23]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

[24]

Soheil Kolouri, Phillip E Pope, Charles E Martin, and Gustavo K Rohde. 2018. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations.

[25]

Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F. Samatova. 2011. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 366--379.

[26]

Sihuan Li, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2021. Resilient Error-Bounded Lossy Compressor for Data Transfer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Article 94, 14 pages.

Digital Library

[27]

Shaomeng Li, Peter Lindstrom, and John Clyne. 2023. Lossy scientific data compression with SPERR. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1007--1017.

[28]

Xiucheng Li, Kaiqi Zhao, Gao Cong, Christian S Jensen, and Wei Wei. 2018. Deep representation learning for trajectory similarity computation. In 2018 IEEE 34th international conference on data engineering (ICDE). IEEE, 617--628.

[29]

Yi Li, Eric Perlman, Minping Wan, Yunke Yang, Charles Meneveau, Randal Burns, Shiyi Chen, Alexander Szalay, and Gregory Eyink. 2008. A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence. Journal of Turbulence 9 (2008), N31.

[30]

Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2018. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets. In 2018 IEEE International Conference on Big Data. IEEE.

[31]

Xin Liang, Ben Whitney, Jieyang Chen, Lipeng Wan, Qing Liu, Dingwen Tao, James Kress, David R Pugmire, Matthew Wolf, Norbert Podhorszki, et al . 2021. MGARD: Optimizing multilevel methods for error-bounded scientific data reduction. IEEE Trans. Comput. (2021).

[32]

Xin Liang, Kai Zhao, Sheng Di, Sihuan Li, Robert Underwood, Ali M Gok, Jiannan Tian, Junjing Deng, Jon C Calhoun, Dingwen Tao, et al. 2022. SZ3: A modular framework for composing prediction-based error-bounded lossy compressors. IEEE Transactions on Big Data (2022).

[33]

Peter Lindstrom. 2014. Fixed-rate compressed floating-point arrays. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2674--2683.

[34]

Jinyang Liu, Sheng Di, Kai Zhao, Sian Jin, Dingwen Tao, Xin Liang, Zizhong Chen, and Franck Cappello. 2021. Exploring Autoencoder-based Error-bounded Compression for Scientific Data. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 294--306.

[35]

Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2022. Dynamic quality metric oriented error bounded lossy compression for scientific datasets. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 892--906.

[36]

Jinyang Liu, Sheng Di, Kai Zhao, Xin Liang, Zizhong Chen, and Franck Cappello. 2023. FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data. In Proceedings of the 37th International Conference on Supercomputing. 1--13.

Digital Library

[37]

Tong Liu, Jinzhen Wang, Qing Liu, Shakeel Alibhai, Tao Lu, and Xubin He. 2021. High-Ratio Lossy Compression: Exploring the Autoencoder to Compress Scientific Data. IEEE Transactions on Big Data (2021).

[38]

Yuanjian Liu, Sheng Di, Kyle Chard, Ian Foster, and Franck Cappello. 2023. Optimizing Scientific Data Transfer on Globus with Error-bounded Lossy Compression. arXiv:2307.05416 [cs.DC]

[39]

Yuzhe Lu, Kairong Jiang, Joshua A Levine, and Matthew Berger. 2021. Compressive neural representations of volumetric scalar fields. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 135--146.

[40]

Tuomas Pelkonen et al. 2015. Gorilla: A Fast, Scalable, in-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1816--1827.

[41]

Tjerk P Straatsma, Katerina B Antypas, and Timothy J Williams. 2017. Exascale scientific applications: Scalability and performance portability. CRC Press.

[42]

Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on circuits and systems for video technology 22, 12 (2012), 1649--1668.

Digital Library

[43]

Dingwen Tao, Sheng Di, Hanqi Guo, Zizhong Chen, and Franck Cappello. 2019. Z-checker: A framework for assessing lossy compression of scientific data. The International Journal of High Performance Computing Applications 33, 2 (2019), 285--303. https://doi.org/10.1177/1094342017737147

Digital Library

[44]

David S Taubman, Michael W Marcellin, and Majid Rabbani. 2002. JPEG2000: Image compression fundamentals, standards and practice. Journal of Electronic Imaging 11, 2 (2002), 286--287.

[45]

Jiannan Tian et al. 2020. CuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT '20). 3--15.

[46]

Jiannan Tian, Sheng Di, Xiaodong Yu, Cody Rivera, Kai Zhao, Sian Jin, Yunhe Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. cuSZ (x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs. CoRR (2021).

[47]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.

Digital Library

[48]

Xinyang Yu et al . 2020. Two-Level Data Compression using Machine Learning in Time Series Database. In 36th IEEE International Conference on Data Engineering. 1333--1344.

[49]

Xiaodong Yu, Sheng Di, Kai Zhao, jiannan Tian, Dingwen Tao, Xin Liang, and Franck Cappello. 2022. SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets. arXiv preprint arXiv:2201.13020 (2022).

[50]

Boyuan Zhang, Jiannan Tian, Sheng Di, Xiaodong Yu, Yunhe Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2023. FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs. arXiv preprint arXiv:2304.12557 (2023).

[51]

Dongxiang Zhang, Mengting Ding, Dingyu Yang, Yi Liu, Ju Fan, and Heng Tao Shen. 2018. Trajectory simplification: an experimental study and quality analysis. Proceedings of the VLDB Endowment 11, 9 (2018), 934--946.

Digital Library

[52]

Kai Zhao, Sheng Di, Perez Danny, Zizhong Chen, and Franck Cappello. 2022. MDZ: An Efficient Error-bounded Lossy Compressor for Molecular Dynamics Simulations. In 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[53]

Kai Zhao, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. 2021. Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1643--1654. https://doi.org/10.1109/ICDE51399.2021.00145

[54]

Kai Zhao, Sheng Di, Xin Lian, Sihuan Li, Dingwen Tao, Julie Bessac, Zizhong Chen, and Franck Cappello. 2020. SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors. In 2020 IEEE International Conference on Big Data (Big Data). 2716--2724.

[55]

Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (Stockholm, Sweden) (HPDC '20). Association for Computing Machinery, New York, NY, USA, 89--100. https://doi.org/10.1145/3369583.3392688

Digital Library

Cited By

Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KLu XChen ZCappello FGuo YThakur R(2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656636
Naraparaju RZhao THu YZhao DGuo LTallent N(2024)Shifting Between Compute and Memory Bounds: A Compression-Enabled Roofline ModelSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00047(309-316)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00047
Agarwal TDi SHuang JHuang YGopalakrishnan GUnderwood RZhao KLiang XLi GCappello F(2024)SZOps: Scalar Operations for Error-bounded Lossy Compressor for Scientific DataSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00042(260-269)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00042
Show More Cited By

Index Terms

High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation

Recommendations

cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Modern scientific applications and supercomputing systems are generating large amounts of data in various fields, leading to critical challenges in data storage footprints and communication times. To address this issue, error-bounded GPU lossy ...
FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing

Error-bounded lossy compression has been effective to resolve the big scientific data issue because it has a great potential to significantly reduce the data volume while allowing users to control data distortion based on specified error bounds. However, ...
Dynamic quality metric oriented error bounded lossy compression for scientific datasets
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

With ever-increasing execution scale of the high performance computing (HPC) applications, vast amount of data are being produced by scientific research every day. Error-bounded lossy compression has been considered a very promising solution to address ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

DOE U.S. Department of Energy
NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)246
Downloads (Last 6 weeks)32

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang JDi SYu XZhai YLiu JHuang YRaffenetti KZhou HZhao KLu XChen ZCappello FGuo YThakur R(2024)gZCCL: Compression-Accelerated Collective Communication Framework for GPU ClustersProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656636(437-448)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656636
Naraparaju RZhao THu YZhao DGuo LTallent N(2024)Shifting Between Compute and Memory Bounds: A Compression-Enabled Roofline ModelSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00047(309-316)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00047
Agarwal TDi SHuang JHuang YGopalakrishnan GUnderwood RZhao KLiang XLi GCappello F(2024)SZOps: Scalar Operations for Error-bounded Lossy Compressor for Scientific DataSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00042(260-269)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00042
Wu XGong QChen JLiu QPodhorszki NLiang XKlasky S(2024)Error-controlled Progressive Retrieval of Scientific Data under Derivable Quantities of InterestSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00092(1-16)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00092
Liu JTian JWu SDi SZhang BUnderwood RHuang YHuang JZhao KLi GTao DChen ZCappello F(2024)CUSZ-i: High-Ratio Scientific Lossy Compression on GPUs with Optimized Multi-Level InterpolationSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00019(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00019
Wu SDing YZhai YLiu JHuang JJian ZDai HDi SWong BChen ZCappello F(2024)FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00035(322-334)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00035

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents