research-article

Z-checker: : A framework for assessing lossy compression of scientific data

Authors: Dingwen Tao, Sheng Di, Hanqi Guo, Zizhong Chen, Franck CappelloAuthors Info & Claims

The International Journal of High Performance Computing Applications, Volume 33, Issue 2

Pages 285 - 303

https://doi.org/10.1177/1094342017737147

Published: 01 March 2019 Publication History

Abstract

Because of the vast volume of data being produced by today’s scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific data sets and understand the data alteration after compression in a systematic and reliable way. To address this gap, we have designed and implemented a generic framework called Z-checker. On the one hand, Z-checker combines a battery of data analysis components for data compression. On the other hand, Z-checker is implemented as an open-source community tool to which users and developers can contribute and add new analysis components based on their additional analysis demands. In this article, we present a survey of existing lossy compressors. Then, we describe the design framework of Z-checker, in which we integrated evaluation metrics proposed in prior work as well as other analysis tools. Specifically, for lossy compressor developers, Z-checker can be used to characterize critical properties (such as entropy, distribution, power spectrum, principal component analysis, and autocorrelation) of any data set to improve compression strategies. For lossy compression users, Z-checker can detect the compression quality (compression ratio and bit rate) and provide various global distortion analysis comparing the original data with the decompressed data (peak signal-to-noise ratio, normalized mean squared error, rate–distortion, rate-compression error, spectral, distribution, and derivatives) and statistical analysis of the compression error (maximum, minimum, and average error; autocorrelation; and distribution of errors). Z-checker can perform the analysis with either coarse granularity (throughout the whole data set) or fine granularity (by user-defined blocks), such that the users and developers can select the best fit, adaptive compressors for different parts of the data set. Z-checker features a visualization interface displaying all analysis results in addition to some basic views of the data sets such as time series. To the best of our knowledge, Z-checker is the first tool designed to assess lossy compression comprehensively for scientific data sets.

References

[1]

Austin E (2016) Advanced photon source. Synchrotron Radiation News 29(2): 29–30.

Google Scholar

[2]

Austin W, Ballard G, and Kolda TG (2015) Parallel tensor compression for large-scale scientific data. In: Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 912–922. IEEE, 2016.

Google Scholar

[3]

Baker AH, Hammerling DM, Mickelson SA, et al. (2016) Evaluating lossy data compression on climate simulation data within a large ensemble. Geoscientific Model Development 9(12): 4381.

Google Scholar

[4]

Baker AH, Xu H, Dennis JM, et al. (2014) A methodology for evaluating the impact of data compression on climate simulation data. In: Proceedings of the 23 rd international symposium on high-performance parallel and distributed computing, pp. 203–214. ACM.

Google Scholar

[5]

Bernholdt D, Bharathi S, Brown D, et al. (2005) The earth system grid: supporting the next generation of climate modeling research. Proceedings of the IEEE 93(3): 485–495.

Google Scholar

[6]

Chen Z, Son SW, Hendrix W, et al. (2014) Numarck: machine learning algorithm for resiliency and checkpointing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 733–744. IEEE.

Google Scholar

[7]

Community Earth Simulation Model (CESM) (2017) Available at: http://www.cesm.ucar.edu/ (accessed 28 October 2017).

Google Scholar

[8]

Deutsch LP (1996) Gzip file format specification version 4.3.

Google Scholar

[9]

Di S and Cappello F (2016) Fast error-bounded lossy HPC data compression with SZ. In: Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 730–739. IEEE.

Google Scholar

[10]

Gleckler PJ, Durack PJ, Stouffer RJ, et al. (2016) Industrial-era global ocean heat uptake doubles in recent decades. Nature Climate Change 6(4): 394–398.

Google Scholar

[11]

Habib S, Morozov V, Frontiere N, et al. (2016) HACC: extreme scaling and performance across diverse architectures. Communications of the ACM 60(1): 97–104.

Google Scholar

[12]

Huffman DA et al. (1952) A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9): 1098–1101.

Crossref

Google Scholar

[13]

Janert PK (2009) Gnuplot in Action. Greenwich: Manning Publications Co. ISBN: 1–933988.

Digital Library

Google Scholar

[14]

Lakshminarasimhan S, Shah N, Ethier S, et al. (2013) ISABELA for effective in situ compression of scientific data. Concurrency and Computation: Practice and Experience 25(4): 524–540.

Crossref

Google Scholar

[15]

Laney D, Langer S, Weber C, et al. (2014) Assessing the effects of data compression in simulations using physically motivated metrics. Scientific Programming 22(2): 141–155.

Digital Library

Google Scholar

[16]

Lindstrom P (2014) Fixed-rate compressed floating-point arrays. TVCG 20(12): 2674–2683.

Crossref

Google Scholar

[17]

Lindstrom P and Isenburg M (2006) Fast and efficient compression of floating-point data. TVCG 12(5): 1245–1250.

Digital Library

Google Scholar

[18]

Ratanaworabhan P, Ke J, and Burtscher M (2006) Fast lossless compression of scientific floating-point data. In: Data Compression Conference 2006. DCC 2006 Proceedings, pp. 133–142. IEEE.

Google Scholar

[19]

Sasaki N, Sato K, Endo T, et al. (2015) Exploration of lossy compression for application-level checkpoint/restart. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 914–922. IEEE.

Google Scholar

[20]

Tao D, Di S, Chen Z, et al. (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE international parallel and distributed processing symposium, IPDPS 2017, Orlando, Florida, USA, 29 May–2 June, 2017.

Google Scholar

[21]

Taubman D and Marcellin M (2012) JPEG2000 Image Compression Fundamentals, Standards and Practice: Image Compression Fundamentals, Standards and Practice, Vol. 642. Springer Science & Business Media.

Google Scholar

[22]

Wallace GK (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38(1): 28–34.

Digital Library

Google Scholar

[23]

Wegener AW and Samplify Systems LLC (2006) Adaptive Compression and Decompression of Bandlimited Signals. U.S. Patent 7,009,533.

Google Scholar

[24]

Wu Z and Huang NE (2004) A study of the characteristics of white noise using the empirical mode decomposition method. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences Vol. 460, No. 2046, pp. 1597–1611. The Royal Society.

Google Scholar

[25]

ZFP and Derivatives (2016) Available at: http://computation.llnl.gov/projects/floating-point-compression/zfp-and-derivatives (accessed 28 October 2017).

Google Scholar

[26]

Ziv J and Lempel A (1977) A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3): 337–343.

Digital Library

Google Scholar

[27]

Zuras D, Cowlishaw M, Aiken A, et al. (2008) IEEE standard for floating-point arithmetic. IEEE Std 754-2008 1–70.

Google Scholar

Author biographies

Dingwen Tao is a fifth-year doctoral candidate in computer science at the University of California, Riverside, under the advisement of Dr. Zizhong Chen. He received his bachelor degree in Information and Computing Science from the University of Science and Technology of China. He is currently working at Argonne National Laboratory in the Extreme Scale Resilience Group lead by Dr Franck Cappello. Prior to this, he worked in the High Performance Computing Group at Pacific Northwest National Laboratory in summer 2015. His research interests include high-performance computing, parallel and distributed computing, big data analytics, resilience and fault tolerance, data compression algorithms and softwares, numerical algorithms and softwares, and high-performance computing on heterogeneous systems. He has published 10+ peer-reviewed papers in top HPC and parallel and distributed conferences during his PhD program, such as HPDC, IPDPS, PPoPP, SC. E-mail: [email protected].

Sheng Di received his master’s degree from the Huazhong University of Science and Technology in 2007 and PhD degree from the University of Hong Kong in 2011. He is currently working at Argonne National Laboratory. His research interest involves resilience on high-performance computing (such as silent data corruption, optimization of checkpoint model, characterization and analysis of supercomputing log, and in situ data compression) and broad research topics on cloud computing (including optimization of resource allocation, cloud network topology, and prediction of cloud workload/hostload). He is the author of 17 papers published in international journals and 37 papers published at international conferences. He served as a programming committee member 10+ times for different conferences and served as an external conference/journal reviewer over 50 times. E-mail: [email protected].

Hanqi Guo is a postdoctoral appointee in the Mathematics and Computer Science Division, Argonne National Laboratory. He received his PhD degree in computer science from Peking University in 2014 and the BS degree in mathematics and applied mathematics from the Beijing University of Posts and Telecommunications in 2009. His research interests are mainly in large-scale scientific data visualization. E-mail: [email protected].

Zizhong Chen received a bachelor’s degree in mathematics from Beijing Normal University, a master’s degree degree in economics from the Renmin University of China, and a PhD degree in computer science from the University of Tennessee, Knoxville. He is an associate professor of computer science at the University of California, Riverside. His research interests include high-performance computing, parallel and distributed systems, big data analytics, cluster and cloud computing, algorithm-based fault tolerance, power and energy efficient computing, numerical algorithms and software, and large-scale computer simulations. His research has been supported by National Science Foundation, Department of Energy, CMG Reservoir Simulation Foundation, Abu Dhabi National Oil Company, Nvidia, and Microsoft Corporation. He received a CAREER Award from the US National Science Foundation and a Best Paper Award from the International Supercomputing Conference. He is a Senior Member of the IEEE and a Life Member of the ACM. He currently serves as a subject area editor for Elsevier Parallel Computing journal and an associate editor for the IEEE Transactions on Parallel and Distributed Systems.

Franck Cappello is the director of the Joint-Laboratory on Extreme Scale Computing gathering six of the leading high-performance computing institutions in the world: Argonne National Laboratory, National Center for Scientific Applications, Inria, Barcelona Supercomputing Center, Julich Supercomputing Center, and Riken AICS. He is a senior computer scientist at Argonne National Laboratory and an adjunct associate professor in the Department of Computer Science at the University of Illinois at Urbana–Champaign. He is an expert in resilience and fault tolerance for scientific computing and data analytics. Recently he started investigating lossy compression for scientific data sets to respond to the pressing needs of scientist performing large-scale simulations and experiments. His contribution to this domain is one of the best lossy compressors for scientific data set respecting user-set error bounds. He is a member of the editorial board of the IEEE Transactions on Parallel and Distributed Computing and of the ACM HPDC and IEEE CCGRID steering committees. He is a fellow of the IEEE. E-mail: [email protected].

Cited By

View all

Jia WJin SWang JNiu WTao DYin MCostan ANicolae BSato K(2024)GWLZ: A Group-wise Learning-based Lossy Compression Framework for Scientific DataProceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures10.1145/3659995.3660041(34-41)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659995.3660041
Liu JDi SZhao KLiang XJin SJian ZHuang JWu SChen ZCappello F(2024)High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component InterpolationProceedings of the ACM on Management of Data10.1145/36392592:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639259
Song SHuang YJiang PYu XZheng WDi SCao QFeng YXie ZCappello FMencagli GDazzi PLowenthal DBadia R(2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658691
Show More Cited By

Index Terms

Z-checker: A framework for assessing lossy compression of scientific data

Index terms have been assigned to the content through auto-classification.

Recommendations

cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC ...
Black-box statistical prediction of lossy compression ratios for scientific data

Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no ...
CEAZ: accelerating parallel I/O via hardware-algorithm co-designed adaptive lossy compression
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

As HPC systems continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 33, Issue 2

Mar 2019

200 pages

ISSN:1094-3420

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 March 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jia WJin SWang JNiu WTao DYin MCostan ANicolae BSato K(2024)GWLZ: A Group-wise Learning-based Lossy Compression Framework for Scientific DataProceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures10.1145/3659995.3660041(34-41)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3659995.3660041
Liu JDi SZhao KLiang XJin SJian ZHuang JWu SChen ZCappello F(2024)High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component InterpolationProceedings of the ACM on Management of Data10.1145/36392592:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639259
Song SHuang YJiang PYu XZheng WDi SCao QFeng YXie ZCappello FMencagli GDazzi PLowenthal DBadia R(2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658691
Dongarra JTourancheau BUnderwood RBessac JKrasowska DCalhoun JDi SCappello F(2023)Black-box statistical prediction of lossy compression ratios for scientific dataInternational Journal of High Performance Computing Applications10.1177/1094342023117941737:3-4(412-433)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1177/10943420231179417
Huang YDi SYu XLi GCappello FMohror KArnold DBadia R(2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607048
Lu TZhong YSun ZChen XZhou YWu FYang YHuang YYang YMohror KArnold DBadia R(2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607044
Liu JDi SZhao KLiang XChen ZCappello FGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific dataProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593721(1-13)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593721
Liu JDi SZhao KLiang XChen ZCappello FWolf FShende SCulhane CAlam SJagode H(2022)Dynamic quality metric oriented error bounded lossy compression for scientific datasetsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571967(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571967
Yu XDi SZhao KTian JTao DLiang XCappello FWeissman JChandra AGavrilovska ATiwari D(2022)Ultrafast Error-bounded Lossy Compression for Scientific DatasetsProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531473(159-171)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531473
Germann TFoster IAinsworth MBessac JCappello FChoi JDi SDi ZGok AGuo HHuck KKelly CKlasky SKleese van Dam KLiang XMehta KParashar MPeterka TPouchard LShu TTugluk Ovan Dam HWan LWolf MWozniak JXu WYakushin IYoo SMunson T(2021)Online data analysis and reductionInternational Journal of High Performance Computing Applications10.1177/1094342021102354935:6(617-635)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1177/10943420211023549
Show More Cited By

Abstract

References

Author biographies

Cited By

Index Terms

Recommendations

cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data

Black-box statistical prediction of lossy compression ratios for scientific data

CEAZ: accelerating parallel I/O via hardware-algorithm co-designed adaptive lossy compression

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Share

Share this Publication link

Share on social media

Affiliations