Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Z-checker: : A framework for assessing lossy compression of scientific data

Published: 01 March 2019 Publication History

Abstract

Because of the vast volume of data being produced by today’s scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific data sets and understand the data alteration after compression in a systematic and reliable way. To address this gap, we have designed and implemented a generic framework called Z-checker. On the one hand, Z-checker combines a battery of data analysis components for data compression. On the other hand, Z-checker is implemented as an open-source community tool to which users and developers can contribute and add new analysis components based on their additional analysis demands. In this article, we present a survey of existing lossy compressors. Then, we describe the design framework of Z-checker, in which we integrated evaluation metrics proposed in prior work as well as other analysis tools. Specifically, for lossy compressor developers, Z-checker can be used to characterize critical properties (such as entropy, distribution, power spectrum, principal component analysis, and autocorrelation) of any data set to improve compression strategies. For lossy compression users, Z-checker can detect the compression quality (compression ratio and bit rate) and provide various global distortion analysis comparing the original data with the decompressed data (peak signal-to-noise ratio, normalized mean squared error, rate–distortion, rate-compression error, spectral, distribution, and derivatives) and statistical analysis of the compression error (maximum, minimum, and average error; autocorrelation; and distribution of errors). Z-checker can perform the analysis with either coarse granularity (throughout the whole data set) or fine granularity (by user-defined blocks), such that the users and developers can select the best fit, adaptive compressors for different parts of the data set. Z-checker features a visualization interface displaying all analysis results in addition to some basic views of the data sets such as time series. To the best of our knowledge, Z-checker is the first tool designed to assess lossy compression comprehensively for scientific data sets.

References

[1]
Austin E (2016) Advanced photon source. Synchrotron Radiation News 29(2): 29–30.
[2]
Austin W, Ballard G, and Kolda TG (2015) Parallel tensor compression for large-scale scientific data. In: Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 912–922. IEEE, 2016.
[3]
Baker AH, Hammerling DM, Mickelson SA, et al. (2016) Evaluating lossy data compression on climate simulation data within a large ensemble. Geoscientific Model Development 9(12): 4381.
[4]
Baker AH, Xu H, Dennis JM, et al. (2014) A methodology for evaluating the impact of data compression on climate simulation data. In: Proceedings of the 23 rd international symposium on high-performance parallel and distributed computing, pp. 203–214. ACM.
[5]
Bernholdt D, Bharathi S, Brown D, et al. (2005) The earth system grid: supporting the next generation of climate modeling research. Proceedings of the IEEE 93(3): 485–495.
[6]
Chen Z, Son SW, Hendrix W, et al. (2014) Numarck: machine learning algorithm for resiliency and checkpointing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 733–744. IEEE.
[7]
Community Earth Simulation Model (CESM) (2017) Available at: http://www.cesm.ucar.edu/ (accessed 28 October 2017).
[8]
Deutsch LP (1996) Gzip file format specification version 4.3.
[9]
Di S and Cappello F (2016) Fast error-bounded lossy HPC data compression with SZ. In: Parallel and Distributed Processing Symposium, 2016 IEEE International, pp. 730–739. IEEE.
[10]
Gleckler PJ, Durack PJ, Stouffer RJ, et al. (2016) Industrial-era global ocean heat uptake doubles in recent decades. Nature Climate Change 6(4): 394–398.
[11]
Habib S, Morozov V, Frontiere N, et al. (2016) HACC: extreme scaling and performance across diverse architectures. Communications of the ACM 60(1): 97–104.
[12]
Huffman DA et al. (1952) A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40(9): 1098–1101.
[13]
Janert PK (2009) Gnuplot in Action. Greenwich: Manning Publications Co. ISBN: 1–933988.
[14]
Lakshminarasimhan S, Shah N, Ethier S, et al. (2013) ISABELA for effective in situ compression of scientific data. Concurrency and Computation: Practice and Experience 25(4): 524–540.
[15]
Laney D, Langer S, Weber C, et al. (2014) Assessing the effects of data compression in simulations using physically motivated metrics. Scientific Programming 22(2): 141–155.
[16]
Lindstrom P (2014) Fixed-rate compressed floating-point arrays. TVCG 20(12): 2674–2683.
[17]
Lindstrom P and Isenburg M (2006) Fast and efficient compression of floating-point data. TVCG 12(5): 1245–1250.
[18]
Ratanaworabhan P, Ke J, and Burtscher M (2006) Fast lossless compression of scientific floating-point data. In: Data Compression Conference 2006. DCC 2006 Proceedings, pp. 133–142. IEEE.
[19]
Sasaki N, Sato K, Endo T, et al. (2015) Exploration of lossy compression for application-level checkpoint/restart. In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 914–922. IEEE.
[20]
Tao D, Di S, Chen Z, et al. (2017) Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: 2017 IEEE international parallel and distributed processing symposium, IPDPS 2017, Orlando, Florida, USA, 29 May–2 June, 2017.
[21]
Taubman D and Marcellin M (2012) JPEG2000 Image Compression Fundamentals, Standards and Practice: Image Compression Fundamentals, Standards and Practice, Vol. 642. Springer Science & Business Media.
[22]
Wallace GK (1992) The JPEG still picture compression standard. IEEE Transactions on Consumer Electronics 38(1): 28–34.
[23]
Wegener AW and Samplify Systems LLC (2006) Adaptive Compression and Decompression of Bandlimited Signals. U.S. Patent 7,009,533.
[24]
Wu Z and Huang NE (2004) A study of the characteristics of white noise using the empirical mode decomposition method. In: Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences Vol. 460, No. 2046, pp. 1597–1611. The Royal Society.
[25]
ZFP and Derivatives (2016) Available at: http://computation.llnl.gov/projects/floating-point-compression/zfp-and-derivatives (accessed 28 October 2017).
[26]
Ziv J and Lempel A (1977) A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3): 337–343.
[27]
Zuras D, Cowlishaw M, Aiken A, et al. (2008) IEEE standard for floating-point arithmetic. IEEE Std 754-2008 1–70.

Author biographies

Author biographies
Dingwen Tao is a fifth-year doctoral candidate in computer science at the University of California, Riverside, under the advisement of Dr. Zizhong Chen. He received his bachelor degree in Information and Computing Science from the University of Science and Technology of China. He is currently working at Argonne National Laboratory in the Extreme Scale Resilience Group lead by Dr Franck Cappello. Prior to this, he worked in the High Performance Computing Group at Pacific Northwest National Laboratory in summer 2015. His research interests include high-performance computing, parallel and distributed computing, big data analytics, resilience and fault tolerance, data compression algorithms and softwares, numerical algorithms and softwares, and high-performance computing on heterogeneous systems. He has published 10+ peer-reviewed papers in top HPC and parallel and distributed conferences during his PhD program, such as HPDC, IPDPS, PPoPP, SC. E-mail: [email protected].
Sheng Di received his master’s degree from the Huazhong University of Science and Technology in 2007 and PhD degree from the University of Hong Kong in 2011. He is currently working at Argonne National Laboratory. His research interest involves resilience on high-performance computing (such as silent data corruption, optimization of checkpoint model, characterization and analysis of supercomputing log, and in situ data compression) and broad research topics on cloud computing (including optimization of resource allocation, cloud network topology, and prediction of cloud workload/hostload). He is the author of 17 papers published in international journals and 37 papers published at international conferences. He served as a programming committee member 10+ times for different conferences and served as an external conference/journal reviewer over 50 times. E-mail: [email protected].
Hanqi Guo is a postdoctoral appointee in the Mathematics and Computer Science Division, Argonne National Laboratory. He received his PhD degree in computer science from Peking University in 2014 and the BS degree in mathematics and applied mathematics from the Beijing University of Posts and Telecommunications in 2009. His research interests are mainly in large-scale scientific data visualization. E-mail: [email protected].
Zizhong Chen received a bachelor’s degree in mathematics from Beijing Normal University, a master’s degree degree in economics from the Renmin University of China, and a PhD degree in computer science from the University of Tennessee, Knoxville. He is an associate professor of computer science at the University of California, Riverside. His research interests include high-performance computing, parallel and distributed systems, big data analytics, cluster and cloud computing, algorithm-based fault tolerance, power and energy efficient computing, numerical algorithms and software, and large-scale computer simulations. His research has been supported by National Science Foundation, Department of Energy, CMG Reservoir Simulation Foundation, Abu Dhabi National Oil Company, Nvidia, and Microsoft Corporation. He received a CAREER Award from the US National Science Foundation and a Best Paper Award from the International Supercomputing Conference. He is a Senior Member of the IEEE and a Life Member of the ACM. He currently serves as a subject area editor for Elsevier Parallel Computing journal and an associate editor for the IEEE Transactions on Parallel and Distributed Systems.
Franck Cappello is the director of the Joint-Laboratory on Extreme Scale Computing gathering six of the leading high-performance computing institutions in the world: Argonne National Laboratory, National Center for Scientific Applications, Inria, Barcelona Supercomputing Center, Julich Supercomputing Center, and Riken AICS. He is a senior computer scientist at Argonne National Laboratory and an adjunct associate professor in the Department of Computer Science at the University of Illinois at Urbana–Champaign. He is an expert in resilience and fault tolerance for scientific computing and data analytics. Recently he started investigating lossy compression for scientific data sets to respond to the pressing needs of scientist performing large-scale simulations and experiments. His contribution to this domain is one of the best lossy compressors for scientific data set respecting user-set error bounds. He is a member of the editorial board of the IEEE Transactions on Parallel and Distributed Computing and of the ACM HPDC and IEEE CCGRID steering committees. He is a fellow of the IEEE. E-mail: [email protected].

Cited By

View all
  • (2024)GWLZ: A Group-wise Learning-based Lossy Compression Framework for Scientific DataProceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures10.1145/3659995.3660041(34-41)Online publication date: 3-Jun-2024
  • (2024)High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component InterpolationProceedings of the ACM on Management of Data10.1145/36392592:1(1-27)Online publication date: 26-Mar-2024
  • (2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 33, Issue 2
Mar 2019
200 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 March 2019

Author Tags

  1. Framework
  2. lossy compression
  3. assessment tool
  4. data analytics
  5. scientific data
  6. visualization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)GWLZ: A Group-wise Learning-based Lossy Compression Framework for Scientific DataProceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures10.1145/3659995.3660041(34-41)Online publication date: 3-Jun-2024
  • (2024)High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component InterpolationProceedings of the ACM on Management of Data10.1145/36392592:1(1-27)Online publication date: 26-Mar-2024
  • (2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
  • (2023)Black-box statistical prediction of lossy compression ratios for scientific dataInternational Journal of High Performance Computing Applications10.1177/1094342023117941737:3-4(412-433)Online publication date: 1-Jul-2023
  • (2023)cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End PerformanceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607048(1-13)Online publication date: 12-Nov-2023
  • (2023)ADT-FSE: A New Encoder for SZProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607044(1-13)Online publication date: 12-Nov-2023
  • (2023)FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific dataProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593721(1-13)Online publication date: 21-Jun-2023
  • (2022)Dynamic quality metric oriented error bounded lossy compression for scientific datasetsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571967(1-15)Online publication date: 13-Nov-2022
  • (2022)Ultrafast Error-bounded Lossy Compression for Scientific DatasetsProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531473(159-171)Online publication date: 27-Jun-2022
  • (2021)Online data analysis and reductionInternational Journal of High Performance Computing Applications10.1177/1094342021102354935:6(617-635)Online publication date: 1-Nov-2021
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media