Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503646.3524294acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

Data-aware compression for HPC using machine learning

Published: 05 April 2022 Publication History

Abstract

While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application but also because insights into the behaviour of the algorithms within the context of individual applications are missing.
There are several different compression algorithms available, with each one also having a unique set of options. These options have a direct influence on the achieved performance and compression results. Furthermore, the algorithms and options to use for a given dataset are highly dependent on the characteristics of said dataset.
This paper explores how machine learning can help with identifying fitting compression algorithms with corresponding options based on actual data structure encountered during I/O. In order to do so, a data collection and training pipeline is introduced. Inferencing is performed during regular application runs and shows promising results. Moreover, it provides valuable insights into the benefits of using certain compression algorithms and options for specific data. Further investigations into more advanced machine learning techniques and a deeper integration into existing I/O paths will provide additional benefits.

References

[1]
Cyan4973. 2020. LZ4 - Extremely fast compression. https://github.com/lz4/lz4
[2]
Anna Fuchs. 2019. Enhanced Adaptive Compression in Lustre. https://wiki.lustre.org/Enhanced_Adaptive_Compression_in_Lustre
[3]
Jean-loup Gailly Greg Roelofs and Mark Adler. 2022. OpenSFS Survey March 2021 Results. https://wiki.opensfs.org/images/2/20/OpenSFS_Survey_Results_March_2021.pdf
[4]
Jean-loup Gailly Greg Roelofs and Mark Adler. 2022. zlib Technical Details. https://zlib.net/zlib_tech.html
[5]
The HDF Group. 2018. Release of HDF5-1.10.2 - Newsletter #160. https://www.hdfgroup.org/2018/03/release-of-hdf5-1-10-2-newsletter-160/
[6]
Flora Karniavoura and Kostas Magoutis. 2019. Decision-Making Approaches for Performance QoS in Distributed Storage Systems: A Survey. IEEE Trans. Parallel Distributed Syst. 30, 8 (2019), 1906--1919.
[7]
Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, and Tadashi Imanishi. 2019. Sequence Compression Benchmark (SCB) database --- a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. bioRxiv (2019). arXiv:https://www.biorxiv.org/content/early/2019/12/27/642553.full.pdf
[8]
Michael Kuhn, Julius Plehn, Yevhen Alforov, and Thomas Ludwig. 2020. Improving Energy Efficiency of Scientific Data Compression with Decision Trees. In ENERGY 2020: The Tenth International Conference on Smart Grids, Green Communications and IT Energy-aware Technologies (Lisbon, Portugal). IARIA XPS Press, 17--23. https://www.thinkmind.org/index.php?view=article&articleid=energy_2020_1_40_30038
[9]
na. 2019. netCDF4 Version 1.6.0. https://unidata.github.io/netcdf4-python/
[10]
na. na. IBM Spectrum Scale File compression. https://www.ibm.com/docs/en/spectrum-scale/5.1.0?topic=systems-file-compression
[11]
Uli Plechschmidt. 2020. It's lonely at the top: Lustre continues to dominate top 100 fastest supercomputers. https://community.hpe.com/t5/Advantage-EX/It-s-lonely-at-the-top-Lustre-continues-to-dominate-top-100/ba-p/7109668
[12]
Laura Promberger, Rainer Schwemmer, and Holger Fröning. 2021. Characterization of data compression across CPU platforms and accelerators. Concurrency and Computation: Practice and Experience n/a, n/a (2021), e6465. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6465
[13]
Shadura, Oksana, Bockelman, Brian Paul, Canal, Philippe, Piparo, Danilo, and Zhang, Zhe. 2020. ROOT I/O compression improvements for HEP analysis. EPJ Web Conf. 245 (2020), 02017.
[14]
Houjun Tang, Suren Byna, N. Anders Petersson, and David McCallen. 2021. Tuning Parallel Data Compression and I/O for Large-scale Earthquake Simulation. In 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, December 15--18, 2021. IEEE, 2992--2997.
[15]
Chip Turner Yann Collet. 2016. Smaller and faster data compression with Zstandard. https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/
[16]
Zhaoyuan Yu, Zhengfang Zhang, Dongshuang Li, Wen Luo, Yuan Liu, Uzair Bhatti, and Linwang Yuan. 2020. Adaptive lossy compression of climate model data based on hierarchical tensor with Adaptive-HGFDR (v1.0). (06 2020).
[17]
Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. 2020. Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '20). Association for Computing Machinery, New York, NY, USA, 89--100.

Cited By

View all
  • (2022)Automated performance analysis tools framework for HPC programsProcedia Computer Science10.1016/j.procs.2022.09.162207:C(1067-1076)Online publication date: 1-Jan-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHEOPS '22: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems
April 2022
44 pages
ISBN:9781450392099
DOI:10.1145/3503646
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HDF5
  2. compression
  3. file systems
  4. machine learning

Qualifiers

  • Research-article

Conference

EuroSys '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 6 of 8 submissions, 75%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)220
  • Downloads (Last 6 weeks)30
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Automated performance analysis tools framework for HPC programsProcedia Computer Science10.1016/j.procs.2022.09.162207:C(1067-1076)Online publication date: 1-Jan-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media