Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3652892.3700780acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article
Open access

Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results

Published: 02 December 2024 Publication History

Abstract

Ensuring reproducibility in high-performance computing (HPC) applications is a significant challenge, particularly when nondeterministic execution can lead to untrustworthy results. Traditional methods that compare final results from multiple runs often fail because they provide sources of discrepancies only a posteriori and require substantial resources, making them impractical and unfeasible. This paper introduces an innovative method to address this issue by using scalable capture and comparing intermediate multi-run results. By capitalizing on intermediate checkpoints and hash-based techniques with user-defined error bounds, our method identifies divergences early in the execution paths. We employ Merkle trees for checkpoint data to reduce the I/O overhead associated with loading historical data. Our evaluations on the nondeterministic HACC cosmology simulation show that our method effectively captures differences above a predefined error bound and significantly reduces I/O overhead. Our solution provides a robust and scalable method for improving reproducibility, ensuring that scientific applications on HPC systems yield trustworthy and reliable results.

References

[1]
Peter Ahrens, James Demmel, and Hong Diep Nguyen. 2020. Algorithms for Efficient Reproducible Floating Point Summation. TOMS'20: ACM Transactions on Mathematical Software 46, 3 (2020), 1--49.
[2]
Kevin Assogba, Bogdan Nicolae, Hubertus Van Dam, and M. Mustafa Rafique. 2023. Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics. In SC'23: Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis. Association for Computing Machinery, New York, NY, USA, 1748--1756.
[3]
Emre Ates, Yijia Zhang, Burak Aksar, Jim Brandt, Vitus J. Leung, Manuel Egele, and Ayse K. Coskun. 2019. HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations. In ICPP'19: The Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan). Association for Computing Machinery, New York, NY, USA, Article 40, 10 pages.
[4]
Jens Axboe. 2019. Efficient IO with io_uring. https://kernel.dk/io_uring.pdf
[5]
Pavan Balaji and Dries Kimpe. 2013. On the Reproducibility of MPI Reduction Operations. In HPCC'13: The IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 407--414.
[6]
Marek Baranowski, Braden Caywood, Hannah Eyre, Janaan Lake, Kevin Parker, Kincaid Savoie, Hari Sundar, and Mary Hall. 2017. Reproducing ParConnect for SC16. Parallel Computing 70 (2017), 18--21.
[7]
Magnus Borga, André Ahlgren, Thobias Romu, Per Widholm, Olof Dahlqvist Leinhard, and Janne West. 2020. Reproducibility and Repeatability of MRI-based Body Composition Analysis. Magnetic Resonance in Medicine 84, 6 (2020), 3146--3156.
[8]
Greg L Bryan, Michael L Norman, Brian W O'Shea, Tom Abel, John H Wise, Matthew J Turk, Daniel R Reynolds, David C Collins, Peng Wang, Samuel W Skillman, et al. 2014. Enzo: An Adaptive Mesh Refinement Code for Astrophysics. The Astrophysical Journal Supplement Series 211, 2 (2014), 19.
[9]
R. Shane Canon. 2020. The Role of Containers in Reproducibility. In CANOPIE-HPC'20: The Proceedings of the 2020 2nd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC. IEEE Computer Society, Los Alamitos, CA, USA, 19--25.
[10]
Artem Chebotko, Andrey Kashlev, and Shiyong Lu. 2015. A Big Data Modeling Methodology for Apache Cassandra. In BigData'15: 2015 IEEE International Congress on Big Data. IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 238--245.
[11]
Peter V Coveney, Derek Groen, and Alfons G Hoekstra. 2021. Reliability and Reproducibility in Computational Science: Implementing Validation, Verification and Uncertainty Quantification in silico. Philosophical Transactions of the Royal Society A 379, 2197 (2021), 20200409.
[12]
Paolo Di Tommaso, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow Enables Reproducible Computational Workflows. Nature Biotechnology 35, 4 (2017), 316--319.
[13]
Argonne Leadership Computing Facility. n.d. Polaris. https://www.alcf.anl.gov/polaris. Accessed: May 24, 2024.
[14]
Mikaila J. Gossman, Bogdan Nicolae, and Jon C. Calhoun. 2024. Scalable I/O Aggregation for Asynchronous Multi-level Checkpointing. Future Generation Computer Systems 160 (2024), 420--432.
[15]
Salman Habib, Vitali Morozov, Nicholas Frontiere, Hal Finkel, Adrian Pope, Katrin Heitmann, Kalyan Kumaran, Venkatram Vishwanath, Tom Peterka, Joe Insley, David Daniel, Patricia Fasel, and Zarija Lukić. 2016. HACC: Extreme Scaling and Performance Across Diverse Architectures. Communications of the ACM 60, 1 (dec 2016), 97--104.
[16]
Salman Habib, Adrian Pope, Zarija Lukić, David Daniel, Patricia Fasel, Nehal Desai, Katrin Heitmann, Chung-Hsing Hsu, Lee Ankeny, Graham Mark, Suman Bhattacharya, and James Ahrens. 2009. Hybrid Petacomputing Meets Cosmology: The Roadrunner Universe Project. Journal of Physics: Conference Series 180, 1 (jul 2009), 012019.
[17]
R. W. Hockney and J. W. Eastwood. 1988. Computer Simulation Using Particles. Taylor & Francis Group, New York, NY, USA.
[18]
Bin Hu, Shane Canon, Emiley A Eloe-Fadrosh, Michal Babinski, Yuri Corilo, Karen Davenport, William D Duncan, Kjiersten Fagnan, Mark Flynn, Brian Foster, et al. 2022. Challenges in Bioinformatics Workflows for Processing Microbiome Omics Data at Scale. Frontiers in Bioinformatics 1 (2022), 826370.
[19]
Jie Jia, Yi Liu, Yanke Liu, Yifan Chen, and Fang Lin. 2024. AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems. In Euro-Par'24: Parallel Processing: 30th European Conference on Parallel and Distributed Processing, Madrid, Spain, August 26--30, 2024, Proceedings, Part III (Madrid, Spain). Springer-Verlag, Berlin, Heidelberg, 342--355.
[20]
Kate Keahey, Jason Anderson, Mark Powers, and Adam Cooper. 2023. Three Pillars of Practical Reproducibility. In eScience'23: The IEEE 19th International Conference on e-Science. IEEE Computer Society, Los Alamitos, CA, USA, 1--6.
[21]
Andrzej Kochut, Alexei Karve, and Bogdan Nicolae. 2015. Towards Efficient On-demand VM Provisioning: Study of VM Runtime I/O Access Patterns to Shared Image Content. In IM'15: 13th IFIP/IEEE International Symposium on Integrated Network Management. Ottawa, Canada, 321--329.
[22]
Ignacio Laguna. 2020. Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing. In IPDPS'20: The IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, Los Alamitos, CA, USA, 622--633.
[23]
Philippe Langlois, Rafife Nheili, and Christophe Denis. 2016. Recovering Numerical Reproducibility in Hydrodynamic Simulations. In ARITH'16: The IEEE 23nd Symposium on Computer Arithmetic. IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 63--70.
[24]
Kuan Li, Kang He, Stef Graillat, Hao Jiang, Tongxiang Gu, and Jie Liu. 2023. Multilevel Parallel Multi-layer Block Reproducible Summation Algorithm. Parallel Computing 115 (2023), 102996.
[25]
Xin Liu, JD Emberson, Michael Buehlmann, Nicholas Frontiere, and Salman Habib. 2023. Numerical Discreteness Errors in Multispecies Cosmological N-body Simulations. Monthly Notices of the Royal Astronomical Society 522, 3 (2023), 3631--3647.
[26]
Avinash Maurya, M. Mustafa Rafique, Thierry Tonellot, Hussain J. AlSalem, Franck Cappello, and Bogdan Nicolae. 2023. GPU-Enabled Asynchronous Multilevel Checkpoint Caching and Prefetching. In HPDC'23: The Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing (Orlando, FL, USA). Association for Computing Machinery, New York, NY, USA, 73--85.
[27]
Robert D. McIntosh and Christopher D. Chambers. 2020. The Three R's of Scientific Integrity: Replicability, Reproducibility, and Robustness. Cortex 129 (2020), A4--A7.
[28]
David Moreau, Kristina Wiebels, and Carl Boettiger. 2023. Containers for Computational Reproducibility. Nature Reviews Methods Primers 3, 1 (2023), 50.
[29]
Ingo Müller, Andrea Arteaga, Torsten Hoefler, and Gustavo Alonso. 2018. Reproducible Floating-Point Aggregation in RDBMSs. In ICDE'18: Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE Computer Society, Los Alamitos, CA, USA, 1049--1060.
[30]
Bogdan Nicolae. 2013. Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal. In IPDPS'13: The 27th IEEE International Parallel and Distributed Processing Symposium. Boston, USA, 19--28.
[31]
Bogdan Nicolae. 2015. Leveraging Naturally Distributed Data Redundancy to Reduce Collective I/O Replication Overhead. In IPDPS'15: 29th IEEE International Parallel and Distributed Processing Symposium. Hyderabad, India, 1023--1032.
[32]
Bogdan Nicolae, Tanzima Z. Islam, Robert Ross, Huub Van Dam, Kevin Assogba, Polina Shpilker, Mikhail Titov, Matteo Turilli, Tianle Wang, Ozgur O. Kilic, Shantenu Jha, and Line C. Pouchard. 2023. Building the I (Interoperability) of FAIR for Performance Reproducibility of Large-Scale Composable Workflows in RECUP. In eScience'23: The IEEE 19th International Conference on e-Science. IEEE Computer Society, Los Alamitos, CA, USA, 1--7.
[33]
Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Franck Cappello. 2019. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. In IPDPS'19: The Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society, Los Alamitos, CA, USA, 911--920.
[34]
Line Pouchard, Sterling Baldwin, Todd Elsethagen, Shantenu Jha, Bibi Raju, Eric Stephan, Li Tang, and Kerstin Kleese Van Dam. 2019. Computational Reproducibility of Scientific Workflows at Extreme Scales. IJHPCA'19: The International Journal of High Performance Computing Applications 33, 5 (2019), 763--776.
[35]
Jan Provazník, Radim Filip, and Petr Marek. 2022. Taming Numerical Errors in Simulations of Continuous Variable Non-Gaussian State Preparation. Scientific Reports 12, 1 (2022), 16574.
[36]
Kento Sato, Ignacio Laguna, Gregory L Lee, Martin Schulz, Christopher M Chambreau, Simone Atzeni, Michael Bentley, Ganesh Gopalakrishnan, Zvonimir Rakamaric, Geof Sawaya, et al. 2019. PRUNERS: Providing Reproducibility for Uncovering Non-deterministic Errors in Runs on Supercomputers. IJHPCA'19: The International Journal of High Performance Computing Applications 33, 5 (2019), 777--783.
[37]
Geof Sawaya, Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, and Dong H Ahn. 2017. FLiT: Cross-platform Floating-point Result-consistency Tester and Workload. In IISWC'17: The IEEE International Symposium on Workload Characterization. IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 229--238.
[38]
Swaminathan Sivasubramanian. 2012. Amazon dynamoDB: A Seamlessly Scalable Non-relational Database Service. In SIGMOD/PODS'12: International Conference on Management of Data (Scottsdale, Arizona, USA). Association for Computing Machinery, New York, NY, USA, 729--730.
[39]
Victoria Stodden and Matthew S Krafczyk. 2018. Assessing Reproducibility: An Astrophysical Example of Computational Uncertainty in the HPC Context. ResCuE-HPC'18: The 1st Workshop on Reproducible, Customizable and Portable Workflows for HPC at SC'18.
[40]
Michela Taufer, Omar Padron, Philip Saponaro, and Sandeep Patel. 2010. Improving Numerical Reproducibility and Stability in Large-scale Numerical Simulations on GPUs. In IPDPS'10: The IEEE International Symposium on Parallel & Distributed Processing. IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 1--9.
[41]
Basho Technologies. 2009. Riak. https://www.riak.com/.
[42]
Krishna Tiwari, Sarubini Kananathan, Matthew G Roberts, Johannes P Meyer, Mohammad Umer Sharif Shohan, Ashley Xavier, Matthieu Maire, Ahmad Zyoud, Jinghao Men, Szeyi Ng, et al. 2021. Reproducibility in Systems Biology Modelling. Molecular Systems Biology 17, 2 (2021), e9982.
[43]
Alexandru Uta, Alexandru Custura, Dmitry Duplyakin, Ivo Jimenez, Jan Rellermeyer, Carlos Maltzahn, Robert Ricci, and Alexandru Iosup. 2020. Is Big Data Performance Reproducible in Modern Cloud Networks?. In NSDI'20: The Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (Santa Clara, CA, USA). USENIX Association, USA, 513--528.
[44]
GR Williams, GP Behm, T Nguyen, A Esparza, VG Haka, A Ramos, B Wright, JC Otto, CP Paolini, and MP Thomas. 2017. SC16 Student Cluster Competition Challenge: Investigating the Reproducibility of Results for the ParConnect Application. Parallel Computing 70 (2017), 27--34.

Index Terms

  1. Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    Middleware '24: Proceedings of the 25th International Middleware Conference
    December 2024
    515 pages
    ISBN:9798400706233
    DOI:10.1145/3652892
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    In-Cooperation

    • IFIP
    • Usenix

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2024

    Check for updates

    Author Tags

    1. results reproducibility
    2. checkpoint analysis
    3. high-performance computing
    4. error-bounded hashing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    Middleware '24
    Middleware '24: 25th International Middleware Conference
    December 2 - 6, 2024
    Hong Kong, Hong Kong

    Acceptance Rates

    Overall Acceptance Rate 203 of 948 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 71
      Total Downloads
    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)39
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media