Multi-versioning Performance Opportunities in BGAS System for Resilience

Dun, Nan; Pleiter, Dirk; Fang, Aiman; Vandenbergen, Nicolas; Chien, Andrew A.

doi:10.1007/978-3-319-41321-1_25

Nan Dun¹⁶,
Dirk Pleiter¹⁷,
Aiman Fang¹⁶,
Nicolas Vandenbergen¹⁷ &
…
Andrew A. Chien¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2726 Accesses

Abstract

Resilience has become a major concern in high-performance computing (HPC) systems. Addressing the increasing risk of latent errors (or silent data corruption) is one of the biggest challenges. Multi-version checkpointing system, which keeps multi-version of the application states, has been proposed as a solution and has been implemented in Global View Resilience (GVR). The resulting more sophisticated management of data introduces overheads and the resulting impact on performance need to be investigated. In this paper we explore the performance of GVR for an HPC system with integrated non-volatile memories, namely Blue Gene Active Storage (BGAS). Our empirical study shows that the BGAS system provides a significantly more efficient basis for flexible error recovery by using GVR multi-versioning features compared to using a standard external storage system attached to the same Blue Gene/Q installation. Using BGAS especially achieves at least $10\times $ performance boost for random traversal across multiple versions due to significantly better performance for small random I/O operations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A model of checkpoint behavior for applications that have I/O

Article Open access 17 April 2022

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory

References

IOR benchmark. http://ior-sio.sourceforge.net
Scalable checkpoint/restart (SCR) library. https://github.com/hpc/scr
Summit compute system. https://www.olcf.ornl.gov/summit/
Antypas, K., Wright, N., Cardo, N.P., Andrews, A., Cordery, M.: Cori: a Cray XC pre-exascale system for NERSC. In: Cray User Group Proceedings. Cray (2014)
Google Scholar
Bariuso, R., Knies, A.: SHMEM user’s guide for C. Cray Research, Inc. (1994)
Google Scholar
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)
Google Scholar
Bent, J., Grider, G., Kettering, B., Manzanares, A., McClelland, M., Torres, A., Torrez, A.: Storage challenges at Los Alamos National Lab. In: IEEE 28th Symposium on Mass Storage Systems and Technologies, pp. 1–5, April 2012
Google Scholar
Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale computing study: technology challenges in achieving exascale systems. Technical report DARPA IPTO (2008)
Google Scholar
Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54, 67–77 (2011)
Article Google Scholar
Brown, D.L., Messina, P., Keyes, D., Morrison, J., Lucas, R., Shalf, J., Beckman, P., Brightwell, R., Geist, A., Vetter, J., et al.: Scientific grand challenges: crosscutting technologies for computing at the exascale. Office of Science, U.S. Department of Energy, pp. 2–4, February 2010
Google Scholar
Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Article Google Scholar
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21(02), 111–132 (2011)
Article MathSciNet Google Scholar
Carns, P., Latham, R., Ross, R., Iskra, K., Lang, S., Riley, K.: 24/7 characterization of petascale I/O workloads. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–10, August 2009
Google Scholar
Chien, A.A., Balaji, P., Beckman, P., Dun, N., Fang, A., Fujita, H., Iskra, K., Rubenstein, Z., Zheng, Z., Schreiber, R., Hammond, J., Dinan, J., Laguna, I., Dubey, A., Hoemmen, M., Heroux, M., Teranishi, K., Siegel, A.: Versioned distributed arrays for resilience in scientific applications: global view resilience. In: Proceedings of International Conference on Computational Science (2015)
Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3) (2006)
Google Scholar
Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12 (2009)
Google Scholar
Dun, N., Fujita, H., Tramm, J., Chien, A.A., Siegel, A.R.: Data decomposition in Monte Carlo particle transport simulations using global view arrays. Int. J. High Perform. Comput. Appl. March 2015
Google Scholar
Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013)
Article Google Scholar
Fang, A., Chien, A.A.: How much SSD is useful for resilience in supercomputers. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale (2015)
Google Scholar
Ferreira, K., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Google Scholar
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 78:1–78:12 (2012)
Google Scholar
Fitch, B.G.: Exploring the capabilities of a massively scalable, compute-in-storage architecture (2013). http://www.hpdc.org/2013/site/files/HPDC13_Fitch_BlueGeneActiveStorage.pdf
Fitch, B.G., Rayshubskiy, A., Pitman, M.C., Ward, T.J.C., Germain, R.S.: Using the active storage fabrics model to address petascale storage challenges. In: Proceedings of the 4th Annual Workshop on Petascale Data Storage (2009)
Google Scholar
Fujita, H., Dun, N., Rubenstein, Z.A., Chien, A.A.: Log-structured global array for efficient multi-version snapshots. In: Proceedings of 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 281–291 (2015)
Google Scholar
Fujita, H., Iskra, K., Balaji, P., Chien, A.A.: Empirical comparison of three versioning architectures. In: Proceedings of IEEE Cluster 2015 (2015)
Google Scholar
Gao, S., He, B., Xu, J.: Real-time in-memory checkpointing for future hybrid memory systems. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 263–272 (2015)
Google Scholar
GVR Team.: Global View Resilience (GVR) API documentation, version 1.0.1. Technical report, University of Chicago, Department of Computer Science, October 2015
Google Scholar
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494 (2006)
Article Google Scholar
Heger, D., Shah, G.: IBM’s general parallel file system (GPFS) 1.4 for AIX. Technical report, IBM Corporation, November 2001
Google Scholar
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
Article MATH Google Scholar
IBM Blue Gene Team: The IBM Blue Gene project. IBM J. Res. Dev. 57 (2013)
Google Scholar
Jones, T., Koniges, A., Yates, R.K.: Performance of the IBM general parallel file system. In: Proceedings of 2000 IEEE International Parallel and Distributed Processing Symposium (2000)
Google Scholar
Jülich Supercomputing Centre: BGAS user documentation. https://trac.version.fz-juelich.de/EIC/wiki/bgas-user
Jülich Supercomputing Centre: Blue Gene Active Storage boosts I/O performance at JSC. http://www.fz-juelich.de/SharedDocs/Pressemitteilungen/UK/EN/2013/13-11-18bgas.html
Kulkarni, A., Manzanares, A., Ionkov, L., Lang, M., Lumsdaine, A.: The design and implementation of a multi-level content-addressable checkpoint file system. In: 2012 19th International Conference on High Performance Computing, pp. 1–10, December 2012
Google Scholar
Li, D., Vetter, J.S., Marin, G., McCurdy, C., Cira, C., Liu, Z., Yu, W.: Identifying opportunities for byte-addressable non-volatile memory in extreme-scale scientific applications. In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 945–956 (2012)
Google Scholar
Liu, N., Cope, J., Carns, P., Carothers, C., Ross, R., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: Proceedings of the 2012 IEEE Conference on Massive Data Storage (2012)
Google Scholar
Lu, G., Zheng, Z., Chien, A.A.: When is multi-version checkpointing needed? In: Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale, pp. 49–56 (2013)
Google Scholar
Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 610–621 (2014)
Google Scholar
Metzler, B., Trivedi, A.: Prototyping byte-addressable NVM access. In: Proceedings of 11th OpenFabrics Developers Workshop (2015)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)
Google Scholar
Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Aprà, E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20(2), 203–231 (2006)
Article Google Scholar
Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2) (1998)
Google Scholar
Ouyang, X., et al.: Enhancing checkpoint performance with staging I/O and SSD. In: Proceedings of 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010
Google Scholar
Romano, P.K., Forget, B.: The OpenMC Monte Carlo particle transport code. Ann. Nucl. Energy 51, 274–281 (2013)
Article Google Scholar
Sato, K., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Maruyama, N., Matsuoka, S.: A user-level InfiniBand-based file system and checkpoint strategy for burst buffers. In: Proceedings of 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014)
Google Scholar
Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1(3), 222–238 (1983)
Article Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of 2006 IEEE/IFIP International Conference on Dependable Systems and Networks (2006)
Google Scholar
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of Supercomputing (2011)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9) (1974)
Google Scholar
Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: Proceedings of 2011 IEEE International Parallel and Distributed Processing Symposium (2011)
Google Scholar
Zhou, M., Du, Y., Childers, B.R., Melhem, R., Mosse, D.: Writeback-aware bandwidth partitioning for multi-core systems with PCM. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 113–122 (2013)
Google Scholar

Download references

Acknowledgments

This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, the U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. This work was completed in part with resources provided by Jülich Supercomputing Centre and we would like to thank in particular Michael Stephan for his support. We gracefully acknowledge the collaboration with IBM Research on the BGAS architecture in the context of the Exascale Innovation Center (EIC). In particular, we want to thank Blake Fitch for his continuous support and for many helpful discussions. Part of the work has been done within the Joint Laboratory for Extreme Scale Computing (JLESC).

Author information

Authors and Affiliations

Department of Computer Science, University of Chicago, Chicago, USA
Nan Dun, Aiman Fang & Andrew A. Chien
Jülich Research Centre, JSC, 52425, Jülich, Germany
Dirk Pleiter & Nicolas Vandenbergen

Authors

Nan Dun
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Pleiter
View author publications
You can also search for this author in PubMed Google Scholar
Aiman Fang
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Vandenbergen
View author publications
You can also search for this author in PubMed Google Scholar
Andrew A. Chien
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dirk Pleiter .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dun, N., Pleiter, D., Fang, A., Vandenbergen, N., Chien, A.A. (2016). Multi-versioning Performance Opportunities in BGAS System for Resilience. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_25
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics