Abstract
Resilience has become a major concern in high-performance computing (HPC) systems. Addressing the increasing risk of latent errors (or silent data corruption) is one of the biggest challenges. Multi-version checkpointing system, which keeps multi-version of the application states, has been proposed as a solution and has been implemented in Global View Resilience (GVR). The resulting more sophisticated management of data introduces overheads and the resulting impact on performance need to be investigated. In this paper we explore the performance of GVR for an HPC system with integrated non-volatile memories, namely Blue Gene Active Storage (BGAS). Our empirical study shows that the BGAS system provides a significantly more efficient basis for flexible error recovery by using GVR multi-versioning features compared to using a standard external storage system attached to the same Blue Gene/Q installation. Using BGAS especially achieves at least \(10\times \) performance boost for random traversal across multiple versions due to significantly better performance for small random I/O operations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
IOR benchmark. http://ior-sio.sourceforge.net
Scalable checkpoint/restart (SCR) library. https://github.com/hpc/scr
Summit compute system. https://www.olcf.ornl.gov/summit/
Antypas, K., Wright, N., Cardo, N.P., Andrews, A., Cordery, M.: Cori: a Cray XC pre-exascale system for NERSC. In: Cray User Group Proceedings. Cray (2014)
Bariuso, R., Knies, A.: SHMEM user’s guide for C. Cray Research, Inc. (1994)
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)
Bent, J., Grider, G., Kettering, B., Manzanares, A., McClelland, M., Torres, A., Torrez, A.: Storage challenges at Los Alamos National Lab. In: IEEE 28th Symposium on Mass Storage Systems and Technologies, pp. 1–5, April 2012
Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: Exascale computing study: technology challenges in achieving exascale systems. Technical report DARPA IPTO (2008)
Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54, 67–77 (2011)
Brown, D.L., Messina, P., Keyes, D., Morrison, J., Lucas, R., Shalf, J., Beckman, P., Brightwell, R., Geist, A., Vetter, J., et al.: Scientific grand challenges: crosscutting technologies for computing at the exascale. Office of Science, U.S. Department of Energy, pp. 2–4, February 2010
Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21(02), 111–132 (2011)
Carns, P., Latham, R., Ross, R., Iskra, K., Lang, S., Riley, K.: 24/7 characterization of petascale I/O workloads. In: IEEE International Conference on Cluster Computing and Workshops, pp. 1–10, August 2009
Chien, A.A., Balaji, P., Beckman, P., Dun, N., Fang, A., Fujita, H., Iskra, K., Rubenstein, Z., Zheng, Z., Schreiber, R., Hammond, J., Dinan, J., Laguna, I., Dubey, A., Hoemmen, M., Heroux, M., Teranishi, K., Siegel, A.: Versioned distributed arrays for resilience in scientific applications: global view resilience. In: Proceedings of International Conference on Computational Science (2015)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3) (2006)
Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12 (2009)
Dun, N., Fujita, H., Tramm, J., Chien, A.A., Siegel, A.R.: Data decomposition in Monte Carlo particle transport simulations using global view arrays. Int. J. High Perform. Comput. Appl. March 2015
Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013)
Fang, A., Chien, A.A.: How much SSD is useful for resilience in supercomputers. In: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale (2015)
Ferreira, K., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 78:1–78:12 (2012)
Fitch, B.G.: Exploring the capabilities of a massively scalable, compute-in-storage architecture (2013). http://www.hpdc.org/2013/site/files/HPDC13_Fitch_BlueGeneActiveStorage.pdf
Fitch, B.G., Rayshubskiy, A., Pitman, M.C., Ward, T.J.C., Germain, R.S.: Using the active storage fabrics model to address petascale storage challenges. In: Proceedings of the 4th Annual Workshop on Petascale Data Storage (2009)
Fujita, H., Dun, N., Rubenstein, Z.A., Chien, A.A.: Log-structured global array for efficient multi-version snapshots. In: Proceedings of 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 281–291 (2015)
Fujita, H., Iskra, K., Balaji, P., Chien, A.A.: Empirical comparison of three versioning architectures. In: Proceedings of IEEE Cluster 2015 (2015)
Gao, S., He, B., Xu, J.: Real-time in-memory checkpointing for future hybrid memory systems. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 263–272 (2015)
GVR Team.: Global View Resilience (GVR) API documentation, version 1.0.1. Technical report, University of Chicago, Department of Computer Science, October 2015
Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494 (2006)
Heger, D., Shah, G.: IBM’s general parallel file system (GPFS) 1.4 for AIX. Technical report, IBM Corporation, November 2001
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
IBM Blue Gene Team: The IBM Blue Gene project. IBM J. Res. Dev. 57 (2013)
Jones, T., Koniges, A., Yates, R.K.: Performance of the IBM general parallel file system. In: Proceedings of 2000 IEEE International Parallel and Distributed Processing Symposium (2000)
Jülich Supercomputing Centre: BGAS user documentation. https://trac.version.fz-juelich.de/EIC/wiki/bgas-user
Jülich Supercomputing Centre: Blue Gene Active Storage boosts I/O performance at JSC. http://www.fz-juelich.de/SharedDocs/Pressemitteilungen/UK/EN/2013/13-11-18bgas.html
Kulkarni, A., Manzanares, A., Ionkov, L., Lang, M., Lumsdaine, A.: The design and implementation of a multi-level content-addressable checkpoint file system. In: 2012 19th International Conference on High Performance Computing, pp. 1–10, December 2012
Li, D., Vetter, J.S., Marin, G., McCurdy, C., Cira, C., Liu, Z., Yu, W.: Identifying opportunities for byte-addressable non-volatile memory in extreme-scale scientific applications. In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 945–956 (2012)
Liu, N., Cope, J., Carns, P., Carothers, C., Ross, R., Grider, G., Crume, A., Maltzahn, C.: On the role of burst buffers in leadership-class storage systems. In: Proceedings of the 2012 IEEE Conference on Massive Data Storage (2012)
Lu, G., Zheng, Z., Chien, A.A.: When is multi-version checkpointing needed? In: Proceedings of the 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale, pp. 49–56 (2013)
Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 610–621 (2014)
Metzler, B., Trivedi, A.: Prototyping byte-addressable NVM access. In: Proceedings of 11th OpenFabrics Developers Workshop (2015)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)
Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Aprà, E.: Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl. 20(2), 203–231 (2006)
Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2) (1998)
Ouyang, X., et al.: Enhancing checkpoint performance with staging I/O and SSD. In: Proceedings of 2010 International Workshop on Storage Network Architecture and Parallel I/Os, May 2010
Romano, P.K., Forget, B.: The OpenMC Monte Carlo particle transport code. Ann. Nucl. Energy 51, 274–281 (2013)
Sato, K., Mohror, K., Moody, A., Gamblin, T., de Supinski, B.R., Maruyama, N., Matsuoka, S.: A user-level InfiniBand-based file system and checkpoint strategy for burst buffers. In: Proceedings of 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014)
Schlichting, R.D., Schneider, F.B.: Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1(3), 222–238 (1983)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of 2006 IEEE/IFIP International Conference on Dependable Systems and Networks (2006)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of Supercomputing (2011)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9) (1974)
Zheng, Z., Yu, L., Tang, W., Lan, Z., Gupta, R., Desai, N., Coghlan, S., Buettner, D.: Co-analysis of RAS log and job log on Blue Gene/P. In: Proceedings of 2011 IEEE International Parallel and Distributed Processing Symposium (2011)
Zhou, M., Du, Y., Childers, B.R., Melhem, R., Mosse, D.: Writeback-aware bandwidth partitioning for multi-core systems with PCM. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 113–122 (2013)
Acknowledgments
This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, the U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. This work was completed in part with resources provided by Jülich Supercomputing Centre and we would like to thank in particular Michael Stephan for his support. We gracefully acknowledge the collaboration with IBM Research on the BGAS architecture in the context of the Exascale Innovation Center (EIC). In particular, we want to thank Blake Fitch for his continuous support and for many helpful discussions. Part of the work has been done within the Joint Laboratory for Extreme Scale Computing (JLESC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Dun, N., Pleiter, D., Fang, A., Vandenbergen, N., Chien, A.A. (2016). Multi-versioning Performance Opportunities in BGAS System for Resilience. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-41321-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)