research-article

The MIG Framework: Enabling Transparent Process Migration in Open MPI

Authors:

Federico Reghenzani,

Gianmario Pozzi,

Giuseppe Massari,

Simone Libutti,

William FornaciariAuthors Info & Claims

EuroMPI '16: Proceedings of the 23rd European MPI Users' Group Meeting

Pages 64 - 73

https://doi.org/10.1145/2966884.2966903

Published: 25 September 2016 Publication History

Abstract

This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management.

Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a self-contained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager.

References

[1]

Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. IEEE Micro, 32(3):122--134, 2012.

Digital Library

[2]

Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. Proactive process-level live migration in hpc environments. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 43. IEEE Press, 2008.

Digital Library

[3]

Ifeanyi P Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302--1326, 2013.

Digital Library

[4]

Ian Philp. Software failures and the road to a petaflop machine. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11), 2005.

[5]

Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.

[6]

J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. In Usenix Winter Technical Conference, pages 213--223, January 1995.

Digital Library

[7]

Joshua Hursey, Timothy I Mattox, and Andrew Lumsdaine. Interconnect agnostic checkpoint/restart in open mpi. In Proceedings of the 18th ACM international symposium on High performance distributed computing, pages 49--58. ACM, 2009.

Digital Library

[8]

Bryan Mills, Ryan E Grant, Kurt B Ferreira, and Rolf Riesen. Evaluating energy savings for checkpoint/restart. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, page 6. ACM, 2013.

Digital Library

[9]

Miguel G Xavier, Marcelo Veiga Neves, Fabio D Rossi, Tiago C Ferreto, Tobias Lange, and Cesar AF De Rose. Performance evaluation of container-based virtualization for high performance computing environments. In Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, pages 233--240. IEEE, 2013.

Digital Library

[10]

Wei Huang, Matthew J Koop, Qi Gao, and Dhabaleswar K Panda. Virtual machine aware communication libraries for high performance computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, page 9. ACM, 2007.

Digital Library

[11]

Simon Pickartz, Jens Breitbart, and Stefan Lankes. Impacts of virtualization on intra-host communication. 2016.

[12]

Criu - checkpoint/restore in userspace. https://criu.org/. Accessed: 2016-04-11.

[13]

W. Li, A. Kanso, and A. Gherbi. Leveraging linux containers to achieve high availability for cloud services. In Cloud Engineering (IC2E), 2015 IEEE International Conference on, pages 76--83, March 2015.

Digital Library

[14]

Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. Open mpi: Goals, concept, and design of a next generation mpi implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 97--104. Springer, 2004.

[15]

Rami Rosen. Resource management: Linux kernel namespaces and cgroups. Haifux, May, 2013.

[16]

Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 38. IEEE Computer Society, 2004.

Digital Library

[17]

Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. Open mpi: A flexible high performance mpi. In Parallel Processing and Applied Mathematics, pages 228--239. Springer, 2005.

Digital Library

[18]

David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.

Digital Library

[19]

L Peter Deutsch. Gzip file format specification version 4.3. 1996.

Digital Library

[20]

Patrick Bellasi, Giuseppe Massari, and William Fornaciari. Effective runtime resource management using linux control groups with the barbequertrm framework. ACM Transactions on Embedded Computing Systems (TECS), 14(2):39, 2015.

Digital Library

Cited By

Reghenzani FGuo ZFornaciari W(2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3589950

Recommendations

M-JavaMPI: A Java-MPI Binding with Process Migration Support
CCGRID '02: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid

Several Java bindings to the Message Passing Interface (MPI) software have been developed for high-performance parallel Java-based computing with message-passing in the past.None of them however addressed the issue of supporting transparent Java process ...
VCluster: a thread-based Java middleware for SMP and heterogeneous clusters with thread migration support

Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high-performance computing. Message-passing libraries, such as message-passing interface (MPI) and parallel virtual machine (...
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

We present the architecture of the Deep Computing Messaging Framework (DCMF), a message passing runtime designed for the Blue Gene/P machine and other HPC architectures. DCMF has been designed to easily support several programming paradigms such as the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI '16: Proceedings of the 23rd European MPI Users' Group Meeting

September 2016

225 pages

ISBN:9781450342346

DOI:10.1145/2966884

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI 2016

EuroMPI 2016: The 23rd European MPI Users' Group Meeting

September 25 - 28, 2016

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
96
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Reghenzani FGuo ZFornaciari W(2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3589950

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents