Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2966884.2966903acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

The MIG Framework: Enabling Transparent Process Migration in Open MPI

Published: 25 September 2016 Publication History

Abstract

This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management.
Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a self-contained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager.

References

[1]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. IEEE Micro, 32(3):122--134, 2012.
[2]
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L Scott. Proactive process-level live migration in hpc environments. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 43. IEEE Press, 2008.
[3]
Ifeanyi P Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302--1326, 2013.
[4]
Ian Philp. Software failures and the road to a petaflop machine. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11), 2005.
[5]
Paul H Hargrove and Jason C Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.
[6]
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under Unix. In Usenix Winter Technical Conference, pages 213--223, January 1995.
[7]
Joshua Hursey, Timothy I Mattox, and Andrew Lumsdaine. Interconnect agnostic checkpoint/restart in open mpi. In Proceedings of the 18th ACM international symposium on High performance distributed computing, pages 49--58. ACM, 2009.
[8]
Bryan Mills, Ryan E Grant, Kurt B Ferreira, and Rolf Riesen. Evaluating energy savings for checkpoint/restart. In Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, page 6. ACM, 2013.
[9]
Miguel G Xavier, Marcelo Veiga Neves, Fabio D Rossi, Tiago C Ferreto, Tobias Lange, and Cesar AF De Rose. Performance evaluation of container-based virtualization for high performance computing environments. In Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, pages 233--240. IEEE, 2013.
[10]
Wei Huang, Matthew J Koop, Qi Gao, and Dhabaleswar K Panda. Virtual machine aware communication libraries for high performance computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, page 9. ACM, 2007.
[11]
Simon Pickartz, Jens Breitbart, and Stefan Lankes. Impacts of virtualization on intra-host communication. 2016.
[12]
Criu - checkpoint/restore in userspace. https://criu.org/. Accessed: 2016-04-11.
[13]
W. Li, A. Kanso, and A. Gherbi. Leveraging linux containers to achieve high availability for cloud services. In Cloud Engineering (IC2E), 2015 IEEE International Conference on, pages 76--83, March 2015.
[14]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. Open mpi: Goals, concept, and design of a next generation mpi implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 97--104. Springer, 2004.
[15]
Rami Rosen. Resource management: Linux kernel namespaces and cgroups. Haifux, May, 2013.
[16]
Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for mpi programs. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 38. IEEE Computer Society, 2004.
[17]
Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. Open mpi: A flexible high performance mpi. In Parallel Processing and Applied Mathematics, pages 228--239. Springer, 2005.
[18]
David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.
[19]
L Peter Deutsch. Gzip file format specification version 4.3. 1996.
[20]
Patrick Bellasi, Giuseppe Massari, and William Fornaciari. Effective runtime resource management using linux control groups with the barbequertrm framework. ACM Transactions on Embedded Computing Systems (TECS), 14(2):39, 2015.

Cited By

View all
  • (2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI '16: Proceedings of the 23rd European MPI Users' Group Meeting
September 2016
225 pages
ISBN:9781450342346
DOI:10.1145/2966884
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distributed Systems
  2. HPC
  3. MPI
  4. Message Passing
  5. Open MPI
  6. Process Migration
  7. Runtime
  8. Scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroMPI 2016
EuroMPI 2016: The 23rd European MPI Users' Group Meeting
September 25 - 28, 2016
Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Software Fault Tolerance in Real-Time Systems: Identifying the Future Research QuestionsACM Computing Surveys10.1145/358995055:14s(1-30)Online publication date: 17-Jul-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media