Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2600212.2600713acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture

Published: 23 June 2014 Publication History

Abstract

The advent of many-core architectures like Intel MIC is enabling the design of increasingly capable supercomputers within reasonable power budgets. Fault-tolerance is becoming more important with the increased number of components and the complexity in these heterogeneous clusters. Checkpoint-restart mechanisms have been traditionally used to enhance the dependability of applications, and to enable dynamic task rescheduling in the face of system failures. Naive checkpointing protocols, which are predominantly I/O-intensive, face severe performance bottlenecks on the Xeon Phi architecture due to several inherent and acquired limitations. Consequently, existing checkpointing frameworks are not capable of serving distributed MPI applications that leverage heterogeneous hardware architectures. This paper discusses the I/O limitations on the Xeon Phi system, and describes the architecture and design of a novel distributed checkpointing framework, namely MIC-Check, for HPC applications running on it.

References

[1]
Improving File IO performance on Intel Xeon Phi. http://software.intel.com/en-us/blogs/2014/01/07/improving-file-io-performance-on-intel-xeon-phi.
[2]
Intel MIC Architecture. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.
[3]
XEON-PHI Software Developer's Guide. http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-phi-software-developers-guide.pdf.
[4]
Top500 Supercomputer List. http://www.top500.org/lists/2013/11, November 2013.
[5]
K. S. et al. Early Experiences Porting Scientific Applications to the Many Integrated Core (MIC) Platform. In TACC-Intel Tech. Rep, 2012.
[6]
S. P. et al. MVAPICH-PRISM: A Proxy-based Communication Framework using InfiniBand and SCIF for Intel MIC clusters. In SC'13, 2013.
[7]
TACC. Stampede Supercomputer. http://www.top500.org/system/177931.

Cited By

View all
  • (2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
  • (2021)BBB: Simplifying Persistent Programming using Battery-Backed Buffers2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00019(111-124)Online publication date: Feb-2021
  • (2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
June 2014
334 pages
ISBN:9781450327497
DOI:10.1145/2600212
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. checkpointing
  2. i/o
  3. many integrated cores
  4. mpi
  5. xeon phi

Qualifiers

  • Research-article

Conference

HPDC'14
Sponsor:

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00015(74-85)Online publication date: 21-Oct-2023
  • (2021)BBB: Simplifying Persistent Programming using Battery-Backed Buffers2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00019(111-124)Online publication date: Feb-2021
  • (2020)Checkpoint Restart Support for Heterogeneous HPC Applications2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-69(242-251)Online publication date: May-2020
  • (2019)Efficient Checkpointing with Recompute Scheme for Non-volatile Main MemoryACM Transactions on Architecture and Code Optimization10.1145/332309116:2(1-27)Online publication date: 29-May-2019
  • (2018)Hardware supported permission checks on persistent objects for performance and programmabilityProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00046(466-478)Online publication date: 2-Jun-2018
  • (2017)Hiding the Long Latency of Persist Barriers Using Speculative ExecutionACM SIGARCH Computer Architecture News10.1145/3140659.308024045:2(175-186)Online publication date: 24-Jun-2017
  • (2017)ProteusProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3124539(178-190)Online publication date: 14-Oct-2017
  • (2017)Hardware supported persistent object address translationProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123981(800-812)Online publication date: 14-Oct-2017
  • (2017)Hiding the Long Latency of Persist Barriers Using Speculative ExecutionProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080240(175-186)Online publication date: 24-Jun-2017
  • (2017)Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimizationThe Journal of Supercomputing10.1007/s11227-017-2116-5Online publication date: 20-Aug-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media