Application-Transparent Checkpoint/Restart for MPI Programs over InniBand
Qi Gao Weikuan Yu Wei Huang Dhabaleswar K. Panda
Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {gaoq, yuw, huanwei, panda}@cse.ohio-state.edu Abstract Ultra-scale computer clusters with high speed intercon- nects, such as InniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes criti- cal for such systems to be equipped with fault tolerance sup- port. In this paper, we present our design and implementation of checkpoint/restart framework for MPI programs running over InniBand clusters. Our design enables low-overhead, application-transparent checkpointing. It uses coordinated protocol to save the current state of the whole MPI job to reliable storage, which allows users to perform rollback re- covery if the system runs into faulty states later. Our solution has been incorporated into MVAPICH2, an open-source high performance MPI-2 implementation over InniBand. Perfor- mance evaluation of this implementation has been carried out using NAS benchmarks, HPL benchmark, and a real-world application called GROMACS. Experimental results indicate that in our design, the overhead to take checkpoints is low, and the performance impact for checkpointing applications periodically is insignicant. For example, time for check- pointing GROMACS is less than 0.3% of the execution time, and its performance only decreases by 4% with checkpoints taken every minute. To the best of our knowledge, this work is the rst report of checkpoint/restart support for MPI over InniBand clusters in the literature. 1 Introduction High End Computing (HEC) systems are quickly gaining in their speed and size. In particular, more and more computer clusters with multi-thousand nodes are getting deployed dur- ing recent years because of their low price/performance ratio.
This research is supported in part by a DOE grant #DE-FC02-
01ER25506 and NSF Grants #CNS-0403342 and #CNS-0509452; grants from Intel, Mellanox, Cisco Systems, Linux Networx, and Sun MicroSys- tems; and equipment donations from Intel, Mellanox, AMD, Apple, Appro, Dell, Microway, PathScale, IBM, SilverStorm, and Sun MicroSystems. While the failure rate of an entire system grows rapidly with the number of the components, few of such large-scale sys- tems are equipped with built-in fault tolerance support. The applications running over these systems also tend to be more error-prone because the failure of any single component can cascade widely to other components due to the interaction and dependence among them. The Message Passing Interface (MPI) [21] is the de facto programming model on which parallel applications are typ- ically written. However, it has no specication about the fault tolerance support that a particular implementation must achieve. As a result, most MPI implementations are designed without the fault tolerant support, providing only two modes of the working state: RUNNING or FAILED. Faults occurred during the execution time often abort the programand the pro- gram has to restart from the beginning. For long running programs, this can waste a large amount of computing re- sources because all the computation that has already been ac- complished is lost. To save the valuable computing resources, it is desirable that a parallel application can restart from some previous state before a failure occurs and continue the execu- tion. Thus checkpointing and rollback recovery is one of the commonly used techniques in fault recovery. The InniBand Architecture (IBA) [18] has been recently standardized in industry to design next generation high-end clusters for both data-center and high performance comput- ing. Large cluster systems with InniBand are being de- ployed. For example, in the Top500 list recently released on November 2005 [31], the 5th, 20th, and 51st most pow- erful supercomputers use InniBand as their parallel applica- tion communication interconnect. These systems can have as many as 8,000 processors. It becomes critical for such large- scale systems to be deployed with checkpoint/restart support so that the long-running MPI parallel programs are able to recover from failures. However, it is still an open challenge to provide checkpoint/restart support for MPI programs over InniBand clusters. In this paper, we take on this challenge to enable checkpoint/restart for MPI programs over InniBand clus- Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 ters. Based on the capability of Berkeley Labs Check- point/Restart(BLCR) [12] to take snapshots of processes on a single node, we design a checkpoint/restart framework to take global checkpoints of the entire MPI program while en- suring the global consistency. We have implemented our de- sign of checkpoint/restart in MVAPICH2 [24], which is an open-source high performance MPI-2 implementation over InniBand, and is widely used by the high performance com- puting community. Checkpoing/restart-capable MVAPICH2 enables low-overhead, application-transparent checkpointing for MPI applications with only insignicant performance im- pact. For example, time for checkpointing GROMACS [11] is less than 0.3% of the execution time, and its performance only decreases by 4% with checkpoints taken every minute. The rest of the paper is organized as follows: In section 2 and section 3, we describe the background of our work, and identify the challenges involved in checkpointing InniBand parallel applications. In section 4, we present our design in detail with discussions on some key design issues. In sec- tion 5, we describe the experimental results of our current im- plementation. In section 6, we discuss related works. Finally, we provide our conclusions and describe future works in sec- tion 7. 2 Background 2.1 InniBand and MVAPICH2 InniBand[18] is an open standard of next generation high speed interconnect. In addition to send/receive semantics, the native transport services, a.k.a InniBand verbs, provide memory-based semantics, Remote Direct Memory Access (RDMA), for high performance interprocess communication. By directly accessing and/or modifying the contents of remote memory, RDMA operations are one sided and do not incur CPU overhead on the remote side. Because of its high perfor- mance, InniBand is gaining wider deployment as high end computing platforms [31]. Designed and implemented based on its predecessor MVA- PICH [20] and MPICH2 [1], MVAPICH2 is an open-source high performance implementation of MPI-2 standard. MVA- PICH2, along with MVAPICH, is currently being used by more than 355 organizations across the world. Currently en- abling several large-scale InniBand clusters, MVAPICH2 in- cludes a high performance transport device over InniBand, which takes advantage of RDMA capabilities. 2.2 Checkpointing and Rollback Recovery Checkpointing and rollback recovery is one of the most commonly used techniques for failure recovery in distributed computing. A detailed comparison of various rollback re- covery protocols including both checkpointing and message- logging can be found in [13]. In our work we choose co- ordinated checkpointing because: (a) message logging can potentially impose considerable overhead in the environ- ment of high-bandwidth interconnects such as InniBand, and (b) solely uncoordinated checkpointing is susceptible to the domino effect [26], where the dependencies between pro- cesses make all processes roll back to the initial state. With respect to the transparency to user application, check- point/restart techniques can be divided into two categories: application-level checkpointing and system-level checkpoint- ing. The former usually involves user application in the checkpointing procedure. While gaining advantages of ef- ciency with assistance from the user application, this ap- proach has a major drawback: the source code of user ap- plications need to be tailored to the checkpointing interface, which often involves a signicant amount of work for each application. The latter is application-transparent, because OS takes care of saving the state of running processes. Although it may involve more overhead, it does not need any code mod- ication of applications. Thus we follow this approach. 3 Challenges Most studies on checkpointing parallel applications as- sume the communication is based on TCP/IP stack. Although InniBand also provides TCP/IP support using IP over IB (IPoIB), it does not deliver as good performance as native In- niBand verbs. In this section, we identify the challenging issues in checkpointing the parallel programs that are built over native InniBand protocols as follows. First, parallel processes over InniBand communicate via an OS-bypass user-level protocol. In regular TCP/IP net- works, the operating system (OS) kernel handles all net- work activities, so these network activities can be temporar- ily stopped in an application-transparent manner. However, InniBand provides its high performance communication via OS-bypass capabilities in its user-level protocol [6]. The use of these user-level protocols has the following side effect: the operating system is skipped in the actual communication and does not maintain the complete information of ongoing net- work activities. Because of this gap of information regarding the communication activities between the OS kernel and the user-land of application process, it becomes difcult for the operating system to directly stop network activities and take checkpoints without loosing consistency. Second, the context of network connection is available only in network adapter. In regular TCP/IP networks, the network communication context is stored in kernel memory, which can be saved to checkpoints. Different from TCP/IP networks, InniBand network adapter stores the network con- nection context in the adapter memory. This part of informa- tion is designed to be volatile, and thus very difcult to be reused by restarted process. Therefore, network connection context has to be released before checkpoint, and rebuilt af- terwards. As InniBand uses user-level protocol, some net- work context information, such as Queue Pairs (QPs), is also cached in user memory, which must be reconstructed accord- ing to new network connection context before a process con- tinues communication. And the releasing/rebuilding of net- work connections should be totally transparent to applica- Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 tions. Third, some network connection context is even cached on the remote node. Because of their high performance, many applications take advantage of the RDMA operations pro- vided by InniBand. Different from some other RDMA ca- pable networks such as Myrinet [22], InniBand requires au- thentication for accessing remote memory. Before process A accesses remote memory in process B, process B must regis- ter the memory to network adapter, and then inform process A about the virtual address of the registered memory and the remote key to access that part of memory. Then process A must cache that key and include it in the RDMA requests so that the network adapter for process A can match the keys and authorize the memory access. Since these keys will become invalid when network connection context is rebuilt, potential inconsistency may be introduced by the invalid keys. 4 Checkpoint/Restart Framework and Design Issues In this section, we present the detailed checkpoint/restart framework for MPI over InniBand and some key design is- sues. As we characterize the issues, we focus on these issues in particular: (a) how to stop an MPI program into a state which can be consistently saved to a checkpoint, and (b) how to resume an MPI program based on a checkpoint. There are three design objectives for this framework: Consistency: the global consistency of the MPI program must be preserved. Transparency: the checkpoints must be taken transpar- ently to MPI application. Responsiveness: requests for checkpointing can be is- sued at any point of the execution, and upon a request, the checkpoint must be taken as soon as possible. We design a protocol to coordinate all MPI processes in the MPI job to consistently and transparently suspend all Inni- Band communication channels between them, and preserve the communication channel states to checkpoint, which pro- vides high responsiveness. In the remaining part of this section, we start with the Checkpoint/Restart (C/R) framework, describing components in the framework and their functionalities. Then we pro- vide a global view of the C/R procedure by describing the overall state diagram and state transition. Finally, we dis- cuss some design issues in a local view to show how to sus- pend/reactivate InniBand communication channels. 4.1 Checkpoint/Restart Framework In a cluster environment, a typical MPI program consists of: a front-end MPI job console, a process manager cross- ing multiple nodes, and individual MPI processes running on these nodes. Multi Purpose Daemon (MPD) [8] is the default process manager for MVAPICH2. All the MPD daemons are connected as a ring. As depicted in Figure 1, the proposed C/R framework is built upon the MPI job structure, and there are ve key components in this framework described as fol- lows: Global C/R Coordinator is a part of MPI job console, responsible for global management of checkpoint/restart the whole MPI job. It can be congured to initiate check- points periodically and/or handle checkpoint/restart re- quests from users or administrators. Control Message Manager provides an interface be- tween global C/R coordinator and local C/R controller. It utilizes the process manager already deployed in the cluster to provide out-of-band messaging between MPI processes and the job console. In our current implemen- tation, we extend the functionality of MPD to support C/R control messages. Local C/R Controller takes the responsibility of local management of the C/R operations for each MPI pro- cess. Its functionality can be described as follows: (a) to take C/R requests from and report the results to global C/R coordinator; (b) to cooperate with communication channel managers and other C/R controllers in peer MPI processes to converge the whole MPI job to a state which can be consistently checkpointed; and (c) to invoke C/R library to take checkpoints locally. In our current design, the C/R controller is implemented as a separate thread, which wakes up only when a checkpoint request comes. C/R Library is responsible for checkpointing/restarting the local process. Checkpointing a single process on a single node has been studied extensively and there are several packages available to the community. In our current implementation, we use Berkeley Labs Check- point/Restart (BLCR) [15] package. Communication Channel Manager controls the in-band message passing. In C/R framework, it has extended functionalities of suspending/reactivating the communi- cation channel, and the temporary suspension does not impair the channel consistency, and is transparent to up- per layers. Currently, we implement the C/R function- ality on the InniBand channel based on OpenIB [25] Gen2 stack. 4.2 Overall Checkpoint/Restart Procedure Initial Synchronization Running Pre-checkpoint Coordination Restarting Post-checkpoint Coordination Local Checkpointing Storage Normal Start Restart Figure 2. State Diagram for Checkpoint/Restart Figure 2 shows the state diagram of our checkpoint/restart framework. During a normal run, the job can go over the Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 Global C/R Coordinator MPI Process MPI Process MPI Process MPI Process Communication Channel Manager Control Message Manager MPI Process Control Message Manager Local C/R Controller C/R Library C/R Library Process Manager (Extended with C/R Messaging) MPI Job Console Communication Channel Manager MPI Process Local C/R Controller Inf iniBand Channel Figure 1. Checkpoint/Restart Framework checkpointing cycle upon user requests or periodically, which consists of four phases: Initial Synchronization Phase: All processes in the MPI job synchronize with each other and prepare for pre- checkpoint coordination. First, the global C/R coordi- nator in the job console propagates a checkpoint request to all local C/R controllers running in individual MPI processes. Then, upon the arrival of the request, the local C/R controller wakes up and locks the communi- cation channels to prevent main thread from accessing them during the checkpointing procedure. Pre-checkpoint Coordination Phase: C/R controllers co- ordinate with each other to make all MPI processes indi- vidually checkpointable. To do so, C/R controllers coop- erate with communication channel managers to suspend all communication channels temporarily and release the network connections in these channels. Local Checkpointing Phase: C/R controllers invoke C/R library to save the current state of the local MPI process, including the state of suspended communication chan- nels to a checkpoint le. Post-checkpoint Coordination Phase: C/R controllers cooperate with communication channel managers to re- activate communication channels. This step involves re- building the lowlevel network connections and resolving the possible inconsistency introduced by the potentially different network connection information. The details about how to suspend/reactivate communication channels consistently and transparently will be discussed in section 4.3. To restart from a checkpoint, a restarting procedure is performed, which consists of restarting phase and post- checkpoint coordination phase. In Restarting Phase, the global C/R coordinator rst propagates the restart request to each node, where the C/R library is responsible for restarting the local MPI process from the checkpoint le. Then, local C/R controller reestablishes the connection between a MPI process and its process manager and performs necessary coor- dination between them. At this point, the MPI job is restarted from a state identical with the previous state in local check- pointing phase. Therefore, to continue running, it rst goes to post-checkpoint coordination phase, and when all commu- nication channels are reactivated, it comes back to running state. 4.3 Suspension/Reactivation InniBand Channel CQs MRs PDs QPs Registered User Buffers Network Connection Information Dedicated Communication Buffers Channel Progress Information MVAPICH2 (Upper Layers) MVAPICH2 (InfiniBand Channel) InfiniBand Channel Host Adaptor (HCA) Rebuild Update Preserve Software Hardware User Application Figure 3. InniBand Channel for MVAPICH2 During the checkpoint/restart procedure described in pre- vious section, the consistency and transparency are two key requirements. In this section, we explain how we transpar- ently suspend/reactivate the InniBand communication chan- nel while preserving the channel consistency. The structure of InniBand communication channel in MVAPICH2 is described by Figure 3. Below the MVAPICH2 InniBand channel is the InniBand Host Channel Adapter (HCA), which maintains the network connection context, such as Queue Pairs (QPs), Completion Queues (CQs), Mem- ory Regions (MRs), and Protection Domains (PDs). MVA- PICH2 InniBand channel state consists of four parts: Network connection information is the user-level data structures corresponding to the network connection con- text. Dedicated communication buffers are the registered buffers which can be directly accessed by HCA for send- ing/receiving small messages. Channel progress information is the data structures for book-keeping and ow control, such as pending re- quests, credits, etc. Registered user buffers are the memory allocated by user applications. These buffers are registered by commu- nication channel to HCA for zero-copy transmission of large messages. These four parts need to be handled differently according to their different natures when performing checkpointing. Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 Network connection information needs to be cleaned before checkpointing and re-initialized afterwards, as the network connection context in HCA is released and rebuilt. Dedi- cated communication buffers, and channel progress informa- tion need to be mostly kept same but also updated partially because they are closely coupled with network connection in- formation. Registered user buffers need to be re-registered but the content of them need to be totally preserved. Now we explain the protocol for suspending and reactivat- ing communication channel, including the discussion on some design issues. In pre-checkpoint coordination phase, to suspend commu- nication channels, channel managers rst drain all the in- transit messages, because otherwise these messages will be lost when releasing network context. So the protocol must guarantee that to a certain point, all the messages before this point must have been delivered and all the messages after this point must have not been posted to network. Two things need to be noted here: (a) the word messages refer to the network level messages rather than MPI level messages, and one MPI level message may involve several network level messages, and (b) the synchronization points for different channels do not need to correspond to the same time point, and each chan- nel can have its own synchronization point. Due to the First-In-First-Out (FIFO) and reliable nature of InniBand Reliable Connection-based (RC) channel, draining in-transit messages can be achieved by exchanging ag mes- sages between each pair of channel managers, which means, each process sends ags through all channels it has and waits for receiving ags from all the channels. Once these send and receive operations complete, all in-transit messages are known to be drained from the network because the ag mes- sages are the last ones in channels. Then, the channel manager releases the underlying network connection. One issue involved is when the channel manager should handle the drained in-transit messages. Because the commu- nication channel is designed to execute the transmission pro- tocol chosen by upper layers in the MPI library, processing an incoming message may cause sending a response message, which will also need to be drained and in the worst case may lead to an innite ping-pong livelock condition. To avoid that, the channel manager has to either buffer the drained messages for future processing, or process these messages but buffer the response messages instead of sending them im- mediately. We choose the latter approach because: (a) some control messages need to be processed immediately, and these control messages will not lead to any response message; (b) the overhead for buffering is lower as the number of response messages is generally smaller than the number of incoming messages. In post-checkpoint coordination phase, after rebuilding un- derlying network connections, the channel manager rst up- dates the local communication channel as we described be- fore, and then sends control messages to update the other side of the channel. The remote updating is to resolve the potential inconsistency introduced by invalid remote keys for RDMA Benchmark: lu.C.8 bt.C.9 sp.C.9 Checkpoint File Size (MBs): 126 213 193 Table 1. Checkpoint File Size per Process operation. This issue has been discussed in Section 3. For example, for performance reasons, the rendezvous protocol for transmitting large messages is implemented by RDMA write operation. To achieve high responsiveness and trans- parency, our design allows rendezvous protocol being inter- rupted by checkpointing. Therefore the remote keys cached on the sender side for RDMA write will become invalid be- cause of the re-registration on the receiver side. Hence, the channel manager on the receiver side needs to capture the re- freshed remote keys and send them to the sender side. 5 Performance Evaluation In this section, we describe experimental results and an- alyze the performance of our current implementation based on MVAPICH2-0.9.2. The experiments are conducted on an InniBand cluster of 12 nodes. Each node is equipped with dual Intel Xeon 3.4GHz CPUs, 2GB memory and a Mellanox MT25208 PCI-Express InniBand HCA. The operating sys- temused is Redhat Linux AS4 with kernel 2.6.11. The lesys- tem we use is ext3 on top of local SATA disk. We evaluate the performance of our implementation using NAS parallel Benchmarks [32], High-Performance Linpack (HPL) [2] Benchmark, and GROMACS [11]. First, we ana- lyze the overhead for taking checkpoints and restarting from checkpoints, and then we show the performance impact to ap- plications for taking checkpoints periodically. 5.1 Overhead Analysis for Checkpoint/Restart In this section, we analyze the overhead for C/R in terms of checkpoint le size, checkpointing time, and restarting time. We choose BT, LU, and SP from NAS Parallel Benchmarks and HPL Benchmarks, because they reect the computation kernel commonly used in scientic applications. Because checkpointing involves saving the current state of running processes into reliable storage, taking a system-level full checkpoint involves writing all used memory pages within process address space to the checkpoint le, therefore, check- point le size is determined by the memory footprint of the process, in this case, MPI process. Table 1 shows the check- point le sizes per process for BT, LU, and SP, class C, using 8 or 9 processes. Time for checkpointing/restarting is determined mainly by two factors: the time for coordination, which increases with the system size; the time for writing/reading the checkpoint le to/from le systems, which depends on both the check- point le size and the performance of the underlying le sys- tem. Figure 4 shows the time for checkpointing/restarting NAS benchmarks. It also provides the le accessing time for the Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 Figure 4. Overall Time for Checkpointing/Restarting NAS checkpoint le for comparison. On our testbed, we have ob- served that the le accessing time is the dominating factor. In the real-world deployment, le writing can be designed to be non-blocking and overlapping with program execution. And incremental checkpointing techniques can also be applied to reduce the checkpoint le size. We plan to further investigate the issues on how to speed up the commitment of checkpoint les. Figure 5. Coordination Time for Checkpoint- ing/Restarting NAS To further analyze the coordination overhead, we excluded the le accessing time and broke the coordination time down to individual phases. As shown in Figure 5, for check- pointing, post-checkpoint coordination consumes most of the time. The reason is that this step involves a relatively time- consuming component, the establishment of InniBand con- nections, which has been explored in our previous study [34]. For restarting, the post-checkpoint coordination consumes al- most the same amount of time as for checkpointing, but the major part of time is in restarting phase, mainly spent by MPD and BLCR for spawning processes on multiple nodes. To evaluate the scalability of our design, we measure the average coordination time for checkpointing HPL benchmark using 2, 4, 6, 8, 10, and 12 processes with one process on each node. In the experiment we choose the problem size to let HPL benchmark consume around 800MB memory for each process. Figure 6. Coordination Time for Checkpointing HPL To improve the scalability, we adopt the technique of boot- strap channel described in [34] to reduce the InniBand con- nection establishment time fromthe order of O(N 2 ) to O(N), where N is the number of connections. As shown in Figure 6, because the dominating factor, post-checkpoint coordination time, is O(N), the overall coordination time is also in the order of O(N). To further improve the scalability of check- point/restart, we plan to utilize adaptive connection manage- ment model [33] to reduce the number of active InniBand connections. Nonetheless, with current performance, we believe our de- sign is sufcient for checkpointing many applications because the time for checkpointing/restarting is insignicant when comparing to the execution time of applications. 5.2 Performance Impact for Checkpointing Figure 7. Performance Impact for Checkpointing NAS In this section, we evaluate the performance of our system in a working scenario. In real world, periodically checkpoint- ing MPI applications is a commonly used method to achieve fault tolerance. We conduct experiments to analyze the per- formance impact for taking checkpoints at different frequen- cies during the execution time of applications. We used LU, Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 Figure 8. PerformanceImpact for CheckpointingHPL BT, and SP from NAS benchmarks and HPL benchmark to simulate user application. And we also include a real-world application called GROMACS [11], which is a package to perform molecular dynamics for biochemical analysis. In our design, there is very little extra book-keeping over- head on data communication introduced by C/R functional- ity, so that our checkpoing/restart-capable MVAPICH2 has almost the same performance as original MVAPICH2 if no checkpoint is taken. As shown in Figure 7, the total running time of LU, BT, and SP decreases as the checkpointing interval increases. The additional execution time caused by checkpointing matches the theoretical value: checkpointing time number of check- points. Figure 8 shows the impact on calculated performance in GFLOPS of HPL benchmarks for 8 processes. Because these benchmarks load all data to memory at the beginning of execution, the checkpoint le size is relatively large. Therefore, in our experiments, the dominating part of the overhead for checkpointing, the le writing time, is rel- atively large. But even with this overhead, the performance does not decrease much with a reasonable long checkpointing interval, 4 minutes for example. Figure 9. Performance Impact for Checkpointing GROMACS On the other hand, many real-world applications may spend hours even days to process many thousands of datasets for a run. Normally only a small portion of datasets are loaded into memory at any point time, so the memory footprints for these applications are relatively small. Therefore the over- head for checkpointing is lower with respect to their running time. To evaluate this case, we run GROMACS on DPPC benchmark dataset [11]. As shown in Figure 9, the time for checkpointing GROMACS is less than 0.3% of its execution time, and even if GROMACS is checkpointed every minute, the performance degradation is still around 4%. From these experiments we can conclude that for long running applica- tions, the performance impact of checkpointing is negligible, and even for memory intensive applications, with a reasonable checkpointing frequency, the performance impact is insignif- icant. 6 Related Works Algorithms for coordinated checkpointing have been pro- posed since mid-80s [30, 9], and a detailed comparison of var- ious rollback recovery protocols including both checkpointing and message-logging can be found in [13]. Many efforts have been carried out to provide fault tol- erance to message-passing based parallel programs. FT- MPI [14] has extended the MPI specication to provide sup- port to applications to achieve fault tolerance on application level. An architecture has been designed and an implementa- tion has been made based on HARNESS [5]. LAM/MPI [28] has incorporated checkpoint/restart capabilities based on Berkeley Labs Checkpoint/Restart (BLCR) [12]. A frame- work to checkpoint MPI program running over TCP/IP net- work is developed. Another approach to achieve fault toler- ance using uncoordinatedcheckpointing and message logging is studied in MPICH-V project [7]. They have used the Con- dor checkpoint library [19] to checkpoint MPI processes, and designed and evaluated a variety of message logging proto- cols for uncoordinated checkpointing for MPI programs over TCP/IP network. In [17], the design of a fault-tolerant MPI over Myrinet based on MPICH-GM [23] is described. Other researches toward fault tolerant message passing systems in- clude: Starsh [3], CoCheck [29], LA-MPI [4], Egida [27], CLIP [10], etc. Recently, a low-overhead, kernel-level check- pointer called TICK [16] has been designed for parallel com- puters with incremental checkpointing support. Our work differs from the previous related works in the way that we address the challenges to checkpoint MPI pro- grams over InniBand. The details about these challenges are discussed in Section 3. Although Myrinet is another high per- formance interconnect similar to InniBand in some aspects, its network API, Myrinet GM [23], follows a connection- less model, which is quite different from the commonly used InniBand transport service, Reliable Connection (RC). Ad- ditionaly, different from InniBand, RDMA operations pro- vided by Myrinet GM do not require remote keys for authen- tication. Therefore, the solution over Myrinet GM is not read- ily applicable to InniBand. Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006 7 Conclusions and Future Work In this paper, we have presented our design of check- point/restart framework for MPI over InniBand. Our design enables application-transparent, coordinated checkpointing to save the state of the whole MPI program into checkpoints stored in reliable storage for future restart. We have evalu- ated our design using NAS benchmarks, HPL benchmark and GROMACS. Experimental results indicate that our design in- curs a low overhead for checkpointing, and the performance impact of checkpointing to long running applications is in- signicant. To the best of our knowledge, this work is the rst report of checkpoint/restart support for MPI over InniBand clusters in the literature. In future, we plan to work on the issues related to SMP channels and also incorporate the adaptive connection man- agement [33] to reduce the checkpointing overhead. In a longer term, we plan to investigate the issues on management of checkpoint les and auto-recovery from failures, and then conduct a more thorough study on the overall performance of checkpointing and rollback recovery with different MTBF. References [1] MPICH2, Argonne. http://www-unix.mcs.anl.gov/mpi/mpich2/. [2] A. Petitet and R. C. Whaley and J. Dongarra and A. Cleary. http://www.netlib.org/benchmark/hpl/. [3] A. Agbaria and R. Friedman. Starsh: Fault-tolerant dynamic MPI programs on clusters of workstations. In Proceedings of IEEE Sym- posium on High Performance Distributed Computing (HPDC) 1999, pages 167176, August 1999. [4] R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, and M. W. Sukalski M. A. Taylor, T. S. Woodall. Architecture of la- mpi, a network-fault-tolerant mpi. In Proceedings of Intl Parallel and Distributed Processing Symposium, Santa Fe, NM, April 2004. [5] M. Beck, J. J. Dongarra, G. E. Fagg, G. A. Geist, P. Gray, J.s Kohl, M. Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. L. Scott, and V. Sunderam. HARNESS: A Next Generation Distributed Virtual Ma- chine. Future Generation Computer Systems, 15(56):571582, 1999. [6] R. A. F Bhoedjang, T. Rubl, and H. E. Bal. User-Level Network Inter- face Protocols. IEEE Computer, 31(11):5360, 1998. [7] G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Magniette, V. N eri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Proceedings of IEEE/ACM SC2002, Baltimore, MD, November 2002. [8] R. Butler, W. Gropp, and E. Lusk. Components and Interfaces of a Pro- cess Management System for Parallel Programs. Parallel Computing, 27(11):14171429, 2001. [9] M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. In ACM Trans. Comput. Syst. 31, 1985. [10] Y. Chen, K. Li, and J. S. Plank. CLIP: A Checkpointing Tool for Message-passing Parallel Programs. In Proceedings of IEEE/ACM SC97, NOV 1997. [11] D. Van Der Spoel and E. Lindahl and B. Hess and G. Groenhof and A.E. Mark and H.J.C. Berendsen. Gromacs: Fast, exible, and free. Journal of Computational Chemistry, 26:1701 1718, 2005. [12] J. Duell, P. Hargrove, and E. Roman. The Design and Implementation of Berkeley Labs Linux Checkpoint/Restart. Technical Report LBNL- 54941, Berkeley Lab, 2002. [13] E. N. Elnozahy and L. Alvisi and Y. M. Wang and D. B. Johnson. ASur- vey of Rollback-recovery Protocols in Message Passing Systems. Tech- nical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1996. [14] G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac- Grbovic, K. London, and J. J. Dongarra. Extending the MPI Speci- cation for Process Fault Tolerance on High Performance Computing Systems. In Proceeding of International Supercomputer Conference (ICS), Heidelberg, Germany, 2003. [15] Future Technologies Group (FTG). http://ftg.lbl.gov/ CheckpointRestart/CheckpointRestart.shtml. [16] R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini. Transparent incre- mental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of ACM/IEEE SC2005, Seattle, WA, November 2005. [17] H. Jung and D. Shin and H. Han and J. W. Kim and H. Y. Yeom and J. Lee. Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M 3 ). In Proceedings of ACM/IEEE SC2005, Seattle, WA, November 2005. [18] InniBand Trade Association. http://www.innibandta.org. [19] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report UW-CS-TR-1346, University of Wisconsin - Madison Computer Sciences Department, April 1997. [20] J. Liu, J. Wu, S. P. Kini, P. Wyckoff, and D. K. Panda. High Perfor- mance RDMA-Based MPI Implementation over InniBand. In 17th Annual ACM International Conference on Supercomputing (ICS 03), June 2003. [21] Message Passing Interface Forum. MPI: A Message-Passing Interface standard. The International Journal of Supercomputer Applications and High Performance Computing, 1994. [22] Myricom. http://www.myri.com. [23] Myricom. Myrinet Software and Customer Support. http://www.myri. com/scs/, 2003. [24] Network-Based Computing Laboratory. MVAPICH: MPI for Inni- Band. http://nowlab.cse.ohio-state.edu/projects/mpi-iba/. [25] Open Inniband Alliance. http://www.openib.org. [26] B. Randell. Systems structure for software fault tolerance. IEEE Trans- actions on Software Engineering, SE-1(2):220232, 1975. [27] S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low- overhead fault-tolerance. In Symposium on Fault-Tolerant Computing, pages 4855, 1999. [28] S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Har- grove, and E. Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Perfor- mance Computing Applications, pages 479493, 2005. [29] G. Stellner. CoCheck: Checkpointing and process migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages 526531, April 1996. [30] Y. Tamir and C. H. Sequin. Error Recovery in Multicomputers Using Global Checkpoints. In Proceedings of the Intl Conference on Parallel Processing, pages 3241, 1984. [31] TOP 500 Supercomputers. http://www.top500.org/. [32] F. C. Wong, R. P. Martin, R. H. Arpaci-Dusseau, and D. E. Culler. Ar- chitectural Requirements and Scalability of the NAS Parallel Bench- marks. In Proceedings of Supercomputing, 1999. [33] W. Yu, Q. Gao, and D. K. Panda. Adaptive Connection Management for Scalable MPI over InniBand. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS) 2006, Rhodes Island, Greece, April 2006. [34] W. Yu, J. Wu, and D. K. Panda. Fast and Scalable Startup of MPI Programs In InniBand Clusters. In Proceedings of International Con- ference on High Performance Computing 2004, Banglore, India, De- cember 2004. Proceedings of the 2006 International Conference on Parallel Processing (ICPP'06) 0-7695-2636-5/06 $20.00 2006