Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Adaptive Erasure Coded Fault Tolerant Linear System Solver

Published: 09 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem.
    In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.

    1 Introduction

    Fault tolerance for parallel and distributed computations is a significant challenge, particularly at scale. The corresponding problem for storage has been effectively addressed with development of erasure codes, which augment data with redundant blocks so that in the event of an erasure (e.g., disk failure), data can be reconstituted using remaining (non-erased) coded blocks. A number of codes have been developed that minimize storage overhead, computational cost for coding and reconstitution of data, and communication in distributed environments.
    In a recent set of results [26], we have developed a novel concept called erasure coded computations that generalizes erasure codes from storage to computations. The theory underlying erasure coded computations has been developed in the context of linear system solvers and eigenvalue problems. The basic idea is to augment an input matrix with a suitable set of coded rows and columns to generate an augmented problem. This augmented problem is then solved using conventional solvers (e.g., Conjugate Gradient (CG)) on a faulty ensemble of parallel processors in a fault-oblivious manner. For instance, when a processor fails (we assume a fail stop failure), the rest of the processors simply continue with their computation. In this manner, a solution is computed for the augmented problem. In the event of a fault, a computationally inexpensive reconstruction procedure is used to recover the original solution (to the original problem instance) from the augmented solution. In the work of Zhu et al. [26], we derive a number of theoretical results relating to necessary conditions for augmentation blocks, recovery algorithm, and associated costs.
    While providing a proof of concept for erasure coded computations, past results [16] make several assumptions that are problematic in practice. They rely on static coding blocks, based on the assumption of a known maximum number of faults. The solver fails if the actual number of faults exceeds this bound. Second, the overhead of the coding blocks is paid, irrespective of the number of errors. Stated otherwise, even if the actual number of faults is lower, the cost associated with redundant blocks corresponds to the worst case. Third, the addition of a redundant blocks irrespective of erasures adds a null space into the coded matrix. This often manifests in slower convergence rates for the solver operating on the coded matrix. Finally, the coding blocks induce communication in distributed execution. This communication pattern does not follow the communication induced by the original matrix. Having a large coding block complicates parallel formulations of the solver and degrades parallel performance.
    Motivated by these shortcomings, in this article we present the next step in the development of practical and efficient erasure coded computations, which we refer to as Adaptive Erasure Coded Computations (AECC). The idea behind AECC is that coded rows and columns are added only as faults are detected. Specifically, the solver operates with no coding blocks until a fault is detected. In the event of faults, only the required number of rows and columns are added. In this manner, the solver only ever operates on a system of size . This has a number of desirable properties: (i) the method is computation-optimal in the sense that the system size stays the same as the input system; (ii) the convergence properties are maintained (i.e., the system is always full rank, and if the input is symmetric positive definite (SPD), the coded system is also SPD); and (iii) the coded block adds negligible computational and parallel overhead to the base solver. We argue that a combination of these properties makes AECC an ideal fault tolerant solver.
    We present the theoretical underpinnings of AECC, coding and solution reconstruction techniques, as well as parallel formulation of the AECC solver. We support all of our theoretical constructions using a parallel implementation and validate various desirable features of our AECC solver. We also present comparisons to a highly optimized checkpointing scheme, which only stores necessary data structures, at optimally tuned intervals. We show significant improvements in efficiency and scalability for our erasure coded scheme over checkpointing.
    We make the following specific contributions in the article:
    We present a novel AECC scheme for linear system solvers that is computation-optimal with respect to base solver.
    We derive theoretical underpinnings of AECC, including coding and reconstruction techniques.
    We present a parallel implementation of AECC and demonstrate excellent performance in terms of convergence rates, scalability, and robustness to different fault arrival models.
    We present a comparison with an optimized checkpointing scheme and show significant performance benefits of AECC.

    2 Background

    Techniques such as checkpoint-restart and active replicas are commonly used for fault tolerance in parallel systems. Checkpoint-restart relies on consistent checkpoints, which are often hard to find without significant rollbacks, particularly in scalable parallel programs that attempt to minimize global synchronizations. They also require significant bandwidth (to store checkpoints in persistent storage, or across multiple nodes for in-memory checkpoints [12, 18]) and available disk or extra memory capacity. Scalable systems with hundreds of thousands of cores and beyond are often constrained in all of these resources. Active replicas [22], however, are typically used in real-time systems, where worst-case execution times have to be guaranteed. In this case, critical computations are replicated and a consensus protocol detects and corrects errors. Active replicas have significant resource overheads.
    In the domain of distributed computations, faults are often handled through techniques such as deterministic replay in systems such as MapReduce [4, 10]. In such systems, the reduce step provides intermediate checkpoints. Computations (maps) are executed and monitored by a runtime system. Failed maps are rescheduled to ensure fault tolerance. Such techniques are dependent on two critical aspects: periodic reduce phases that act as consistent checkpoints (global synchronizations) so that maps are only rescheduled from the last reduce point, and committing output of reduce phases to persistent storage (typically to a redundant distributed file system). The two overheads of these schemes are also apparent. First, the makespan of jobs increases as faults increase, since the runtime system must reschedule maps, thus slowing down the entire computation. Second, the overhead of committing output of reduce phases can itself be significant.
    An alternate class of techniques, broadly classified as algorithm-based fault tolerance (ABFT), redesigns the algorithm to make it resilient to faults [1, 2, 5, 6, 7, 8, 11, 15]. As an example, for linear system solvers, a faulty solver may be viewed as a preconditioner for an outer solve, which validates the solution of the potentially faulty inner solve. Although these techniques alleviate many of the performance and resource constraints of system-supported fault tolerance techniques, they require intricate design of the algorithm, along with associated proofs of correctness and characterization of performance.
    In the domain of storage, fault tolerance has been addressed using replication and erasure coding techniques. Replication techniques store data at distinct sites to tolerate failures. The resource overheads of such techniques are significant. Erasure coding techniques [20], however, augment data with codes that enable recovery in the event of a failure. Erasure codes may be viewed as multiplication of the data (viewed as a vector) by a matrix that satisfies certain rank properties. Given an dimensional data vector, erasure codes multiply the data vector by an matrix (i.e., a matrix with rows and columns) to generate an dimensional coded vector. The coding matrix is designed in such a way that any subset of rows of the matrix is linearly independent. In the event of up to failures, one simply inverts the rest of the matrix and multiplies with the available data items to extract the original dimensional data vector. Such techniques have been used with great success in systems such as RAID (Redundant Array of Inexpensive Disks) and wide area file systems such as Google’s Colossus. The precise structure of the coding matrix is determined by desired level of fault tolerance, as well as the overhead of constituting the coded data and computing original data in the event of a failure. The key benefit of erasure coding is that it requires storage to tolerate faults, as compared to storage in the case of replication. We realize the same benefit for computation, through our erasure coded computation analog.
    Our overall approach is best characterized as adapting the ideas behind erasure coded fault-tolerant storage to create fault tolerant problem formulations, and using the same base algorithm to solve these augmented problems. As a further advance, this article presents a novel technique that adaptively codes problems as faults occur, thus minimizing overhead of fault tolerance.

    3 An Erasure Coded Linear System Solver

    We begin with a brief description of the base erasure coded linear solver to provide necessary background. We refer readers to the work of Zhu et al. [26] for the theoretical underpinnings of erasure coded computations. The purpose here is simply to provide necessary background and motivation for the parallel adaptive fault tolerant solver, which is the topic of our current work.
    Given a linear system, , we first construct a coding matrix so that has Kruskal rank , to tolerate a maximum of faults. Note that Kruskal rank of implies that any subset of columns of is guaranteed to be linearly independent. Such coding matrices can be constructed using traditional coding techniques, using Vandermonde matrices or sparse low-density parity codes (LDPC). Using matrices and , we build an encoded system to provide fault tolerance. The augmented or encoded system, one solution to the augmented system, and the augmented right hand side, are given by the following equation.
    Note that this system is singular, and there are multiple solutions. With these augmented data structures, we find any solution of the new system,
    (1)
    using traditional parallel techniques. In the event of a fault (we assume a fail-stop failure), the non-faulty processors continue their computations in a fault-oblivious manner until the solver converges.
    When has Kruskal rank , Zhu et al. [26] prove a number of properties of this augmented system that guarantee that solution recovery is possible. If a subset of up to rows and columns of matrix were to be erased (due to a fail-stop processor failure), along with associated elements of intermediate solution and right-hand side , the solver continues computations on remaining parts of the augmented system. If the original matrix is SPD, it can be shown that the remaining system that survives erasures is symmetric semi-definite, and that Krylov subspace methods converge to a solution for this partially erased augmented system. We can recover true solution from the solution returned by the fault-oblivious solver. Specifically, let be any solution to the augmented system, where . To recover the true solution in the presence of errors, we only need to compute [26]
    (2)
    This gives us a straightforward and computationally inexpensive recovery algorithm.
    Note that the solver operates on the augmented system of size . For large values of , this can add significant overhead, particularly when the augmentation blocks have significant fill (non-zeros). This overhead is incurred by the solver, independent of the number of faults encountered in the execution, since the value of is chosen based on the worst-case estimate of number of faults.

    4 Coding Matrix for Parallel Implementation

    The results from Zhu et al. [26] are designed such that any combination of failures can be handled. Existing simple techniques to achieve this involve matrices with Kruskal rank-, which then cause large degrees of fill-in for the coded system . Given that over short-enough time windows, we expect computations to be successful, we investigate strategies to relax this strict requirement to enable methods that will work better in practice. This motivates the following weaker definition used in the work of Kang et al. [16].
    Definition. A -by- matrix satisfies the recovery-at-random property if a random subset of rows (selected uniformly with replacement) is rank with probability .
    This definition admits a much wider class of coding matrices. A simple construction for one such matrix is to use staggered matrices as in Figure 2(a). These have columns and runs of length elements that wrap-around. An example with , is as follows.
    These are motivated by a random coding of a small window of the system to promote sparsity in the final result. Formally, for the row, the elements for ranging from 1 to are set to random reals in the range . We call this -by- matrix a -staggered distribution.
    These matrices have a number of useful properties including the the recovery-at-random property.
    Proposition 4.1.
    Let be a submatrix of formed by selecting at most rows of matrix , then the matrix has full row rank. This shows that any collection of up to rows of matrix are linearly independent.
    Proof.
    The two keys to our proof are that wide matrices of random entries (more columns than rows) with the same non-zero structure are non-singular and that matrices with distinct non-zero patterns are non-singular if any subset of the rows with the same non-zero pattern are non-singular. The full argument simply combines these two pieces. Consider a matrix of up to rows where all of the non-zero structures are the same. This full row rank -by- matrix of random uniform entries is rank if . If there are distinct non-zero patterns, we simply repeat the argument on each distinct non-zero pattern, which shows that it is non-zero. The base case is a single row, which is the simplest case of a full row-rank matrix.□
    Proposition 4.1 shows that any submatrix of at most rows of matrix has full rank. But because we want to use this to correct failure, to use theory similar to Zhu et al. [26], we would need that any submatrix of rows of matrix must have rank . Clearly, there exist degenerate cases where this is not true—specifically, if we select rows, each with the same non-zero structure, we end up with a submatrix of rank only , which is less than . Here, it is important to appreciate that when this result is used, the rows chosen correspond to failures or fault. We now show that random collections are likely to be full rank, which will help us prove the recovery at random result.
    Theorem 4.2.
    Let . The probability that a random set of rows of a matrix drawn from the -by- -staggered distribution is linearly dependent is less than .
    Proof. A necessary and sufficient condition for rows to be linearly dependent is that some selection of of these rows have the same non-zero structure.
    Note that there are distinct non-zero structures in the rows of matrix . Furthermore, since rows are uniformly assigned one of these distinct row structures, the probability that a row has a specific row structure is and the probability that rows have the same row structure is . Since there are () ways to select rows out of the selected block of rows, the probability that a selected block of rows is linearly dependent is given by
    As increases, it is easy to see that this probability rapidly approaches 0. Stated otherwise, matrix satisfies recovery-at-random for chosen as a suitable function of . The following analysis generalizes the situations. Note that there are distinct non-zero structures in the -staggered distribution, and this was key to the proof of the previous result. We can analyze the expected maximum number of duplicate non-zero structures from a set of in the following result.
    Theorem 4.3.
    Let be a matrix where each row has one of non-zero patterns and the pattern in a random row occurs with probability . The maximum number of rows from among randomly selected rows of matrix that have same non-zero structure is with high probability.
    Proof.
    This result is equivalent to existing arguments for balls-and-bins problems that arise in the study of hashing. In this equivalent setup there are bins (one for each type of row non-zero pattern). Then our selection of random rows consists of “dropping” balls into these bins, where each ball chooses a bin with probability . Our question is then what is the maximum number of balls in any bin—the maximum number of rows with the same non-zero structure. This was originally shown in the work of Gonnet [14] and later re-derived in the work of Raab and Steger [21]. Both show that the maximum is with probability .□
    These results show that we will have recovery at random for up to faults using an -by- coding matrix constructed using a staggered non-zero pattern with -non-zeros, when is larger than . Notably, since , we need , which occurs when . Consequently, we use , which, with high probability, guarantees recovery at random for smaller than a few hundred thousand.

    5 Parallel Implementation of Erasure Coded CG

    A simple approach to solve the augmented system with in parallel is to use a version of CG that will enable us to reset the recurrence whenever a fault is detected. The two-term CG [19], instead of the three-term CG, is such a method. This was used by Zhu et al. [26] to solve systems where is SPD. The algorithm and the reset procedure are given in Algorithm 1.
    When this is executed in parallel [16], there are a few relevant details that motivate our new work on adaptive solvers. First, consider the case where the augmented matrix and the augmented vectors can be distributed among multiple processes by rows. Alternate formulations with 2D partitioning are also feasible within our coding framework, although we limit our discussion for simplicity. Let the index set associated with process be , then and the set of faulty indices is . Now consider what happens to the operations of Algorithm 1 and how they are affected by faults in a distributed environment. The two non-local steps are the aggregation operations: inner products and the matrix-vector multiplication. After a fault, the viable processes carry out the all-reduce operation in an inner product or by simply skipping the faulty components in the vectors; see other works [16, 26] for additional details. For the matrix-vector multiplication , each process has a block of the matrix (using Matlab notation), a viable process carries out its local aggregation operation for computing by again skipping the faulty components in .
    Another technical issue we need to consider when faults happen is the update to the search direction . Here, when we observe a fault, we truncate the update to be . This corresponds to a reset of the Krylov process and is described in the caption of Algorithm 1.
    We now reiterate the recovery of the solution to the original system. Suppose the erasure-coded CG converges on the augmented system (1). Then, we simply compute the expression given by the recovery Equation (2) on that solution.
    Partitioning considerations.. Even with the sparse matrix described in Section 4, the augmentation blocks in potentially introduce dense blocks if not suitably computed. For this reason, we use a two-step process. In the first step, we order the input matrix (Figure 1(a)) using a traditional matrix/ graph partitioning technique such as Metis (Figure 1(b)). We then use this ordered matrix to compute (Figure 1(c)). The resulting matrix is then reordered once again to partition across various nodes in a parallel/ distributed platform (Figure 1(d)). The first step minimizes non-zeros in , and the second partitioning step minimizes communication for the solver applied to .
    Fig. 1.
    Fig. 1. Illustration of the process of computing and repartitioning augmented matrix . From Kang et al. [16].

    6 An Adaptive Fault Tolerant Conjugate Gradient Solver

    The underlying principle of our adaptive solver is to add redundant blocks into the matrix only when errors are encountered. (Solvers from Section 5 operate on matrices with up to errors handled.) The solver always operates on matrices, which are guaranteed to be SPD if the input matrix is SPD. Redundant blocks are precomputed and stored, but only utilized in the event of faults. We now describe our adaptive fault tolerant scheme built on top of the CG solver described earlier.
    We assume that the original matrix and right-hand side are initially distributed among multiple processes by rows. This assumption does not restrict parallelization to 1D; rather, the assumption is merely for exposition. The solver runs on the input system until a fault occurs. As before, we assume a fail-stop failure model (other fault models can be reduced to fail-stop models through predicate checks to detect failures). We also assume that other processors can detect the identity of the failed processor. This is typically implemented in systems through periodic heartbeats. When a fault happens, the rows (and columns) assigned to the processor are erased. These erased blocks are compensated for by the addition of an identical number of rows (and columns) selected from the precomputed coding blocks (Figure 2). We elaborate on this selection strategy in the next section. For now, we assume that the coding blocks are dense and that the selection of which rows/columns to add from the coding blocks is arbitrary.
    Fig. 2.
    Fig. 2. Illustration of the coding and compensation process for erasures (fail-stop failures).
    To describe the overall methodology, let so that we can write the augmented matrix as
    We assume that the matrix and the vector are available at all processors. In our adaptive solver, the matrix and vector are held in reserve and initially unused. Consequently, on the first fault, which we represent as erasures, we conceptually permute the matrix as follows to identify the correct and faulty entries in blocks:
    Thus, when we lose elements to erasures, we lose the rows associated with and along with the elements . We also assume that other processors have cached the most recent values from so that this information is not lost. The idea is that we are going to use information from and to quickly add information back to the matrix. Consider, again conceptually, the full augmented system (1) at this point with the same partitioning:
    (3)
    Here,
    and
    By assumption, the still-running processors have available the precomputed matrices and vector involved in block partitioned form of the augmented system (1). The vector has not been used at this point, so its values are zero. To be abundantly clear, we do not form the system in (3) explicitly. This is handled algorithmically. When an erasure occurs, rows of the system are lost. We select an arbitrary set of columns from the precomputed coding data corresponding to an matrix . (In the event of a sequence of erasures, a column can only be selected once.) Let be the data selected. Then, we form and solve the new system:
    (4)
    Note that all of these matrix blocks are available to us. Given any solution of this problem, we can recreate the true solution to the original system following the reconstruction procedure from Zhu et al. [26].
    The corresponding procedure is described in Algorithm 2 in terms of explicit algorithmic primitives. Steps 3 through 9 correspond to the standard CG method. Step 10 corresponds to the adaptive fault tolerance mechanism. After erasures, we need to update the corresponding right-hand side to account for the fault. To compute , we use the saved vector . We also set the initial value of . Then, we reset the Krylov subspace process for this new system to ensure that the computed recurrences represent the changes to the system.
    How this differs from reforming the original system. One subtle aspect of this idea is that the precomputed coding data allows us to easily look up data that will render the system non-singular and equivalent for any possible erasure. In contrast, without this information, we would have to recreate precisely those elements of the matrix that were erased. Given that forming linear systems can itself be a complicated process, where we are unlikely to maintain random access to all the data in future, this represents a distinct advantage to our approach. Even if such a random access structure were to be available, large-scale solvers typically run at the limit of memory, and storing multiple copies of the matrix (as many copies as faults to be tolerated) would typically not be feasible.
    Coding matrix.. We use the structured sparse coding matrix described in Section 4. The coding matrix is illustrated in Figure 2(a). Furthermore, owing to its structure and sparsity, it does not induce dense blocks into the augmented matrix. Please note that the relative dimensions of matrices and in Figure 2 are selected for illustration purposes. The choices of values of and relative to the size of the matrix make the augmentation blocks appear dense. In real experiments, the augmentation blocks constitute a small fraction of the total matrix, and are much sparser. In Figure 2(c), we show how parts of the augmentation blocks are swapped in, to compensate for erasures. Finally, Figure 2(d) illustrates the compensated matrix.
    Using a sparse coding matrix has important implications for the adaptive fault tolerant solver. Recall that coding rows (or blocks) are added only as faults are detected. If the coding blocks are dense and Kruskal rank-, we can select arbitrary columns from matrix , corresponding to columns of the matrices from the last section, to compensate for erasures (see detailed discussion in the work of Zhu et al. [26]). However, if the coding blocks are sparse, we cannot arbitrarily select columns from the coding blocks, because column of may not involve row or column from the matrix . Consequently, we need to ensure that if an element is erased, then the column we select from must have a non-zero entry . This is easily done, since by assumption, all processors are aware of the indices of erased elements and the elements of are structured.
    We reiterate that we do not add the erased row of matrix itself because we would have to maintain multiple copies of the matrix (one more than the number of node failures we wish to tolerate) to be able to replace erased rows with original rows in the matrix. In contrast, using coded rows, we can significantly reduce the storage requirement for coded blocks. Note that this reduction in required memory is identical to that of storing erasure coded data in storage systems, as compared to replication.
    Communication overhead of coded rows.. An important question on parallel performance of our method relates to the communication overhead and computational cost introduced by the coded rows. The volume of communication and consequently the communication overhead are determined by the structure and density of the coding block. Rows in the coding block are constructed as scaled summations of sparse subsets of columns of the input matrix. For this reason, the coding block structure, fill in coded row, and resulting communication overheads are highly dependent on the matrix structure. We note, however, that for minimizing communication and maximizing cache performance, sparse matrices are ordered in such a way that rows with similar non-zero structure are ordered to be in proximity of each other. Since our coding matrix is constructed as scaled sum of selected contiguous rows, the fill structure of the coding block is similar to that of other rows in the partition (other rows assigned to the processor). What this implies is that the coded rows typically do not increase the number of processors that need to exchange data over the base matrix-vector product. The increase in volume of communication is minimal, as we demonstrate in our experiments, resulting in minimal loss in efficiency, compared to the solver on the original uncoded matrix.
    Comparison with the static coding scheme.. The base erasure coded linear solver of Kang et al. [16] estimates worst-case execution faults and augments the input matrix to account for this worst case. Although this static coding scheme outperforms other fault tolerance mechanisms, it has two performance drawbacks:
    The added computational/communication overhead of the static coding block that must be paid even when there are fewer faults.
    The slower convergence rate from the null space that is added to the system.
    These performance characteristics are observed in the experimental results in the work of Kang et al. [16]. In contrast, experimental results from our adaptive coding scheme clearly show that the convergence of our solver closely tracks the base (uncoded matrix) case (i.e., there is no loss in convergence, since there is no null space), and that the parallel performance is indistinguishable from the base (uncoded matrix) solver.

    7 Experimental Validation

    We present a comprehensive experimental validation of our proposed adaptive fault tolerance scheme. We aim to demonstrate the following key aspect of our scheme: (i) we characterize the fault tolerance of the CG solver, demonstrating that the solver converges to solution at low relative residual; (ii) the iteration and time overhead of our fault tolerant solver is low, compared to state of the art techniques; and (iii) the parallel overhead induced by our augmentation rows is small—leading to highly efficient and scalable performance.
    We select matrices from the University of Florida Matrix Collection for our tests, in which cbuckle and gyro_m are used to validate the convergence of adaptive fault tolerant linear solver; larger matrices consph and ldoor are used to validate parallel scalability and robustness to different fault arrival models. The matrix sizes and sparsity are shown in Table 1.
    Table 1.
    MatrixRowsNon-Zeros
    cbuckle13,681676,515
    gyro_m17,361340,431
    consph83,3346,010,480
    ldoor952,20342,493,817
    Table 1. Matrices Used in Testing
    All of the test matrices are SPD (i.e., both CG on the original and augmented system converge on these matrices). The right-hand side is initialized as where is the column vector with all 1s and normalized (i.e., ). Therefore, the relative residual is equal to the residual for our problems. During execution, we compute the residual at each iteration and set the termination condition as for all matrices. The maximum number of iterations of CG is set to 10,000. Of our test matrices, only ldoor does not achieve a residual of before reaching the iteration bound, for either the fault-free or faulty cases.
    For parallel performance, the matrices are first reordered using Metis [17]. To construct the augmented system, we use an encoding matrix as described in the previous section and use this matrix to generate the augmented system.
    In our tests, we use two different fault arrival models: faults arriving instantaneously and faults arriving at different points in time according to an exponential distribution.
    The motivation for the instantaneous fault model derives from coordinated failures. For instance, there have been studies that show that the components with three highest failures rates in data centers are disk, memory, and power supply [23]. In each of these cases, one or more sockets (i.e., the processes allocated to those cores) may fail instantaneously. This is modeled by our instantaneous fault model, where at a selected time, a specific number of processes (i.e., those on a single socket or a blade) fail.
    An exponential distribution is a commonly used fault arrival model [3]. The probability distribution function (PDF) of the time to failure is given by
    (5)
    Here, is the failure rate. We define orig_iter as the number of iterations the original system needs to converge to a residual norm of less than or to reach the iteration bound. Different fault rates () ranging from to are tested. Note that for an exponential distribution, the mean number of iterations between two consecutive faults is . This means the average number of faults in orig_iter iterations is 1, 2, or 3 for and , respectively. However, since the number of iterations to convergence may be increased by the addition of coding rows, the actual number of faults (even on average) may be higher. In our tests, we set the first fault to happen at . This is to ensure that the fault process starts; otherwise, in some runs, there are no faults at all for the exponential fault process. Finally, in our experiments, for the exponential fault model, we report the number of faults to be the mean of this distribution.

    7.1 Convergence Rates

    Our first set of experiments focuses on the convergence rate of the adaptive linear system solver in the presence of faults. We plot convergence rates and compare them with the no-fault case. The relative error is calculated with respect to the original system.
    3 shows the convergence rate of the adaptive fault tolerant linear solver with different fault rates. For test matrices cbuckle and gyro_m, we observe that our solver can reach the same relative error as the original system (without any faults) while tolerating a number of faults. As the fault rate increases, the solver needs more iterations to converge. However, we note that the overhead in terms of increased iterations is significantly lower than comparable fault tolerance techniques based on replicated execution or deterministic replay for the same number of faults.
    Fig. 3.
    Fig. 3. Convergence rate for cbuckle and gyro_m with different fault rates and fault starting points. The black circles indicate instances where faults are encountered in execution.
    4 shows the convergence of the solver on larger matrices, consph and ldoor, for different fault rates. We observe that our solver achieves good convergence rates for both matrices with different fault rates. Furthermore, we note that the convergence for the faulty and no-fault cases track very closely with each other, indicating that use of the augmented system does not impact solver convergence adversely.
    Fig. 4.
    Fig. 4. Convergence rate for consph and ldoor with different fault rates and fault starting points.

    7.2 Parallel Performance

    We now demonstrate the parallel performance and the time overhead of our augmented system solver. We benchmark the parallel performance of our solver on a 192-core (8 sockets) Intel Xeon Platinum 8168 processor operating at 2.70 GHz. We use MPI to implement our solver. We simulate faults in the system by inducing a selected number of erasures. Processors communicate via non-blocking communications, and processors where faults have been induced stop communicating from the time of induced fault. In our experiments, we use two fault models: the exponential fault model and the instantaneous fault model. Upon detecting a fault, the solver updates the erased rows and continues with the solve, as described in our algorithm. We report on the parallel performance for different sizes of augmenting blocks (varying number of faults, ; here, corresponds to the original system without faults during execution). The parallel speedup is defined as the ratio of the time taken by one processor (to convergence or to reach iteration bound) to the corresponding time taken by the parallel execution. Since we aim to quantify the parallel overhead of the coding blocks, both serial and parallel executions are assumed to have the same number of faults.
    Exponential fault model. For the exponential fault model, we use the fault arrival rate of for all matrices. Figure 5 shows the speedup for large matrices with different number of faults. With increasing number of processors, the speedup increases nearly linearly for all augmentation sizes. Note that the speedup starts to saturate for matrix consph as the number of processors increases. However, this saturation also happens for , indicating that our augmentation blocks add negligible parallel overhead (over and above the base solver).
    Fig. 5.
    Fig. 5. Parallel performance of consph and ldoor for the exponential fault model.
    Instantaneous fault model. For the instantaneous fault model, all faults happen at the same time—we select this to be for all matrices. This point is the same as the occurrence of the first fault in exponential fault model with a fault rate equal to .
    6 shows the speedup for large matrices with different number of faults under the instantaneous fault model. We observe near linear speedup for our solver—and most importantly, the speedup is very close to the base solver with no faults—indicating that under this fault model as well, we do not introduce any significant parallel overhead.
    Fig. 6.
    Fig. 6. Parallel performance for matrices consph and ldoor for the instantaneous fault model.

    7.3 Time Overhead Using Coding Blocks

    We analyze the time overhead of the augmented system solver with respect to original system. We let a fixed number of faults happen during execution () and calculate the ratio of the time to solution for the augmented system with faults and the original system with no faults. As before, time to solution either corresponds to the time to convergence or to reach the iteration bound. Since we are interested in quantifying the compute overhead of our coding block, all results here are obtained on a single core. Ideally, we want this ratio to be as close to 1 as possible.
    7 shows the time overhead of our solver for different numbers of faults. As expected, with increasing number of faults, we need more time to converge. This time overhead comes from two factors: (i) the impact of denser coding rows in the augmentation blocks and (ii) the increased number of iterations to convergence. We observe for our test matrices that the time overhead for each case tested is less than a factor of 1.2. This is highly efficient, particularly when the number of faults increases, especially in comparison to competing replicated execution or deterministic replay schemes.
    Fig. 7.
    Fig. 7. Time overhead with different numbers of faults under different fault arrival models.

    7.4 Comparison with Checkpointing

    We compare AECC with traditional checkpointing [12, 24]. Checkpointing periodically saves the intermediate program state and rolls back to the last consistent checkpoint when it detects a fault. We implement an optimized form of application checkpointing, where we only store data structures that are needed for restarting the iterative solver, as opposed to all ancillary data structures that may have been updated. Specifically, we save vectors , , and into persistent storage (temporary files) periodically. We collect this data at a designated storage node using a gather operation and save it to disk. Checkpointing intervals are important, because infrequent checkpoints result in significant rollback, whereas frequent checkpoints in the presence of infrequent faults results in significant checkpointing overhead [9, 13, 25]. For this reason, we experiment with different checkpointing intervals including every 1,000, 100, 50, or 10 iterations of the solver. When a fault is detected, the solver reads from the file containing the most recent intermediate results and continues.

    7.4.1 Comparison of Speedup.

    Figures 8 and 9 show speedup comparison of AECC with checkpointing for different checkpoint intervals (CP) and fault rates. With increasing number of processors, AECC achieves nearly linear speedup, whereas the speedup of checkpointing saturates. This is particularly true for smaller checkpoint intervals, where the overhead of checkpointing becomes a dominant bottleneck.
    Fig. 8.
    Fig. 8. Comparison of AECC with checkpointing for different fault rates on matrix consph. Each figure shows a different checkpoint interval—indicated in the title. We see the best speedup for checkpointing with infrequent checkpointing, but this still lags our method. None of the methods greatly differ in the number of faults.
    Fig. 9.
    Fig. 9. Comparison of AECC with checkpointing for different fault rates on matrix ldoor. Each figure shows a different checkpoint interval—indicated in the title. We see the best speedup for checkpointing with infrequent checkpointing, but this still lags our method. None of the methods greatly differ in the number of faults.

    7.4.2 Comparison of Time Overhead.

    We compare the time overhead of AECC and checkpointing in Figure 10. Compared to AECC, whose time overheads primarily come from increased computation on dense augmentation blocks and larger number of iterations to converge, the overheads of checkpointing primarily come from storing and retrieving intermediate results to/ from persistent storage. When the checkpoint interval, CP, is small, checkpointing incurs significant storage overhead but does not lose much in wasted computation, since it does not roll back too far. Conversely, when checkpoint interval is large, the storage overhead is low, but the roll back overhead is high. The optimal checkpoint interval is therefore a function of the storage cost (which in turn depends on the size of program state, as well as performance of the underlying I/O subsystem), fault rate, and the cost of finding a consistent checkpoint (the latter is not a significant problem in our case, since the iterative synchronized nature of our solver makes it easy to find consistent checkpoints).
    Fig. 10.
    Fig. 10. Comparison of time overhead of AECC and checkpointing for matrices consph and ldoor demonstrating significantly lower overhead of AECC. The AECC results are the same as Figure 7.
    Figure 10 shows a comparison of time overheads of AECC and checkpointing. The figure clearly demonstrates that the overhead of AECC is significantly lower than checkpointing across a range of fault models and checkpoint intervals. This is particularly true for the larger benchmark matrix, .

    8 Conclusion

    In this article, we present an adaptive fault tolerant linear system solver capable of scaling to large numbers of processors and associated faults. The solver works by augmenting the input matrix using erasure coded blocks as faults are detected, instead of augmenting the input matrix a priori. Our technique has the following significant advantages: (i) coding blocks are only used when faults are detected—at any time, the linear system is always identical in size to the input system; (ii) the convergence properties of the augmented system closely follow those of the original (input) matrix; and (iii) the parallel performance of the solver scales well with increasing numbers of processors. We investigate the effect of fault rates and fault arrival patterns (instantaneous versus exponential) and show that our scheme is robust to a wide range of fault characteristics and significantly outperforms traditional fault tolerance mechanisms.
    In terms of avenues for future research, the proof of concept presented in this work strongly establishes adaptive erasure coded fault tolerance as a powerful new technique, particularly at scale. An important question that arises in this context is the design of the coding matrix. Although a structured sparse matrix is used in our experiments, an open question relates to the best coding matrix structure that optimally trades off fill in augmentation block, communication in parallel execution, and conditioning of augmented matrix.

    References

    [1]
    George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69, 4 (April 2009), 410–416.
    [2]
    Patrick G. Bridges, Kurt B. Ferreira, Michael A. Heroux, and Mark Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. CoRR abs/1206.1390 (2012).
    [3]
    Xavier Castillo, Stephen McConnel, and Daniel Siewiorek. 1982. Derivation and calibration of a transient error reliability model. IEEE Trans. Comput. C-31 (1982), 658–671.
    [4]
    Yunji Chen, Shijin Zhang, Qi Guo, Ling Li, Ruiyang Wu, and Tianshi Chen. 2015. Deterministic replay: A survey. ACM Comput. Surv. 48, 2 (Sept. 2015), Article 17, 47 pages.
    [5]
    Zizhong Chen. 2009. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the ACM/IEEE Conference on High Performance Computing.
    [6]
    Zizhong Chen. 2011. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing. 73–84.
    [7]
    Zizhong Chen and Jack Dongarra. 2005. Numerically stable real number codes based on random matrices. In Proceedings of the International Conference on Computational Science (ICCS’05). 115–122.
    [8]
    Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack Dongarra. 2005. Fault tolerant high performance computing by a coding approach. In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPP’05). 213–223.
    [9]
    J. T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 3 (Feb. 2006), 303–312.
    [10]
    Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation—Volume 6. 10.
    [11]
    Jack Dongarra and Zizhong Chen. 2008. Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19, 12 (2008), 1628–1641.
    [12]
    Ifeanyi P. Egwutuoha, David Levy, Bran Selic, and Shiping Chen. 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65, 3 (Sept. 2013), 1302–1326.
    [13]
    Nosayba El-Sayed and Bianca Schroeder. 2016. Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Trans. Dependable Secure Comput. PP (March 2016), 1. DOI:
    [14]
    Gaston H. Gonnet. 1981. Expected length of the longest probe sequence in hash code searching. J. ACM 28, 2 (April 1981), 289–304. DOI:
    [15]
    Kuang-Hua Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (June 1984), 518–528.
    [16]
    Xuejiao Kang, David F. Gleich, Ahmed Sameh, and Ananth Grama. 2017. Distributed fault tolerant linear system solvers based on erasure coding. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS’17). 2478–2485.
    [17]
    George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359–392.
    [18]
    Richard Koo and Sam Toueg. 1986. Checkpointing and rollback-recovery for distributed systems. In Proceedings of 1986 ACM Fall Joint Computer Conference (ACM’86). IEEE, Los Alamitos, CA, 1150–1158.
    [19]
    Gerard Meurant. 2006. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations (Software, Environments, and Tools). SIAM.
    [20]
    J. S. Plank. 2013. Erasure codes for storage systems: A brief primer. Login 38, 6 (Dec. 2013).
    [21]
    Martin Raab and Angelika Steger. 1998. “Balls into bins”—A simple and tight analysis. In Randomization and Approximation Techniques in Computer Science (RANDOM). Lecture Notes in Computer Science, Vol. 1518. Springer, 159–170. DOI:
    [22]
    Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299–319.
    [23]
    Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What can we learn from four years of data center hardware failures? In Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’17). IEEE, Los Alamitos, CA, 25–36. DOI:
    [24]
    Tang Xiongchao, Jidong Zhai, Bowen Yu, Wenguang Chen, Weiming Zheng, and Keqin Li. 2017. An efficient in-memory checkpoint method and its practice on fault-tolerant HPL. IEEE Trans. Parallel. Distrib. Systems PP (Dec. 2017), 1. DOI:
    [25]
    John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (Sept. 1974), 530–531. DOI:
    [26]
    Yao Zhu, Ananth Grama, and David F. Gleich. 2017. Erasure coding for fault oblivious linear system solvers. SIAM J. Sci. Comput. 39, 1 (2017), C48–C64.

    Cited By

    View all
    • (2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: Jul-2024

    Index Terms

    1. Adaptive Erasure Coded Fault Tolerant Linear System Solver

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Parallel Computing
      ACM Transactions on Parallel Computing  Volume 8, Issue 4
      December 2021
      118 pages
      ISSN:2329-4949
      EISSN:2329-4957
      DOI:10.1145/3481693
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 December 2021
      Accepted: 01 September 2021
      Revised: 01 August 2021
      Received: 01 February 2021
      Published in TOPC Volume 8, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Fault tolerance
      2. linear solver
      3. adaptive fault tolerance

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • U.S. Department of Energy

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)139
      • Downloads (Last 6 weeks)14

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: Jul-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media