research-article

Open access

Combining Sparse Approximate Factorizations with Mixed-precision Iterative Refinement

Authors:

Patrick Amestoy,

Alfredo Buttari,

Nicholas J. Higham,

Jean-Yves L’Excellent,

Theo Mary,

Bastien VieubléAuthors Info & Claims

ACM Transactions on Mathematical Software, Volume 49, Issue 1

Article No.: 4, Pages 1 - 29

https://doi.org/10.1145/3582493

Published: 21 March 2023 Publication History

All formats PDF

Abstract

The standard LU factorization-based solution process for linear systems can be enhanced in speed or accuracy by employing mixed-precision iterative refinement. Most recent work has focused on dense systems. We investigate the potential of mixed-precision iterative refinement to enhance methods for sparse systems based on approximate sparse factorizations. In doing so, we first develop a new error analysis for LU- and GMRES-based iterative refinement under a general model of LU factorization that accounts for the approximation methods typically used by modern sparse solvers, such as low-rank approximations or relaxed pivoting strategies. We then provide a detailed performance analysis of both the execution time and memory consumption of different algorithms, based on a selected set of iterative refinement variants and approximate sparse factorizations. Our performance study uses the multifrontal solver MUMPS, which can exploit block low-rank factorization and static pivoting. We evaluate the performance of the algorithms on large, sparse problems coming from a variety of real-life and industrial applications showing that mixed-precision iterative refinement combined with approximate sparse factorization can lead to considerable reductions of both the time and memory consumption.

1 Introduction

Direct methods for the solution of sparse linear systems $Ax=b$, where $A\in \mathbb {R}^{n\times n}$ and $x,b\in \mathbb {R}^n$, are widely used and generally appreciated for their robustness and accuracy. These desirable properties, however, come at the cost of high operational complexity and memory consumption and a limited scalability on large-scale, parallel supercomputers compared with iterative solvers. To mitigate some of these limitations, we can use various approaches to trade off some accuracy and robustness for lower complexity and memory consumption or better computational efficiency and scalability. These include the use of low-rank approximations or relaxed numerical pivoting. Furthermore, the recent appearance and increasingly widespread adoption of low-precision arithmetics offer additional opportunities for reducing the computational cost of sparse direct solvers. In the cases where these approaches lead to a poor quality solution, they can be combined with other lightweight algorithms that attempt to recover the lost accuracy. Arguably the most well known and oldest of such algorithms is iterative refinement, whose basic idea is to improve the accuracy of a given computed solution $\widehat{x}$ by iteratively repeating three steps:

(1)

compute the residual $r = b-A\widehat{x}$;

(2)

solve $Ad = r$;

(3)

update $\widehat{x}\leftarrow \widehat{x}+d$.

While same precision can be used on all three refinement steps to improve numerical stability [39, 50], multiple arithmetics can be conveniently mixed to achieve better accuracy, robustness, or performance. The method, originally proposed by Wilkinson [53],54] for fixed-point arithmetic and later extended by Moler [45] to floating-point computations, uses higher precision for computing the residuals, which allows the method to converge to a more accurate solution. Langou et al. [41] and Buttari et al. [16] redefined iterative refinement as a way to accelerate a direct solver by computing the LU factorization of A in single-precision accuracy instead of double, while keeping a double-precision accuracy on the solution.

In recent years, the emergence of lower precision floating-point arithmetic in hardware, in particular, the half precision fp16 and bfloat16 arithmetics, has generated renewed interest in mixed-precision algorithms in general and in mixed-precision iterative refinement [36]. Recent work has explored the use of mixed-precision iterative refinement methods that employ a factorization computed in low precision [18]. Furthermore, novel mixed-precision iterative refinement variants [3, 18] that can use up to five different precisions have been proposed, offering a wide range of options with different tradeoffs between accuracy, robustness, and performance. Several of these variants of iterative refinement have been implemented on modern hardware, notably supporting half precision such as GPUs, and have been shown to be highly successful at accelerating the solution of dense linear systems [29, 30, 31, 43]. Unlike for dense systems, there have been few previous efforts to accelerate sparse direct solvers with iterative refinement, and most date back to the two-precision variants of the late 2000s [14, 15] implemented, for example, in the HSL_MA79 sparse solver [37].

In this article, we investigate the potential of mixed-precision arithmetic to accelerate the solution of large sparse linear systems by combining state-of-the-art iterative refinement variants with state-of-the-art sparse factorizations taking into account the use of numerical approximations. First, we tackle this subject from a theoretical point of view and extend the error analysis in [3] to the case of approximate factorizations. Then, we address the issues related to a high performance parallel implementation of mixed-precision iterative refinement for sparse linear systems and provide an in-depth analysis of experimental results obtained on real-life problems.

The article is structured as follows. For the sake of completeness, in Section 2, we present relevant information on sparse direct solvers and approximate factorization methods, and give details on the different iterative refinement algorithms that we work with. In Section 3, we explain how the specific features of sparse direct solvers influence the behavior of iterative refinement and the design choices that have to be made when considering which variant to use and what the differences are with respect to the dense case. We provide, in Section 4, an error analysis for iterative refinement using a general approximate factorization model with LU factorization or generalized minimal residual method (GMRES) as the solver on the update step. In Section 5, we illustrate the effectiveness of a subset of modern iterative refinement variants, with and without numerical approximations, for the parallel solution of large-scale sparse linear systems by implementing them on top of the multifrontal massively parallel sparse (MUMPS) direct solver [7, 8]. This includes a performance analysis of the standard factorization, as well as the use of two types of approximate factorizations (block low-rank and static pivoting) and their combination on problems coming from a variety of real-life industrial and academic applications.

2 Background

2.1 Sparse Direct Methods

Sparse direct solvers compute the solution of a sparse linear system through the factorization of the associated matrix. In this work we deal with Gaussian elimination types of factorizations, that is, LU, Cholesky or $LDL^T$. Computing the factorization of a sparse matrix is made difficult by the occurrence of fill-in, which is when zero coefficients of the sparse matrix are turned into nonzeros by the factorization process. Because of fill-in, in general, it is not possible to define the complexity of a sparse matrix factorization; however, for a 3D cubic problem it can be shown that, if the matrix is permuted using nested dissection [27], $O(n^2)$ floating-point operations are required and the size of the resulting factors is $O(n^{4/3})$ (respectively, $O(n^{3/2})$ and $O(n\log {(n)})$ for a 2D, square problem), assuming n is the size of the matrix.

In the case of unsymmetric or symmetric indefinite problems, pivoting must be used to contain element growth and make the solution process backward stable. However, for both dense and sparse systems, pivoting reduces the efficiency and scalability of the factorization on parallel computers, because it requires communication and synchronizations. In the case of sparse factorizations, pivoting has the additional drawback of introducing additional fill-in. Moreover, because this fill-in depends on the unfolding of the factorization it cannot be predicted beforehand, so pivoting requires the use of dynamic data structures and may lead to load unbalance in a parallel setting. For this reason, few parallel sparse direct solvers employ robust pivoting techniques. Although in many cases the overhead imposed by pivoting can be modest, when targeting large-scale parallel computers and/or numerically difficult problems performance may be severely affected.

2.2 Approximate Factorizations

To improve performance and/or reduce the complexity, sparse direct solvers often compute approximate factorizations. By approximate factorization, we refer to techniques that make an approximation at the algorithm level, independent of any floating point arithmetic. In this work, we mainly focus on two approximate factorization techniques, described in the next two paragraphs, which can be combined. The error analysis presented in Section 4, though, is more general and potentially applies to other approximate factorization methods.

Block low-rank (BLR). In several applications, we can exploit the data sparsity by partitioning dense matrices (for example, those appearing during a sparse matrix factorization) into blocks of low numerical rank, which can be suitably compressed, for example, through a truncated SVD decomposition, with a reliably controlled loss of accuracy. Sparse direct solvers exploiting this property to accelerate the computations and reduce the memory consumption have been proposed and shown to be highly effective in a variety of applications [5, 7, 28, 47, 49]. We will in particular focus on the block low-rank (BLR) format [4, 6, 7], which is based on a flat block partitioning of the matrix into low-rank blocks as opposed to other formats that employ hierarchical formats. The LU factorization of a BLR matrix can be efficiently computed by adapting the usual partitioned LU factorization to take advantage of the low-rank property of the blocks. For a detailed description of the BLR LU factorization algorithms, we refer to, for example, [7]; in this article, we specifically use the left-looking UFSC+LUA variant described in [7]. The use of the BLR method in a sparse direct solver can reduce, at best, the operational complexity to $O(n^{4/3})$ and the factors size to $O(n\log n)$ for a 3D problem (respectively, $O(n\log n)$ and $O(n)$ for a 2D problem) [6]. The constants hidden in the big O complexities depend on the ranks of the blocks, which are determined by a threshold, $\tau _b$ in this article, that controls the accuracy of the approximations. A larger threshold leads to lower memory and operational costs but also to lower accuracy. This makes iterative refinement a particularly attractive method, because it allows us to recover a satisfactory accuracy in the cases where a large threshold is employed to reduce the time and memory consumption of the factorization.

Static pivoting. Unlike partial pivoting, static pivoting, first proposed by Li and Demmel [42], does not apply permutations on the rows or columns of the sparse matrix. Instead when a pivot is found to be too small with respect to a prescribed threshold $\tau _s\Vert A\Vert _{\infty }$, it is replaced with $\tau _s\Vert A\Vert _{\infty }$. Static pivoting improves the use of BLAS3 operations and improves parallelism with respect to partial pivoting, whose scalability suffers from the communications needed to identify the pivots at each elimination stage. Moreover, the use of static pivoting in a sparse direct solver does not introduce additional fill-in, as partial pivoting does, and, consequently, is less prone to load unbalance. It must be noted that static pivoting has a twofold effect on the accuracy and stability of the factorization. A small value for $\tau _s$ makes the factorization more accurate but might lead to large element growth, while a large value controls element growth but reduces the accuracy of the factorization. Several previous studies [13, 25, 42] have proposed to remedy the instability introduced by static pivoting by using fixed precision iterative refinement. It is important to mention that static pivoting is often used in combination with appropriate graph matching and matrix scaling algorithms to move the large elements to the diagonal before the numerical factorization [24, 25].

2.3 Iterative Refinement

Iterative refinement is an old algorithm, but major evolutions were recently proposed, and we summarize here the most up-to-date forms that are based on the LU factorization of the matrix A.

The historical and most common form of iterative refinement solves the correction equation $Ad=r$ by substitution using the computed LU factors of the matrix in precision $u_f$. The computation of the residual is done in precision $u_r$ and the update is done in working precision u. We refer to this kind of iterative refinement as LU-based iterative refinement or LU-IR, which is described in Algorithm 1. However, the use of low-precision arithmetic to accelerate the LU factorization also restricts substantially the ability of LU-IR to handle moderately ill-conditioned problems. To overcome this limitation and extend the applicability of low-precision factorizations, Carson and Higham [17] proposed an alternative form of iterative refinement that can handle much more ill-conditioned matrices by solving the system $Ad=r$ by GMRES preconditioned with the computed LU factors, as described in Algorithm 2. The GMRES method carries out its operations in precision $u_g \ge u$ except the preconditioned matrix–vector products, which are applied in a precision $u_p \lt u_f$. We refer to this method as GMRES-based iterative refinement or GMRES-IR. As an example, if the factorization is carried out in fp16 arithmetic, then LU-IR is only guaranteed to reduce the solution error if $\kappa (A)\ll 2$ $\times$ $10^3$, whereas GMRES-IR is guaranteed to reduce it if $\kappa (A)\ll 3$ $\times$ $10^7$ in the case where the GMRES precisions ($u_g$ and $u_p$) correspond to double-precision arithmetic.

With the rising number of available precisions in hardware, Carson and Higham [18] reestablished the use of extra precision in the computation of the residual, bridging the gap between traditional iterative refinement targeting accuracy improvements and iterative refinement targeting a faster factorization. This leads to the use of up to three different precisions in LU-IR (Algorithm 1). Finally Amestoy et al. [3] have analyzed Algorithm 2 in five precisions to allow for an even more flexible choice of precisions and to be able to best exploit the range of arithmetics available in the target hardware.

3 Specific Features of Iterative Refinement with Sparse Direct Solvers

The most important difference between iterative refinement for dense and sparse linear systems lies in its practical cost. As explained in Section 2.1, a key property of sparse direct solvers is that they generate fill-in, that is, the LU factors of A are typically much denser than A itself. Therefore, as the size of the matrix grows, the storage for A becomes negligible compared with that for its LU factors. Note that this still holds for data sparse solvers despite the reduced asymptotic complexity. For example, as explained in Section 2.2, BLR sparse direct solvers reduce the size of the LU factors to at best $O(n\log n)$ entries, but with the constants hidden in the big O, the size of the LU factors typically remains several orders of magnitude larger than that of the original matrix.

A crucial consequence of the existence of fill-in is that, with a lower precision factorization ($u_f \gt u$), LU-IR (Algorithm 1) can achieve not only higher speed but also lower memory consumption than a standard sparse direct solver run entirely in precision u. This is because the LU factors, which account for most of the memory footprint, need be stored only in precision $u_f$. We emphasize that LU-IR does not reduce the memory footprint in the case of dense linear systems, since in this case the matrix A and the LU factors require the same number of entries, and A must be stored at least in precision u. In fact, since a copy of A must be kept in addition to its LU factors, iterative refinement for dense linear systems actually consumes more memory than a standard in-place LU factorization in precision u.

Similar comments apply to the cost of the matrix–vector products $Ax_i$ in the computation of the residual (step 4 of Algorithms 1 and 2). Whereas for a dense matrix this cost is comparable with that of the LU triangular solves (step 5 with LU-IR), when the matrix is sparse it becomes, most of the time, negligible. In particular, this means that we have more flexibility to choose the precision $u_r$, especially when the target precision u is double precision; performing the matrix–vector products in high precision ($u_r=u^2$) does not necessarily have a significant impact on the performance, even for arithmetics usually not supported in hardware, such as quadruple precision (fp128). This is illustrated and further discussed in Section 5.

To summarize, LU-IR is attractive for sparse linear systems, because it can lead to memory gains and because the most costly step of the iterative phase, the triangular solves with the LU factors, is carried out in the low-precision $u_f$.

Unfortunately, these last points are not guaranteed to be met when using GMRES-IR, because the triangular solves have to be applied in precision $u_p\lt u_f$. As a consequence the cost of the iterations is higher and the factors need to be cast in precision $u_p$. As an extreme case, setting $u_p=u^2$ as originally proposed by Carson and Higham [18] would make the iterative phase significantly costly compared with the factorization. Therefore, the five-precision analysis of Amestoy et al. [3], which allows for setting $u_p \gt u^2$, is even more relevant in the sparse case. In this article, we therefore focus on variants where $u_p \ge u$.

Finally, another specific feature of sparse direct solvers is related to pivoting. While partial pivoting is the most common approach for dense linear systems, sparse direct solvers commonly use other approaches (for example, static pivoting, see Section 2.2) that better preserve the sparsity of the LU factors and limit the communications in parallel contexts. While partial pivoting guarantees the practical stability of the resolution, these methods do not. However, combined with iterative refinement, a sparse direct solver can achieve a satisfactory stability under suitable conditions.

4 Error Analysis of Iterative Refinement with A General Approximate Factorization

Carson and Higham [18] provided an error analysis of a general form of iterative refinement using an arbitrary linear solver. They then specialized this analysis to LU-IR and GMRES-IR, under the assumption that the LU factors are computed with standard Gaussian elimination with partial pivoting. However, as explained above, modern sparse direct solvers often depart from this assumption, because they typically do not implement partial pivoting, and because they take advantage of data sparsity resulting in numerical approximations. This affects the error analysis of LU-IR and GMRES-IR and the conditions under which they are guaranteed to converge. For this reason, in this section, we propose a new error analysis under a general approximate factorization model. Our model can be applied to at least BLR, static pivoting, and their combined use, and we expect it to cover several other approximate approaches used in direct solvers. Moreover, although in this article we are particularly motivated by sparse applications, the results of this section carry over to the dense case.

4.1 Preliminaries and Notations

We use the standard model of floating-point arithmetic [32, Section 2.2]. For any integer k, we define

\begin{equation*} \gamma _k = \frac{ku}{1-ku}. \end{equation*}

A superscript on $\gamma$ denotes that u carries that superscript as a subscript; thus, $\gamma _k^f = ku_f/(1-ku_f)$, for example. We also use the notation $\widetilde{\gamma }_k = \gamma _{ck}$ to hide modest constants c.

The error bounds obtained by our analysis depend on some constants related to the problem dimension n. We refer to these constants as $c_k$ for $k=1, 2, 3\ldots$ As these constants are known to be pessimistic [19, 33, 34], for the sake of the readability, we do not always keep track of their precise value. When we drop constants $c_k$ from an inequality, we write the inequality using “$\ll$.” A convergence condition expressed as “$\kappa (A) \ll \theta$” can be read as “$\kappa (A)$ is sufficiently less than $\theta$.” Finally, we also use the notation $\lesssim$ when dropping second-order terms in the error bounds.

We consider a sparse linear system $Ax=b$, where $A\in \mathbb {R}^{n\times n}$ is nonsingular and $b\in \mathbb {R}^{n}$. We denote by p the maximum number of nonzeros in any row of the matrix $[A\ b]$.

The forward error of an approximate solution $\widehat{x}$ is $\varepsilon _\mathrm{fwd}=\Vert x-\widehat{x}\Vert /\Vert x\Vert$, while the (normwise) backward error of $\widehat{x}$ is [32, Section 7.1]

\begin{equation*} \varepsilon _\mathrm{bwd} = \min \lbrace \, \epsilon : (A+\Delta A)\widehat{x}= b+\Delta b, \; \Vert \Delta A\Vert \le \epsilon \Vert A\Vert , \; \Vert \Delta b\Vert \le \epsilon \Vert b\Vert \,\rbrace = \frac{\Vert b - A\widehat{x}\Vert }{\Vert A\Vert \,\Vert \widehat{x}\Vert + \Vert b\Vert }. \end{equation*}

We also use Wilkinson’s growth factor $\rho _n$ defined in [32, p. 165].

Our error analysis uses the $\infty$-norm, denoted by $\Vert \cdot \Vert _{\infty }$, and we write $\kappa _{\infty }(A) = \Vert A\Vert _{\infty }\Vert A^{-1}\Vert _{\infty }$ for the corresponding condition number of A. We use unsubscripted norms or condition numbers when the constants depending on the problem dimensions have been dropped, since the norms are equivalent.

4.2 Error Analysis

In analyzing iterative refinement, we aim to show that under suitable conditions the forward error and backward error decrease until they reach a certain size called the limiting forward error or backward error. We informally refer to “convergence,” meaning that errors decrease to a certain limiting accuracy, while recognizing that the error does not necessarily converge in the formal sense.

Let us first recall the known results on LU-IR and GMRES-IR from the error analysis in [3, 18], based on the assumption of a standard LU factorization computed by Gaussian elimination with partial pivoting. In the case of LU-IR (Algorithm 1) the convergence condition for both the forward and backward errors is

\begin{equation} \kappa (A)u_f \ll 1. \end{equation}

(4.1)

In the case of GMRES-IR (Algorithm 2), we have instead

\begin{equation} (u_g+u_p\kappa (A))(\kappa (A)^2u_f^2+1)\ll 1, \end{equation}

(4.2)

for the forward error to converge and

\begin{equation} (u_g + u_p\kappa (A))(\kappa (A)u_f+1)\kappa (A) \ll 1, \end{equation}

(4.3)

for the backward error to converge. We recall that $u_p \lt u_f$, so that the GMRES-IR condition (4.2) is significantly less restrictive than the LU-IR one (4.1). Indeed, if (4.1) is satisfied, then (4.2) is satisfied too; however, if (4.1) is not satisfied, then (4.2) can still be satisfied depending on the precisions $u_g$ and $u_p$.

Provided the corresponding conditions are met, the forward and backward errors will reach their limiting values

\begin{equation} \varepsilon _\mathrm{fwd} \le pu_r\frac{\Vert |A^{-1}||A||x|\Vert }{\Vert x\Vert } + u \end{equation}

(4.4)

and

\begin{equation} \varepsilon _\mathrm{bwd} \le pu_r+u, \end{equation}

(4.5)

respectively. Note that these limiting values are solver independent (as long as iterative refinement converges).

We now turn to the main objective of this section, which is to derive conditions analogous to (4.1), (4.2), and (4.3) under a more general model of an approximate LU factorization. Specifically, our model makes the following two assumptions.

•

The approximate factorization performed at precision $u_s$ provides computed LU factors satisfying

(4.6)

where e is the vector of ones and $\epsilon$ is a parameter quantifying the quality of the approximate factorization.

•

The triangular solve $\widehat{T}y = v$, where $\widehat{T}$ is one of the approximately computed LU factors, performed at precision $u_s$ provides a computed solution $\widehat{y}$ satisfying

(4.7)

From (4.6) and (4.7), it follows that the solution of the linear system $Ay=v$ provides a computed solution $\widehat{y}$ satisfying

(4.8)

Note that, for our approximate factorization model to be more general, we have not enforced a particular sparsity structure on the term $c_1\epsilon \Vert A\Vert _{\infty }ee^T$, which is in fact dense. The extension of the analysis to backward error with a sparse structure, such as in [12], is outside our scope.

4.2.1 Error Analysis for LU-IR.

We want to determine the convergence conditions for LU-IR (Algorithm 1). We can apply [18, Corollary 3.3] and [18, Corollary 4.2], respectively, for the convergence of the forward and normwise backward errors of the system $Ax=b$, and for both, we need, respectively, a bound on the forward and backward errors of the computed solution $\widehat{d}_i$ of the correction equation $Ad_i=\widehat{r}_i$. Note that for LU-IR, the factorization (4.6) and the LU solves (4.7) are performed in precision $u_f$.

Considering the solution of the linear system $Ad_i=\widehat{r}_i$, (4.8) yields

\begin{equation*} d_i - \widehat{d}_i = A^{-1}\Delta A^{(2)}\widehat{d}_i - A^{-1}\Delta \widehat{r}_i. \end{equation*}

Taking norms, we obtain

Using [32, Lemma 9.6]

\begin{equation} \Vert |\widehat{L}||\widehat{U}|\Vert _{\infty } \le (1+2(n^2-n)\rho _n)(\Vert A\Vert _{\infty }+\Vert \Delta A^{(1)}\Vert _{\infty }), \end{equation}

(4.9)

where $\rho _n$ is the growth factor for $A + \Delta A^{(1)}$. Dropping second-order terms finally gives

(4.10)

In the same fashion, we can show that

(4.11)

Dropping constants and applying [18, Corollary 3.3] and [18, Corollary 4.2] using (4.10) and (4.11) guarantees that as long as

\begin{equation} (\rho _nu_f + \epsilon)\kappa (A) \ll 1, \end{equation}

(4.12)

the forward and the normwise backward errors of the system $Ax=b$ will converge to their limiting values (4.4) and (4.5).

As a check, if we set $\epsilon =0$ (no approximation) and drop $\rho _n$ (negligible element growth), then we recover (4.1).

Before commenting in Section 4.2.3 on the significance of these new LU-IR convergence conditions, we first similarly derive the GMRES-IR conditions.

4.2.2 Error Analysis for GMRES-IR.

We now determine the convergence conditions of GMRES-IR (Algorithm 2). We proceed similarly as for LU-IR, seeking bounds on the forward and backward errors of the computed solution $\widehat{d}_i$ of the correction equation $Ad_i = \widehat{r}_i$. One difference lies in the fact that the GMRES solver is applied to the preconditioned system $\widetilde{A}d_i=\widehat{s}_i$ where $\widetilde{A}= \widehat{U}^{-1}\widehat{L}^{-1}A$ and $s_i = \widehat{U}^{-1}\widehat{L}^{-1}\widehat{r}_i$. Our analysis follows closely the analysis of [3, Section 3.2], so we mainly focus on the changes coming from the use of a more general approximate factorization model and refer the reader to that work for the full details.

We first need to bound the error introduced in forming the preconditioned right-hand side $s_i$ in precision $u_p$. Computing $s_i$ implies two triangular solves (4.7), which differ from the original GMRES-IR analysis by having an error term on the right-hand side. Adapting [3, (3.11)–(3.13)] with (4.6) and (4.7) and using (4.9) provides the bound

(4.13)

As in [3], we compute the error of the computation of the preconditioned matrix–vector product $z_j = \widetilde{A}\widehat{v}_j$ to use [3, Theorem 3.1]. We obtain $z_j$ through a standard matrix–vector product with A followed by two triangular solves (4.7) with $\widehat{L}$ and $\widehat{U}$. The computed $\widehat{z}_i$ satisfies $\widehat{z}_i = z_j + f_j$, where $f_j$ carries the error of the computation. With a very similar reasoning as for deriving [3, (3.14)], considering our new assumptions, we obtain the bound

(4.14)

Apart from the constants and the presence of the growth factor $\rho _n$, which can be arbitrarily large without assumptions on the pivoting, (4.14) and (4.13) are similar to [3, (3.14)] and [3, (3.13)] and meet the assumptions of [3, Theorem 3.1], which can be used to compute a bound of $\Vert s_i - \widetilde{A}\widehat{d}_i\Vert _{\infty }$.

We can finally bound the normwise relative backward error of the system $\widetilde{A}\widehat{d}_i = s_i$ [3, (3.17)] by

(4.15)

and the relative error of the computed $\widehat{d}_i$ [3, (3.18)] by

(4.16)

In addition, the backward error of the original correction equation $Ad_i=\widehat{r}_i$ can be bounded using $\widehat{r}_i - A \widehat{d}_i = \widehat{L}\widehat{U}(s_i - \widetilde{A}\widehat{d}_i)$ and (4.15), yielding

(4.17)

It is essential to study the conditioning of the preconditioned matrix $\widetilde{A}$ to express the convergence conditions according to the conditioning of the original matrix $\kappa (A)$. Using the same reasoning as for [17, 3.2], we obtain

Dropping constants and applying [18, Corollary 3.3] and [18, Corollary 4.2] using (4.16) and (4.17) guarantees that as long as

\begin{align} &(u_g + u_p\rho _n\kappa (A))((u_f\rho _n+\epsilon)^2\kappa (A)^2+1) \ll 1 & \mbox{(forward error)}, \end{align}

(4.18)

\begin{align} &(u_g + u_p\rho _n\kappa (A))((u_f\rho _n + \epsilon)\kappa (A) + 1)\kappa (A) \ll 1 & \mbox{(backward error)} , \end{align}

(4.19)

the forward and the normwise backward errors of the system $Ax=b$ will converge to their limiting values (4.4) and (4.5).

As a check, with $\epsilon =0$ and dropping $\rho _n$, we recover (4.2) and (4.3).

4.2.3 Summary of the Error Analysis and Interpretation.

We summarize the analysis in the following theorem.

Theorem 4.1.

Let $Ax=b$ be solved by LU-IR (Algorithm 1) or GMRES-IR (Algorithm 2) using an approximate LU factorization satisfying (4.6) and (4.7). Then the forward error will reach its limiting value (4.4) provided that

\begin{align} &(u_f\rho _n + \epsilon)\kappa (A) \ll 1 & {(LU-IR)}, \end{align}

(4.20)

\begin{align} &(u_g + u_p\rho _n\kappa (A))((u_f\rho _n + \epsilon)^2\kappa (A)^2+1) \ll 1 & {(GMRES-IR)}, \end{align}

(4.21)

and the backward error will reach its limiting value (4.5) provided that

\begin{align} &(u_f\rho _n + \epsilon)\kappa (A) \ll 1 & {(LU-IR)}, \end{align}

(4.22)

\begin{align} &(u_g + u_p\rho _n\kappa (A))((u_f\rho _n + \epsilon)\kappa (A) + 1)\kappa (A) \ll 1 & {(GMRES-IR)}. \end{align}

(4.23)

We now comment on the significance of this result. Compared with the original convergence conditions (4.1)–(4.3), the new conditions of Theorem 4.1 include two new terms. The first is the growth factor $\rho _n$ that, without any assumption on the pivoting strategy, cannot be assumed to be small. This shows that a large growth factor can prevent iterative refinement from converging. The second is $\epsilon$, which reflects the degree of approximation used by the factorization. The terms $\rho _nu_f + \epsilon$ show that we can expect the approximation to impact the convergence of iterative refinement when $\epsilon \gtrsim \rho _n u_f$ (ignoring the difference between the constants in front of each term). It is important to note that the instabilities introduced by element growth and numerical approximations combine additively, rather than multiplicatively (there is no $\epsilon \rho _n$ term). In particular, this means that the usual wisdom that it is not useful to use a very high precision for an approximate factorization ($u_f \ll \epsilon$) is no longer true in presence of large element growth. This is a key property that we confirm experimentally in Section 5.

Note that the left-hand side quantities in the convergence conditions (4.20)–(4.23) also bound the convergence rate of the refinement: thus, smaller quantities will in general lead to faster convergence.

4.2.4 Convergence Conditions for BLR and Static Pivoting.

We now apply the above analysis to the use of BLR approximations and static pivoting.

The BLR format approximates the blocks of the matrix by replacing them by low-rank matrices. The ranks are determined by a threshold, $\tau _b$ in this article, that controls the accuracy of the approximations. Higham and Mary [35] have carried out error analysis of the BLR LU factorization and obtained a backward error bound of order $\tau _b\Vert A\Vert +u_f\Vert \widehat{L}\Vert \Vert \widehat{U}\Vert$. One issue is that their analysis derives normwise bounds, whereas our model (4.6) and (4.7) require componentwise bounds. However, we have checked that, at the price of slightly larger constants by about a factor $O(n^{3/4})$ and a more complicated analysis, analogous componentwise bounds can be obtained. Therefore, using the componentwise version of [35, Theorem 4.2] or [35, Theorem 4.3] for (4.6) and of [35, Theorem 4.4] for (4.7), we conclude that Theorem 4.1 applies with $\epsilon =\tau _b$.

We now turn to static pivoting, assuming a strategy that replaces pivots smaller in absolute value than $\tau _s\Vert A\Vert _{\infty }$ by $\tau _s\Vert A\Vert _{\infty }$, where $\tau _s$ is a threshold that controls the accuracy of the factorization. With such a strategy, we are actually solving a perturbed system,

\begin{equation} Mx=b, \quad M = A+E, \end{equation}

(4.24)

where E is a diagonal matrix having nonzero entries equal to $\tau _{s}\Vert A\Vert _{\infty }$ in the positions corresponding to pivots that were replaced. By applying [32, Theorem 9.3] to (4.24) we meet the condition (4.6) with $\epsilon =\tau _s$, while condition (4.7) is met, since the triangular solves are standard. Therefore Theorem 4.1 applies with $\epsilon =\tau _s$.

Finally, we can also derive convergence conditions for the case where BLR and static pivoting are combined. This amounts to using BLR approximations on the perturbed system (4.24), and so Theorem 4.1 applies with $\epsilon =\tau _b+\tau _s$.

5 Performance Analysis

We have implemented a selected set of iterative refinement variants, and we analyze in this section their practical performance for the solution of large-scale, real-life sparse problems on parallel computers.

After describing our implementation details and our experimental setting in Sections 5.1 and 5.2, we compare, in Section 5.3, five variants with the case of a plain fp64 factorization plus solves. We then carry out detailed analyses of the time and memory performance of these variants in Sections 5.3.1 and 5.3.2, respectively. In Section 5.4 we investigate the use of four BLR, static pivoting, and BLR with static pivoting variants combined with iterative refinement.

5.1 Implementation Details

To perform our experiments, we implemented both LU-IR and GMRES-IR for their execution on parallel architectures. In the following, we describe our implementation choices.

For the sparse LU factorization and LU triangular solves, we rely on the MUMPS solver [7, 9], which implements the multifrontal method [26]. It must be noted that most of our analysis readily applies to other sparse factorization approaches, such as the right- or left-looking supernodal method used, for example, in SuperLU [22], PaStiX [38], or PARDISO [48]. The only exception is the memory consumption analysis (Section 5.3.2), where we rely on features of the multifrontal method. The default pivoting strategy used in the MUMPS solver is threshold partial pivoting [23], which provides great stability; alternatively, static pivoting (as described in Section 2.2) can be used, where possible, to improve performance. MUMPS also implements the BLR factorization method described in Section 2.2; for a detailed description of the BLR feature of MUMPS, we refer to [4, 7].

For the GMRES solver, we have used an in-house implementation of the unrestarted MGS-GMRES method. This code does not use MPI parallelism, but is multithreaded; as a result, all computations are performed on a single node, using multiple cores, except for the solves in the preconditioning, which are operated through a call to the corresponding MUMPS routine, which benefits from MPI parallelism. This also implies that the original system matrix and all the necessary vectors (including the Krylov basis) are centralized on MPI rank zero. The GMRES method is stopped when the scaled residual falls below a prescribed threshold $\tau _g$; that is, $\Vert \widetilde{A}\widehat{d}_{i,j}-\widehat{s}_i\Vert / \Vert \widetilde{A}\widehat{d}_{i,0}-\widehat{s}_i\Vert \le \tau _g$, where $d_{i,j}$ is the jth iterate of the GMRES solution.

In the GMRES-IR case, the solves require the LU factors to be in a different precision than what was computed by the factorization (that is, $u_f\ne u_p$). Two options are possible to handle this requirement. The first is to make an explicit copy of the factors by casting the data into precision $u_p$; the second is to make the solve operations blockwise, as is commonly done to take advantage of BLAS-3 operations, and cast the blocks on the fly using temporary storage. This second approach has the advantage of not increasing the memory consumption (only a small buffer is needed to cast blocks of the factors on the fly) and may even positively affect performance on memory-bound applications such as the solve [11]. However, this approach requires in-depth modifications of the MUMPS solve step, and we leave it for future work. In this article, we thus rely on the first approach, and assume the cast is performed in-place to minimize the storage overhead. In the same fashion as the factors, we also cast the original matrix in precision $u_r$ to perform the matrix–vector products in the residual computation.

For the symmetric matrices, we use the $LDL^T$ factorization. It must be noted that the matrix-vector product is not easily parallelizable when a compact storage format is used for symmetric matrices (such as one that stores only the upper or lower triangular part); for this reason, we choose to store symmetric matrices with a non-compact format to make the residual computation more efficiently parallelizable.

The code implementing the methods has been written in Fortran 2003, supports real and complex arithmetics, and supports both multithreading (through OpenMP) and MPI parallelism (through MUMPS). The results presented below were obtained with MUMPS version 5.4.0; the default settings were used except we used the advanced multithreading described in [44]. We used the Metis [40] tool version 5.1.0 for computing the fill-reducing permutation. BLAS and LAPACK routines are from the Intel Math Kernel Library version 18.2 and the Intel C and Fortran compilers version 18.2 were used to compile our code as well as the necessary third party packages. The code was compiled with the “flush to zero” option to avoid inefficient computations on subnormal numbers; this issue is discussed in [55]. The quadruple precision used is the IEEE fp128 supported by the Intel compilers. Since commonly available BLAS libraries do not support quadruple precision arithmetic, we had to implement some operations (copy, norms) by taking care of multithreading them.

5.2 Experimental Setting

Throughout our experiments, we analyze several variants of iterative refinement that use different combinations of precisions and different kinds of factorization, with and without approximations such as BLR or static pivoting.

In all experiments, the working precision is set to double ($u={\rm\small D}$) and GMRES is used in fixed precision ($u_g=u_p$) for a reason explained below. The factorization precision ($u_f$), the residual precision ($u_r$), and the precisions inside GMRES ($u_g$ and $u_p$) may vary according to the experiments. Alongside the text, we define an iterative refinement variant with the solver employed (LU or GMRES) and the set of precisions $u_f$, u, and $u_r$ (and $u_g$, $u_p$ if GMRES is the solver used). If the solver employed is LU, then we refer to it as an LU-IR variant and if it is GMRES we call it a GMRES-IR variant. We use the letters s, d, and q to refer to single, double, and quadruple precision arithmetic. We compare the iterative refinement variants to a standard double-precision direct solver, namely, MUMPS, which we refer to as DMUMPS (Double-precision MUMPS).

The values of the BLR threshold $\tau _b$ and the static pivoting threshold $\tau _s$ are specified alongside the text. For simplicity, we set $\tau _g$, the threshold used to stop GMRES, to $10^{-6}$ in all the experiments, even though it could be tuned on a case by case basis for optimized performance.

We do not cover all combinations of precisions of LU-IR and GMRES-IR; rather, we focus our study on a restricted number of combinations of $u_f$, $u_g$, and $u_p$, all meaningful in the sense of [3, Section 3.4]. This is motivated by several reasons.

•

Hardware support for half precision is still limited and the MUMPS solver on which we rely for this study does not currently support its use for the factorization; this prevents us from experimenting with $u_f={\rm\small H}$.

•

Setting $u_p=$ q might lead to excessively high execution time and memory consumption. In addition, it has been noticed in [3] that in practice this brings only a marginal improvement in the convergence compared with the case $u_p=$d on a wide set of real-life problems.

•

In our experiments, we rarely observed the Krylov basis to exceed more than a few dozen vectors except in Section 5.4 for very high thresholds $\tau _b$ and $\tau _s$. Hence, setting $u_g \gt u_p$ to reduce the memory footprint associated with the Krylov basis is not a priority for this study, and we focused on the case $u_g=u_p$.

In sparse direct solvers, the factorization is commonly preceded by a so called analysis step to prepare the factorization. We do not report results on this step, since:

•

Its behavior is independent of the variants and precisions chosen.

•

It can be performed once and reused for all problems that share the same structure.

•

The fill-reducing permutation can be more efficiently computed when the problem geometry is known (which is the case in numerous applications).

•

The efficiency of this step is very much implementation-dependent.

All the experiments were run on the Olympe supercomputer of the CALMIP supercomputing center of Toulouse, France. It is composed of 360 bi-processors nodes equipped with 192 GB of RAM and 2 Intel Skylake 6140 processors (2.3 Ghz, 18 cores) each. All experiments were done using 18 threads per MPI process, because this was found to be the most efficient combination. Depending on the matrix, we use 2 or 4 MPI processes (that is, 1 or 2 nodes) for the problem to fit in memory; the number of MPI processes for each matrix is specified in Table 1 and is the same for all the experiments.

Table 1.

Table 1 shows the matrices coming from the SuiteSparse Matrix Collection [20] (not bold) and industrial applications provided by industrial partners (bold) that were used for our experiments. These matrices are chosen such that a large panel of applications and a large range of condition numbers are covered. The data reported in the last three columns of the table are computed by the MUMPS solver with the settings described above. As MUMPS applies a scaling (and graph matching depending on the properties of the matrix) for numerical stability on the input matrix, the displayed condition number is therefore the one of the scaled matrix.

In all tests the right-hand side vector was set to $b=Ax$ with a generated x vector having all its components set to 1, which also served as the reference solution to compute the forward error. Note that, in a real context, the true solution is not known; if the forward error needs to be computed by the user, then the approach proposed by Demmel et al. [21] can be used.

We give a short description of the matrices provided by our industrial partners:

•

ElectroPhys10M: Cardiac electrophysiology model [46].

•

DrivAer6M: Incompressible CFD, pressure problem, airflow around an automobile [52].

•

tminlet3M: Noise propagation in an airplane turbine [10].

•

perf009ar: Elastic design of a pump subjected to a constant interior pressure. It was provided by Électricité de France (EDF), who carries out numerical simulations for structural mechanics applications using Code_Aster.¹

•

elasticity-3d: Linear elasticity problem applied on a beam composed of heterogeneous materials [2].

•

lfm_aug5M: Electromagnetic modelling, stabilized formulation for the low-frequency solution of Maxwell’s equation [51].

•

CarBody25M: structural mechanics, car body model.

•

thmgas: coupled thermal, hydrological, and mechanical problem.

5.3 Performance of LU-IR and GMRES-IR Using Standard LU Factorization

In this first set of experiments, we analyze the time and memory savings that different iterative refinement variants without approximate factorization are able to achieve, and we show how the specific features discussed in Section 3, the choice of a multifrontal solver, and the matrix properties can affect the performance of the method.

In Table 2, we present the execution time and memory consumption of five iterative refinement variants and DMUMPS for the set of the test matrices of Table 1. We classify the variants into two categories; in the first, we have variants that achieve a forward error equivalent to that obtained with the double-precision direct solver DMUMPS (the ones using $u_r=$ d) and, in the second, those whose forward error is of order $10^{-16}$, the double-precision unit roundoff (the ones using $u_r=$ q). Actually, for the first category, LU-IR and GMRES-IR can provide a better accuracy on the solution than DMUMPS, which is why we stop their iterations when they reach a forward error of the same order as the solution obtained with DMUMPS. We denote by a “—” the failure of a method to converge. For each matrix, we highlight in bold the execution time and memory consumption that do not exceed by more than 10% the best execution time or memory consumption.

Table 2.

Some general conclusions can be drawn from the results in this table. The LU-IR variants with single-precision factorization generally achieve the lowest execution times, except for a few cases where iterative refinement underperforms for reasons we will discuss in Section 5.3.1 or where convergence is not achieved. They also always achieve the lowest memory consumption when they converge, which comes at no surprise, because most of the memory is consumed in the factorization step.

Since the GMRES-IR variants with single-precision factorization typically require more LU solves to achieve convergence than the LU-IR variants with single-precision factorization, they usually have a higher execution time. Their memory consumption is also higher, because in our implementation the factors are cast to double precision. These variants, however, generally provide a more robust and reliable solution with respect to the LU-IR ($u_f=$ s) ones. As a result, GMRES-IR variants can solve problems where LU-IR do not achieve convergence. In such cases, for our matrix set, their execution time can be higher than that of variants that employ double-precision factorization (DMUMPS or LU-IR with $u_f=$ d and $u_r=$ q); however, their memory footprint usually remains smaller.

Overall, Table 2 shows that the GMRES-IR variants provide a good compromise between performance and robustness: unlike LU-IR ($u_f=$ s), they converge for all matrices in our set, while still achieving a significantly better performance than double-precision-based factorization variants.

It is also worth noting that, with respect to variants with $u_r=$ d, variants with $u_r=$ q can achieve a forward error of order $10^{-16}$ with only a small additional overhead in both time (because the residual is computed in quadruple rather than double precision and a few more iterations are required) and memory consumption (because the matrix is stored in quadruple precision). As a result, these variants can produce a more accurate solution than a standard double-precision direct solver (DMUMPS) with a smaller memory consumption and, in most cases, faster. We illustrate the accuracy improvement in Figure 1, which reports the forward error achieved by variants DMUMPS, LU-IR with $u_f=u=u_r=$ d (stopped when the forward error stagnates), and LU-IR and GMRES-IR with $u_f=$ s and $u_r=$ q.

Fig. 1.

To provide more insight into the behavior of each variant, we next carry out a detailed analysis of time and memory consumption in Sections 5.3.1 and 5.3.2, respectively.

5.3.1 Detailed Execution Time Analysis.

The potential gain in execution time of mixed-precision iterative refinement comes from the fact that the most time-consuming operation, the LU factorization, is carried in low-precision arithmetic and high precision is only used in iterative refinement steps, which involve low-complexity operations. For this gain to be effective, the cost of the refinement iterations must not exceed the time reduction resulting from running the factorization in low precision. This is very often the case. First, on current processors (such as the model used for our experiments), computations in single precision can be up to twice as fast as those in double precision. Additionally, operations performed in the refinement steps have a lower asymptotic complexity compared with the factorization. Nonetheless, in practice, the overall time reduction can vary significantly depending on a number of parameters. First, the ratio between the complexity of the factorization and that of the solution phase is less favorable on 2D problems than on 3D problems (see Section 2.1). Second, the single-precision factorization may be less than twice as fast as the double-precision one; this may happen, for example, on small problems where the overhead of symbolic operations in the factorization (data indexing, handling of data structures, etc.) is relatively high or, in a parallel setting, because the single-precision factorization is less scalable than the double-precision one due to the relatively lower weight of floating-point operations with respect to that of symbolic ones. It must also be noted that although the factorization essentially relies on efficient BLAS-3 operations, the operations done in the iterative refinement, in particular the LU solves, rely on memory-bound BLAS-2 operations and are thus less efficient. Finally, in the case of badly conditioned problems, iterative refinement may require numerous iterations to achieve convergence.

Figure 2 shows the execution time of variants encountered in Table 2 normalized with respect to that of DMUMPS for a selected subset of matrices from our test set; each bar also shows the time breakdown into LU factorization, LU solves and all the rest, which includes computing the residual and, for the GMRES-based variants, casting the factors, computing the Krylov basis, orthonormalizing it, and so on. The values on top of each bar are the number of LU solve operations; note that for GMRES-based variants, multiple LU solve operations are done in each refinement iteration.

Fig. 2.

In the first row of this figure, we report problems that behave well, in the sense that all the parameters mentioned above align in the direction that leads to a good overall time reduction. For these problems the single-precision factorization is roughly twice as fast as the double-precision one, the complexity of the solve is much lower than that of the factorization (three orders of magnitude in all cases, as reported in Table 1), and relatively few iterations are needed to achieve convergence. For all these problems, the complexity of the matrix–vector product is more than two orders of magnitude lower than that of the solve (see columns “NNZ” and “Slv.” of Table 1). As a result, the computation of the residual only accounts for a small portion of the total execution time—even for variants with $u_r=$ q, for which it is carried out in slow quadruple precision arithmetic (which is not supported by our hardware). This is a very desirable property, since these variants greatly improve the forward error with only a modest overhead. The figure clearly shows, however, that despite their relatively low complexity, the operations in iterative refinement are relatively slow and, therefore, the gain is considerably reduced when many solves are necessary. This issue is exacerbated in the case of GMRES-IR variants, because the solves are carried out in double instead of single precision as for LU-IR variants ($u_f=$ s).

In the second row of Figure 2, we report some cases where mixed-precision iterative refinement does not reduce execution time. Chevron4 is a relatively small 2D problem where the cost of the solve and the matrix–vector product relative to that of the factorization is high; as a result, even for a moderate number of refinement iterations, variant DMUMPS achieves the best execution time and all other variants are much slower. perf009ar is one where the single-precision factorization is only 1.6$\times$ faster than the double-precision one and, additionally, it produces little fill-in (as shown by the small ratio Slv./NNZ in Table 1) and so the relative cost of computing the residual in quadruple precision is high. Finally, CarBody25M is badly conditioned and variants based on single-precision factorization either do not converge or require so many iterations that the execution time is higher than that of DMUMPS. It is, however, worth noting that on these particular matrices variants based on single precision factorization may be slower than DMUMPS but at a significantly reduced memory cost (as shown in Table 2).

5.3.2 Detailed Memory Consumption Analysis.

One distinctive feature of the multifrontal method in comparison with left or right-looking ones is in the way it uses memory. In addition to the memory needed to store the factors, which grows monotonically throughout the factorization, the multifrontal method also needs a temporary workspace, which we refer to as active memory. As a result, the peak memory consumption achieved in the course of the factorization is generally higher than the memory needed to store the factors. It must also be noted that parallelism does not have any effect on the memory needed to store the factors but generally increases the size of the active memory; this is because more temporary data is generated at the same time. For a thorough discussion of the memory consumption in the multifrontal method, we refer the reader to the paper by Agullo et al. [1].

In the rest of this article, we refer to the difference between the peak memory consumption and the size of factors as active memory overhead. We assume that at the end of the factorization all the active memory is freed and only the factors are left in memory. It is only at this moment that the original problem matrix is cast to quadruple precision for computing the residual at each refinement iteration. Therefore, the active memory overhead and the memory required to store a quadruple precision version of the matrix do not accumulate. In our implementation, the GMRES-IR variants with $u_p=u^2=$ d also require the factors to be cast to double precision, which we do upon completion of the factorization, when the active memory is freed. We also report the size of the Krylov basis in the GMRES solver; although in most of our experiments this is completely negligible, there might be cases (we will show one) where the number of GMRES iterations is sufficiently high to make the memory consumed by the Krylov basis relevant. Finally, we do not count the memory consumption of the solution, residual and correction vectors.

All these assumptions lead us to Figure 3 where we present the normalized memory consumption of certain LU-IR and GMRES-IR variants relative to that of variant DMUMPS for a selected subset of problems. We do not include variants using $u_r=$ d, because they behave very similarly to variants with $u_r=$ q. For each problem and variant the bar is split in two parts showing the memory consumption during and after the factorization, respectively.

Fig. 3.

In the first row, we report problems that behave well, which corresponds to the most common case as shown in Table 2. It shows, as expected, that LU-IR with single-precision factorization consumes half as much memory as DMUMPS, because the memory needed to store the problem matrix in quadruple precision does not exceed the active memory overhead. Thus, the highest memory consumption corresponds to the single-precision factorization peak. GMRES-based variant ($u_p=$ d) casts the factors to double precision, which exceeds the peak of the single-precision factorization. Nonetheless, even if on top of this we have to add the memory needed to store the quadruple precision matrix, the overall consumption is lower than the double-precision factorization peak by a factor that can be almost up to 50% on this set, making the memory consumption of the GMRES-IR variant almost identical to that of the LU-IR one in a few cases (such as matrices nlpkkt80 and Serena). As for the LU-IR variant with $u_f=$ d, it clearly does not bring any improvement with respect to DMUMPS but no loss either, because the memory for storing the matrix in quadruple precision is much lower than the active memory overhead.

In the second row of Figure 3, we report problems where the memory reduction is not as good, for four different reasons, the last two of which are exclusive to GMRES-IR.

(1)

In the case of CarBody25M, the single-precision factorization consumes more than half the memory of the double-precision one (about 55%). This is because the relative weight of the symbolic data structures, which is counted as part of the active memory overhead and does not depend on the factorization precision, is high for this matrix.

(2)

In the case of fem_hifreq_circuit and CarBody25M the factorization generates little fill-in, which makes the relative weight of the quadruple precision matrix copy significant compared with the size of the factors. Here this storage remains less than the active memory overhead and so the overall memory consumption of LU-IR is not impacted; however, it does impact GMRES-IR, leading to less memory savings.

(3)

In the case of vas_stokes_1M and fem_hifreq_circuit (and to a lesser extent 24) the active memory overhead represents a smaller fraction of the total memory, further reducing the memory savings for GMRES-IR.

(4)

Finally, fem_hifreq_circuit (and to a lesser extent CarBody25M) is one of the few matrices having a non-negligible Krylov basis memory footprint, showing that an increase in the number iterations for the GMRES to converge diminishes here the potential memory gain.

5.4 Performance of LU-IR and GMRES-IR using Approximate Factorizations

In this set of experiments, we are interested in studying the performance of LU-IR and GMRES-IR while using BLR, static pivoting, and BLR with static pivoting. For each experiment, we use a selected set of matrices from Table 1, which are representative of different types of behavior that can be encountered in practice.

These approximation techniques have conflicting effects on the performance: if they reduce the time and memory of the factorization, then they increase the number of refinement iterations.

5.4.1 BLR Factorization.

In Table 3, we present the execution time and memory consumption of four iterative refinement variants using low rank BLR factorization for different values of the compression threshold $\tau _b$. All variants provide a forward error on the solution equivalent to the one of DMUMPS. If $\tau _b=$ “full-rank,” then the factorization is run without BLR, this is a standard factorization as in Section 5.3. It should be noted that in this case the double-precision factorization LU-IR variant is equivalent to DMUMPS, and we will refer to it as DMUMPS in the text. We denote by “—” the cases where convergence is not reached and, for each matrix, we highlight in bold the execution time and memory consumption that do not exceed by more than 10% the best execution time or memory consumption. We choose to work with the default BLR settings of MUMPS, in which case the data in the active memory is not compressed with low rank approximations. It should be noted, however, that MUMPS also offers the possibility to compress the data in the active memory; this would result in additional memory savings, but will badly affect the execution time, because it adds the compression overhead without reducing the operational complexity.

Table 3.

The experimental results of Table 3 are in good agreement with the theoretical convergence conditions of Theorem 4.1 developed in Section 4. We can clearly see how the robustness of the presented variants is related to both the condition number $\kappa (A)$ of the matrix (specified for each matrix in Table 1) and the BLR threshold $\tau _b$. Convergence is not achieved for excessively large values of the BLR threshold; the largest $\tau _b$ value for which convergence is achieved depends on the matrix condition number and, in general, it is smaller for badly conditioned problems. In the case of GMRES-IR variants, which are more robust, the BLR threshold can be pushed to larger values without breaking convergence. Note that there is no situation where a variant does not converge when in theory it should (that is, when the convergence condition is met). However, as the theoretical convergence conditions in Theorem 4.1 can be pessimistic, there are several cases where convergence is achieved even though the theoretical convergence condition is not met.

The use of BLR with a good choice of compression threshold $\tau _b$ generally results in substantial reductions of the LU-IR and GMRES-IR execution times. As the BLR threshold increases, the operational complexity and, consequently, the execution time of the factorization and solve operations decreases; conversely, the number of iterations increases up to the point where convergence may not be achieved anymore. The optimal BLR threshold value that delivers the lowest execution time obviously depends on the problem. It must be noted that even though the GMRES-IR variants achieve convergence for larger $\tau _b$ values, this leads to an excessive number of iterations whose cost exceeds the improvement provided by BLR; as a result, these variants are slower than LU-IR ones ($u_f=$ s but also $u_f=$ d) in all cases. Consequently, the single-precision factorization LU-IR variant generally achieves the best execution time on this set of problems, similarly to what was observed in Section 5.3, with a few exceptions. On perf009ar the double-precision factorization LU-IR variant is the best due to the fact that similarly to the full rank case (see Section 5.3.1) the BLR factorization is less than twice as fast when single precision is used instead of double for the same $\tau _b$ value; additionally, a substantial number of iterations is needed to achieve convergence. It is worth mentioning that on this matrix the GMRES-IR variant with $u_g=u_p=$ d is faster than the single-precision factorization LU-IR variant (36.9 s versus 40.3 s) and consumes less memory than the double-precision factorization LU-IR variant (20.0 GB versus 37.1 GB). On CarBody25M, DMUMPS is the fastest variant as in the full rank case; this is due to the fact that, on this problem, BLR does not achieve a good reduction of the operational complexity and, therefore, of the execution time.

As for the storage, the use of BLR leads to a different outcome with respect to the case where a full-rank factorization is used (see Section 5.3) where the single-precision factorization LU-IR variant is the best. This is due to the combination of two factors. First, when BLR is used, the relative weight of the active memory is higher, because it corresponds to data that is not compressed due to the choice of parameters we have made; consequently, the memory consumption peak is often reached during the factorization rather than during the iterative refinement. Second, the memory consumption of the factorization decreases monotonically when the BLR threshold is increased. As a result of these two effects, the GMRES-IR variants achieve the lowest memory consumption on this set of problems, because they can preserve convergence for larger values of $\tau _b$ than the LU-IR variants can. For example, on tminlet3M the GMRES-IR variant with $u_g=u_p=$ d consumes almost $15\%$ less memory than the LU-IR one with $u_f=$ s (70.9 GB versus 82.4 GB), on thmgas the GMRES-IR variant with $u_g=u_p=$ d consumes almost $30\%$ less memory than variant LU-IR with $u_f=$ s (43.7 GB versus 61.4 GB), and on matrix 24 the GMRES-IR variant with $u_g=u_p=$ d consumes more than $35\%$ less memory than variant LU-IR with $u_f=$ d (41.8 GB versus 64.8 GB). It is worth pointing out that the value of $\tau _b$ for which GMRES-IR achieves the lowest possible memory consumption is not always the largest value for which convergence is still possible. This is because for a large number of iterations the memory needed to store the Krylov basis may exceed the savings obtained with BLR. This problem can be overcome or mitigated by choosing an appropriate value for the $\tau _g$ threshold or, similarly, using a restarted GMRES method; we leave this analysis for future work.

We finally compare the two GMRES-IR variants $u_g=u_p=$ d and $u_g=u_p=$ s. When $u_g=u_p=$ s, GMRES-IR avoids the cast of the LU factors from single to double precision, and thus reduces memory consumption compared with $u_g=u_p=$ d. However, as explained above, the relative weight of the factors with respect to the active memory is smaller as $\tau _b$ increases, and so the reduction achieved by GMRES-IR with $u_g=u_p=$ s grows smaller until the point where both variants achieve a similar memory consumption. On our matrix set, for the values of $\tau _b$ where the LU-IR with $u_f=$ s does not converge, GMRES-IR with $u_g=u_p=$ s does not achieve significant memory reductions compared with GMRES-IR with $u_g=u_p=$ d (at best 7% on perf009ar, 18.6 GB versus 20.0 GB).

5.4.2 Static Pivoting Factorization.

We now turn our attention to the use of static pivoting. We report in Table 4 the execution time and memory consumption of three iterative refinement variants for different values of the static pivoting threshold $\tau _s$. All variants are stopped when they reach a forward error on the solution equivalent to the one of DMUMPS. If $\tau _s=$ “partial,” then the factorization is run in standard MUMPS threshold partial pivoting. It should be noted that in this case the double-precision factorization LU-IR variant is equivalent to DMUMPS.

Table 4.

Once again the observed convergence behaviors are in good agreement with Theorem 4.1 as explained below. In the case of static pivoting, the execution time of the factorization does not depend on $\tau _s$; to minimize the overall solution time, the goal is therefore to achieve the fastest possible convergence. This is a complex issue: A smaller perturbation $\tau _s$ does not always mean a faster convergence, because the value of $\tau _s$ also directly impacts the growth factor $\rho _n$. Thus, there is an optimal value of $\tau _s$, which is clearly problem dependent, that leads to the fastest convergence by balancing the $u_f\rho _n$ and $\tau _s$ terms in the convergence condition. To confirm this, Table 4 reports $\underline{\rho _n}$, a lower bound on the true growth factor that can be used as a cheap but rough indicator of the behavior of $\rho _n$ (the true $\rho _n$ would be extremely expensive to compute for such large matrices). There is a clear trend of $\underline{\rho _n}$ decreasing as $\tau _s$ increases, which explains, for example, why on lfm_aug5M convergence is achieved for large $\tau _s$. For many matrices in our set, such as tminlet3M in Table 4, static pivoting slightly accelerates the factorization without excessively deteriorating the convergence, and so allows for modest time gains overall. However, for some matrices such as lfm_aug5M, static pivoting requires many iterations and can lead to significant slowdowns compared with partial pivoting. It is, however, interesting to note that, on lfm_aug5M, if the use of partial pivoting is impossible (for instance, because the available solver does not support it), then the GMRES-IR variant provides the best overall execution time.

5.4.3 BLR Factorization with Static Pivoting.

Finally, in Table 5, we present the execution time and memory consumption of three iterative refinement variants (the same as in Section 5.4.2) for different values of the BLR compression threshold $\tau _b$ and a fixed value of the static pivoting perturbation $\tau _s$. All variants are stopped when they reach a forward error on the solution equivalent to the one of DMUMPS. If $\tau _s=$ partial, then the factorization is run in standard MUMPS threshold partial pivoting and the results are then equivalent to the BLR results of Section 5.4.1.

Table 5.

Theorem 4.1 applied to the case where BLR and static pivoting are used together states that the convergence conditions should be affected by the largest perturbations $\max (\tau _s, \tau _b)$ and the term $\rho _nu_f$, which depends on the growth factor. Our experiments confirm this: Values of $\tau _s$ or $\tau _b$ for which a given variant was not converging with BLR or static pivoting alone still do not converge when they are combined, and, conversely, variants that were converging for BLR and static pivoting alone still converge when these two approximations are used together. lfm_aug5M with $\tau _s=10^{-6}$ illustrates an interesting point of the error bound $\max (\tau _s,\tau _b)+\rho _nu_f$; convergence is only achieved for the variant that uses a double-precision factorization ($u_f={\rm\small D}$), even for values of $\tau _b$ that are much larger than the unit roundoff of single precision. This shows that the rule of thumb that the factorizaton precision should be chosen as low as possible as long as $u_f \le \tau _b$ is not true in presence of large element growth, since a smaller value of $u_f$ can be beneficial to absorb a particularly large $\rho _n$.

While the reductions in execution time obtained by using static pivoting instead of partial pivoting were modest for the full-rank factorization, they are larger for the BLR factorization. Taking tminlet3M as an example, in full-rank the single-precision factorization LU-IR variant is only 1.12$\times$ (136 s/121 s) faster after the activation of the static pivoting (see Table 4), whereas in BLR it is 1.26$\times$ (88 s/70 s) faster (see Table 3). These better reductions are explained by the fact that in the BLR factorization, static pivoting also allows the panel reduction operation to be processed with low-rank operations [7], which leads to a reduction of flops and thus a faster execution time.

5.5 Performance Summary

To summarize the results presented in the previous sections, we report in Table 6 the best execution time and memory consumption amongst all the previously reviewed iterative refinement variants, for the industrial partners matrices and Queen_4147. All variants are stopped when they reach a forward error on the solution equivalent to the one of DMUMPS.

Table 6.

We obtain at best on lfm_aug5M a reduction of $5.6\times$ in time and on thmgas a reduction of $4.4\times$ in memory. A greater variability is observed in the speedup with respect to the memory gains. This is because numerous parameters affect the execution time, which are related to the numerical properties of the problems as well as to the features of the computer architecture; in some extreme cases (such as CarBody25M) no speedup is observed at all. As the best memory saving is sometimes obtained for aggressive values of the BLR threshold, the execution time can be deteriorated due to a high number of iterations. We, however, note that a balance between the two use cases can be struck to obtain large memory savings while keeping a reasonable execution time: taking thmgas as an example, we can achieve a $3.6\times$ memory reduction (compared with the $2.8\times$ reduction of the “best in time” variant) while only leading to a $0.7\times$ slowdown (compared with the $0.03\times$ slowdown of the “best in memory” variant).

6 Conclusions

We have evaluated the potential of mixed-precision iterative refinement to improve the solution of large sparse systems with direct methods. Compared with dense systems, sparse ones present some challenges but also, as we have explained, some specific features that make the use of iterative refinement especially attractive. In particular, the fact that the LU factors are much denser than the original matrix makes the computation of the residual in quadruple precision arithmetic affordable, and leads to potentially large memory savings compared with a full precision solver. Moreover, iterative refinement can remedy potential instabilities in the factorization, which modern sparse solvers often introduce by using numerical approximations or relaxed pivoting strategies. In this study, we have therefore sought to combine recently proposed iterative refinement variants using up to five precisions with state-of-the-art approximate sparse factorizations employing low-rank approximations and static pivoting.

From a theoretical point of view, the standard convergence bounds for LU-IR and GMRES-IR, obtained for a stable factorization with partial pivoting, needed to be adapted. We derived, in Theorem 4.1, new bounds that take into account numerical approximations in the factorization as well as a possibly large element growth due to relaxed pivoting. These bounds better correspond to the typical use of sparse solvers, and we have observed them to be in good accordance with the experimental behavior, at least in the case of BLR approximations, static pivoting, and their combination.

We then provided an extensive experimental study of several iterative refinement variants combined with different types of approximate factorization using the multifrontal solver MUMPS. Our experiments demonstrate the potential of mixed-precision arithmetic to reduce the execution time and memory consumption of sparse direct solvers, and shed light on important features of these methods. In particular, we have shown that LU-IR with a standard factorization in single precision is able to reduce the time and memory by up to $2\times$ compared with a double-precision solver. We have found GMRES-IR to usually be more expensive but also more robust, which allows it to converge on very ill-conditioned problems and still achieve gains in memory, and sometimes even in time. Moreover, we have combined single-precision arithmetic with BLR and static pivoting and analyzed how the convergence of iterative refinement depends on their threshold parameters. Overall, compared with the double-precision solver, we have obtained reductions of up to $5.6\times$ in time and $4.4\times$ in memory all while preserving double-precision accuracy. Moreover, we have shown that memory consumption can be even further reduced at the expense of time, by using GMRES-IR with more aggressive approximations.

These results open up promising avenues of research as half precision arithmetic becomes progressively available in hardware and supported by compilers.

Acknowledgments

We thank our industrial partners and the EoCoE project for providing some of the test problems. All the experiments were performed on the Olympe supercomputer of the CALMIP center (Project No. P0989). We thank the referees and the editor for their thorough evaluations, feedback, and corrections.

Footnote

http://www.code-aster.org.

References

[1]

Emmanuel Agullo, Patrick R. Amestoy, Alfredo Buttari, Abdou Guermouche, Jean-Yves L’Excellent, and François-Henry Rouet. 2016. Robust memory-aware mappings for parallel multifrontal factorizations. SIAM J. Sci. Comput. 38, 3 (2016), C256–C279. DOI:

Abstract

1 Introduction

2 Background

2.1 Sparse Direct Methods

2.2 Approximate Factorizations

2.3 Iterative Refinement

3 Specific Features of Iterative Refinement with Sparse Direct Solvers

4 Error Analysis of Iterative Refinement with A General Approximate Factorization

4.1 Preliminaries and Notations

4.2 Error Analysis

4.2.1 Error Analysis for LU-IR.

4.2.2 Error Analysis for GMRES-IR.

4.2.3 Summary of the Error Analysis and Interpretation.

4.2.4 Convergence Conditions for BLR and Static Pivoting.

5 Performance Analysis

5.1 Implementation Details

5.2 Experimental Setting

5.3 Performance of LU-IR and GMRES-IR Using Standard LU Factorization

5.3.1 Detailed Execution Time Analysis.

5.3.2 Detailed Memory Consumption Analysis.

5.4 Performance of LU-IR and GMRES-IR using Approximate Factorizations

5.4.1 BLR Factorization.

5.4.2 Static Pivoting Factorization.

5.4.3 BLR Factorization with Static Pivoting.

5.5 Performance Summary

6 Conclusions

Acknowledgments

Footnote

References

Cited By

Index Terms

Recommendations

Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions

A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems

Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Definite Linear Systems and Least Squares Problems

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations