These approximation techniques have conflicting effects on the performance: if they reduce the time and memory of the factorization, then they increase the number of refinement iterations.
5.4.1 BLR Factorization.
In Table
3, we present the execution time and memory consumption of four iterative refinement variants using low rank BLR factorization for different values of the compression threshold
\(\tau _b\). All variants provide a forward error on the solution equivalent to the one of DMUMPS. If
\(\tau _b=\) “full-rank,” then the factorization is run without BLR, this is a standard factorization as in Section
5.3. It should be noted that in this case the double-precision factorization LU-IR variant is equivalent to DMUMPS, and we will refer to it as DMUMPS in the text. We denote by “—” the cases where convergence is not reached and, for each matrix, we highlight in bold the execution time and memory consumption that do not exceed by more than 10% the best execution time or memory consumption. We choose to work with the default BLR settings of MUMPS, in which case the data in the active memory is not compressed with low rank approximations. It should be noted, however, that MUMPS also offers the possibility to compress the data in the active memory; this would result in additional memory savings, but will badly affect the execution time, because it adds the compression overhead without reducing the operational complexity.
The experimental results of Table
3 are in good agreement with the theoretical convergence conditions of Theorem
4.1 developed in Section
4. We can clearly see how the robustness of the presented variants is related to both the condition number
\(\kappa (A)\) of the matrix (specified for each matrix in Table
1) and the BLR threshold
\(\tau _b\). Convergence is not achieved for excessively large values of the BLR threshold; the largest
\(\tau _b\) value for which convergence is achieved depends on the matrix condition number and, in general, it is smaller for badly conditioned problems. In the case of GMRES-IR variants, which are more robust, the BLR threshold can be pushed to larger values without breaking convergence. Note that there is no situation where a variant does not converge when in theory it should (that is, when the convergence condition is met). However, as the theoretical convergence conditions in Theorem
4.1 can be pessimistic, there are several cases where convergence is achieved even though the theoretical convergence condition is not met.
The use of BLR with a good choice of compression threshold
\(\tau _b\) generally results in substantial reductions of the LU-IR and GMRES-IR execution times. As the BLR threshold increases, the operational complexity and, consequently, the execution time of the factorization and solve operations decreases; conversely, the number of iterations increases up to the point where convergence may not be achieved anymore. The optimal BLR threshold value that delivers the lowest execution time obviously depends on the problem. It must be noted that even though the GMRES-IR variants achieve convergence for larger
\(\tau _b\) values, this leads to an excessive number of iterations whose cost exceeds the improvement provided by BLR; as a result, these variants are slower than LU-IR ones (
\(u_f=\) s but also
\(u_f=\) d) in all cases. Consequently, the single-precision factorization LU-IR variant generally achieves the best execution time on this set of problems, similarly to what was observed in Section
5.3, with a few exceptions. On perf009ar the double-precision factorization LU-IR variant is the best due to the fact that similarly to the full rank case (see Section
5.3.1) the BLR factorization is less than twice as fast when single precision is used instead of double for the same
\(\tau _b\) value; additionally, a substantial number of iterations is needed to achieve convergence. It is worth mentioning that on this matrix the GMRES-IR variant with
\(u_g=u_p=\) d is faster than the single-precision factorization LU-IR variant (36.9 s versus 40.3 s) and consumes less memory than the double-precision factorization LU-IR variant (20.0 GB versus 37.1 GB). On CarBody25M, DMUMPS is the fastest variant as in the full rank case; this is due to the fact that, on this problem, BLR does not achieve a good reduction of the operational complexity and, therefore, of the execution time.
As for the storage, the use of BLR leads to a different outcome with respect to the case where a full-rank factorization is used (see Section
5.3) where the single-precision factorization LU-IR variant is the best. This is due to the combination of two factors. First, when BLR is used, the relative weight of the active memory is higher, because it corresponds to data that is not compressed due to the choice of parameters we have made; consequently, the memory consumption peak is often reached during the factorization rather than during the iterative refinement. Second, the memory consumption of the factorization decreases monotonically when the BLR threshold is increased. As a result of these two effects, the GMRES-IR variants achieve the lowest memory consumption on this set of problems, because they can preserve convergence for larger values of
\(\tau _b\) than the LU-IR variants can. For example, on tminlet3M the GMRES-IR variant with
\(u_g=u_p=\) d consumes almost
\(15\%\) less memory than the LU-IR one with
\(u_f=\) s (70.9 GB versus 82.4 GB), on thmgas the GMRES-IR variant with
\(u_g=u_p=\) d consumes almost
\(30\%\) less memory than variant LU-IR with
\(u_f=\) s (43.7 GB versus 61.4 GB), and on matrix 24 the GMRES-IR variant with
\(u_g=u_p=\) d consumes more than
\(35\%\) less memory than variant LU-IR with
\(u_f=\) d (41.8 GB versus 64.8 GB). It is worth pointing out that the value of
\(\tau _b\) for which GMRES-IR achieves the lowest possible memory consumption is not always the largest value for which convergence is still possible. This is because for a large number of iterations the memory needed to store the Krylov basis may exceed the savings obtained with BLR. This problem can be overcome or mitigated by choosing an appropriate value for the
\(\tau _g\) threshold or, similarly, using a restarted GMRES method; we leave this analysis for future work.
We finally compare the two GMRES-IR variants \(u_g=u_p=\) d and \(u_g=u_p=\) s. When \(u_g=u_p=\) s, GMRES-IR avoids the cast of the LU factors from single to double precision, and thus reduces memory consumption compared with \(u_g=u_p=\) d. However, as explained above, the relative weight of the factors with respect to the active memory is smaller as \(\tau _b\) increases, and so the reduction achieved by GMRES-IR with \(u_g=u_p=\) s grows smaller until the point where both variants achieve a similar memory consumption. On our matrix set, for the values of \(\tau _b\) where the LU-IR with \(u_f=\) s does not converge, GMRES-IR with \(u_g=u_p=\) s does not achieve significant memory reductions compared with GMRES-IR with \(u_g=u_p=\) d (at best 7% on perf009ar, 18.6 GB versus 20.0 GB).
5.4.2 Static Pivoting Factorization.
We now turn our attention to the use of static pivoting. We report in Table
4 the execution time and memory consumption of three iterative refinement variants for different values of the static pivoting threshold
\(\tau _s\). All variants are stopped when they reach a forward error on the solution equivalent to the one of DMUMPS. If
\(\tau _s=\) “partial,” then the factorization is run in standard MUMPS threshold partial pivoting. It should be noted that in this case the double-precision factorization LU-IR variant is equivalent to DMUMPS.
Once again the observed convergence behaviors are in good agreement with Theorem
4.1 as explained below. In the case of static pivoting, the execution time of the factorization does not depend on
\(\tau _s\); to minimize the overall solution time, the goal is therefore to achieve the fastest possible convergence. This is a complex issue: A smaller perturbation
\(\tau _s\) does not always mean a faster convergence, because the value of
\(\tau _s\) also directly impacts the growth factor
\(\rho _n\). Thus, there is an optimal value of
\(\tau _s\), which is clearly problem dependent, that leads to the fastest convergence by balancing the
\(u_f\rho _n\) and
\(\tau _s\) terms in the convergence condition. To confirm this, Table
4 reports
\(\underline{\rho _n}\), a lower bound on the true growth factor that can be used as a cheap but rough indicator of the behavior of
\(\rho _n\) (the true
\(\rho _n\) would be extremely expensive to compute for such large matrices). There is a clear trend of
\(\underline{\rho _n}\) decreasing as
\(\tau _s\) increases, which explains, for example, why on lfm_aug5M convergence is achieved for large
\(\tau _s\). For many matrices in our set, such as tminlet3M in Table
4, static pivoting slightly accelerates the factorization without excessively deteriorating the convergence, and so allows for modest time gains overall. However, for some matrices such as lfm_aug5M, static pivoting requires many iterations and can lead to significant slowdowns compared with partial pivoting. It is, however, interesting to note that, on lfm_aug5M, if the use of partial pivoting is impossible (for instance, because the available solver does not support it), then the GMRES-IR variant provides the best overall execution time.
5.4.3 BLR Factorization with Static Pivoting.
Finally, in Table
5, we present the execution time and memory consumption of three iterative refinement variants (the same as in Section
5.4.2) for different values of the BLR compression threshold
\(\tau _b\) and a fixed value of the static pivoting perturbation
\(\tau _s\). All variants are stopped when they reach a forward error on the solution equivalent to the one of DMUMPS. If
\(\tau _s=\) partial, then the factorization is run in standard MUMPS threshold partial pivoting and the results are then equivalent to the BLR results of Section
5.4.1.
Theorem
4.1 applied to the case where BLR and static pivoting are used together states that the convergence conditions should be affected by the largest perturbations
\(\max (\tau _s, \tau _b)\) and the term
\(\rho _nu_f\), which depends on the growth factor. Our experiments confirm this: Values of
\(\tau _s\) or
\(\tau _b\) for which a given variant was not converging with BLR or static pivoting alone still do not converge when they are combined, and, conversely, variants that were converging for BLR and static pivoting alone still converge when these two approximations are used together. lfm_aug5M with
\(\tau _s=10^{-6}\) illustrates an interesting point of the error bound
\(\max (\tau _s,\tau _b)+\rho _nu_f\); convergence is only achieved for the variant that uses a double-precision factorization (
\(u_f={\rm\small D}\)), even for values of
\(\tau _b\) that are much larger than the unit roundoff of single precision. This shows that the rule of thumb that the factorizaton precision should be chosen as low as possible as long as
\(u_f \le \tau _b\) is not true in presence of large element growth, since a smaller value of
\(u_f\) can be beneficial to absorb a particularly large
\(\rho _n\).
While the reductions in execution time obtained by using static pivoting instead of partial pivoting were modest for the full-rank factorization, they are larger for the BLR factorization. Taking tminlet3M as an example, in full-rank the single-precision factorization LU-IR variant is only 1.12
\(\times\) (136 s/121 s) faster after the activation of the static pivoting (see Table
4), whereas in BLR it is 1.26
\(\times\) (88 s/70 s) faster (see Table
3). These better reductions are explained by the fact that in the BLR factorization, static pivoting also allows the panel reduction operation to be processed with low-rank operations [
7], which leads to a reduction of flops and thus a faster execution time.