O(N) distributed direct factorization of structured dense matrices using runtime systems.

Sameer Deshmukh, School of Computing, Tokyo Institute of Technology, Japan, sameer.deshmukh@rio.gsic.titech.ac.jp

Rio Yokota, Global Scientific Information and Computing Center, Tokyo Institute of Technology, Japan, rioyokota@gsic.titech.ac.jp

George Bosilca, Innovative Computing Laboratory, University of Tennessee at Knoxville, United States of America, bosilca@icl.utk.edu

Qinxiang Ma, School of Computing, Tokyo Institute of Technology, Japan, ma@rio.gsic.titech.ac.jp

DOI: https://doi.org/10.1145/3605573.3605606
ICPP 2023: 52nd International Conference on Parallel Processing, Salt Lake City, UT, USA, August 2023

Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from O(N³) to O(N). The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be factorized using a Cholesky-like algorithm called ULV factorization. The HSS-ULV algorithm is highly parallel because it removes the dependency on trailing sub-matrices at each HSS level. However, a key merge step that links two successive HSS levels remains a challenge for efficient parallelization. In this paper, we use an asynchronous runtime system PaRSEC with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both state-of-the-art implementations of dense direct low rank factorization, and achieve up to 2x better factorization time for matrices arising from a diverse set of applications on up to 128 nodes of Fugaku for similar or better accuracy for all the problems that we survey.

CCS Concepts: • Mathematics of computing → Solvers; • Computing methodologies → Massively parallel algorithms; • Computing methodologies~Shared memory algorithms;

Keywords: low rank approximation, runtime systems, HSS matrix, distributed

ACM Reference Format:
Sameer Deshmukh, Rio Yokota, George Bosilca, and Qinxiang Ma. 2023. O(N) distributed direct factorization of structured dense matrices using runtime systems.. In 52nd International Conference on Parallel Processing (ICPP 2023), August 07--10, 2023, Salt Lake City, UT, USA. ACM, New York, NY, USA 10 Pages. https://doi.org/10.1145/3605573.3605606

1 INTRODUCTION

Various scientific problems in fluid dynamics, structural mechanics, and electromagnetics are governed by partial differential equations (PDEs), which result in a sparse matrix when discretized with finite difference and finite element methods. The boundary element method (BEM) has an advantage over such volume discretization methods, since it only discretizes the boundary and the number of elements can be drastically reduced. However, BEM requires the solution of a dense linear system, which has cubic complexity if solved directly. For very large dense matrices, the Cholesky factorization can be computed using distributed memory implementations from SCALAPACK, DPLASMA [6], SLATE [9] or Elemental [11]. Although modern parallel computer architectures offer significant speedups due to the use of multiple threads, the cubic complexity of dense factorization remains prohibitive for large matrices.

Table 1: Comparison of dense direct factorization methods depending on the matrix format, factorization algorithm and distributed programming paradigm.

Library	Format	Algorithm	Compute complexity	Distributed paradigm	Comm. complexity
DPLASMA [6]	Dense	Tile Cholesky	O(N³)	Asynchronous	O(N³)
SLATE [9]	Dense	Panel Cholesky	O(N³)	Fork-join	O(N³)
LORAPO [7]	BLR	Tile Cholesky	O(N²)	Asynchronous	O(N³)
$\mathcal {H}$-LU [4]	$\mathcal {H}$-matrix	$\mathcal {H}$-LU	O(Nlog(N))	Asynchronous	O(Nlog(N))
STRUMPACK [13]	HSS	ULV	O(N)	Fork-join	O(N²)
Ma et. al. [10]	$\mathcal {H}^2$-matrix	Modified ULV	O(N)	Fork-join	O(N)
$\mathcal {H}$ATRIX-DTD	HSS	ULV	O(N)	Asynchronous	O(N)

Table 1 shows some of the state-of-the-art implementations of distributed dense direct factorization using various matrix formats and algorithms. Going from top to bottom, we can see libraries such as DPLASMA [6] and SLATE [9] (and SLATE's predecessor SCALAPACK) making use of standard dense matrix formats and their associated algorithms that do not make use of low rank approximation. Although SLATE and DPLASMA use the same dense Cholesky factorization algorithm, DPLASMA makes use of asynchronous distributed execution whereas SLATE uses fork-join parallelism. The remaining libraries all make use of low rank representations of the dense matrix. This is achieved by compressing the off-diagonal blocks in the dense matrix that correspond to far interactions in the physical domain. The compression can be performed using a suitable algorithm such as Randomized Singular Value Decomposition (RSVD) or Adaptive Cross Approximation (ACA) [12]. Hierarchical matrices exploit this property to reduce the cost of computation from O(N³) to almost O(N²) or even O(N). The complexity of the factorization is determined by the format of the hierarchical matrix and the algorithm used for the factorization. Various formats such as BLR [2], BLR² [3], HODLR [1], $\mathcal {H}$-matrix, HSS [8] and $\mathcal {H}^2$-matrix [5] have been proposed, varying by conditions of admissibility and the use of nested basis.

The BLR format used by LORAPO [7] subdivides the dense matrix into blocks of uniform size and approximates each block individually. The BLR format reduces the time complexity of the tile Cholesky factorization to O(N²). The use of an asynchronous run time system such as PaRSEC allows LORAPO [7] to prioritize the execution of the critical path of the tile Cholesky factorization and resolve off-diagonal dependencies asynchronously. Large, adjacent low rank blocks of the BLR format can be combined to form the multi-level $\mathcal {H}$-matrix format. The tile LU (or Cholesky) factorization can then be extended to the $\mathcal {H}$-LU (or $\mathcal {H}$-Cholesky) algorithm which costs O(Nlog(N)). The use of an asynchronous runtime system with $\mathcal {H}$-LU has been shown to achieve good strong scaling for distributed computation since this allows for greater parallelism between the recursive blocks of the $\mathcal {H}$-matrix..

The use of nested basis in formats such as BLR², HSS and $\mathcal {H}^2$-matrix can be combined with the ULV factorization [8] to further reduce the time complexity of factorization to close to O(N). The ULV factorization of the HSS matrix (HSS-ULV) exploits the nested basis property to remove the dependency on the off-diagonal blocks during the Cholesky factorization. This means that there is no longer a need to perform the triangular solve and trailing sub-matrix update, which leads to an embarrassingly parallel factorization of successive levels of the HSS matrix. Therefore, dependencies only exist between the levels. Distributed HSS-ULV factorization has been implemented by STRUMPACK [13] using the fork-join programming paradigm as a result of relying on SCALAPACK for the computation. STRUMPACK [13] distributes each block of the HSS matrix with a block cyclic distribution and relies on collective communication to shuffle data. Ma et. al. [10] extend the ULV factorization to the $\mathcal {H}^2$-matrix, by modifying the ULV factorization to precompute the fill-ins before the factorization for the $\mathcal {H}^2$-matrix. Even though the $\mathcal {H}^2$-matrix has off-diagonal dense blocks, the method from Ma et. al. [10] is able to achieve embarrassingly parallel factorization of each level by performing the factorization twice - once for precomputing the fill-ins and then for the actual factorization. Although this method is highly parallel, the fact that it factorizes twice results in a large overhead.

In this paper, we propose $\mathcal {H}$ATRIX-DTD – an implementation of the HSS-ULV factorization with the PaRSEC runtime system. $\mathcal {H}$ATRIX-DTD makes use of the HSS-ULV algorithm that can factorize matrices arising from a variety of Green's functions with comparable accuracy to LORAPO and STRUMPACK. We choose these codes as a reference because they share certain traits with our code, besides the fact that they are the most popular libraries in this field. Similar to our code, LORAPO uses the PaRSEC runtime system, but the matrix structure is BLR. Conversely, STRUMPACK uses the HSS structure like $\mathcal {H}$ATRIX-DTD, but uses a bulk-synchronous model for parallelism instead of a runtime system. By comparing with these two references, we can isolate the effect of choosing HSS over BLR from the effect of using a runtime system over a bulk synchronous approach. The HSS-ULV factorization is computationally cheaper than the BLR-Cholesky factorization, and hence $\mathcal {H}$ATRIX-DTD can outperform LORAPO while making use of the same runtime system. The use of an asynchronous runtime system such as PaRSEC for handling the communication and dependencies between successive levels in the HSS-ULV leads to better overlap of communication and computation, and hence $\mathcal {H}$ATRIX-DTD can outperform STRUMPACK that makes use of a similar HSS-ULV algorithm. As a result, we experimentally prove two key assumptions about the factorization algorithms and distributed memory implementation of low rank matrix formats in this paper:

The use of the HSS matrix format and HSS-ULV algorithm has lower computational complexity than the BLR-tile Cholesky algorithm implemented by LORAPO [7].
The use of an asynchronous runtime system such as PaRSEC leads to a lower overhead of communication than the fork-join parallelism implemented by STRUMPACK [13].

We experimentally show that $\mathcal {H}$ATRIX-DTD can outperform both LORAPO and STRUMPACK [13] over a large number of nodes. $\mathcal {H}$ATRIX-DTD shows weak scaling efficiency and achieves up to 2x faster time of factorization on up to 128 nodes for a variety of Green's functions. We first introduce our notation and construction of HSS matrices, and then elaborate on the HSS-ULV algorithm. We then describe our implementation of the HSS-ULV using the PaRSEC runtime system. We then demonstrate with rigorous experimental evidence the performance of $\mathcal {H}$ATRIX-DTD against STRUMPACK [13] and LORAPO.

2 CONSTRUCTION AND NOTATION

A symmetric positive definite BLR² matrix can be constructed from a block dense matrix as shown in Fig. 1. A single block of this matrix at the index (row, column) is denoted by A_{row, column}. We use a single shared bases denoted by U_row to denote the bases of the admissible blocks on row. For example, the block A_{2, 1} in Fig. 1 is denoted as

(1)

where S_{2, 1} denotes the skeleton block shown in green.

The shared basis for each column is generated by concatenating the admissible blocks in the column, denoted by A_{+, 0}. The shared basis can then be computed by computing a pivoted QR factorization

(2)

where the S and R superscripts denote the skeleton part and redundant part of the basis, respectively. In order to make it convenient to represent the U L V factorization, we permute the skeleton and redundant parts as shown in Eq. (3).

(3)

Figure 1: Construction of an SPD BLR² matrix from a block dense matrix.

Any dense block of the BLR² matrix in Fig. 1 A_{i, j} can be represented as Eq. (4).

(4)

Similarly, any low rank block A_{i, j} can be represented as Eq. (5).

(5)

Low rank matrix formats that have dense blocks in their off diagonals are termed as being strongly admissible, and those that have dense blocks only on the diagonal are termed as weakly admissible. The notion of the weakly admissibility BLR² matrix described above can be extended to the HSS matrix. The HSS matrix introduces multiple levels in the matrix by sharing the basis between levels. This should not be confused with the recursive hierarchical structure of the HODLR [1] matrix, which does not share the basis but instead uses recursive low rank blocks in the off-diagonals.

We introduce the notion of level in order to represent the blocks of the HSS matrix at various levels. At the leaf level, the dense block A_{i, j} of the BLR² matrix can be represented as A_{level;i, j} for the HSS matrix. The shared basis at the leaf level also use a similar notation and are denoted by U_level;i. Fig. 2 introduces the notation and construction of a 2-level HSS matrix from the BLR² matrix shown in Fig. 1. As an example, the block A_{1;1, 0} can be represented with the nested basis as shown in Eq. (6).

Figure 2: HSS construction and notation.

(6)

3 THE ULV FACTORIZATION ALGORITHM

BLR² and HSS matrices can be factorized with the ULV factorization. The ULV factorization can be thought of as a modified Cholesky factorization where the L represents the lower triangular dense blocks and the U and V represent the bases. The ULV works on the principle of nullifying the low rank off-diagonal blocks by multiplying the row and column with the shared bases. This means that there is no need to perform the triangular solve and trailing sub-matrix updates for the admissible blocks.

Multiplication of the dense block shown in Eq. (4) with its respective row and column basis from Eq. (3) leads to

(7)

Likewise, multiplication of the low rank block from Eq. (5) leads to Eq. (8).

(8)

As shown in Sec. 2, the HSS matrix is a multi-level matrix format where each level consists of a single BLR² matrix. We first introduce the ULV algorithm for the BLR² matrix in Sec. 3.1. The BLR²-ULV can then be computed at each level of the HSS matrix in order to obtain the HSS-ULV algorithm as shown in Sec. 3.2.

3.1 Weak admissibility BLR²-ULV factorization

Alg. 1 summarizes the BLR²-ULV with weak admissibility. The block diagonal matrix U^F on line 1 is composed of the basis matrices of each row of the BLR² matrix as shown in Eq. (9). The resulting product $\hat{A}$ has dense and low rank blocks split into RR, RS and SS parts as shown in Eq. (7) and Eq. (8), respectively. This is demonstrated in Fig. 3 on a 2x2 BLR² matrix.

Figure 3: ULV factorization of a weakly admissible BLR² matrix.

(9)

The partial Cholesky factorization on $\hat{A}$ at Line 2 works on the RR, RS and SS blocks of each S_{i, j} block. The partial factorization is only performed on the diagonals as a result of the multiplication with the complements in the preceding step

\begin{align} L_{i,i}^{RR}{L_{i,i}^{RR}}^{T} \leftarrow Cholesky(\hat{A}_{i,i}^{RR}) \end{align}
(10)

\begin{align} L_{i,i}^{SR} \leftarrow {{L_{i,i}^{RR}}^T}^{-1} \cdot \hat{A}_{i,i}^{SR} \end{align}
(11)

\begin{align} \hat{A}_{i,i}^{SS} \leftarrow \hat{A}_{i,i}^{SS} - L_{i,i}^{SR} \cdot {L_{i,i}^{SR}}^T \end{align}
(12)

Line 3, demonstrated by Fig. 4, involves permutation of the partially factorized matrix $\hat{A}^{SS}$ which brings all the SS blocks on the lower right corner. This is followed by a dense Cholesky factorization of a smaller matrix of the order of NB × rank. The Cholesky factorization is then performed as follows:

\begin{align} \hat{L}^{SS}{\hat{L}^{SS^T}} \leftarrow Cholesky \left(\begin{bmatrix} \hat{A}_{00}^{SS} & 0 \\ \hat{A}_{10}^{SS} & \hat{A}_{11}^{SS} \end{bmatrix} \right) \end{align}
(13)

Finally, the matrix A is expressed as the following factorization, where $\hat{L}$ represents a partially factorized lower diagonal matrix:

(14)

The solve step for the BLR²-ULV is shown by:

(15)

Algorithm 1

Figure 4: Permutation and Cholesky factorization of partially factorized BLR² matrix.

The final dense matrix in Alg. 1 can get large in size for a large rank and problem size. Therefore, even though the leaf level blocks of size nleaf are factorized in O(N), the overall complexity of the algorithm can reach close to O(N²). The HSS-ULV in the next section shows how the O(N) time complexity can be preserved by exploiting the nested basis.

3.2 HSS-ULV factorization algorithm

Algorithm 2

Alg. 2 describes the ULV factorization of an HSS matrix. The HSS-ULV applies the BLR²-ULV algorithm to each level of the HSS matrix. However, instead of factorizing a dense matrix in the merge step (line 3 in Alg. 1), the HSS-ULV algorithm iteratively applies the same procedure to the leftover blocks.

Fig. 5 illustrates the steps taken by the HSS-ULV for 2 iterations of the factorization for a 2 level HSS matrix. A factorization and permutation of the $\hat{A}_2$ matrix leads to the generation of another HSS matrix $P^T_{2} \cdot \hat{A}^{SS}_{2} \cdot P_{2}$, whose diagonal block has a dimension of 2 × rank. Another iteration of the ULV factorization of this smaller HSS matrix results in the matrix A₀ of size 2 × rank. Finally a dense Cholesky factorization is performed on A₀ at line 6 in Alg. 2 . Unlike the merge step of the BLR²-ULV, the final resulting dense matrix for HSS-ULV is much smaller in size. This leads to O(N) time complexity for this algorithm.

The final factorized form of the matrix A after the HSS-ULV factorization is very similar to that of the BLR²-ULV in Eq. (14). Each lower triangular matrix $\hat{L}$ is permuted and multiplied by a U^F corresponding to that level of the matrix. The fully factorized form of the HSS-ULV is shown in Eq. (16).

(16)

Figure 5: Diagonal product, factorization and permutation steps of the HSS-ULV factorization for a 2 level HSS matrix.

The solve step of the HSS-ULV is similar to that of BLR²-ULV from Eq. (15). There is again a the introduction of multi-level summation terms similar to the factorization. This solve step is shown below:

(17)

4 DISTRIBUTED MEMORY EXECUTION

In this section, we elaborate on the distributed memory implementation of $\mathcal {H}$ATRIX-DTD, and outline the differences from STRUMPACK and LORAPO. Sec. 4.1 provides a generic description of runtime systems such as PaRSEC, followed by an overview of the process distribution strategies followed by $\mathcal {H}$ATRIX-DTD, STRUMPACK and LORAPO in Sec. 4.3. Finally, Sec. 4.2 shows how we map the HSS-ULV algorithm to PaRSEC.

4.1 Runtime systems.

Fig. 6 shows a 3x3 block Cholesky factorization as a directed acyclic graph (DAG). A runtime system such as PaRSEC works with such a graph representation of the algorithm in order to factorize the matrix. Each node in the Directed Acyclic Graph in Fig. 6 is a ‘task’ with dependencies on some other tasks (nodes) in the graph, shown by the arrows. A task cannot begin execution unless all preceding tasks it depends on have finished their execution. Tasks that do not depend on each other can be executed in parallel. The color of the boundary of each node in the DAG corresponds to the block of the dense matrix that the node updates as a result of the computation taking place in the node.

To illustrate, consider the GEMM task inside the dotted black box. This task has READ dependencies on the (2,1) and (3,1) blocks, and a WRITE dependency on (3,2). The (2,1) and (3,1) blocks that the GEMM must read are WRITE dependencies for the two TRSM tasks that GEMM depends on. Unless both the TRSM tasks before the GEMM finish execution and hand over their blocks to GEMM, it cannot begin execution. Notice that the other two SYRK tasks within the dotted black box also have their respective READ dependencies (2,1) and (3,1) satisfied by the preceding TRSM tasks. Since they have no dependency on the GEMM, the two SYRK tasks and the GEMM inside the dotted black box can be executed in parallel.

Figure 6: Directed Acyclic Graph (DAG) representation of the block Cholesky factorization of a 3x3 dense matrix. Each node has an associated computation and depends on certain blocks of the matrix (red or green boxes). The red boxes represent RW (Read-Write) dependencies and green boxes represent R (Read) dependencies. The nodes in the dotted blue box shows nodes that can be executed in parallel.

4.2 Distributed HSS-ULV using the PaRSEC runtime system.

Fig. 8 shows the mapping of tasks in PaRSEC for the HSS-ULV algorithm shown in Alg. 2 . We denote the steps shown in Fig. 5 as they are expressed within the tasks of PaRSEC. The HSS-ULV algorithm operates on the dense, skeleton and bases block matrices of the HSS matrix. These blocks form the dependencies within the tasks. The "Diagonal Product" step on Line 2 of Alg. 2 results in zeroing of the off-diagonal low rank blocks, which makes the partial factorization of each dense block independent of all the other blocks on the same level. This means that the dependencies in the HSS-ULV only come from the merge step on Line 4 of Alg. 2 .

Figure 7: Process distribution used by $\mathcal {H}$ATRIX-DTD for a 2 level HSS matrix. Each distinct color represents a separate process. The dense, skeleton and basis blocks are distributed in a row cyclic process distribution at every level.

PaRSEC is able to exploit the inherently parallel factorization of each level. The "Diagonal Product“ and "Partial Factorization“ steps of each dense block on the same level can be executed in an embarrassingly parallel manner. The dependency between the levels exists as a result of the "Merge“ step. As a result of asynchronous execution, the "Merge“ step can begin right after the corresponding partial factorization for its dependencies have finished.

This asynchronous approach is in contrast to STRUMPACK, where each level executes after the entire previous level has finished factorization. Although the use of SCALAPACK allows STRUMPACK to perform each level of the HSS-ULV in an embarrassingly parallel manner (assuming different compute resources), the use of fork-join parallelism for the ”Merge“ step means that the parent level cannot begin execution unless the child level has completely finished.

Figure 8: Mapping of the HSS-ULV algorithm to tasks within PaRSEC for a 2 level HSS matrix. Each block in this diagram represents some computation and its associated dependencies. These tasks represent the factorization of only the first two nodes of the leaf and its subsequent merging into the upper level. Similar steps are followed for other nodes and levels.

PaRSEC provides multiple Domain Specific Languages (DSL) to expressing algorithms as DAGs - including Dynamic Task Discovery (DTD) and the Parameterized Task Graph (PTG). The DTD interface is similar to a distributed version of the OpenMP task programming - it submits all tasks on every process, which allows every process to gather all necessary knowledge about the algorithm. However, this means each process discovers the entire task graph, trims the non-local tasks by removing those that do not depend on local tasks, and convert those that depends on communication to and from other processes, and finally execute only the tasks that are local to that process according to their data dependencies. The PTG interface is a custom DSL that allows for an concise parameterized description of the algorithm, and supports an event-driven execution where only local tasks are generated by each process and communications are automatically inferred from the task's dependencies. The PTG interface results in lesser runtime overhead especially when the number of tasks and processes is large as a result of not having to generate the entire task graph on every process. On this paper we focus our efforts into a DTD-based implementation of the algorithm, resulting in $\mathcal {H}$ATRIX-DTD.

4.3 Process distribution.

Fig. 7 shows the process distribution strategy of $\mathcal {H}$ATRIX-DTD for the HSS matrix from Fig. 2. Unlike libraries such as SCALAPACK and Elemental [11] which make use of block-cyclic and element-cyclic process distribution, respectively, a row-cyclic process distribution is a better fit for HSS-ULV with PaRSEC. Each block of the HSS matrix that is involved in the HSS-ULV is assigned to a single task. This keeps the number of tasks smaller, which means that the runtime system overhead is better controlled. The blocks owned by P₀ and P₁ from level 2 are merged into P₀ in level 1 as a result of the "Merge“ step in Alg. 2 . Merging blocks into a single process is necessary because of the need to balance the number of tasks with the time consumed by each task. Too many tasks of very little duration will be generated if we proceed with a block-cyclic distribution for the merged block A_{1;0, 0}.

In contrast to the row cyclic distribution used in $\mathcal {H}$ATRIX-DTD, STRUMPACK [13] makes use of a block cyclic distribution for every matrix block of the HSS matrix. This is necessary since STRUMPACK relies on SCALAPACK for computation on the matrix blocks. Such a process distribution would not be effective in $\mathcal {H}$ATRIX-DTD because it would generate too much communication between tasks on the same row (in the block cyclic distribution tasks on the same row will be placed according to the process grid on different processes).

LORAPO [7] implements a tile Cholesky algorithm on a block low rank matrix. This means that resolving the off-diagonal triangular solve and trailing sub-matrix updates is the primary bottleneck in the execution of the critical path. A modified block cyclic distribution where each tile is block-cyclically distributed on a process grid is experimentally found to be the best process distribution strategy for LORAPO. Even though $\mathcal {H}$ATRIX-DTD makes use of PaRSEC, such a data distribution in not necessary since the HSS-ULV algorithm ensures that the critical path along the diagonal can be executed in an embarrassingly parallel manner.

Table 2: Impact of rank and kernel for the methods we tested for a constant problem size of 65536.

Construct

Max Rank

Leaf

Size

Laplace

Const. Err.

Laplace

Solve Err..

Yukawa

Const. Err.

Yukawa

Solve Err.

Matern

Const. Err.

Matern

Solve Err.

HATRIX

100

256

1.54e-06

4.78e-12

2.73e-08

3.04e-15

9.95e-05

3.90e-13

200

256

2.89e-07

5.48e-12

7.63e-09

3.50e-15

2.34e-05

4.51e-12

200

512

1.82e-07

6.06e-12

5.85e-09

3.83e-15

1.6e-05

4.89e-12

400

512

5.51e-10

7.00e-12

5.07e-10

4.42e-15

1.0e-06

5.85e-12

LORAPO

1024

2048

1e-8

2.21e-13

1e-8

1.33e-13

1e-8

1.01e-09

1500

2048

1e-8

2.21e-13

1e-8

1.33e-13

1e-8

1.01e-09

1250

4096

1e-8

2.21e-13

1e-8

1.33e-13

1e-8

7.84e-10

3000

4096

1e-8

2.21e-13

1e-8

1.33e-13

1e-8

7.84e-10

STRUMPACK

100

256

1e-8

5.76e-14

1e-8

2.13e-15

1e-8

1.50e-12

200

256

1e-8

9.05e-11

1e-8

1.48e-14

1e-8

2.35e-09

200

512

1e-8

3.37e-11

1e-8

1.10e-14

1e-8

4.44e-10

400

512

1e-8

1.71e-11

1e-8

4.04e-14

1e-8

9.71e-09

5 RESULTS

We run distributed memory tests for 3 implementations of low rank matrix factorization - $\mathcal {H}$ATRIX-DTD, STRUMPACK and LORAPO. Every implementation uses a uniform 2D grid geometry. Distributed memory tests are run on the Fugaku supercomputer at RIKEN, Japan. Each node of Fugaku has a single A64FX CPU with 48 physicals cores divided into 4 NUMA nodes of 12 cores each. Each node has 32 GB of HBM.

We implemented $\mathcal {H}$ATRIX-DTD using the DTD programming interface from PaRSEC. We make comparisons with STRUMPACK [13] and LORAPO [7] as described in Sec. 1. We compare three Green's functions from diverse applications such as electrostatics and statistics as shown in Table 3 for each implementation.

Table 3: Kernels used for evaluation and their constants.

Kernel	Equation	Constants
Laplace 2D	f(x, y) = −ln (ϵ + dist(x, y))	ϵ = 10^{− 9}
Yukawa	$f(x,y) = \frac{e^{\alpha \times -(\theta + dist(x,y))}}{(\theta + dist(x,y))}$	α = 1, θ = 10^{− 9}
Matern	$f(x,y) = {\left\lbrace \begin{array}{@{}l@{\quad }l@{}}\frac{\sigma ^2}{2^{\rho -1} \times \Gamma (\rho)} \times \frac{dist(x,y)}{\mu }^{\sigma } \times K\nu (\sigma, \frac{dist(x,y)}{\mu }),& \text{otherwise} \\ \sigma ^2, & \text{if } dist(x,y) = 0 \end{array}\right.}$	σ = 1, μ = 0.03, ρ = 0.5

5.1 Effect of rank on accuracy

Table 2 shows the impact of changing the maximum rank and leaf size on the construction and solve error for each tested kernel. The errors are calculated by first generating a normally distributed random vector b and computing the construction error as shown in Eq. (18), and the error of the forward and backward solve as shown in Eq. (19). A_dense denotes the full dense matrix, and A denotes the corresponding compressed HSS matrix. The maximum rank of the HSS matrix and the size of the leaf level nodes is capped for all three libraries surveyed. We use the leaf size and maximum rank measurements of solve error from Table 2 to determine parameters for the weak scaling experiments in Sec. 5.2 in order to obtain sufficient accuracy for all kernels we benchmark.

\begin{equation} err_{construct} = \frac{\left\Vert A_{dense} \cdot b - A \cdot b\right\Vert }{\left\Vert A_{dense} \cdot b\right\Vert } \end{equation}
(18)

\begin{equation} err_{solve} = \frac{\left\Vert b - A^{-1} \cdot A \cdot b\right\Vert }{\left\Vert b\right\Vert } \end{equation}
(19)

The construction error for all cases of $\mathcal {H}$ATRIX-DTD decreases as the rank increases, which is to be expected given that a greater rank means a greater number of basis that can be incorporated in the low rank approximation. However, the solve error seems to increase slightly as the rank increases. This slight increase can be attributed to numerical errors. Since LORAPO uses adaptive ranks, specifying the maximum rank leads to choosing ranks that are enough to satisfy the construction error of 10^{− 8}. As a result, increasing the maximum rank from 1024 to 1500 for a leaf size of 2048 does not lead to changing solve error for any kernel. STRUMPACK allows for a similar tuning of ranks as $\mathcal {H}$ATRIX-DTD since both use the HSS matrix. The HSS-ULV algorithm used within STRUMPACK shows better solve error than $\mathcal {H}$ATRIX-DTD for a maximum rank of 100 and leaf size 256 for the laplace 2D kernel. However, the solve error decreases when going from rank 100 to 200 with leaf size 256. The reasons behind this drop in accuracy are unknown.

Figure 9: Weak scaling of factorization time for all the kernels shown in the Table 3 for varying problem sizes.

5.2 Distributed memory weak scaling

Fig. 9 shows weak scaling for factorization using STRUMPACK, LORAPO and $\mathcal {H}$ATRIX-DTD on up to 128 nodes of Fugaku. For $\mathcal {H}$ATRIX-DTD and STRUMPACK, the size of the matrix begins at 4096 for 2 nodes and then increases linearly with the number of nodes, until it reaches 262,144 with 128 nodes. The linear increase in problem size and number of processors is done in order to maintain constant work per process, given the O(N) time complexity of the HSS-ULV. The tile Cholesky with the BLR matrix used by LORAPO as shown in Table 1 shows O(N²) time complexity. Therefore, we start from a problem size of 4096 with 2 nodes and increase the number of nodes by a factor of 16 for every experiment to maintain constant work per node. This means that the problem size reaches 65,536 for 512 nodes. We report the 95% confidence interval of the mean of the results.

The rank and leaf size are chosen from Sec. 5.1 in order to maintain accuracy that is better than 10^{− 11} for the laplace 2D kernel, 10^{− 14} for the Yukawa kernel, and 10^{− 9} for the matern kernel. We then experiment with combinations of rank and accuracy that provide an acceptable solve error for each problem size and kernel, and show the least time to solution in Fig. 9.

The results in Fig. 9 show that $\mathcal {H}$ATRIX-DTD exhibits better weak scalability than both STRUMPACK and LORAPO. LORAPO and $\mathcal {H}$ATRIX-DTD both make use of the PaRSEC runtime system, however the tile Cholesky algorithm of LORAPO involves almost O(N³) communication for the update of the trailing sub-matrix. This, coupled with the fact that the tile Cholesky is constrained by the execution of the critical path of the diagonal add to the poor weak scaling of LORAPO. Further analysis of LORAPO's weak scaling is done in Sec. 5.3.1. $\mathcal {H}$ATRIX-DTD and STRUMPACK both use the HSS-ULV algorithm, however $\mathcal {H}$ATRIX-DTD is faster than STRUMPACK. This is as a result of the asynchronous execution of PaRSEC, which allows $\mathcal {H}$ATRIX-DTD to begin the factorization of the parent level before the entire child level has been factorized. STRUMPACK, on the other hand, makes use of fork-join parallelism with collective communication, which requires that each level of the HSS matrix be factorized fully before the next level can begin.

Figure 10: Performance breakdown for the 3 implementations in Fig. 9b

5.3 Performance breakdown of weak scaling

In this section, we further analyse the reasons behind the weak scaling performance seen in Sec. 5.2. Since all the kernels show similar performance characteristics, we investigate only the Yukawa kernel in further detail.

5.3.1 Performance breakdown of LORAPO. Fig. 10a shows the performance breakdown for LORAPO [7] for the weak scaling graph of the Yukawa kernel shown in Fig. 9b . We obtain these measurements from the PaRSEC instrumentation tools that allow for measuring the amount of time that corresponds to time spent inside the actual computational kernels and that for various runtime system management activities.

As pointed out in Sec. 1, LORAPO uses the tile Cholesky algorithm with the BLR matrix format. The “COMPUTE TASK TIME” corresponds to the average time per worker spent inside the actual computational kernels for the Cholesky factorization. The “RUNTIME OVERHEAD” corresponds to the average time per worker spent on runtime system management activities such as scheduling, memory management, submitting and executing tasks and deleting previously executed tasks. This also includes various MPI activities such as sending, receiving and polling for messages. The number of workers is the number of physical cores being used for the computation across all the nodes.

It can be seen that overhead of the runtime system far outweighs the amount of time taken for the computation. Moreover, the growth of the overhead is proportional to the time taken for factorization in Fig. 9b, whereas the growth of “COMPUTE TASK TIME” is not. This means that the poor weak scaling of LORAPO can be attributed mainly to the runtime overhead. LORAPO would have had better weak scaling if the runtime overhead would remain constant as the problem size and number of resources is increased.

5.3.2 Performance breakdown of STRUMPACK. Fig. 10b shows the breakdown of time that STRUMPACK spends on actual computation vs. MPI for the STRUMPACK weak scaling plot in Fig. 9b .

The performance statistics in Fig. 10b are obtained from the mpiP tool from LLNL (https://github.com/LLNL/mpiP). The time measurements are averaged over the total number of physical cores used by each experiment. The “MPI TIME” shows the time spent by STRUMPACK inside MPI functions such as collective communication. The time does not include the time spent on synchronization. Therefore, the breakdown of computation and communication in Fig. 10b will not add up to the weak scaling performance in Fig. 9b . The “COMPUTE TIME” shows the time spent on useful computation. The “COMPUTE TIME” remains almost the same for every measurement. However, note that the time spent in MPI by each process increases as the number of nodes increases. This means that the MPI communication overhead using the fork-join paradigm leads to inefficient execution in STRUMPACK. Note the "COMPUTE TIME" graph does not show a flat profile in spite of the the HSS-ULV being embarrassingly parallel at each level of the HSS matrix. Apart from the increasing MPI time of the communication, the increasing per process compute time also contributes to the worsening of weak scaling of STRUMPACK. We show in Sec. 5.3.3 that our implementation in $\mathcal {H}$ATRIX-DTD can overcome these limitations.

5.3.3 Performance breakdown of $\mathcal {H}$ATRIX-DTD. Fig. 10c shows the performance breakdown of $\mathcal {H}$ATRIX-DTD for Fig. 9b . The measurements are taken in a similar manner to LORAPO in Sec. 5.3.1, i.e. with use of the PaRSEC instrumentation tools. The “COMPUTE TASK TIME” and “RUNTIME OVERHEAD” have exactly the same meaning as that of LORAPO in Sec. 5.3.1 since both $\mathcal {H}$ATRIX-DTD and LORAPO make use of the PaRSEC runtime system.

Note that the “COMPUTE TASK TIME” of $\mathcal {H}$ATRIX-DTD is almost completely flat. This means that exactly the same amount of work is being done by each worker when the problem size is increased in proportion to the number of available resources. The final data point shows slightly higher compute time as a result of using a leaf size of 512. This means that the HSS-ULV as implemented in $\mathcal {H}$ATRIX-DTD will show perfect weak scaling in the absence of runtime overhead. The increase in run time of $\mathcal {H}$ATRIX-DTD in Fig. 9 can be attributed to the runtime overhead that shows upward growth as the number of resources is increased. The runtime overhead can be explained by the fact that PaRSEC's DTD interface generates the entire task graph on every node. This leads to redundant work on each node, which becomes non-trivial as the number of available resources increases.

Note that the compute time per worker for $\mathcal {H}$ATRIX-DTD and compute time per thread for STRUMPACK in Sec. 5.3.2 are very similar. The runtime overhead of $\mathcal {H}$ATRIX-DTD appears higher than that of STRUMPACK since the MPI barrier and synchronization time is not accounted for in the STRUMPACK results. However, the $\mathcal {H}$ATRIX-DTD overhead shows the entire overhead including synchronization, communication, scheduling and other work by the PaRSEC runtime system.

5.4 Increasing problem size with constant resources.

Fig. 11 shows the time taken for factorization for varying problem sizes uses 64 nodes of Fugaku. STRUMPACK shows almost uniform time. Since we are using a large number of processes and the computation per process is not very large, the communication time dominates the computation for all cases, and what little computation needs to be done by each process is done in a short time. LORAPO shows O(N²) scaling up to a problem size of 65,536. STRUMPACK has an advantage over $\mathcal {H}$ATRIX-DTD in this case. Since the runtime overhead in $\mathcal {H}$ATRIX-DTD increases as the number of tasks increases, the performance of $\mathcal {H}$ATRIX-DTD increases as O(N) even though the amount of computation is small.

Figure 11: Varying problem sizes with 64 nodes on Fugaku.

5.5 Impact of leaf size on performance.

Fig. 12 shows the impact on the time for factorization for $\mathcal {H}$ATRIX-DTD, STRUMPACK and LORAPO when using a problem size of 262,144 and 128 nodes of Fugaku. The rank is kept constant at 100 for $\mathcal {H}$ATRIX-DTD and STRUMPACK and the maximum rank is half the leaf size for LORAPO. The optimal leaf size for LORAPO changes depending on the problem size. The use of low rank approximation for compressing the dense frontal matrices in the multi-frontal method is an important application of such matrices. The selection of leaf size of the HSS matrix, which correlates to the front size in the multi-frontal solver, is a crucial parameter in justifying the cost of the algorithm. Large leaf sizes can lead to very poor performance of the multi-frontal solver. The fact that $\mathcal {H}$ATRIX-DTD is faster than STRUMPACK when using small leaf sizes shows that $\mathcal {H}$ATRIX-DTD can also be used in place of STRUMPACK to factorize the dense, structured fronts in multi-frontal solvers. Larger leaf sizes for $\mathcal {H}$ATRIX-DTD lead to worse performance due to reduction in the amount of available parallelism and more work to do per thread.

Figure 12: Performance impact of leaf size using 128 nodes and constant problem size of 262,144 for the Yukawa kernel.

6 CONCLUSION

We have proposed a ULV factorization for HSS matrices, and provided an implementation, $\mathcal {H}$ATRIX-DTD, using the PaRSEC runtime system. We have showed that factorization of structured dense matrices arising from a diverse set of Green's functions for a 2D domain can be performed faster using our implementation. This is achieved as a result of the asynchronous runtime system and the lower computational intensity of the HSS-ULV factorization. Using $\mathcal {H}$ATRIX-DTD, we show that our implementation has comparable or better accuracy than established state-of-the-art implementations such as STRUMPACK and LORAPO. Using performance analysis of weak scaling experiments we highlight that our implementation is indeed faster because of a combination of lesser computation and asynchronous resolution of dependencies of the multiple levels of the HSS matrix. As a result of the runtime overhead of PaRSEC, we have shown that STRUMPACK can achieve better performance than $\mathcal {H}$ATRIX-DTD for large problem size and limited number of nodes.

ACKNOWLEDGMENTS

This work was supported by JSPS KAKENHI Grant Number JP20K20624, JP21H03447, JP22H03598. This work is supported by ”Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” in Japan (Project ID: jh230009-NAHI). This research was also supported by US NSF grant 1909015.

REFERENCES

S. Ambikasaran and E. Darve. 2013. An O(NlogN) Fast Direct Solver for Partial Hierarchically Semi-seperable Matrices. Journal of Scientific Computing 57 (2013), 477–501.
P. Amestoy, C. Ashcraft, O. Boiteau, A. Buttari, J.-Y. L'Excellent, and C. Weisbecker. 2015. Improving Multifrontal Methods by Means of Block Low-Rank Representations. SIAM Journal on Scientific Computing 37, 3 (2015), A1451–A1474.
Cleve Ashcraft, Alfredo Buttari, and Theo Mary. 2021. Block Low-Rank Matrices with Shared Bases: Potential and Limitations of the BLR$$ Format. SIAM J. Matrix Anal. Appl. 42, 2 (Jan. 2021), 990–1010. https://doi.org/10.1137/20M1386451
C. Augonnet, D. Goudin, M. Kuhn, X. Lacoste, R. Namyst, and P. Ramet. 2019. A Hierarchical Fast Direct Solver for Distributed Memory Machines with Manycore Nodes. Technical Report. Université de Bordeaux.
S. Borm. 2006. $\mathcal {H}^2$-Matrix Arithmetics in Linear Complexity. Computing 77 (2006), 1–28.
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Azzam Haidar, Thomas Herault, Jakub Kurzak, Julien Langou, Pierre Lemarinier, Hatem Ltaief, Piotr Luszczek, Asim YarKhan, and Jack Dongarra. 2011. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA. In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, Anchorage, AK, USA, 1432–1441. https://doi.org/10.1109/IPDPS.2011.299
Qinglei Cao, Rabab Alomairy, Yu Pei, George Bosilca, Hatem Ltaief, David Keyes, and Jack Dongarra. 2022. A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Lyon, France, 414–424. https://doi.org/10.1109/IPDPS53621.2022.00047
S. Chandrasekaran, M. Gu, and T. Pals. 2006. A Fast ULV Decomposition Solver for Hierarchically Semiseparable Representations. SIAM J. Matrix Anal. Appl. 28, 3 (2006), 603–622. https://doi.org/10.1137/S0895479803436652
M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra. 2019. SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library. In Proceedings of the 2019 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, Denver, USA. https://doi.org/10.1145/3295500.3356223
Qianxiang Ma, Sameer Deshmukh, and Rio Yokota. 2022. Scalable Linear Time Dense Direct Solver for 3-D Problems without Trailing Sub-Matrix Dependencies. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, Dallas, TX, USA, 1198–1209.
Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A New Framework for Distributed Memory Dense Matrix Computations. ACM Trans. Math. Softw. 39, 2 (Feb. 2013), 13:1–13:24. https://doi.org/10.1145/2427023.2427030
S. Rjasanow. 2002. Adaptive Cross Approximation of Dense Matrices. In International Association for Boundary Element Methods. UT Austin, TX, USA.
François-Henry Rouet, Xiaoye S. Li, Pieter Ghysels, and Artem Napov. 2015. A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization. arXiv:1503.05464 [cs] 42, 4 (June 2015), Article No.: 27, pages 1–35. arxiv:1503.05464 [cs]

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

ICPP 2023, August 07–10, 2023, Salt Lake City, UT, USA