Task-Based Sparse Hybrid Linear Solver for Distributed Memory Heterogeneous Architectures

Agullo, Emmanuel; Giraud, Luc; Nakov, Stojce

doi:10.1007/978-3-319-58943-5_7

Emmanuel Agullo²⁶,
Luc Giraud²⁶ &
Stojce Nakov²⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10104))

Included in the following conference series:

European Conference on Parallel Processing

1817 Accesses

Abstract

Heterogeneity is emerging as one of the most challenging characteristics of today’s parallel environments. However, not many fully-featured advanced numerical, scientific libraries have been ported on such architectures. In this paper, we propose to extend a sparse hybrid solver for handling distributed memory heterogeneous platforms. As in the original solver, we perform a domain decomposition and associate one subdomain with one MPI process. However, while each subdomain was processed sequentially (binded onto a single CPU core) in the original solver, the new solver instead relies on task-based local solvers, delegating tasks to available computing units. We show that this “MPI+task” design conveniently allows for exploiting distributed memory heterogeneous machines. Indeed, a subdomain can now be processed on multiple CPU cores (such as a whole multicore processor or a subset of the available cores) possibly enhanced with GPUs. We illustrate our discussion with the MaPHyS sparse hybrid solver relying on the PaStiX and Chameleon dense and sparse direct libraries, respectively. Interestingly, this two-level MPI+task design furthermore provides extra flexibility for controlling the number of subdomains, enhancing the numerical stability of the considered hybrid method. While the rise of heterogeneous computing has been strongly carried out by the theoretical community, this study aims at showing that it is now also possible to build complex software layers on top of runtime systems to exploit heterogeneous architectures.

You have full access to this open access chapter, Download conference paper PDF

On the design of two-stage multiprojection methods for distributed memory systems

Article 18 February 2020

OpenMP Parallelization Strategies for a Discontinuous Galerkin Solver

Article 30 July 2018

Task-Based Parallel Sparse Matrix-Vector Multiplication (SpMVM) with GPI-2

Keywords

1 Introduction

Parallel sparse linear algebra solvers are often the innermost numerical kernels in scientific and engineering applications; consequently, they are one of the most time consuming parts. In order to cope with the hierarchical hardware design of modern large-scale supercomputers, the HPC solver community has proposed new sparse methods. One promising approach to high-performance, scalable solution of large sparse linear systems in parallel scientific computing is to combine direct and iterative methods. To achieve a high scalability, algebraic domain decomposition methods are commonly employed to split a large size linear system into smaller size linear systems that can be efficiently and concurrently handled by a sparse direct solver while the solution along the interfaces is computed iteratively [20, 22, 37, 40]. Such an hybrid approach exploits the advantages of both direct and iterative methods. The iterative component allows us to use a small amount of memory and provides a natural way for parallelization. The direct part provides its favorable numerical properties; furthermore, this combination provides opportunities to exploit several levels of parallelism as we do in this paper. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MaPHyS ^{Footnote 1} [3].

Starting from a baseline MPI version of the considered hybrid solver, the objective of this study is to propose a prototype extension for which each MPI process can handle heterogeneous processing units with a task-based approach, delegating the task management to a runtime system. A preliminary experimental study asseses the potential of the approach.

This paper is organized as follows. Section 2 presents the solver considered in this study and its baseline parallel design. Section 3 presents background on task-based linear algebra and sparse hybrid solvers. Section 4 presents the design of the task-based extension proposed for our sparse hybrid solver. Preliminary results are discussed in Sect. 5 while concluding remarks on this work and perspectives are discussed in Sect. 6.

2 Baseline MPI Hybrid (Direct/Iterative) Solver

We now present the sparse hybrid (direct/iterative) method (Sect. 2.1) considered in this study and its baseline parallel design (Sect. 2.2).

2.1 Method

Let $\mathcal{A}_{} x_{} = b_{} $ be the linear problem and $\mathcal{G}_{} = \{V, E \}$ the adjacency graph associated with $\mathcal{A}_{}$. In this graph, each vertex is associated with a row or column of the matrix $\mathcal{A}_{}$ and it exists an edge between the vertices i and j if the entry $a_{i,j}$ is non zero. In the sequel, to facilitate the exposure and limit the notation we voluntarily mix a vertex of $\mathcal{G}_{}$ with its index depending on the context of the description. The governing idea behind substructuring or Schur complement methods is to split the unknowns in two categories: interior and interface vertices. We assume that the vertices of the graph $\mathcal{G}_{}$ are partitioned into $\mathcal{N}$ disconnected subgraphs $\mathcal{I}_{1}$, ..., $\mathcal{I}_{\mathcal{N}}$ separated by the global vertex separator $\varGamma _{}$. We also decompose the vertex separator $\varGamma _{}$ into non-disjoint subsets $\varGamma _{i}$, where $\varGamma _{i}$ is the set of vertices in $\varGamma _{}$ that are connected to at least one vertex of $\mathcal{I}_{i}$. Notice that this decomposition is not a partition as $\varGamma _{i} \cap \varGamma _{j} \ne \emptyset $ when the set of vertices in this intersection defines the separator of $\mathcal{I}_{i} $ and $\mathcal{I}_{j} $. By analogy with classical domain decomposition in a finite element framework, $\varOmega _{i} =\mathcal{I}_{i} \cup \varGamma _{i} $ will be referred to as a subdomain with internal unknowns $\mathcal{I}_{i}$ and interface unknowns $\varGamma _{i}$. If we denote $\mathcal{I}_{} = \cup \mathcal{I}_{i} $ and order vertices in $\mathcal{I}_{}$ first, we obtain the following block reordered linear system

$$\begin{aligned} \left( \begin{array}{cc} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} &{} \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} \\ \mathcal{A}_{\varGamma _{} \mathcal{I}_{}} &{} \mathcal{A}_{\varGamma _{} \varGamma _{}} \\ \end{array}\right) \left( \begin{array}{c} x_{\mathcal{I}_{}} \\ x_{\varGamma _{}} \\ \end{array} \right) = \left( \begin{array}{c} b_{\mathcal{I}_{}} \\ b_{\varGamma _{}} \\ \end{array} \right) \end{aligned}$$

(1)

where $x_{\varGamma _{}}$ contains all unknowns associated with the separator and $x_{\mathcal{I}_{}}$ contains the unknowns associated with the interiors. Because the interior vertices are only connected to either interior vertices in the same subgraph or with vertices in the interface, the matrix $\mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}}$ has a block diagonal structure, where each diagonal block corresponds to one subgraph $\mathcal{I}_{i}$. Eliminating $x_{\mathcal{I}_{}}$ from the second block row of Eq. (1) leads to the reduced system

$$\begin{aligned} \mathcal{S}_{} x_{\varGamma _{}} = f \end{aligned}$$

(2)

where

$$\begin{aligned} \mathcal{S}_{} =\mathcal{A}_{\varGamma _{} \varGamma _{}}-\mathcal{A}_{\varGamma _{} \mathcal{I}_{}} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} \; \text { and} \; f = b_{\varGamma _{}} -\mathcal{A}_{\varGamma _{} \mathcal{I}_{}} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} b_{\mathcal{I}_{}}. \end{aligned}$$

(3)

The matrix $\mathcal{S}_{}$ is referred to as the Schur complement matrix. This reformulation leads to a general strategy for solving (1). Specifically, an iterative method can be applied to solve (2). Once $x_{\varGamma _{}}$ is known, $x_{\mathcal{I}_{}}$ can be computed with one additional solve for the interior unknowns via

$$\begin{aligned} x_{\mathcal{I}_{}} = \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} \left( b_{\mathcal{I}_{}} - \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} x_{\varGamma _{}} \right) . \end{aligned}$$

We illustrate in Fig. 1a all these notations for a decomposition into 4 subdomains. The local interiors are disjoint and form a partition of the interior $\mathcal{I}_{} = \sqcup \mathcal{I}_{i} $ (blue vertices in Fig. 1b). It is not necessarily the case for the boundaries. Indeed, two subdomains $\varOmega _{i}$ and $\varOmega _{j}$ may share part of their interface ($\varGamma _{i} \bigcap \varGamma _{j} \ne \emptyset $), such as $\varOmega _{1}$ and $\varOmega _{2}$ in Fig. 1b which share eleven vertices. Altogether, the local boundaries form the overall interface $\varGamma _{} = \cup \varGamma _{i} $ (red vertices in Fig. 1b), which is not a disjoint union. Because interior vertices are only connected to vertices of their subset (either on the interior or on the boundary), matrix $\mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}}$ associated to the interior has a block diagonal structure, as shown in Fig. 1a. Each diagonal block $\mathcal{A}_{\mathcal{I}_{i} \mathcal{I}_{i}}$ corresponds to a local interior.

While the Schur complement system is significantly smaller and better conditioned than the original matrix $\mathcal{A}$ [34, Lemma 3.11], it is important to consider further preconditioning when employing a Krylov method. We introduce the general form of the preconditioner considered in MaPHyS. The preconditioner presented below was originally proposed in [17] and successfully applied to large problems in real life applications in [21, 24, 35]. To describe the main preconditioner in MaPHyS, considering the restriction operator $\mathcal{R}_{\varGamma _{i}} $ from $\varGamma _{} $ to $\varGamma _{i} $, we define $\bar{\mathcal{S}_{i}} = \mathcal{R}_{\varGamma _{i}} \mathcal{S}_{} \mathcal{R}_{\varGamma _{i}} ^T$, that corresponds to the restriction of the Schur complement to the interface $\varGamma _{i}$. If $\mathcal{I}_{i}$ is a fully connected subgraph of $\mathcal{G}_{}$, the matrix $\bar{\mathcal{S}_{i}}$ is dense.

With these notations the Additive Schwarz preconditioner reads

$$\begin{aligned} \mathcal{M}_{AS} = \sum _{i=1}^{\mathcal{N}} \mathcal{R}_{\varGamma _{i}} ^T \bar{\mathcal{S}_{i}} ^{-1} \mathcal{R}_{\varGamma _{i}}. \end{aligned}$$

(4)

2.2 Baseline MPI Parallelization

With all these components, the classical parallel implementation of MaPHyS can be decomposed into four main phases:

the partitioning step consists of partitioning the adjacency graph $\mathcal{G}_{}$ of $\mathcal{A}_{}$ into several subdomains and distribute the $\mathcal{A}_{i}$ to different processes. For this we are able to use two state-of-the-art partitioners, Scotch [36] and METIS [28];
the factorization of the interiors and the computation of the local Schur complement factorizes $\mathcal{A}_{i}$ with the PaStiX [25] or the Mumps [9] sparse direct solver and furthermore provides the associated local Schur Complement $\mathcal{S}_{i}$ thanks to recent progress from the development teams of those sparse direct solvers;
the setup of the preconditioner by assembling diagonal blocks of $\mathcal{S}_{i}$ via a few neighbour to neighbour communications and factorization of the dense local $\mathcal{S}_{i}$ using Mkl;
the solve step consists of two steps: a parallel preconditioned Krylov method performed on the reduced system (Eq. 2) to compute $x_{\varGamma _{i}}$ where all BLAS operations are provided by Mkl, followed by the back solve on the interior to compute $x_{\mathcal{I}_{i}}$, done by the sparse direct solver.

3 Related Work

To cope with the complexity of modern architectures, programming paradigms are being revisited. Among others, one major trend consists in writing the algorithms in terms of task graphs and delegating to a runtime system both the management of the data consistency and the orchestration of the actual execution. This paradigm has been intensively studied in the context of dense linear algebra and is now a common utility for related state-of-the-art libraries such as Plasma [2], Magma [1], DPLASMA [14], Chameleon [4] and FLAME [39]. Dense linear algebra algorithms were indeed excellent candidates for pioneering in this direction. First, their computational pattern allows one to design very wide task graphs so that many computational units can execute tasks concurrently. Second, the building block operations they rely on, essentially level-three Basic Linear Algebra Subroutines (BLAS), are compute intensive, which makes it possible to split the work in relatively fine grain tasks while fully benefiting from GPU acceleration. As a result, these algorithms are particularly easy to schedule in the sense that state-of-the-art greedy scheduling algorithms may lead to high performance, including on platforms accelerated with multiple GPUs [4].

This trend has then been followed for designing sparse direct methods. The extra challenge in designing task-based sparse direct method is due to indirection and variable granularities of the tasks. The PaStiX team has proposed such an extension of the solver capable of running on the StarPU [11] and PaRSEC [15] runtime systems on cluster of heterogeneous nodes in the context of X. Lacoste PhD thesis [29, 30]. In the meanwhile, the qr_mumps library developed by A. Buttari [16] aims at solving sparse linear least square problems and has been ported on top of those runtime systems in the context of F. Lopez PhD thesis [6, 33].

With the need of solving ever larger sparse linear systems while maintaining numerical robustness, multiple sparse hybrid variants have been proposed for computing the preconditioner for the Schur complement. PDSLin [31], ShyLU [37] and Hips [19] first perform an exact^{Footnote 2} factorization of the interior of each subdomain concurrently. PDSLin and ShyLU then compute the preconditioner with a two-fold approach. First, an approximation $\widetilde{\mathcal{S}}_{}$ of the (global) Schur complement $\mathcal{S}_{}$ is computed. Second, this approximate Schur complement $\widetilde{\mathcal{S}}_{}$ is factorized to form the preconditioner for the Schur Complement system, which does not need to be formed explicitly. While PDSLin has multiple options for discarding values lower than some user-defined thresholds at different steps of the computation of $\widetilde{\mathcal{S}}_{}$, ShyLU [37] also implements a structure-based approach for discarding values named probing and that was first proposed to approximate interfaces in DDM [18]. Instead of following such a two-fold approach, Hips [19] forms the preconditioner by computing a global ILU factorization based on the multi-level scheme formulation from [26].

These sparse hybrid solvers have also been extended to cope with hierarchical supercomputers. Indeed, to ensure numerical robustness while exploiting all the processors of a platform, an important effort has been devoted to propose two levels of parallelism for these solvers. Designed on top of the SuperLU_DIST [32] distributed memory sparse direct solver, PDSLin implements a 2-level MPI (MPI+MPI) approach with finely tuned intra- and inter-subdomain load balancing [40]. A similar MPI+MPI approach has been assessed for additive Schwarz preconditioning in a prototype version of MaPHyS [23], relying on the Mumps [9, 10] and ScaLAPACK [13] sparse direct and dense distributed memory solvers, respectively. On the contrary, expecting a higher numerical robustness thanks to multi-level preconditioning, Hips associates multiple subdomains to a single process and distributes the subdomains to the processes in order to maintain load balancing [19]. Finally, especially tuned for modern multicore platforms, ShyLU implements a 2-level MPI+thread approach [37]. A similar 2-level MPI+thread design has been investigated for MaPHyS in [35]. However, none of these extensions were tailored to exploit heterogeneous architectures.

4 Design of Task-Based Sparse Hybrid Linear Solver for Distributed Memory Heterogeneous Architectures

Although very efficient for exploiting multiple modern multicore nodes, the relatively low-level design of the 2-level parallelism sparse hybrid solvers discussed above cannot exploit heterogeneous architectures.

One solution for relieving this bottleneck would consist in fully abstracting the MPI scheme of the solver in terms of a DAG of tasks where vertices represent fine grain tasks and edges represent dependencies between them. Once the solver has been designed at such a high-level of abstraction, advanced fine-grain mapping strategies can be implemented as the burden of moving data in the system and ensuring their consistency is delegated to a runtime system. However, such a design prevents from relying on SPMD paradigms. As a consequence, it requires to fully rewrite the solver in terms of a DAG of tasks as illustrated in Fig. 2a: the DAG representing the whole numerical algorithm implemented in MaPHyS is written independently from the hardware architecture (we represent it on top of the runtime system and of the MPI processes). While there has been a lot of progress in that direction as discussed in Sect. 3 for dense (such as the DPLASMA [14] and Chameleon [4] task-based libraries, derived from Plasma [2] and Magma [1]) and sparse direct methods (such as the task-based version of PaStiX [29] and qrm_StarPU [5]), only limited work we are aware of has been devoted to study task-based Krylov methods, the third numerical pillar on top of which hybrid solvers are built on. Indeed, a task-based version of the CG algorithm for platforms equipped with several GPUs was proposed in [7] and a fault-tolerant task-based version of this algorithm was proposed in [27], but none of them could exploit distributed memory platforms.

A second solution would consist in relying on the modular design of the hybrid solver to use appropriate libraries depending on the target architecture leading to a collection of MPI+X, MPI+Y, ... solutions to support X, Y, ... architectures respectively.

In this paper, we propose to combine both solutions with a MPI+task approach in order to benefit from the high-level modular design of the hybrid solver and abstract the architecture with task-based local sparse and direct solvers which delegate the orchestration of the execution of the tasks within computational nodes to a runtime system. With such MPI+task approach, it is not only elegant to support X, Y, ... architectures in a consistent fashion, but also possible to exploit heterogeneous {X + Y} distributed memory architectures. Figure 2b illustrates the proposed design.

4.1 Overall Design of the MPI+task Extension of MaPHyS

In this section we explain how to create an MPI+task design of the MaPHyS solver. As illustrated in Fig. 2b, this latter approach aims at abstracting the hardware architecture relying on task-based programming and delegating the orchestration of the task within computational nodes to a runtime system. However, contrary to the full task-based abstraction depicted in Fig. 2a, each MPI process explicitly handles a DAG representing the numerical algorithm implemented in one MaPHyS subdomain. The MPI communications between subdomains are furthermore handled explicitly by MaPHyS (and not delegated to the task-based runtime system).

The goal of this preliminary study is to show the feasibility of the approach. To do so, we considered the baseline MPI version of MaPHyS and exploited the modular software architecture to substitute the multithreaded kernels with task-based versions of these kernels. We restricted our scope to the Symmetric Positive Definite (SPD) case.

In this paper we focus on the compute intensive numerical steps occurring after the partitioning step. Indeed, this stage is a symbolic pre-processing step; furthermore, to the best of our knowledge, none of the partitioners have yet been implemented on top of a runtime system:

For the factorization of the interiors, we are relying on the task-based version of PaStiX proposed in X. Lacoste thesis [29] for which we further designed a task-based Schur complement functionality thanks to the support of the PaStiX development team.
For the setup of the preconditioner, we use the task-based dense Cholesky solver from the Chameleon [4] library^{Footnote 3}.
For the solve step, the application of the preconditioner is performed by the Chameleon library and other operations involved in the iterative solution step such as level-one BLAS and matrix-vector product are executed with the multithreaded Mkl library.

All in all, we use task-based sparse (PaStiX) and dense (Chameleon) direct solvers, both of them expressed through the StarPU task-based runtime system that we now present.

4.2 The StarPU Task-Based Runtime System

In the last decade, a large variety of task-based runtime systems have been developed. The most common strategy for the parallelization of task-based algorithms consists in traversing the DAG sequentially and submitting the tasks, as discovered, to the runtime system using a non blocking function call. The dependencies between tasks are automatically inferred by the runtime system through a data dependency analysis [8] and the actual execution of the task is then postponed to the moment when all its dependencies are satisfied. This programming model is known as a Sequential Task Flow (STF) model as it fully relies on sequential consistency for the dependency detection. This paradigm is also sometimes referred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mutual dependencies. The popularity of this model encouraged the OpenMP board to include it in the standard 4.0: the task construct was extended with the depend clause which enables the OpenMP runtime to automatically detect dependencies among tasks and consequently schedule them. Note that the abstract model that support this mechanism has been widely used before the inclusion in the standard. Among other, the StarSs [12] and StarPU [11] runtime systems have certainly strongly contributed to that progress. StarPU (read *PU) has been specifically designed for abstracting the underlying architecture so that in can execute task on any type of hardware (CPU core, GPU, ...). As a consequence, it is convenient for exploiting heterogeneous architecture. For this reason, we decided to use it for implementing the proposed task-based extension of MaPHyS.

StarPU provides a convenient interface for implementing and parallelizing applications or algorithms that can be described as a graph of tasks. Tasks have to be explicitly submitted to the runtime system along with the data they work on and the corresponding data access mode. Through data analysis, StarPU can automatically detect the dependencies among tasks and build the corresponding DAG. Once a task is submitted, the runtime tracks its dependencies and schedules its execution as soon as these are satisfied, taking care of gathering the data on the unit where the task is actually executed.

5 Experimental Results

In this section we present preliminary results of our task-based prototype of MaPHyS. The tests presented in this section were performed on the PlaFRIM 2 platform situated installed at Inria Bordeaux-Sud-Ouest, more precisely on the sirocco nodes. These nodes are composed of two Dodeca-core Haswell Intel Xeon E5-2680, for a total of 24 cores per node, and 128 GB of RAM memory. Each node is equipped with 4 Nvidia K40-M GPUs, each one having 12 GB of RAM. We consider the SPD Audi_kw matrix (size $n =900\mathrm{K}$ and non-zeros $nnz =392\mathrm{M}$) to illustrate the behavior of the proposed prototype solver. Both Chameleon and PaStiX use the version 1.1 of the StarPU runtime system described in Sect. 4.2. All tests were performed in double precision.

Figures 3a and b depict the traces obtained on one node, using only CPU cores or both GPUs and CPU cores, respectively. In both cases, the matrix has been decomposed in four subdomains. Each subdomain is associated with one MPI process in charge of a subset of six CPU cores (left trace Fig. 3a), or six CPU cores and one GPU (right trace Fig. 3b), respectively. The runtime system orchestrates the execution of the tasks on the different processing units. The traces represent the execution on one particular subdomain. In the heterogeneous case, each GPU has a CPU core dedicated to handle it (see Sect. 4.2).

The resulting traces show the versatility of the approach that composed multi-threaded and task-based numerical kernels. The processing units are abstracted and the same code may be executed indistinguishably on the homogeneous or on the heterogeneous cases. Although the implementation is still preliminary and not optimized, Table 1 shows that the resulting timings allow for accelerating all three numerical steps with the use of one GPU per subdomain in spite of the preliminary design. The setup of the preconditioner benefits from the highest acceleration as it mostly consists of a dense factorization accelerated with Chameleon. The factorization of the interiors has a limited (but not negligible) acceleration because PaStiX internal kernel has not been tuned for the Nvidia K40-M GPU. The solve step phase is accelerated thanks to the application of the preconditioner with Chameleon. The time differences between the fastest and slowest subdomain computation for the factorization of the interiors and preconditioner setup are related to the matrix partitioning that balances the splitting of the adjacency graph but not the calculation associated with each subgraph. They are some ongoing work in the graph community to address this issue that is out of the scope of this work.

Table 1. Minimum, average and maximum time per subdomain for the MPI+task MaPHyS prototype for the multicore case (Fig. 3a) and the heterogeneous case (Fig. 3b) processing the Audi_kw matrix. Four subdomains (hence four processes) are used in total and a dense preconditioner is applied.

Full size table

6 Concluding Remarks

We have proposed an MPI+task extension of MaPHyS for exploiting distributed memory heterogeneous platforms. The modular design of MaPHyS allowed to use the task-based PaStiX and Chameleon sparse and dense direct libraries, respectively, in order to benefit from their ability to efficiently exploit the underlying heterogeneous architecture.

Although this prototype extension of MaPHyS is working properly and showed the feasibility of the proposed approach, designing a solid MPI+task version of MaPHyS would require further work. First of all, the proposed approach still follows a bulk-synchronous parallelism [38] (also sometimes designated as fork-join approach) pattern. Indeed, the calls to PaStiX and Chameleon, yet local to each subdomain, induce costly pre-processing. On the one hand, PaStiX need to perform a reordering of the variables to limit fill-in and a symbolic factorization. These steps are sequential in the present prototype. Although there exist parallel implementations of these steps, they are known to have a very limited parallel efficiency. To overcome the subsequent synchronizations, it would therefore be necessary to overlap these symbolic pre-processing steps with other numerical operations. On the other hand, following Plasma design, Chameleon first decomposes the dense matrix in tiles, which is also a synchronizing operation. As for Plasma, there exists an advanced interface allowing for tackling matrices already decomposed into tiles. Using this interface would certainly alleviate the bottleneck occurring within the setup of the preconditioner when calling the dense solver.

Other operations involved in the iterative solution step such as level-one BLAS and matrix-vector product could be implemented with a task-based approach. In the case of a dense preconditioner, these operations could also be implemented by calling BLAS operations implemented in Chameleon. However, in the present state, without using the advanced interface discussed above, the synchronizations would occur multiple times per iteration. A full task-based CG solver is presented in [7] and discuss in details how synchronization points can be alleviated.

To completely alleviate the synchronizations between the different sequences into which MaPHyS is decomposed, it would be necessary to further overlap communications with computations. This could be performed with a clever usage of asynchronous MPI calls. This approach is relatively difficult to implement and has been applied to overlap main stages in MaPHyS, both in the MPI+thread version and in the MPI+task prototype discussed in this section. However, relying on this paradigm for performing fine-grain overlapping would be challenging and certainly result in a code very complex to maintain. Alternatively, the MPI calls can be appended to the task flow. Doing so, the task-based runtime system can dynamically decide when to perform the actual MPI call to the MPI layer and interleave them with fine-grain computational tasks. Modern runtime systems such as StarPU and PaRSEC provide such an opportunity. However, even in that case, the task-flow would still have to be designed accordingly to the mapping between tasks and processes. On the contrary, the full task approach (see Fig. 2a) would allows for fully abstracting the hardware architecture and makes the mapping issues practically orthogonal to the design of the task flow.

Notes

1.
https://project.inria.fr/maphys/.
2.
There are also options for computing Incomplete LU (ILU) factorizations of the interiors but the related descriptions are out the scope of this paper.
3.
https://project.inria.fr/chameleon/.

References

MAGMA Users’ Guide, version 0.2, November 2009. http://icl.cs.utk.edu/magma
PLASMA Users’ Guide: Parallel Linear Algebra Software for Multicore Architectures, Version 2.0, November 2009. http://icl.cs.utk.edu/plasma
Agullo, E., Giraud, L., Guermouche, A., Roman, J.: Parallel hierarchical hybrid linear solvers for emerging computing platforms. C. R. Acad. Sci. Mec. 339(2–3), 96–105 (2011)
Article MATH Google Scholar
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.: Faster, cheaper, better - a hybridization methodology to develop linear algebra software for GPUs. In: Hwu, W.W. (ed.) GPU Computing Gems, vol. 2. Morgan Kaufmann, September 2010
Google Scholar
Agullo, E., Buttari, A., Guermouche, A., Lopez, F.: Multifrontal QR factorization for multicore architectures over runtime systems. In: Wolf, F., Mohr, B., Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 521–532. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40047-6_53
Chapter Google Scholar
Agullo, E., Buttari, A., Guermouche, A., Lopez, F.: Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems. ACM Trans. Math. Softw. 43, 13 (2016)
Article MathSciNet Google Scholar
Agullo, E., Giraud, L., Guermouche, A., Nakov, S., Roman, J.: Task-based conjugate-gradient for multi-GPUs platforms. Research report RR-8192, INRIA (2012)
Google Scholar
Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Amestoy, P.R., Duff, I.S., Koster, J., L’Excellent, J.-Y.: A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23(1), 15–41 (2001)
Article MathSciNet MATH Google Scholar
Amestoy, P.R., Guermouche, A., L’Excellent, J.-Y., Pralet, S.: Hybrid scheduling for the parallel solution of linear systems. Parallel Comput. 32(2), 136–156 (2006)
Article MathSciNet Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 863–874. Springer, Heidelberg (2009). doi:10.1007/978-3-642-03869-3_80
Chapter Google Scholar
Ayguadé, E., Badia, R.M., Igual, F.D., Labarta, J., Mayo, R., Quintana-Ortí, E.S.: An extension of the StarSs programming model for platforms with multiple GPUs. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009). doi:10.1007/978-3-642-03869-3_79
Chapter Google Scholar
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. SIAM Press, Philadelphia (1997)
Book MATH Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Distributed-memory task execution and dependence tracking within DAGuE and the DPLASMA project. Innovative Computing Laboratory Technical report (2010)
Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Hérault, T., Dongarra, J.J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013)
Article Google Scholar
Buttari, A.: Fine-grained multithreading for the multifrontal QR factorization of sparse matrices. SIAM J. Sci. Comput. 35(4), C323–C345 (2013)
Article MathSciNet MATH Google Scholar
Carvalho, L.M., Giraud, L., Meurant, G.: Local preconditioners for two-level non-overlapping domain decomposition methods. Numer. Linear Algebra Appl. 8(4), 207–227 (2001)
Article MathSciNet MATH Google Scholar
Chan, T.F.C., Mathew, T.P.: The interface probing technique in domain decomposition. SIAM J. Matrix Anal. Appl. 13(1), 212–238 (1992)
Article MathSciNet MATH Google Scholar
Gaidamour, J., Hénon, P.: A parallel direct/iterative solver based on a Schur complement approach. In: 2013 IEEE 16th International Conference on Computational Science and Engineering, pp. 98–105 (2008)
Google Scholar
Gaidamour, J., Hénon, P.: HIPS: a parallel hybrid direct/iterative solver based on a Schur complement approach. In: Proceedings of PMAA (2008)
Google Scholar
Giraud, L., Haidar, A., Watson, L.T.: Parallel scalability study of hybrid preconditioners in three dimensions. Parallel Comput. 34, 363–379 (2008)
Article MathSciNet Google Scholar
Giraud, L., Haidar, A.: Parallel algebraic hybrid solvers for large 3D convection-diffusion problems. Numer. Algorithms 51(2), 151–177 (2009)
Article MathSciNet MATH Google Scholar
Giraud, L., Haidar, A., Pralet, S.: Using multiple levels of parallelism to enhance the performance of domain decomposition solvers. Parallel Comput. 36(5–6), 285–296 (2010)
Article MathSciNet MATH Google Scholar
Haidar, A.: On the parallel scalability of hybrid linear solvers for large 3D problems. Ph.D. thesis, Institut National Polytechnique de Toulouse, 17 December 2008
Google Scholar
Hénon, P., Ramet, P., Roman, J.: PaStiX: a high-performance parallel direct solver for sparse symmetric definite systems. Parallel Comput. 28(2), 301–321 (2002)
Article MathSciNet MATH Google Scholar
Hénon, P., Saad, Y.: A parallel multistage ILU factorization based on a hierarchical graph decomposition. SIAM J. Sci. Comput. 28(6), 2266–2293 (2006)
Article MathSciNet MATH Google Scholar
Jaulmes, L., Casas, M., Moretó, M., Ayguadé, E., Labarta, J., Valero, M.: Exploiting asynchrony from exact forward recovery for due in iterative solvers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 53:1–53:12. ACM, New York (2015)
Google Scholar
Karypis, G., Kumar, V.: MeTiS - Unstructured Graph Partitioning and Sparse Matrix Ordering System - Version 2.0. University of Minnesota, June 1995
Google Scholar
Lacoste, X.: Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU cluster systems. Ph.D. thesis, LaBRI, Université Bordeaux, Talence, France, February 2015
Google Scholar
Lacoste, X., Faverge, M., Ramet, P., Thibault, S., Bosilca, G.: Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes, May 2014
Google Scholar
Li, X.S., Shao, M., Yamazaki, I., Ng, E.G.: Factorization-based sparse solvers and preconditioners. J. Phys. Conf. Ser. 180(1), 012015 (2009)
Article Google Scholar
Li, X.S., Demmel, J.W.: SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans. Math. Softw. 29(2), 110–140 (2003)
Article MATH Google Scholar
Lopez, F.: Task-based multifrontal QR solver for heterogeneous architectures. Ph.D. thesis, University Paul Sabatier, Toulouse, France (2015, submitted)
Google Scholar
Mathew, T.P.A.: Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations. Lecture Notes in Computational Science and Engineering, vol. 61. Springer, Heidelberg (2008). doi:10.1007/978-3-540-77209-5
MATH Google Scholar
Nakov, S.: On the design of sparse hybrid linear solvers for modern parallel architectures. Theses, Université de Bordeaux, December 2015
Google Scholar
Pellegrini, F., Roman, J.: Sparse matrix ordering with Scotch. In: Hertzberger, B., Sloot, P. (eds.) HPCN-Europe 1997. LNCS, vol. 1225, pp. 370–378. Springer, Heidelberg (1997). doi:10.1007/BFb0031609
Chapter Google Scholar
Rajamanickam, S., Boman, E.G., Heroux, M.A.: ShyLU: a hybrid-hybrid solver for multicore platforms. In: International Parallel and Distributed Processing Symposium, pp. 631–643 (2012)
Google Scholar
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Article Google Scholar
Van Zee, F.G., Chan, E., van de Geijn, R.A., Quintana-Orti, E.S., Quintana-Orti, G.: The libflame library for dense matrix computations. Comput. Sci. Eng. 11(6), 56–63 (2009)
Article MATH Google Scholar
Yamazaki, I., Li, X.S.: On techniques to improve robustness and scalability of a parallel hybrid linear solver. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 421–434. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19328-6_38
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Inria, HiePACS Project-Team, Bordeaux, France
Emmanuel Agullo, Luc Giraud & Stojce Nakov

Authors

Emmanuel Agullo
View author publications
You can also search for this author in PubMed Google Scholar
Luc Giraud
View author publications
You can also search for this author in PubMed Google Scholar
Stojce Nakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luc Giraud .

Editor information

Editors and Affiliations

Inria, Université Grenoble Alpes, Grenoble, France
Frédéric Desprez
LIG, Université Grenoble Alpes, Grenoble, France
Pierre-François Dutot
Computer Technology Institute, University of Patras, Patras, Greece
Christos Kaklamanis
CNRS, University of Lyon, Lyon, France
Loris Marchal
Agilient Technologies, Santa Clara, California, USA
Korbinian Molitorisz
Department of Computer Science, University of Pisa, Pisa, Italy
Laura Ricci
Università di Salerno, Salerno, Italy
Vittorio Scarano
University of Extremadura, Caceres, Spain
Miguel A. Vega-Rodríguez
University of Amsterdam, Amsterdam, The Netherlands
Ana Lucia Varbanescu
TU Wien, Vienna, Austria
Sascha Hunold
Oak Ridge National Laboratory, Tennessee Tech University, Oak Ridge, Tennessee, USA
Stephen L. Scott
RWTH Aachen University, Aachen, Germany
Stefan Lankes
TU München, Garching, Bayern, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agullo, E., Giraud, L., Nakov, S. (2017). Task-Based Sparse Hybrid Linear Solver for Distributed Memory Heterogeneous Architectures. In: Desprez, F., et al. Euro-Par 2016: Parallel Processing Workshops. Euro-Par 2016. Lecture Notes in Computer Science(), vol 10104. Springer, Cham. https://doi.org/10.1007/978-3-319-58943-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-58943-5_7
Published: 28 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58942-8
Online ISBN: 978-3-319-58943-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics