Keywords

1 Introduction

Parallel sparse linear algebra solvers are often the innermost numerical kernels in scientific and engineering applications; consequently, they are one of the most time consuming parts. In order to cope with the hierarchical hardware design of modern large-scale supercomputers, the HPC solver community has proposed new sparse methods. One promising approach to high-performance, scalable solution of large sparse linear systems in parallel scientific computing is to combine direct and iterative methods. To achieve a high scalability, algebraic domain decomposition methods are commonly employed to split a large size linear system into smaller size linear systems that can be efficiently and concurrently handled by a sparse direct solver while the solution along the interfaces is computed iteratively [20, 22, 37, 40]. Such an hybrid approach exploits the advantages of both direct and iterative methods. The iterative component allows us to use a small amount of memory and provides a natural way for parallelization. The direct part provides its favorable numerical properties; furthermore, this combination provides opportunities to exploit several levels of parallelism as we do in this paper. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MaPHyS Footnote 1 [3].

Starting from a baseline MPI version of the considered hybrid solver, the objective of this study is to propose a prototype extension for which each MPI process can handle heterogeneous processing units with a task-based approach, delegating the task management to a runtime system. A preliminary experimental study asseses the potential of the approach.

This paper is organized as follows. Section 2 presents the solver considered in this study and its baseline parallel design. Section 3 presents background on task-based linear algebra and sparse hybrid solvers. Section 4 presents the design of the task-based extension proposed for our sparse hybrid solver. Preliminary results are discussed in Sect. 5 while concluding remarks on this work and perspectives are discussed in Sect. 6.

2 Baseline MPI Hybrid (Direct/Iterative) Solver

We now present the sparse hybrid (direct/iterative) method (Sect. 2.1) considered in this study and its baseline parallel design (Sect. 2.2).

2.1 Method

Let \(\mathcal{A}_{} x_{} = b_{} \) be the linear problem and \(\mathcal{G}_{} = \{V, E \}\) the adjacency graph associated with \(\mathcal{A}_{}\). In this graph, each vertex is associated with a row or column of the matrix \(\mathcal{A}_{}\) and it exists an edge between the vertices i and j if the entry \(a_{i,j}\) is non zero. In the sequel, to facilitate the exposure and limit the notation we voluntarily mix a vertex of \(\mathcal{G}_{}\) with its index depending on the context of the description. The governing idea behind substructuring or Schur complement methods is to split the unknowns in two categories: interior and interface vertices. We assume that the vertices of the graph \(\mathcal{G}_{}\) are partitioned into \(\mathcal{N}\) disconnected subgraphs \(\mathcal{I}_{1}\), ..., \(\mathcal{I}_{\mathcal{N}}\) separated by the global vertex separator \(\varGamma _{}\). We also decompose the vertex separator \(\varGamma _{}\) into non-disjoint subsets \(\varGamma _{i}\), where \(\varGamma _{i}\) is the set of vertices in \(\varGamma _{}\) that are connected to at least one vertex of \(\mathcal{I}_{i}\). Notice that this decomposition is not a partition as \(\varGamma _{i} \cap \varGamma _{j} \ne \emptyset \) when the set of vertices in this intersection defines the separator of \(\mathcal{I}_{i} \) and \(\mathcal{I}_{j} \). By analogy with classical domain decomposition in a finite element framework, \(\varOmega _{i} =\mathcal{I}_{i} \cup \varGamma _{i} \) will be referred to as a subdomain with internal unknowns \(\mathcal{I}_{i}\) and interface unknowns \(\varGamma _{i}\). If we denote \(\mathcal{I}_{} = \cup \mathcal{I}_{i} \) and order vertices in \(\mathcal{I}_{}\) first, we obtain the following block reordered linear system

$$\begin{aligned} \left( \begin{array}{cc} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} &{} \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} \\ \mathcal{A}_{\varGamma _{} \mathcal{I}_{}} &{} \mathcal{A}_{\varGamma _{} \varGamma _{}} \\ \end{array}\right) \left( \begin{array}{c} x_{\mathcal{I}_{}} \\ x_{\varGamma _{}} \\ \end{array} \right) = \left( \begin{array}{c} b_{\mathcal{I}_{}} \\ b_{\varGamma _{}} \\ \end{array} \right) \end{aligned}$$
(1)

where \(x_{\varGamma _{}}\) contains all unknowns associated with the separator and \(x_{\mathcal{I}_{}}\) contains the unknowns associated with the interiors. Because the interior vertices are only connected to either interior vertices in the same subgraph or with vertices in the interface, the matrix \(\mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}}\) has a block diagonal structure, where each diagonal block corresponds to one subgraph \(\mathcal{I}_{i}\). Eliminating \(x_{\mathcal{I}_{}}\) from the second block row of Eq. (1) leads to the reduced system

$$\begin{aligned} \mathcal{S}_{} x_{\varGamma _{}} = f \end{aligned}$$
(2)

where

$$\begin{aligned} \mathcal{S}_{} =\mathcal{A}_{\varGamma _{} \varGamma _{}}-\mathcal{A}_{\varGamma _{} \mathcal{I}_{}} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} \; \text { and} \; f = b_{\varGamma _{}} -\mathcal{A}_{\varGamma _{} \mathcal{I}_{}} \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} b_{\mathcal{I}_{}}. \end{aligned}$$
(3)

The matrix \(\mathcal{S}_{}\) is referred to as the Schur complement matrix. This reformulation leads to a general strategy for solving (1). Specifically, an iterative method can be applied to solve (2). Once \(x_{\varGamma _{}}\) is known, \(x_{\mathcal{I}_{}}\) can be computed with one additional solve for the interior unknowns via

$$\begin{aligned} x_{\mathcal{I}_{}} = \mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}} ^{-1} \left( b_{\mathcal{I}_{}} - \mathcal{A}_{\mathcal{I}_{} \varGamma _{}} x_{\varGamma _{}} \right) . \end{aligned}$$

We illustrate in Fig. 1a all these notations for a decomposition into 4 subdomains. The local interiors are disjoint and form a partition of the interior \(\mathcal{I}_{} = \sqcup \mathcal{I}_{i} \) (blue vertices in Fig. 1b). It is not necessarily the case for the boundaries. Indeed, two subdomains \(\varOmega _{i}\) and \(\varOmega _{j}\) may share part of their interface (\(\varGamma _{i} \bigcap \varGamma _{j} \ne \emptyset \)), such as \(\varOmega _{1}\) and \(\varOmega _{2}\) in Fig. 1b which share eleven vertices. Altogether, the local boundaries form the overall interface \(\varGamma _{} = \cup \varGamma _{i} \) (red vertices in Fig. 1b), which is not a disjoint union. Because interior vertices are only connected to vertices of their subset (either on the interior or on the boundary), matrix \(\mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}}\) associated to the interior has a block diagonal structure, as shown in Fig. 1a. Each diagonal block \(\mathcal{A}_{\mathcal{I}_{i} \mathcal{I}_{i}}\) corresponds to a local interior.

Fig. 1.
figure 1

Domain decomposition into four subdomains \(\varOmega _{1}\), ..., \(\varOmega _{4}\). The initial domain \(\varOmega _{}\) may be algebraically represented with the graph \(\mathcal{G}_{}\) associated to the sparsity pattern of matrix \(\mathcal{A}_{}\) (a). The local interiors \(\mathcal{I}_{1}\), ..., \(\mathcal{I}_{\mathcal{N}}\) form a partition of the interior \(\mathcal{I}_{} = \sqcup \mathcal{I}_{i} \) (blue vertices in (b)). They interact with each others through the interface \(\varGamma _{}\) (red vertices in (b)). The block reordered matrix (c) has a block diagonal structure for the variables associated with the interior \(\mathcal{A}_{\mathcal{I}_{} \mathcal{I}_{}}\). (Color figure online)

While the Schur complement system is significantly smaller and better conditioned than the original matrix \(\mathcal{A}\) [34, Lemma 3.11], it is important to consider further preconditioning when employing a Krylov method. We introduce the general form of the preconditioner considered in MaPHyS. The preconditioner presented below was originally proposed in [17] and successfully applied to large problems in real life applications in [21, 24, 35]. To describe the main preconditioner in MaPHyS, considering the restriction operator \(\mathcal{R}_{\varGamma _{i}} \) from \(\varGamma _{} \) to \(\varGamma _{i} \), we define \(\bar{\mathcal{S}_{i}} = \mathcal{R}_{\varGamma _{i}} \mathcal{S}_{} \mathcal{R}_{\varGamma _{i}} ^T\), that corresponds to the restriction of the Schur complement to the interface \(\varGamma _{i}\). If \(\mathcal{I}_{i}\) is a fully connected subgraph of \(\mathcal{G}_{}\), the matrix \(\bar{\mathcal{S}_{i}}\) is dense.

With these notations the Additive Schwarz preconditioner reads

$$\begin{aligned} \mathcal{M}_{AS} = \sum _{i=1}^{\mathcal{N}} \mathcal{R}_{\varGamma _{i}} ^T \bar{\mathcal{S}_{i}} ^{-1} \mathcal{R}_{\varGamma _{i}}. \end{aligned}$$
(4)

2.2 Baseline MPI Parallelization

With all these components, the classical parallel implementation of MaPHyS can be decomposed into four main phases:

  • the partitioning step consists of partitioning the adjacency graph \(\mathcal{G}_{}\) of \(\mathcal{A}_{}\) into several subdomains and distribute the \(\mathcal{A}_{i}\) to different processes. For this we are able to use two state-of-the-art partitioners, Scotch [36] and METIS [28];

  • the factorization of the interiors and the computation of the local Schur complement factorizes \(\mathcal{A}_{i}\) with the PaStiX [25] or the Mumps [9] sparse direct solver and furthermore provides the associated local Schur Complement \(\mathcal{S}_{i}\) thanks to recent progress from the development teams of those sparse direct solvers;

  • the setup of the preconditioner by assembling diagonal blocks of \(\mathcal{S}_{i}\) via a few neighbour to neighbour communications and factorization of the dense local \(\mathcal{S}_{i}\) using Mkl;

  • the solve step consists of two steps: a parallel preconditioned Krylov method performed on the reduced system (Eq. 2) to compute \(x_{\varGamma _{i}}\) where all BLAS operations are provided by Mkl, followed by the back solve on the interior to compute \(x_{\mathcal{I}_{i}}\), done by the sparse direct solver.

3 Related Work

To cope with the complexity of modern architectures, programming paradigms are being revisited. Among others, one major trend consists in writing the algorithms in terms of task graphs and delegating to a runtime system both the management of the data consistency and the orchestration of the actual execution. This paradigm has been intensively studied in the context of dense linear algebra and is now a common utility for related state-of-the-art libraries such as Plasma [2], Magma [1], DPLASMA [14], Chameleon [4] and FLAME [39]. Dense linear algebra algorithms were indeed excellent candidates for pioneering in this direction. First, their computational pattern allows one to design very wide task graphs so that many computational units can execute tasks concurrently. Second, the building block operations they rely on, essentially level-three Basic Linear Algebra Subroutines (BLAS), are compute intensive, which makes it possible to split the work in relatively fine grain tasks while fully benefiting from GPU acceleration. As a result, these algorithms are particularly easy to schedule in the sense that state-of-the-art greedy scheduling algorithms may lead to high performance, including on platforms accelerated with multiple GPUs [4].

This trend has then been followed for designing sparse direct methods. The extra challenge in designing task-based sparse direct method is due to indirection and variable granularities of the tasks. The PaStiX team has proposed such an extension of the solver capable of running on the StarPU [11] and PaRSEC [15] runtime systems on cluster of heterogeneous nodes in the context of X. Lacoste PhD thesis [29, 30]. In the meanwhile, the qr_mumps library developed by A. Buttari [16] aims at solving sparse linear least square problems and has been ported on top of those runtime systems in the context of F. Lopez PhD thesis [6, 33].

With the need of solving ever larger sparse linear systems while maintaining numerical robustness, multiple sparse hybrid variants have been proposed for computing the preconditioner for the Schur complement. PDSLin [31], ShyLU [37] and Hips [19] first perform an exactFootnote 2 factorization of the interior of each subdomain concurrently. PDSLin and ShyLU then compute the preconditioner with a two-fold approach. First, an approximation \(\widetilde{\mathcal{S}}_{}\) of the (global) Schur complement \(\mathcal{S}_{}\) is computed. Second, this approximate Schur complement \(\widetilde{\mathcal{S}}_{}\) is factorized to form the preconditioner for the Schur Complement system, which does not need to be formed explicitly. While PDSLin has multiple options for discarding values lower than some user-defined thresholds at different steps of the computation of \(\widetilde{\mathcal{S}}_{}\), ShyLU [37] also implements a structure-based approach for discarding values named probing and that was first proposed to approximate interfaces in DDM [18]. Instead of following such a two-fold approach, Hips [19] forms the preconditioner by computing a global ILU factorization based on the multi-level scheme formulation from [26].

These sparse hybrid solvers have also been extended to cope with hierarchical supercomputers. Indeed, to ensure numerical robustness while exploiting all the processors of a platform, an important effort has been devoted to propose two levels of parallelism for these solvers. Designed on top of the SuperLU_DIST [32] distributed memory sparse direct solver, PDSLin implements a 2-level MPI (MPI+MPI) approach with finely tuned intra- and inter-subdomain load balancing [40]. A similar MPI+MPI approach has been assessed for additive Schwarz preconditioning in a prototype version of MaPHyS [23], relying on the Mumps [9, 10] and ScaLAPACK [13] sparse direct and dense distributed memory solvers, respectively. On the contrary, expecting a higher numerical robustness thanks to multi-level preconditioning, Hips associates multiple subdomains to a single process and distributes the subdomains to the processes in order to maintain load balancing [19]. Finally, especially tuned for modern multicore platforms, ShyLU implements a 2-level MPI+thread approach [37]. A similar 2-level MPI+thread design has been investigated for MaPHyS in [35]. However, none of these extensions were tailored to exploit heterogeneous architectures.

4 Design of Task-Based Sparse Hybrid Linear Solver for Distributed Memory Heterogeneous Architectures

Although very efficient for exploiting multiple modern multicore nodes, the relatively low-level design of the 2-level parallelism sparse hybrid solvers discussed above cannot exploit heterogeneous architectures.

Fig. 2.
figure 2

Illustration of the execution of MaPHyS on four subdomains with two different task-based paradigms on a platform composed of two nodes of eight cores and four GPUs per node

One solution for relieving this bottleneck would consist in fully abstracting the MPI scheme of the solver in terms of a DAG of tasks where vertices represent fine grain tasks and edges represent dependencies between them. Once the solver has been designed at such a high-level of abstraction, advanced fine-grain mapping strategies can be implemented as the burden of moving data in the system and ensuring their consistency is delegated to a runtime system. However, such a design prevents from relying on SPMD paradigms. As a consequence, it requires to fully rewrite the solver in terms of a DAG of tasks as illustrated in Fig. 2a: the DAG representing the whole numerical algorithm implemented in MaPHyS is written independently from the hardware architecture (we represent it on top of the runtime system and of the MPI processes). While there has been a lot of progress in that direction as discussed in Sect. 3 for dense (such as the DPLASMA [14] and Chameleon [4] task-based libraries, derived from Plasma [2] and Magma [1]) and sparse direct methods (such as the task-based version of PaStiX [29] and qrm_StarPU [5]), only limited work we are aware of has been devoted to study task-based Krylov methods, the third numerical pillar on top of which hybrid solvers are built on. Indeed, a task-based version of the CG algorithm for platforms equipped with several GPUs was proposed in [7] and a fault-tolerant task-based version of this algorithm was proposed in [27], but none of them could exploit distributed memory platforms.

A second solution would consist in relying on the modular design of the hybrid solver to use appropriate libraries depending on the target architecture leading to a collection of MPI+X, MPI+Y, ... solutions to support X, Y, ... architectures respectively.

In this paper, we propose to combine both solutions with a MPI+task approach in order to benefit from the high-level modular design of the hybrid solver and abstract the architecture with task-based local sparse and direct solvers which delegate the orchestration of the execution of the tasks within computational nodes to a runtime system. With such MPI+task approach, it is not only elegant to support X, Y, ... architectures in a consistent fashion, but also possible to exploit heterogeneous {X + Y} distributed memory architectures. Figure 2b illustrates the proposed design.

4.1 Overall Design of the MPI+task Extension of MaPHyS

In this section we explain how to create an MPI+task design of the MaPHyS solver. As illustrated in Fig. 2b, this latter approach aims at abstracting the hardware architecture relying on task-based programming and delegating the orchestration of the task within computational nodes to a runtime system. However, contrary to the full task-based abstraction depicted in Fig. 2a, each MPI process explicitly handles a DAG representing the numerical algorithm implemented in one MaPHyS subdomain. The MPI communications between subdomains are furthermore handled explicitly by MaPHyS (and not delegated to the task-based runtime system).

The goal of this preliminary study is to show the feasibility of the approach. To do so, we considered the baseline MPI version of MaPHyS and exploited the modular software architecture to substitute the multithreaded kernels with task-based versions of these kernels. We restricted our scope to the Symmetric Positive Definite (SPD) case.

In this paper we focus on the compute intensive numerical steps occurring after the partitioning step. Indeed, this stage is a symbolic pre-processing step; furthermore, to the best of our knowledge, none of the partitioners have yet been implemented on top of a runtime system:

  • For the factorization of the interiors, we are relying on the task-based version of PaStiX proposed in X. Lacoste thesis [29] for which we further designed a task-based Schur complement functionality thanks to the support of the PaStiX development team.

  • For the setup of the preconditioner, we use the task-based dense Cholesky solver from the Chameleon [4] libraryFootnote 3.

  • For the solve step, the application of the preconditioner is performed by the Chameleon library and other operations involved in the iterative solution step such as level-one BLAS and matrix-vector product are executed with the multithreaded Mkl library.

All in all, we use task-based sparse (PaStiX) and dense (Chameleon) direct solvers, both of them expressed through the StarPU task-based runtime system that we now present.

4.2 The StarPU Task-Based Runtime System

In the last decade, a large variety of task-based runtime systems have been developed. The most common strategy for the parallelization of task-based algorithms consists in traversing the DAG sequentially and submitting the tasks, as discovered, to the runtime system using a non blocking function call. The dependencies between tasks are automatically inferred by the runtime system through a data dependency analysis [8] and the actual execution of the task is then postponed to the moment when all its dependencies are satisfied. This programming model is known as a Sequential Task Flow (STF) model as it fully relies on sequential consistency for the dependency detection. This paradigm is also sometimes referred to as superscalar since it mimics the functioning of superscalar processors where instructions are issued sequentially from a single stream but can actually be executed in a different order and, possibly, in parallel depending on their mutual dependencies. The popularity of this model encouraged the OpenMP board to include it in the standard 4.0: the task construct was extended with the depend clause which enables the OpenMP runtime to automatically detect dependencies among tasks and consequently schedule them. Note that the abstract model that support this mechanism has been widely used before the inclusion in the standard. Among other, the StarSs [12] and StarPU [11] runtime systems have certainly strongly contributed to that progress. StarPU (read *PU) has been specifically designed for abstracting the underlying architecture so that in can execute task on any type of hardware (CPU core, GPU, ...). As a consequence, it is convenient for exploiting heterogeneous architecture. For this reason, we decided to use it for implementing the proposed task-based extension of MaPHyS.

StarPU provides a convenient interface for implementing and parallelizing applications or algorithms that can be described as a graph of tasks. Tasks have to be explicitly submitted to the runtime system along with the data they work on and the corresponding data access mode. Through data analysis, StarPU can automatically detect the dependencies among tasks and build the corresponding DAG. Once a task is submitted, the runtime tracks its dependencies and schedules its execution as soon as these are satisfied, taking care of gathering the data on the unit where the task is actually executed.

5 Experimental Results

In this section we present preliminary results of our task-based prototype of MaPHyS. The tests presented in this section were performed on the PlaFRIM 2 platform situated installed at Inria Bordeaux-Sud-Ouest, more precisely on the sirocco nodes. These nodes are composed of two Dodeca-core Haswell Intel Xeon E5-2680, for a total of 24 cores per node, and 128 GB of RAM memory. Each node is equipped with 4 Nvidia K40-M GPUs, each one having 12 GB of RAM. We consider the SPD Audi_kw matrix (size \(n =900\mathrm{K}\) and non-zeros \(nnz =392\mathrm{M}\)) to illustrate the behavior of the proposed prototype solver. Both Chameleon and PaStiX use the version 1.1 of the StarPU runtime system described in Sect. 4.2. All tests were performed in double precision.

Figures 3a and b depict the traces obtained on one node, using only CPU cores or both GPUs and CPU cores, respectively. In both cases, the matrix has been decomposed in four subdomains. Each subdomain is associated with one MPI process in charge of a subset of six CPU cores (left trace Fig. 3a), or six CPU cores and one GPU (right trace Fig. 3b), respectively. The runtime system orchestrates the execution of the tasks on the different processing units. The traces represent the execution on one particular subdomain. In the heterogeneous case, each GPU has a CPU core dedicated to handle it (see Sect. 4.2).

Fig. 3.
figure 3

Multicore execution trace associated with one subdomain of the MPI+task MaPHyS prototype processing the Audi_kw matrix. Four subdomains (hence four processes) are used in total.

The resulting traces show the versatility of the approach that composed multi-threaded and task-based numerical kernels. The processing units are abstracted and the same code may be executed indistinguishably on the homogeneous or on the heterogeneous cases. Although the implementation is still preliminary and not optimized, Table 1 shows that the resulting timings allow for accelerating all three numerical steps with the use of one GPU per subdomain in spite of the preliminary design. The setup of the preconditioner benefits from the highest acceleration as it mostly consists of a dense factorization accelerated with Chameleon. The factorization of the interiors has a limited (but not negligible) acceleration because PaStiX internal kernel has not been tuned for the Nvidia K40-M GPU. The solve step phase is accelerated thanks to the application of the preconditioner with Chameleon. The time differences between the fastest and slowest subdomain computation for the factorization of the interiors and preconditioner setup are related to the matrix partitioning that balances the splitting of the adjacency graph but not the calculation associated with each subgraph. They are some ongoing work in the graph community to address this issue that is out of the scope of this work.

Table 1. Minimum, average and maximum time per subdomain for the MPI+task MaPHyS prototype for the multicore case (Fig. 3a) and the heterogeneous case (Fig. 3b) processing the Audi_kw matrix. Four subdomains (hence four processes) are used in total and a dense preconditioner is applied.

6 Concluding Remarks

We have proposed an MPI+task extension of MaPHyS for exploiting distributed memory heterogeneous platforms. The modular design of MaPHyS allowed to use the task-based PaStiX and Chameleon sparse and dense direct libraries, respectively, in order to benefit from their ability to efficiently exploit the underlying heterogeneous architecture.

Although this prototype extension of MaPHyS is working properly and showed the feasibility of the proposed approach, designing a solid MPI+task version of MaPHyS would require further work. First of all, the proposed approach still follows a bulk-synchronous parallelism [38] (also sometimes designated as fork-join approach) pattern. Indeed, the calls to PaStiX and Chameleon, yet local to each subdomain, induce costly pre-processing. On the one hand, PaStiX need to perform a reordering of the variables to limit fill-in and a symbolic factorization. These steps are sequential in the present prototype. Although there exist parallel implementations of these steps, they are known to have a very limited parallel efficiency. To overcome the subsequent synchronizations, it would therefore be necessary to overlap these symbolic pre-processing steps with other numerical operations. On the other hand, following Plasma design, Chameleon first decomposes the dense matrix in tiles, which is also a synchronizing operation. As for Plasma, there exists an advanced interface allowing for tackling matrices already decomposed into tiles. Using this interface would certainly alleviate the bottleneck occurring within the setup of the preconditioner when calling the dense solver.

Other operations involved in the iterative solution step such as level-one BLAS and matrix-vector product could be implemented with a task-based approach. In the case of a dense preconditioner, these operations could also be implemented by calling BLAS operations implemented in Chameleon. However, in the present state, without using the advanced interface discussed above, the synchronizations would occur multiple times per iteration. A full task-based CG solver is presented in [7] and discuss in details how synchronization points can be alleviated.

To completely alleviate the synchronizations between the different sequences into which MaPHyS is decomposed, it would be necessary to further overlap communications with computations. This could be performed with a clever usage of asynchronous MPI calls. This approach is relatively difficult to implement and has been applied to overlap main stages in MaPHyS, both in the MPI+thread version and in the MPI+task prototype discussed in this section. However, relying on this paradigm for performing fine-grain overlapping would be challenging and certainly result in a code very complex to maintain. Alternatively, the MPI calls can be appended to the task flow. Doing so, the task-based runtime system can dynamically decide when to perform the actual MPI call to the MPI layer and interleave them with fine-grain computational tasks. Modern runtime systems such as StarPU and PaRSEC provide such an opportunity. However, even in that case, the task-flow would still have to be designed accordingly to the mapping between tasks and processes. On the contrary, the full task approach (see Fig. 2a) would allows for fully abstracting the hardware architecture and makes the mapping issues practically orthogonal to the design of the task flow.