201806 ECCOMAS Glasgow Article
201806 ECCOMAS Glasgow Article
201806 ECCOMAS Glasgow Article
Key words: Parallel CFD, SpMV, Portability, MPI + OpenMP + OpenCL, Hybrid
CPU + GPU, Heterogeneous computing
1 INTRODUCTION
Continuous enhancement in hardware technologies enables scientific computing to ad-
vance incessantly to reach further aims. After hitting petascale speeds in 2008, sev-
eral organisations and institutions began the well-known global race for exascale high-
performance computing (HPC). Thenceforth, hardware developers have been facing two
significant challenges. Firstly, the energy efficiency of the exascale systems ought to be
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
augmented by two orders of magnitude respect to the earliest petascale machines. Sec-
ondly, the memory bandwidth must be increased to satisfy the demands of the scientific
computing community. The common FLOP-oriented architectures (i.e. very high, and
growing FLOPS to memory bandwidth ratios) are not efficiently dealing with most of the
algorithms used in scientific computing; they barely reach 3% of their peak performance
as shown in the HPCG Benchmark [1].
In consequence, massively-parallel devices of various architectures are being incorpo-
rated into the newest supercomputers. This progress leads to an increasing hybridisation
of HPC systems and makes the design of computing applications a rather complex prob-
lem. The computing operations that form the algorithms, the so-called kernels, must be
compatible with distributed- and shared-memory SIMD and MIMD parallelism and, more
importantly, with stream processing (SP), which is a more restrictive parallel paradigm.
Initially, the GPU-only implementations proved to be more energy efficient than the
CPU-only did [2], even though they let the majority of CPU cores on the hybrid nodes to
remain idle. Heterogeneous implementations rapidly became popular since they can tar-
get a wide range of architectures and combine different kinds of parallelism, engaging all
the computing hardware available on the node. For instance, the MPI+OpenMP+CUDA
implementation in [3] provides high scalability on up to 1024 hybrid nodes. However, only
depending on the proprietary NVIDIA CUDA framework for handling GPUs leads to a
loss of portability while, considering the enormous complexity of porting existing codes,
the software efficiency and portability is of crucial importance. Therefore, fully-portable
implementations such as the MPI+OpenMP+OpenCL in [4] are preferred in this work.
In this context of accelerated innovation, making an effort to design modular applica-
tions composed of a reduced number of independent and well-defined code blocks is worth
it. On the one hand, this helps to reduce the generation of errors and facilitates debug-
ging. On the other hand, modular applications are user-friendly and more comfortable
for porting to new architectures (the fewer the kernels of an application, the easier it is
to provide portability). Furthermore, if the majority of kernels represent linear algebra
operations, then standard optimised libraries (e.g. ATLAS, clBLAST) or specific in-house
implementations can be used and easily switched.
Nevertheless, the design of modular frameworks requires a long-term, global strategy
to ensure that the pieces fit correctly. If you take a look at the studies in numerical
methods that mimic the properties of the underlying physical and mathematical models,
most of them use an operator-based formulation because of its power to analyse and con-
struct accurate discretisations [5, 6]. Such a formulation encouraged Oyarzun et al. [7]
to implement an algebra-based CFD algorithm for simulation of incompressible turbu-
lent flows. Roughly, the approach consists in replacing traditional stencil data structures
and sweeps by algebraic data structures and kernels. As a result, the algorithm of the
time-integration phase relies on a reduced set of only three basic algebraic operations:
the sparse matrix-vector product, the linear combination of vectors and the dot product.
Consequently, this approach combined with a multilevel MPI+OpenMP+OpenCL par-
allelisation naturally provides modularity and portability. Furthermore, in our previous
work, we generalised the concept of the framework to extend its applications beyond CFD;
2
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
we presented in [8] the HPC2 (Heterogeneous Portable Code for HPC), a fully-portable,
algebra-based framework with many potential applications in the fields of computational
physics and mathematics.
The most time-consuming operation in our framework is the sparse matrix-vector prod-
uct (SpMV), which represents up to 80% of the computational time of simulation as shown
in [7]. It widely receives much attention because it is prevalent and even essential in many
computing applications. Significant effort has been made in many works to adapt sparse
matrix storage formats for different architectures and matrix properties. For instance,
see [9, 10, 11, 12]. This key operation is a bottleneck in scientific computing because it
is a memory-bounded operation with a very low arithmetic intensity and it leads to indi-
rect memory accesses with unavoidable cache misses. Additionally, it is very challenging
in parallel computing because it may involve both inter- and intra-node data exchanges
as a result of the domain decomposition approach. Thus, hiding this expensive com-
munication overhead behind the computations is critical. The benefits of the overlap of
communications and computations on GPUs for the SpMV are demonstrated in [13]. The
heterogeneous execution of the SpMV on hybrid CPU+GPU systems is studied in [14],
showing a notable gain; notwithstanding, that study is restricted to a single node with a
single NVIDIA GPU.
In this work, we present the strategies for the efficient heterogeneous execution of
large-scale simulations on hybrid supercomputers that are part of the HPC2 core. Firstly,
the multilevel domain decomposition is proposed as the optimal method for distributing
the workload across the HPC system. Secondly, both the multithreaded simple and
double overlap execution diagrams are described. Finally, the heterogeneous performance
is studied in detail for the major computing kernel, the SpMV, using a sparse matrix
derived from a simulation on a hybrid unstructured mesh and up to 32 nodes of a hybrid
CPU+GPU supercomputer.
3
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
scientific computing community. The sparse pattern of a matrix (i.e. the distribution
of the non-zero coefficients) typically arises from the spatial discretisation of a computa-
tional domain. The discretised domain is a finite set of objects in which some pairs are in
some sense related (such as mesh nodes, cells, faces, vertices, etc.). The couplings of an
element (represented by the non-zero coefficients within its row) depend on the numerical
method utilised.
In our algebra-based approach, the SpMV kernel represents the 80% of the compu-
tational cost. In addition, this key operation is the only one among the three of the
HPC2 that requires data exchanges as a result of the domain decomposition approach.
That is, the other two kernels are independent of the workload distribution. Therefore,
we will only focus on the heterogeneous implementation of the SpMV from now on, con-
sidering that the one of the axpy and dot is straightforward.
Figure 1: Representation of a generic discretised domain (left). First-level decomposition of the domain
among two MPI processes (right).
4
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
Host D1 D1
D2
D1 D2
Figure 2: Two different strategies for the second-level decomposition. Minimising the number of cou-
plings (left), isolating the devices from the external communications (right).
5
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
(outlined in Table 1). The host blocks (white-coloured) are blocking tasks in the sense
that the assigned OpenMP threads must finish them before proceeding. In contrast,
the device blocks (grey-coloured) are non-blocking; that is, the assigned OpenMP thread
only submits the task to an OpenCL queue, then continues with the following. In D2H
and H2D blocks, the use of mapped, pinned-memory OpenCL intermediate buffers is
necessary; such pinned buffers are needed for DMA transfers, which can be overlapped
with computations.
Table 1: Description of the operational blocks composing the execution diagrams in Figure 3.
The simple overlap diagram is shown in Figure 3 (left). In this mode, the halo update is
overlapped with the INN and IFC blocks. Two threads are created in the outer OpenMP
region: one for host computations and another for managing the device’s OpenCL queues.
The device’s outer thread spawns an OpenMP nested region with as many threads as
devices (thd ). The host thread executes the host computations within an OpenMP nested
region engaging those threads still available (thh ). In this simple overlap method, the
devices are involved in external communications. Hence, the H2D block must start right
after the MPI synchronisation. Assuming that all devices perform equal, the overall
computational time of the simple overlap can be estimated as
tsov = max(thinn , (tdinn + td2h + tmpi )) + max(thif c , (th2d + tdif c )). (1)
Thus, since the time depends on the maximum values in between different blocks, a proper
load balancing becomes very important.
The double overlap diagram is shown in Figure 3 (right). The main difference is that,
in this case, the external MPI communications are performed simultaneously with the
6
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
internal D2H and H2D exchanges. To do so, the second-level decomposition must have
isolated the devices from the external interface as shown in Figure 2 (right). This way,
the MPI block becomes independent of the D2H, and consequently, H2D is independent of
MPI. Nevertheless, host IFC is bigger in this case. Hence, the double overlap is beneficial
only if the host computations are faster than the internal communications. The overall
computational time of the double overlap can be estimated as
tdov = max(thinn , max(tmpi , (tdinn + td2h + th2d ))) + max(thif c , tdif c ). (2)
Besides, the synchronous execution scheme may be relevant for test and comparisons.
It is not shown in Figure 3 due to its simplicity. Essentially, the idea is to complete the
halo update first, then proceed with computations. The overall computational time of
the synchronous mode is straightforward:
OpenMP nested [thh] OpenMP nested [thd] OpenMP nested [thh] OpenMP nested [thd]
H2D H2D
MPI
barrier barrier
OpenMP nested [thh] OpenMP nested [thd] OpenMP nested [thh] OpenMP nested [thd]
H2D H2D
IFC
IFC IFC IFC
IFC IFC
Figure 3: Multithreaded execution diagrams for simple overlap (left), double overlap (right). The
white-coloured blocks correspond to host; the grey-coloured ones correspond to devices.
7
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
The benefits of the heterogeneous CPU+GPU execution of the SpMV are measured
on the Lomonosov-2 hybrid supercomputer. Its nodes are equipped with a 14-core Intel
E5-2697v3 CPU and an NVIDIA Tesla K40M GPU. The performance comparison for the
CPU-only, GPU-only and heterogeneous executions on a single node is shown in Figure 4.
The heterogeneous mode shows a gain of 32% compared to the GPU-only mode, which
corresponds to a 98% of heterogeneous efficiency compared to the sum of the performance
of the CPU-only and the GPU-only modes.
30
25
20
GFLOP/s
15
10
0
E5 2697v3 Tesla K40 Heterogeneous
Figure 4: Single-node performance comparison for CPU-only, GPU-only and heterogeneous modes.
The strong scalability results in Figure 5 show that the simple overlap strategy pre-
sented in this work notably improves the performance by hiding the communications.
However, the scalability decays faster in the heterogeneous mode compared to the GPU-
only mode. This leak occurs because in the former, the computational load per GPU is
smaller and the communication load is higher than of the latter. Besides, in the heteroge-
neous mode, the CPU is loaded with computations which may interfere with MPI library
routines. Therefore, the overlapping operational range gets reduced.
Finally, the sustained performance for the different execution modes is shown in Fig-
ure 6. It can be seen that despite a little weaker scalability, the heterogeneous mode
outperforms the GPU-only mode. However, this advantage decays again with the number
of nodes because the CPU increasingly gets more involved in communications and the
load per GPU decreases.
8
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
30 Linear Speedup
10M Heterogeneous
25
10M GPU, overlap
Strong Scaling
15
10
5 10 15 20 25 30
Number of Nodes
Figure 5: Strong scalability study of the SpMV for different execution modes.
300
200
100
0
5 10 15 20 25 30
Number of Nodes
4 CONCLUSIONS
An algebra-based framework with a heterogeneous MPI+OpenMP+OpenCL imple-
mentation has been presented. This approach naturally provides modularity and porta-
bility; it can target a wide range of architectures and combine different kinds of parallelism,
engaging all the computing hardware available on the node. Considering the increasing
hybridisation of HPC systems, this appears to be very relevant.
9
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
The strong scalability study shows that the benefit of the heterogeneous execution of
the SpMV decreases with the number of nodes. Therefore, to efficiently run heterogeneous
large-scale simulations, the load per device must be enough to guarantee the adequate
performance of the devices and the sufficient contribution of the host. Heterogeneous
performance studies must be carried out to determine the optimal workload distribution.
To that end, our algebra-based framework appears to suit very well since a single kernel
(the SpMV) is a key representative of the overall performance.
Acknowledgments
The work has been financially supported by the Ministerio de Economı́a y Compet-
itividad, Spain (ENE2014-60577-R). X. Á. is supported by a FI predoctoral contract
(FI B-2017-00614). F. X. T. is supported by a Ramón y Cajal postdoctoral contract
(RYC-2012-11996). This work has been carried out using computing resources of the the
Center for collective use of HPC computing resources at Lomonosov Moscow State Uni-
versity; the authors thankfully acknowledge these institutions. The authors also wish to
thank Ms. E. Pino for her helpful illustrations.
REFERENCES
[1] J. Dongarra and M. A. Heroux, “Toward a new metric for ranking high perfor-
mance computing systems,” Tech. Rep. SAND2013-4744, Sandia National Laborato-
ries, 2013.
[3] C. Xu, X. Deng, L. Zhang, J. Fang, G. Wang, Y. Jiang, W. Cao, Y. Che, Y. Wang,
Z. Wang, W. Liu, and X. Cheng, “Collaborating CPU and GPU for large-scale high-
order CFD simulations with complex grids on the TianHe-1A supercomputer,” Jour-
nal of Computational Physics, vol. 278, pp. 275–297, dec 2014.
10
Xavier Álvarez, Andrey Gorobets and F. Xavier Trias
[12] W. Liu and B. Vinter, “CSR5: An Efficient Storage Format for Cross-Platform Sparse
Matrix-Vector Multiplication,” in Proceedings of the 29th ACM on International Con-
ference on Supercomputing, pp. 339–350, ACM Press, mar 2015.
[14] W. Yang, K. Li, and K. Li, “A hybrid computing method of SpMV on CPU–GPU
heterogeneous computing systems,” Journal of Parallel and Distributed Computing,
vol. 104, pp. 49–60, jun 2017.
[15] D. Lasalle and G. Karypis, “Multi-threaded Graph Partitioning,” in 2013 IEEE 27th
International Symposium on Parallel and Distributed Processing, pp. 225–236, may
2013.
11