Ecp2018 Magma Tutorial 1
Ecp2018 Magma Tutorial 1
Accelerating Linear
Algebra with MAGMA
Stan Tomov, Mark Gates, and Azzam Haidar
Innovative Computing Laboratory
University of Tennessee, Knoxville
• Part I
Overview of dense linear algebra libraries
Design principles and fundamentals
• Part II
MAGMA Overview
Availability, routines, code, testers, methodology
• Part III
MAGMA Batched
MAGMA Sparse
2
Dense Linear Algebra in Applications
Dense Linear Algebra (DLA) is needed in a wide variety of science and
engineering applications:
• Linear systems: Solve Ax = b
• Computational electromagnetics, material science, applications using
boundary integral equations, airflow past wings, fluid flow around ship
and other offshore constructions, and many more
• Least squares: Find x to minimize || Ax – b ||
• Computational statistics (e.g., linear least squares or ordinary least squares),
econometrics, control theory, signal processing, curve fitting, and many more
• Eigenproblems: Solve Ax = λ x
• Computational chemistry, quantum mechanics, material science, face recognition,
PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational
analysis, compression, and many more
• SVD: A = U Σ V* (Au = σv and A*v = σu)
• Information retrieval, web search, signal processing, big data analytics, low rank
matrix approximation, total least squares minimization, pseudo-inverse, and many more
• Many variations depending on structure of A
• A can be symmetric, positive definite, tridiagonal, Hessenberg, banded,
sparse with dense blocks, etc.
• DLA is crucial to the development of sparse solvers
3
Overview of Dense Numerical Linear Algebra Libraries
netlib.org
icl.utk.edu/research
Kernels
for
dense
linear
algebra
BLAS
dense
linear
algebra
PLASMA
(multicore)
new
software
for
multicore
and
accelerators
Support
from
ECP
SLATE,
CEED,
PEEKS,
xSDK
4
Why use GPUs in HPC?
PERFORMANCE & ENERGY EFFICIENCY
MAGMA 2.3 LU factorization in double precision arithmetic Energy efficiency
CPU
Intel Xeon E5-2650 v3 (Haswell)
2x10 cores @ 2.30 GHz K40 NVIDIA Kepler GPU
15 MP x 192 @ 0.88 GHz
P100 NVIDIA Pascal GPU
56 MP x 64 @ 1.19 GHz
V100 NVIDIA Volta GPU
80 MP x 64 @ 1.38 GHz
(under ~ the same power draw)
6000 25
5000 V100
10x 20
Performance GFLOP/s
10x
GFLOPs / Watt
P100
4000
K40 15
3000
CPU
10
2000
5
1000
0 0
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 32k 34k 36k CPU K40 P100 V100
5 Matrix size N x N
BLAS: Basic Linear Algebra Subroutines
6
BLAS: Basic Linear Algebra Subroutines
7
BLAS: Basic Linear Algebra Subroutines
8
Why Higher Level BLAS?
• By taking advantage of the principle of locality:
• Present the user with as much memory as is available in the
cheapest technology.
• Provide access at the speed offered by the fastest technology.
• Can only do arithmetic on data at the top of the hierarchy
• Higher level BLAS lets us do this
Registers
L 1 Cache
L 2 Cache
Local Memory
Remote Memory
Secondary Memory
9
Level 1, 2 and 3 BLAS
Nvidia P100, 1.19 GHz, Peak DP = 4700 Gflop/s
4800 C = C + A*B
4400 4503 Gflop/s
4000
3600
3200
2800
31x
Gflop/s
2400
2000
y = y + A*x
1600
145 Gflop/s
1200
dgemm BLAS Level 3
800 dgemv BLAS Level 2 y= *x + y
daxpy BLAS Level 1 52 Gflop/s
400
0
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Matrix size (N), vector size (NxN)
Nvidia P100
The theoretical peak double precision is 4700 Gflop/s
CUDA version 8.0
10
A brief history of (Dense) Linear Algebra software
• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
• Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to
other rows – BLAS-1
• How do we reorganize GE to use BLAS-3 ?
• Contents of LAPACK (summary)
• Algorithms we can turn into (nearly) 100% BLAS 3
• Linear Systems: solve Ax=b for x
• Least Squares: choose x to minimize ||Ax - b||2
• Algorithms that are only 50% BLAS 3 (so far)
• “Eigenproblems”: Find λ and x where Ax = λ x
• Singular Value Decomposition (SVD): (ATA)x=σ2x
• Generalized problems (e.g., Ax = λ Bx)
• Error bounds for everything
• Lots of variants depending on A’s structure (banded, A=AT, etc)
• How much code? (Release 3.8, Nov 2017) (www.netlib.org/lapack)
• Source: 1674 routines, 490K LOC, Testing: 448K LOC
11
A brief history of (Dense) Linear Algebra software
• Is LAPACK parallel?
• Only if the BLAS are parallel (possible in shared memory)
12
LAPACK
• http://www.netlib.org/lapack/ LAPACK is in
FORTRAN
• LAPACK (Linear Algebra Package) provides routines for Column Major
• solving systems of simultaneous linear equations,
• least-squares solutions of linear systems of equations, LAPACK is
• eigenvalue problems, SEQUENTIAL
• and singular value problems.
LAPACK is a
• LAPACK relies on BLAS
REFERENCE
implementation
• The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized
Schur) are also provided, as are related computations such as reordering of the Schur
factorizations and estimating condition numbers.
• Dense and banded matrices are handled, but not general sparse matrices. In all areas,
similar functionality is provided for real and complex matrices, in both single and double
precision.
13
Parallelism in LAPACK
14
Overview of Dense Numerical Linear Algebra Libraries
16
ScaLAPACK structure
ScaLAPACK
PBLAS
Global addressing
Local addressing
LAPACK
Platform independent
Platform specific
BLACS
BLAS
MPI
17
ScaLAPACK routine, solve AX = B
• LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info)
• ScaLAPACK: pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)
• input:
Global matrix
point of view
• output:
info (error code)
= 0: no error
< 0: invalid argument
> 0: numerical error
(e.g., singular)
L, U overwrite A
X overwrites B
18
2D block-cyclic layout m × n matrix
p × q process grid
19
2D block-cyclic layout m × n matrix
p × q process grid
20
2D block-cyclic layout m × n matrix
p × q process grid
21
2D block-cyclic layout m × n matrix
p × q process grid
22
2D block-cyclic layout m × n matrix
p × q process grid
23
2D block-cyclic layout m × n matrix
p × q process grid
24
2D block-cyclic layout m × n matrix
p × q process grid
25
2D block-cyclic layout m × n matrix
p × q process grid
26
2D block-cyclic layout m × n matrix
p × q process grid
27
Parallelism in ScaLAPACK
• Similar to LAPACK
• Bulk-synchronous
• Most flops in gemm update
• 2/3 n3 term
• Can use sequential BLAS,
p x q = # cores
= # MPI processes,
num_threads = 1
• Or multi-threaded BLAS,
p x q = # nodes
= # MPI processes,
num_threads = # cores/node
28
Major Changes to Software
29
Software Projects
netlib.org
icl.utk.edu/research
dense
linear
algebra
LAPACK
PLASMA
(multicore)
30
Software Projects
netlib.org
icl.utk.edu/research
scheduling
LAPACK
PLASMA
QUARK
(multicore)
ScaLAPACK MAGMA
scheduling
BLAS
SLATE
PaRSEC
(distributed
memory)
CBLAS
dynamic
runtime
schedulers
LAPACKE
31
Software Projects
netlib.org
icl.utk.edu/research
scheduling
LAPACK
PLASMA
OpenMP
(multicore)
ScaLAPACK MAGMA
scheduling
BLAS
SLATE
PaRSEC
(distributed
memory)
CBLAS
dynamic
runtime
schedulers
LAPACKE
32
PLASMA
dense
linear
algebra
for
multicore
dataflow
scheduling
tile
matrix
layout
tile
algorithms
33
Programming with Quark tasking
#include <quark.h> #include <omp.h>
int main(int argc , char∗∗ argv) { int main(int argc , char∗∗ argv) {
Quark * quark = QUARK_New( nthreads );
… #pragma omp parallel
for (int m = 1; m <= 8; m++) { #pragma omp master
for (int n = 1; n <= 7; n++) { {
dgemm_tile_quark( quark, NULL, …
CblasColMajor, CblasNoTrans, CblasNoTrans, for (int m = 1; m <= 8; m++)
nb, nb, nb, −1.0, for (int n = 1; n <= 7; n++) {
A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb); #pragma omp task depend( in:A(m,0)[0:nb*nb] ) \
} depend( in:A(0, n)[0:nb*nb]) \
} depend( inout:A(m,n)[0:nb*nb])
… cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
QUARK_Delete( quark ); nb, nb, nb, −1.0,
} A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb);
}
void dgemm_tile_quark(Quark* quark, Quark_Task_Flags ∗ task_flags, …
enum CBLAS_ORDER order, enum CBLAS_TRANSPOSE transa, }
enum CBLAS_TRANSPOSE transb, enum CBLAS_TRANSPOSE transc, }
int m, int n, int k, double alpha, double *A, int lda, double *B, int ldb,
double beta, double *C, int ldc) { void dgemm_tile_task( Quark* quark ) {
int main(int argc , char∗∗ argv) { int main(int argc , char∗∗ argv) {
Quark * quark = QUARK_New( nthreads );
… #pragma omp parallel
for (int m = 1; m <= 8; m++) { #pragma omp master
for (int n = 1; n <= 7; n++) { {
dgemm_tile_quark( quark, NULL, …
CblasColMajor, CblasNoTrans, CblasNoTrans, for (int m = 1; m <= 8; m++)
nb, nb, nb, −1.0, for (int n = 1; n <= 7; n++) {
A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb); #pragma omp task depend( in:A(m,0)[0:nb*nb] ) \
} depend( in:A(0, n)[0:nb*nb]) \
} depend( inout:A(m,n)[0:nb*nb])
… cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
QUARK_Delete( quark ); nb, nb, nb, −1.0,
} A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb);
}
void dgemm_tile_quark(Quark* quark, Quark_Task_Flags ∗ task_flags, …
enum CBLAS_ORDER order, enum CBLAS_TRANSPOSE transa, }
enum CBLAS_TRANSPOSE transb, enum CBLAS_TRANSPOSE transc, }
int m, int n, int k, double alpha, double *A, int lda, double *B, int ldb,
double beta, double *C, int ldc) { void dgemm_tile_task( Quark* quark ) {
CblasColMajor,
CblasNoTrans, CblasTrans, DTRMM DLAUMM DGEMM
}
}
} |
}
36
SLATE
Software for Linear Algebra Targeting Exascale
37
SLATE Objectives
Coverage
ScaLAPACK
and
beyond
can
be
built:
l serial
Modern
Hardware
DOE
CORAL
(pre
Exascale)
→
DOE
Exascale
l OpenMP
multithreading
l MPI
message
passing
l GPU
acceleration
Portability
Intel
Xeon
(&Phi),
IBM
POWER,
ARM,
NVIDIA,
AMD,
…
38
SLATE Stack
OMPI-X
39
SLATE Resources
main
ECP
website:
https://exascaleproject.org
40
SLATE Working Notes
http://www.icl.utk.edu/publications/series/swans
Designing
SLATE:
Software
for
Linear
Algebra
Targeting
Exascale
http://www.icl.utk.edu/publications/swan-‐003
https://bitbucket.org/icl/blaspp
https://bitbucket.org/icl/lapackpp
Roadmap
for
the
Development
of
a
Linear
Algebra
Library
for
Exascale
Computing:
SLATE:
Software
for
Linear
Algebra
Targeting
Exascale
http://www.icl.utk.edu/publications/swan-‐001
41
SLATE Matrix
not allocated
collection
of
tiles
While
in
the
PLASMA
library
the
matrix
is
also
stored
in
tiles,
the
tiles
are
laid
out
contiguously
in
memory.
individually
allocated
In
contrast,
in
SLATE,
the
tiles
are
individually
allocated,
with
no
correlation
of
their
locations
in
the
only
allocate
what
is
needed
matrix
to
their
addresses
in
memory.
accommodates:
symmetric,
triangular,
band,
…
42
SLATE Distributed Matrix
LAPACK
SLATE
MAGMA
C
=
C
–
A
×
B
44
GEMM Efficiency
C
=
C
–
A
×
B
with
small
k,
i.e.,
the
DGEMM
called
in
LU
factorization
The
matrix
Jills
out
the
GPU
memory.
The
X
axis
shows
the
k
dimension.
45
GEMM Scheduling
nested parallelism
bottom level:
batch GEMM
46
SLATE GPU Performance
asymptotic
scaling
112
K
×
112
K
1
node
4
GPUs
225
K
×
225
K
4
nodes
16
GPUs
450
K
×
450
K
16
nodes
64
GPUs
SummitDev @ OLCF
3×18 = 54 nodes (IBM S822LC)
2×10 = 20 cores (IBM POWER8) ca. 0.5 TFLOPS (2.5%)
4 GPUs (NVIDIA P100) ca. 20 TFLOPS (97.5%)
256 GB DDR4
4×16 = 64 GB HBM2
NVLink 1.0 80 GBPS (advertised)
GCC 7.1.0
ESSL 5.5.0
CUDA 8.0.54
Spectrum MPI 10.1.0.3.
47
SLATE GPU Trace
Cholesky
factorization
20
cores
+
4
GPUs
112
K
×
112
K
matrix
tile
size
of
512
48
SLATE Timeline
2016
Q1
research
Q2
design
Q3
prototyping
2017
Q4
C++
APIs
for
BLAS
and
LAPACK
Q1
parallel
BLAS
Q2
parallel
norms
Q3
linear
systems
(LU,
LLT,
LDLT)
2018
Q4
least
squares
(CA-‐QR/LQ)
Q1
mixed
precision
linear
systems
Q2
matrix
inversion
Q3
SVD
2019
Q4
EVP
49
Collaborators and Support
MAGMA team
http://icl.cs.utk.edu/magma
PLASMA team
http://icl.cs.utk.edu/plasma
Collaborating partners
University of Tennessee, Knoxville
Lawrence Livermore National Laboratory,
Livermore, CA
LLNL led ECP CEED:
Center for Efficient Exascale Discretizations
University of Manchester, Manchester, UK
University of Paris-Sud, France
INRIA, France