0% found this document useful (0 votes)

52 views

Ecp2018 Magma Tutorial 1

سنجد الجبر الخطي في علم الوراثة وفي الاقتصاد ودراسة الأنظمة المقعدة وتحليل البيانات… كذلك هو لغة الحوسبة الكمية، فالفهم الجيد للمفاهيم الأساسية التي يُبنى عليها الجبر الخطي هام لفهم الحساب الكمي.

Uploaded by

Sharaf Al-Qadasi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Ecp2018 Magma Tutorial 1

Uploaded by

Sharaf Al-Qadasi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

ECP 2018 Annual Meeting Tutorials

Accelerating Linear
Algebra with MAGMA
Stan Tomov, Mark Gates, and Azzam Haidar
Innovative Computing Laboratory
University of Tennessee, Knoxville

Exascale Computing Project

2nd Annual Meeting
Knoxville, TN
February 9, 2018
Outline

•  Part I
Overview of dense linear algebra libraries
Design principles and fundamentals
•  Part II
MAGMA Overview
Availability, routines, code, testers, methodology
•  Part III
MAGMA Batched
MAGMA Sparse

2
Dense Linear Algebra in Applications
Dense Linear Algebra (DLA) is needed in a wide variety of science and
engineering applications:
•  Linear systems: Solve Ax = b
•  Computational electromagnetics, material science, applications using
boundary integral equations, airflow past wings, fluid flow around ship
and other offshore constructions, and many more
•  Least squares: Find x to minimize || Ax – b ||
•  Computational statistics (e.g., linear least squares or ordinary least squares),
econometrics, control theory, signal processing, curve fitting, and many more
•  Eigenproblems: Solve Ax = λ x
•  Computational chemistry, quantum mechanics, material science, face recognition,
PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational
analysis, compression, and many more
•  SVD: A = U Σ V* (Au = σv and A*v = σu)
•  Information retrieval, web search, signal processing, big data analytics, low rank
matrix approximation, total least squares minimization, pseudo-inverse, and many more
•  Many variations depending on structure of A
•  A can be symmetric, positive definite, tridiagonal, Hessenberg, banded,
sparse with dense blocks, etc.
•  DLA is crucial to the development of sparse solvers

3
Overview of Dense Numerical Linear Algebra Libraries

netlib.org icl.utk.edu/research
Kernels for dense linear algebra
BLAS dense linear algebra PLASMA (multicore)

Sequential Dense/batched/sparse linear algebra

LAPACK MAGMA (accelerators)
dense linear algebra

Parallel distributed dense linear algebra

ScaLAPACK dense linear algebra SLATE (distributed memory / muticore / accelerators)

new software
for multicore
and accelerators

Support from
ECP SLATE, CEED, PEEKS, xSDK

4
Why use GPUs in HPC?
PERFORMANCE & ENERGY EFFICIENCY
MAGMA 2.3 LU factorization in double precision arithmetic Energy efficiency
CPU
Intel Xeon E5-2650 v3 (Haswell)
2x10 cores @ 2.30 GHz K40 NVIDIA Kepler GPU
15 MP x 192 @ 0.88 GHz
P100 NVIDIA Pascal GPU
56 MP x 64 @ 1.19 GHz
V100 NVIDIA Volta GPU
80 MP x 64 @ 1.38 GHz
(under ~ the same power draw)

6000 25

5000 V100
10x 20
Performance GFLOP/s

10x

GFLOPs / Watt
P100
4000
K40 15
3000
CPU
10
2000

5
1000

0 0
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k 22k 24k 26k 28k 30k 32k 34k 36k CPU K40 P100 V100

5 Matrix size N x N
BLAS: Basic Linear Algebra Subroutines

•  Level 1 BLAS — vector operations

•  O(n) data and flops (floating point operations)
•  Memory bound:
O(1) flops per memory access

•  Level 2 BLAS — matrix-vector operations

•  O(n2) data and flops
•  Memory bound:
O(1) flops per memory access

•  Level 3 BLAS — matrix-matrix operations

•  O(n2) data, O(n3) flops
•  Surface-to-volume effect
•  Compute bound:
O(n) flops per memory access

6
BLAS: Basic Linear Algebra Subroutines

•  Level 1 BLAS — vector operations

•  O(n) data and flops (floating point operations)
•  Memory bound:
O(1) flops per memory access

•  Level 2 BLAS — matrix-vector operations

•  O(n2) data and flops
•  Memory bound:
O(1) flops per memory access

•  Level 3 BLAS — matrix-matrix operations

•  O(n2) data, O(n3) flops
•  Surface-to-volume effect
•  Compute bound:
O(n) flops per memory access

7
BLAS: Basic Linear Algebra Subroutines

•  Level 1 BLAS — vector operations

•  O(n) data and flops (floating point operations)
•  Memory bound:
O(1) flops per memory access

•  Level 2 BLAS — matrix-vector operations

•  O(n2) data and flops
•  Memory bound:
O(1) flops per memory access

•  Level 3 BLAS — matrix-matrix operations

•  O(n2) data, O(n3) flops
•  Surface-to-volume effect
•  Compute bound:
O(n) flops per memory access

8
Why Higher Level BLAS?
•  By taking advantage of the principle of locality:
•  Present the user with as much memory as is available in the
cheapest technology.
•  Provide access at the speed offered by the fastest technology.
•  Can only do arithmetic on data at the top of the hierarchy
•  Higher level BLAS lets us do this

Registers

L 1 Cache

L 2 Cache

Local Memory

Remote Memory

Secondary Memory

9
Level 1, 2 and 3 BLAS
Nvidia P100, 1.19 GHz, Peak DP = 4700 Gflop/s
4800 C = C + A*B
4400 4503 Gflop/s

4000

3600

3200

2800
31x
Gflop/s

2400

2000
y = y + A*x
1600
145 Gflop/s
1200
dgemm BLAS Level 3
800 dgemv BLAS Level 2 y= *x + y
daxpy BLAS Level 1 52 Gflop/s
400

0
2k 4k 6k 8k 10k 12k 14k 16k 18k 20k
Matrix size (N), vector size (NxN)
Nvidia P100
The theoretical peak double precision is 4700 Gflop/s
CUDA version 8.0
10
A brief history of (Dense) Linear Algebra software
•  LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
•  Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to
other rows – BLAS-1
•  How do we reorganize GE to use BLAS-3 ?
•  Contents of LAPACK (summary)
•  Algorithms we can turn into (nearly) 100% BLAS 3
•  Linear Systems: solve Ax=b for x
•  Least Squares: choose x to minimize ||Ax - b||2
•  Algorithms that are only 50% BLAS 3 (so far)
•  “Eigenproblems”: Find λ and x where Ax = λ x
•  Singular Value Decomposition (SVD): (ATA)x=σ2x
•  Generalized problems (e.g., Ax = λ Bx)
•  Error bounds for everything
•  Lots of variants depending on A’s structure (banded, A=AT, etc)
•  How much code? (Release 3.8, Nov 2017) (www.netlib.org/lapack)
•  Source: 1674 routines, 490K LOC, Testing: 448K LOC

11
A brief history of (Dense) Linear Algebra software

•  Is LAPACK parallel?
•  Only if the BLAS are parallel (possible in shared memory)

•  ScaLAPACK – “Scalable LAPACK” (1995 – now)

•  For distributed memory – uses MPI
•  More complex data structures, algorithms than LAPACK
•  Only (small) subset of LAPACK’s functionality available
•  All at www.netlib.org/scalapack

12
LAPACK
•  http://www.netlib.org/lapack/ LAPACK is in
FORTRAN
•  LAPACK (Linear Algebra Package) provides routines for Column Major
•  solving systems of simultaneous linear equations,
•  least-squares solutions of linear systems of equations, LAPACK is
•  eigenvalue problems, SEQUENTIAL
•  and singular value problems.
LAPACK is a
•  LAPACK relies on BLAS
REFERENCE
implementation
•  The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized
Schur) are also provided, as are related computations such as reordering of the Schur
factorizations and estimating condition numbers.

•  Dense and banded matrices are handled, but not general sparse matrices. In all areas,
similar functionality is provided for real and complex matrices, in both single and double
precision.

13
Parallelism in LAPACK

•  Most flops in gemm update

•  2/3 n3 term
•  Easily parallelized using
multi-threaded BLAS
•  Done in any reasonable software

•  Other operations lower order

•  Potentially expensive if not parallelized

14
Overview of Dense Numerical Linear Algebra Libraries

•  BLAS: kernel for dense linear algebra

•  LAPACK: sequential dense linear algebra
•  ScaLAPACK: parallel distributed dense linear algebra

Scalable Linear Algebra PACKage

15
PBLAS

•  Similar to BLAS in functionality and naming

•  Built on BLAS and BLACS
•  Provide global view of matrix

•  LAPACK: dge___( m, n, A(ia, ja), lda, ... )

•  Submatrix offsets implicit in pointer

•  ScaLAPACK: pdge___( m, n, A, ia, ja, descA, ... )

•  Pass submatrix offsets and matrix descriptor

16
ScaLAPACK structure
ScaLAPACK

PBLAS

Global addressing

Local addressing

LAPACK
Platform independent

Platform specific

BLACS
BLAS

MPI

17
ScaLAPACK routine, solve AX = B
•  LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info)

•  ScaLAPACK: pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)

•  input:
Global matrix
point of view

•  output:
info (error code)
= 0: no error
< 0: invalid argument
> 0: numerical error
(e.g., singular)

L, U overwrite A
X overwrites B

implicit unit diagonal

18
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

19
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

20
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

21
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

22
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

23
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

24
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

25
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

26
2D block-cyclic layout m × n matrix
p × q process grid

Global matrix view Local process point of view

27
Parallelism in ScaLAPACK
•  Similar to LAPACK
•  Bulk-synchronous
•  Most flops in gemm update
•  2/3 n3 term
•  Can use sequential BLAS,
p x q = # cores
= # MPI processes,
num_threads = 1
•  Or multi-threaded BLAS,
p x q = # nodes
= # MPI processes,
num_threads = # cores/node

28
Major Changes to Software

•  Must rethink the design of our software

•  Another disruptive technology
•  Similar to what happened with cluster computing and
message passing
•  Rethink and rewrite the applications, algorithms, and software

•  Numerical libraries for example are changing

•  For example, both LAPACK and ScaLAPACK undergo major changes to
accommodate this

29
Software Projects

netlib.org icl.utk.edu/research
dense linear algebra
LAPACK PLASMA (multicore)

ScaLAPACK MAGMA dense linear algebra

(accelerators)

dense linear algebra

BLAS SLATE (distributed memory / muticore / accelerators)

CBLAS new software

for multicore
and accelerators
LAPACKE

30
Software Projects

netlib.org icl.utk.edu/research
scheduling
LAPACK PLASMA QUARK (multicore)

ScaLAPACK MAGMA

scheduling
BLAS SLATE PaRSEC (distributed memory)

CBLAS dynamic
runtime
schedulers
LAPACKE

31
Software Projects

netlib.org icl.utk.edu/research
scheduling
LAPACK PLASMA OpenMP (multicore)

ScaLAPACK MAGMA

scheduling
BLAS SLATE PaRSEC (distributed memory)

CBLAS dynamic
runtime
schedulers
LAPACKE

32
PLASMA
  dense linear algebra for multicore
  dataﬂow scheduling
  tile matrix layout
  tile algorithms

LAPACK Layout Tile Layout

33
Programming with Quark tasking
#include <quark.h> #include <omp.h>

int main(int argc , char∗∗ argv) { int main(int argc , char∗∗ argv) {
Quark * quark = QUARK_New( nthreads );
… #pragma omp parallel
for (int m = 1; m <= 8; m++) { #pragma omp master
for (int n = 1; n <= 7; n++) { {
dgemm_tile_quark( quark, NULL, …
CblasColMajor, CblasNoTrans, CblasNoTrans, for (int m = 1; m <= 8; m++)
nb, nb, nb, −1.0, for (int n = 1; n <= 7; n++) {
A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb); #pragma omp task depend( in:A(m,0)[0:nb*nb] ) \
} depend( in:A(0, n)[0:nb*nb]) \
} depend( inout:A(m,n)[0:nb*nb])
… cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
QUARK_Delete( quark ); nb, nb, nb, −1.0,
} A(m, 0), nb, A(0, n), nb, 1.0, A(m, n), nb);
}
void dgemm_tile_quark(Quark* quark, Quark_Task_Flags ∗ task_flags, …
enum CBLAS_ORDER order, enum CBLAS_TRANSPOSE transa, }
enum CBLAS_TRANSPOSE transb, enum CBLAS_TRANSPOSE transc, }
int m, int n, int k, double alpha, double *A, int lda, double *B, int ldb,
double beta, double *C, int ldc) { void dgemm_tile_task( Quark* quark ) {

QUARK_Insert_Task( quark, dgemm_tile_task, task_flags, enum CBLAS_ORDER order;

sizeof(enum CBLAS_ORDER), &order , VALUE, enum CBLAS_TRANSPOSE transa, transb, transc;
sizeof(enum CBLAS_TRANSPOSE), &transa, VALUE, int m, n, k;
sizeof(enum CBLAS_TRANSPOSE), &transb, VALUE, double alpha, beta, *A, *B, *C;
sizeof(enum CBLAS_TRANSPOSE), &transc, VALUE,
sizeof(int), &m , VALUE, quark_unpack_args_15(quark, order, transa, transb, transc,
sizeof(int), &n , VALUE, m, n, k,
sizeof(int), &k , VALUE, alpha, A, lda,
sizeof(double), &alpha , VALUE, B, ldb,
sizeof(double *), &A , INPUT, beta , C, ldc );
sizeof(int), &lda , VALUE,
sizeof(double *), &B , INPUT, cblas_dgemm(order, transa, transb, transc,
sizeof(int), &ldb , VALUE, m, n, k,
sizeof(double), &beta , VALUE, alpha, A, lda,
sizeof(double *), &C , INOUT, B, ldb,
sizeof(int), &ldc , VALUE, 0 ); beta, C, ldc );
34 } }
Programming with OpenMP4 tasking
#include <quark.h> #include <omp.h>

QUARK_Insert_Task( quark, dgemm_tile_task, task_flags, enum CBLAS_ORDER order;

sizeof(enum CBLAS_ORDER), &order , VALUE, enum CBLAS_TRANSPOSE transa, transb, transc;
sizeof(enum CBLAS_TRANSPOSE), &transa, VALUE, int m, n, k;
sizeof(enum CBLAS_TRANSPOSE), &transb, VALUE, double alpha, beta, *A, *B, *C;
sizeof(enum CBLAS_TRANSPOSE), &transc, VALUE,
sizeof(int), &m , VALUE, quark_unpack_args_15(quark, order, transa, transb, transc,
sizeof(int), &n , VALUE, m, n, k,
sizeof(int), &k , VALUE, alpha, A, lda,
sizeof(double), &alpha , VALUE, B, ldb,
sizeof(double *), &A , INPUT, beta , C, ldc );
sizeof(int), &lda , VALUE,
sizeof(double *), &B , INPUT, cblas_dgemm(order, transa, transb, transc,
sizeof(int), &ldb , VALUE, m, n, k,
sizeof(double), &beta , VALUE, alpha, A, lda,
sizeof(double *), &C , INOUT, B, ldb,
sizeof(int), &ldc , VALUE, 0 ); beta, C, ldc );
35 } }
#pragma omp parallel
#pragma omp master
{
for (k = 0; k < nt; k++) {
#pragma omp task depend(inout:A(k,k)[0:nb*nb])
info = LAPACKE_dpotrf_work(
LAPACK_COL_MAJOR, DPOTRF DPOTRF

lapack_const(PlasmaLower), DTRSM DTRSM DTRSM

DTRSM DTRSM DTRSM
nb, A(k,k), nb);
DSYRK DGEMM DSYRK DGEMM DGEMM DSYRK
DSYRK DGEMM DSYRK DTRTRI DGEMM DGEMM DSYRK

for (m = k+1; m < nt; m++) { DPOTRF

DPOTRF DTRMM DTRMM DTRMM
#pragma omp task depend(in:A(k,k)[0:nb*nb]) \
depend(inout:A(m,k)[0:nb*nb]) DTRSM DTRSM
DTRSM DTRSM DTRSM DLAUMM

cblas_dtrsm( DSYRK DGEMM DSYRK

DSYRK DGEMM DTRTRI DGEMM DGEMM DSYRK DSYRK
CblasColMajor,
DPOTRF
CblasRight, CblasLower, DTRMM DTRMM DTRMM DPOTRF

CblasTrans, CblasNonUnit, DTRSM

DLAUMM DTRSM DTRSM DTRSM
nb, nb, DSYRK
1.0, A(k,k), nb, DSYRK DSYRK DGEMM DGEMM DTRTRI DGEMM DSYRK

A(m,k), nb); DPOTRF

DPOTRF DTRMM DTRMM DTRMM
} DTRSM DTRSM DTRSM DTRSM DTRSM DTRSM
for (m = k+1; m < nt; m++) { DTRSM DTRSM DTRSM DLAUMM

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ DGEMM DTRTRI DGEMM

DSYRK DGEMM DTRTRI DGEMM DSYRK DGEMM DSYRK
depend(inout:A(m,m)[0:nb*nb])
DGEMM DGEMM DTRSM
cblas_dsyrk( DTRMM DTRMM DTRMM

CblasColMajor, DTRSM DTRSM DTRSM DTRSM DTRSM DTRTRI

DLAUMM
CblasLower, CblasNoTrans,
nb, nb,
DTRTRI DTRTRI

-1.0, A(m,k), nb, DLAUMM

1.0, A(m,m), nb); DSYRK

for (n = k+1; n < m; n++) { DSYRK DTRMM

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ DSYRK DGEMM DLAUMM

depend(in:A(n,k)[0:nb*nb]) \
depend(inout:A(m,n)[0:nb*nb]) DTRMM DGEMM DSYRK

cblas_dgemm( DGEMM DTRMM DSYRK

CblasColMajor,
CblasNoTrans, CblasTrans, DTRMM DLAUMM DGEMM

nb, nb, nb, DSYRK DTRMM

-1.0, A(m,k), nb,

A(n,k), nb,
DTRMM

1.0, A(m,n), nb); DLAUMM

}
}
} |
}

36
SLATE
Software for Linear Algebra Targeting Exascale

This research was supported by the Exascale Computing Project

(17-‐SC-‐20-‐SC), a joint project of the U.S. Department of Energy’s
OfJice of Science and National Nuclear Security Administration, responsible
for delivering a capable exascale ecosystem,
including software, applications, and hardware technology,
to support the nation’s exascale computing imperative.

37
SLATE Objectives
  Coverage ScaLAPACK and beyond
can be built:
l  serial
  Modern Hardware DOE CORAL (pre Exascale) → DOE Exascale l  OpenMP multithreading
l  MPI message passing
l  GPU acceleration
  Portability Intel Xeon (&Phi), IBM POWER, ARM, NVIDIA, AMD, …

  Modern Language C++11/14/17 (templates, STL, overloading, polymorphism, …)

  Modern Standards MPI 3, OpenMP 4/5 (&omp target)

  Performance 80-‐90% of peak (asymptotic)

  Scalability full machine (tens of thousands of nodes)

  Productivity ca. 4 full time developers

  Maintainability part time developers + community

38
SLATE Stack

molecular computational quantum quantum sparse

dynamics chemistry mechanics chemistry solvers

EXAALT NWChemEx QMCPACK GAMESS FBSS

...
PARALLEL DENSE LINEAR ALGEBRA ROUTINES
DISTRIBUTED MEMORY
SLATE MULTICORE
ACCELERATORS

PaRSEC MPI OpenMP BLAS++ LAPACK++ batch BLAS++

Exa MPI SOLLVE MKL ESSL cuBLAS ACML

OMPI-X

ECP standards vendor SLATE

39
SLATE Resources
  main ECP website: https://exascaleproject.org

  main SLATE website: http://icl.utk.edu/slate/

  main SLATE repository: https://bitbucket.org/icl/slate

  BLAS++ repository: https://bitbucket.org/icl/blaspp

  LAPACK++ repository: https://bitbucket.org/icl/lapackpp

  SLATE Working Notes: http://www.icl.utk.edu/publications/series/swans

  Research Gate project: https://www.researchgate.net/project/ECP-‐SLATE

  SLATE User https://groups.google.com/a/icl.utk.edu/forum/#!forum/slate-‐user

40
SLATE Working Notes
http://www.icl.utk.edu/publications/series/swans

  Designing SLATE: Software for Linear Algebra Targeting Exascale
http://www.icl.utk.edu/publications/swan-‐003

  C++ API for BLAS and LAPACK

http://www.icl.utk.edu/publications/swan-‐002

https://bitbucket.org/icl/blaspp

https://bitbucket.org/icl/lapackpp

  Roadmap for the Development of a Linear Algebra Library for Exascale Computing:
SLATE: Software for Linear Algebra Targeting Exascale
http://www.icl.utk.edu/publications/swan-‐001

41
SLATE Matrix

not allocated

std::map<std::tuple<int64_t, int64_t, int>, Tile<FloatType>> tiles_;

  collection of tiles While in the PLASMA library the matrix is also stored
in tiles, the tiles are laid out contiguously in memory.
  individually allocated
In contrast, in SLATE, the tiles are individually
allocated, with no correlation of their locations in the
  only allocate what is needed
matrix to their addresses in memory.
  accommodates: symmetric, triangular, band, …
42
SLATE Distributed Matrix

std::map<std::tuple<int64_t, int64_t, int>, Tile<FloatType>> tiles_;

  distributed matrix The same structure, used for single node

representation, naturally supports distributed memory
  global indexing of tiles representation.

  only allocate the local part

  any distribution is possible (2D block cyclic by default)
43
GEMM Efficiency

LAPACK
SLATE
MAGMA

C = C – A × B
44
GEMM Efficiency

C = C – A × B with small k, i.e., the DGEMM called in LU factorization
The matrix Jills out the GPU memory. The X axis shows the k dimension.

45
GEMM Scheduling

  nested parallelism

  top level: #pragma omp task depend

  bottom level:

  #pragma omp task

  batch GEMM

46
SLATE GPU Performance

asymptotic scaling
  112 K × 112 K 1 node 4 GPUs
  225 K × 225 K 4 nodes 16 GPUs
  450 K × 450 K 16 nodes 64 GPUs

SummitDev @ OLCF
  3×18 = 54 nodes (IBM S822LC)
  2×10 = 20 cores (IBM POWER8) ca. 0.5 TFLOPS (2.5%)
  4 GPUs (NVIDIA P100) ca. 20 TFLOPS (97.5%)
  256 GB DDR4
  4×16 = 64 GB HBM2
NVLink 1.0 80 GBPS (advertised)

  GCC 7.1.0
  ESSL 5.5.0
  CUDA 8.0.54
  Spectrum MPI 10.1.0.3.

47
SLATE GPU Trace

Cholesky factorization
20 cores + 4 GPUs
112 K × 112 K matrix
tile size of 512

48
SLATE Timeline

2016
Q1 research
Q2 design
Q3 prototyping
2017 Q4 C++ APIs for BLAS and LAPACK
Q1 parallel BLAS
Q2 parallel norms
Q3 linear systems (LU, LLT, LDLT)
2018 Q4 least squares (CA-‐QR/LQ)
Q1 mixed precision linear systems
Q2 matrix inversion
Q3 SVD
2019 Q4 EVP

49
Collaborators and Support

MAGMA team
http://icl.cs.utk.edu/magma
PLASMA team
http://icl.cs.utk.edu/plasma
Collaborating partners
University of Tennessee, Knoxville
Lawrence Livermore National Laboratory,
Livermore, CA
LLNL led ECP CEED:
Center for Efficient Exascale Discretizations
University of Manchester, Manchester, UK
University of Paris-Sud, France
INRIA, France

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
How To Make HQ Dorks & Keywords - My Method
No ratings yet
How To Make HQ Dorks & Keywords - My Method
5 pages
Guide & Workbook: It'S Worth The Energy
No ratings yet
Guide & Workbook: It'S Worth The Energy
13 pages
AI-Artificial Intelligence Akash
100% (1)
AI-Artificial Intelligence Akash
64 pages
Adva Micromux™: Support For Lower-Speed Clients Today With No Impact On Future Scalability
100% (1)
Adva Micromux™: Support For Lower-Speed Clients Today With No Impact On Future Scalability
2 pages
MX10003 MX204
33% (3)
MX10003 MX204
43 pages
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
No ratings yet
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
44 pages
دليل المعلم Mega Goal 3
No ratings yet
دليل المعلم Mega Goal 3
257 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Elt - 05 17 - en Mes 2324 (48) P
No ratings yet
Elt - 05 17 - en Mes 2324 (48) P
4 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Heroux App Perf On Multicores Mantevo Project SAND2008-1085P 020408
No ratings yet
Heroux App Perf On Multicores Mantevo Project SAND2008-1085P 020408
21 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
Trinity Price List
100% (1)
Trinity Price List
12 pages
PowerEdge MX IO Guide PDF
No ratings yet
PowerEdge MX IO Guide PDF
72 pages
PowerEdge MX7000 Networking Guide
No ratings yet
PowerEdge MX7000 Networking Guide
48 pages
dc7261 Scott Ruppert Tim Woodard Deep Learning With Quadro in Workstation
No ratings yet
dc7261 Scott Ruppert Tim Woodard Deep Learning With Quadro in Workstation
11 pages
PowerEdge MX IO Guide
No ratings yet
PowerEdge MX IO Guide
99 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
RTI Control Processor Comparison
No ratings yet
RTI Control Processor Comparison
1 page
PowerEdge MX IO Guide
No ratings yet
PowerEdge MX IO Guide
80 pages
Stateful Pcap Replay With Aticara Over Wifi Interface
No ratings yet
Stateful Pcap Replay With Aticara Over Wifi Interface
4 pages
C2000-sway022a
No ratings yet
C2000-sway022a
13 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-On-chip
No ratings yet
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-On-chip
7 pages
AN12282_powerquad_DSP
No ratings yet
AN12282_powerquad_DSP
22 pages
BRKCRS 3143
No ratings yet
BRKCRS 3143
105 pages
Whats New in Electronics
No ratings yet
Whats New in Electronics
8 pages
Parameter C2050 (Fermi) K10 (Kepler)
No ratings yet
Parameter C2050 (Fermi) K10 (Kepler)
2 pages
HSPA To LTE
No ratings yet
HSPA To LTE
15 pages
dell_emc_networking_qrg_campus_and_branch
No ratings yet
dell_emc_networking_qrg_campus_and_branch
1 page
HFR My5G One Page Summary-March, 2021
No ratings yet
HFR My5G One Page Summary-March, 2021
1 page
Dell-Networking-Campus-Quick-Reference-Guide
No ratings yet
Dell-Networking-Campus-Quick-Reference-Guide
4 pages
Talk 1 Satoshi Matsuoka
No ratings yet
Talk 1 Satoshi Matsuoka
112 pages
Symmetric Key Cryptography On Modern Graphics Hardware
No ratings yet
Symmetric Key Cryptography On Modern Graphics Hardware
17 pages
S3064 Pedraforca ARM GPU Cluster HPC
No ratings yet
S3064 Pedraforca ARM GPU Cluster HPC
18 pages
Mellanox Ethernet Switch Brochure
No ratings yet
Mellanox Ethernet Switch Brochure
4 pages
PSG 2024 IO211K Rev46
No ratings yet
PSG 2024 IO211K Rev46
41 pages
Data Sheet: Technical Features
No ratings yet
Data Sheet: Technical Features
6 pages
Whats New in Electronics Desktop
No ratings yet
Whats New in Electronics Desktop
7 pages
Whats New in Electronics Desktop
No ratings yet
Whats New in Electronics Desktop
7 pages
Channel Estimation PDF
No ratings yet
Channel Estimation PDF
4 pages
Vlsi2022 C10-1
No ratings yet
Vlsi2022 C10-1
31 pages
Understanding 10G To 400G Ethernet Speeds Transceivers and Selecting The Correct Fiber Optic Connectivity v2 1
No ratings yet
Understanding 10G To 400G Ethernet Speeds Transceivers and Selecting The Correct Fiber Optic Connectivity v2 1
23 pages
Comparing Nvidia K40 and V100 GPU
No ratings yet
Comparing Nvidia K40 and V100 GPU
2 pages
Tutorial On TI C6678
No ratings yet
Tutorial On TI C6678
65 pages
Fx3gc PLC User Manual 20191026
No ratings yet
Fx3gc PLC User Manual 20191026
2 pages
MES - 3348 - 3348 Datasheet
No ratings yet
MES - 3348 - 3348 Datasheet
4 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
sprt248f
No ratings yet
sprt248f
5 pages
Fujitsu LifeBook LH530 Quanta FH1 Intel UMA Rev1A Schematic
No ratings yet
Fujitsu LifeBook LH530 Quanta FH1 Intel UMA Rev1A Schematic
37 pages
Understanding 10G to 400G Ethernet Speeds, Transceivers, and Selecting the Correct Fiber Optic Connectivity
No ratings yet
Understanding 10G to 400G Ethernet Speeds, Transceivers, and Selecting the Correct Fiber Optic Connectivity
23 pages
DWF13 Euf Net T0645
No ratings yet
DWF13 Euf Net T0645
46 pages
Huawei CloudEngine Series Data Center Switch Portfolio
No ratings yet
Huawei CloudEngine Series Data Center Switch Portfolio
1 page
Plateform: High-Bit Rate Coherent Transmission: Main Equipments
No ratings yet
Plateform: High-Bit Rate Coherent Transmission: Main Equipments
1 page
ICT Presentation
No ratings yet
ICT Presentation
27 pages
SoCDesign PDF
No ratings yet
SoCDesign PDF
42 pages
1642 Whitepaper
No ratings yet
1642 Whitepaper
8 pages
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels on Hopper Tensor Cores_1679452516682001bWRm
80 pages
Mehrpoo2019 PDF
No ratings yet
Mehrpoo2019 PDF
5 pages
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Basic Symbols and Fundamental Elements of Technical Drawing 1
100% (1)
Basic Symbols and Fundamental Elements of Technical Drawing 1
26 pages
Penetra Meter
No ratings yet
Penetra Meter
19 pages
Fragmented Contract Management: Challenges, Impacts and Solutions
No ratings yet
Fragmented Contract Management: Challenges, Impacts and Solutions
22 pages
Right TO Privacy: S.S. Jain Subodh Law College
No ratings yet
Right TO Privacy: S.S. Jain Subodh Law College
15 pages
Suvarna Internship
No ratings yet
Suvarna Internship
86 pages
It6311 Set2
No ratings yet
It6311 Set2
5 pages
Module Electronic 2020
No ratings yet
Module Electronic 2020
22 pages
Computer Integrated Manufacturing
No ratings yet
Computer Integrated Manufacturing
13 pages
How-To Guide: Transaction Launcher (SAP CRM 7.0) .0
No ratings yet
How-To Guide: Transaction Launcher (SAP CRM 7.0) .0
22 pages
AZ-104 Exam - Free Actual Q&As, Page 3 _ ExamTopics
No ratings yet
AZ-104 Exam - Free Actual Q&As, Page 3 _ ExamTopics
6 pages
Sample Technical Report: Measurement and Error
No ratings yet
Sample Technical Report: Measurement and Error
10 pages
User Engine
80% (5)
User Engine
4 pages
sysmon+wazuh
No ratings yet
sysmon+wazuh
5 pages
Laudon-Traver Ec10 PPT ch01
No ratings yet
Laudon-Traver Ec10 PPT ch01
31 pages
ISM Unit-4
No ratings yet
ISM Unit-4
18 pages
3.3 9 - Modeling and Simulation of Wear in A Pin On Disc Tribometer
No ratings yet
3.3 9 - Modeling and Simulation of Wear in A Pin On Disc Tribometer
10 pages
Function Introduction : Operation Instructions For The Cloning Function For 4 Generation Gearbox of Mercedes-Benz
No ratings yet
Function Introduction : Operation Instructions For The Cloning Function For 4 Generation Gearbox of Mercedes-Benz
12 pages
WD - Final Question Bank Students
100% (1)
WD - Final Question Bank Students
1 page
LaTeX For Economists
No ratings yet
LaTeX For Economists
12 pages
Ibright CL1000 FL1000 Brochure
No ratings yet
Ibright CL1000 FL1000 Brochure
16 pages
Computational Intelligence in Data Mining Himansu Sekhar Behera 2024 Scribd Download
100% (2)
Computational Intelligence in Data Mining Himansu Sekhar Behera 2024 Scribd Download
52 pages
B Aotai-1
No ratings yet
B Aotai-1
16 pages
Pioneer Avic-F40bt Avic-Z140bh SM
No ratings yet
Pioneer Avic-F40bt Avic-Z140bh SM
294 pages
Programming Paradigms: Vitaly Shmatikov
No ratings yet
Programming Paradigms: Vitaly Shmatikov
31 pages
Marketing Planning MBA Notes 1.0
100% (3)
Marketing Planning MBA Notes 1.0
27 pages
TIA-607-C, Rack & Cabinet Ground Bonding Solutions For Telecommunications Equipment
No ratings yet
TIA-607-C, Rack & Cabinet Ground Bonding Solutions For Telecommunications Equipment
1 page