Introduction To CUDA
Introduction To CUDA
Computing
Mike Clark, NVIDIA
Developer Technology Group
Outline
Today
Motivation
GPU Architecture
Three ways to accelerate applications
Tomorrow
QUDA: QCD on GPUs
Why GPU Computing?
1200 160
600 80
Tesla 20-series
60
400
Westmere
40 Nehalem 3 GHz
Westmere 3 GHz
200 Nehalem
3 GHz
3 GHz 20
0 0
2003 2004 2005 2006 2007 2008 2009 2010 2003 2004 2005 2006 2007 2008 2009 2010
GFlops/sec GBytes/sec
Single Precision: NVIDIA GPU Single Precision: x86 CPU NVIDIA GPU X86 CPU
Double Precision: NVIDIA GPU Double Precision: x86 CPU ECC off
Stunning Graphics Realism Lush, Rich Worlds
Id software ©
CPU GPU
Nbody GPU versus CPU
Low Latency or High Throughput?
CPU GPU
Optimized for low-latency Optimized for data-parallel,
access to cached data sets throughput computation
Control logic for out-of-order Architecture tolerant of
and speculative execution memory latency
More transistors dedicated to
computation
Small Changes, Big Speed-up
Application Code
Rest of Sequential
Compute-Intensive Functions CPU Code
GPU Use GPU to Parallelize CPU
+
146X 36X 18X 50X 100X
Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics
U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN
14
12
10
8
Kepler
6
4
Fermi
2 Tesla
DRAM I/F
DRAM I/F
ECC on/off option for Quadro and Tesla products
HOST I/F
DRAM I/F
Streaming Multiprocessors (SMs) L2
Giga Thread
Perform the actual computations
DRAM I/F
Each SM has its own:
DRAM I/F
Control units, registers, execution pipelines, caches
DRAM I/F
GPU Architecture – Fermi:
Instruction Cache
Scheduler Scheduler
Register File
32 fp32 ops/clock
Core Core Core Core
Uniform Cache
Kepler
Fermi Kepler
SM
Instruction Cache Instruction Cache
Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler
Scheduler Scheduler Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit
CUDA Core
Dispatch Port Dispatch Port
Register File (65,536 x 32-bit)
Dispatch Dispatch Operand Collector
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Register File ALU
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Result Queue
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Load/Store Units x 16
Special Func Units x 4 Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Interconnect Network Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Uniform Cache
64K Configurable
64 KB Shared Memory / L1 Cache
Cache/Shared Mem
Interconnect Network
Uniform Cache
3 Ways to Accelerate Applications
Applications
OpenACC Programming
Libraries
Directives Languages
Building-block
ArrayFire Matrix Sparse Linear C++ STL Features
IMSL Library Algorithms for CUDA
Computations Algebra for CUDA
3 Steps to CUDA-accelerated application
cublasFree(d_x);
Deallocate device vectors
cublasFree(d_y);
cublasShutdown();
Drop-In Acceleration (Step 2)
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Explore the CUDA (Libraries) Ecosystem
developer.nvidia.com/cuda-
tools-ecosystem
3 Ways to Accelerate Applications
Applications
OpenACC Programming
Libraries
Directives Languages
Your original
Fortran or C code
OpenACC
Open Programming Standard for Parallel Computing
“OpenACC will enable programmers to easily develop portable applications that maximize
the performance and power efficiency benefits of the hybrid CPU/GPU architecture of
Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
OpenACC Standard
OpenACC
The Standard for GPU Directives
iter = iter +1
err=0._fp_kind
end do
!$acc end data Close off data region,
copy data back
Directives: Easy & Powerful
Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and
Detection using Monte Carlo Biomolecules
Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio
Systems
www.nvidia.com/gpudirectives
3 Ways to Accelerate Applications
Applications
OpenACC Programming
Libraries
Directives Languages
C OpenACC, CUDA C
C# GPU.NET
CUDA C
Standard C Code Parallel C Code
__global__
void saxpy_serial(int n, void saxpy_parallel(int n,
float a, float a,
float *x, float *x,
float *y) float *y)
{ {
int i = blockIdx.x*blockDim.x +
for (int i = 0; i < n; ++i) threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }
http://developer.nvidia.com/cuda-toolkit
CUDA C++: Develop Generic Parallel Code
Templates
template <typename T, typename Oper>
Operator overloading __global__ void kernel(T *output, int n) {
Oper op(3.7);
Functors (function objects)
output = new T[n]; // dynamic allocation
Device-side new/delete int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
More…
output[i] = op(i); // apply functor
}
http://developer.nvidia.com/cuda-toolkit
Rapid Parallel C++ Development
http://developer.nvidia.com/thrust or http://thrust.googlecode.com
CUDA Fortran
Program GPU using Fortran module mymodule contains
attributes(global) subroutine saxpy(n,a,x,y)
Key language for HPC real :: x(:), y(:), a,
Simple language extensions integer n, i
attributes(value) :: a, n
Kernel functions
i = threadIdx%x+(blockIdx%x-1)*blockDim%x
Thread / block IDs if (i<=n) y(i) = a*x(i) + y(i);
Device & data end subroutine saxpy
management end module mymodule
Python PyCUDA
C# .NET GPU.NET
Numerical
Analytics
Get Started Today
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop or desktop PC!
Mathematica
PyCUDA (Python) http://www.wolfram.com/mathematica/new
http://mathema.tician.de/software/pycuda -in-8/cuda-and-opencl-support/
Six Ways to SAXPY
Programming Languages
for GPU Computing
Single precision Alpha X Plus Y (SAXPY)
... ...
// Perform SAXPY on 1M elements ! Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y); call saxpy(2**20, 2.0, x_d, y_d)
... ...
http://developer.nvidia.com/openacc or http://openacc.org
CUBLAS Library
Serial BLAS Code Parallel cuBLAS Code
int N = 1<<20; int N = 1<<20;
... cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
// Use your choice of blas library cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
cublasShutdown();
__global__
void saxpy(int n, float a, void saxpy(int n, float a,
float *x, float *y) float *x, float *y)
{ {
for (int i = 0; i < n; ++i) int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
Serial C++ Code
with STL and Boost
Parallel C++ Code
... ...
thrust::device_vector<float> d_x = x;
thrust::device_vector<float> d_y = y;
www.boost.org/libs/lambda http://thrust.github.com
CUDA Fortran
Standard Fortran Parallel Fortran
module mymodule contains module mymodule contains
subroutine saxpy(n, a, x, y) attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a real :: x(:), y(:), a
integer :: n, i integer :: n, i
do i=1,n attributes(value) :: a, n
y(i) = a*x(i)+y(i) i = threadIdx%x+(blockIdx%x-1)*blockDim%x
enddo if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy end subroutine saxpy
end module mymodule end module mymodule
http://developer.nvidia.com/cuda-fortran
Python
Standard Python Copperhead: Parallel Python
@cu
def saxpy(a, x, y): def saxpy(a, x, y):
return [a * xi + yi return [a * xi + yi
for xi, yi in zip(x, y)] for xi, yi in zip(x, y)]
with places.gpu0:
gpu_result = saxpy(2.0, x, y)
with places.openmp:
cpu_result = saxpy(2.0, x, y) cpu_result = saxpy(2.0, x, y)
http://numpy.scipy.org http://copperhead.github.com
Enabling Endless Ways to SAXPY