Gpu Programming
Gpu Programming
I work here
Body-fitted mesh
CFD basics
• C
•MPI
Part 2: CPUs and GPUs
Moore’s Law
“The complexity for minimum component costs has
increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to
continue.”
Gordon Moore (Intel), 1965
Moore’s Law
“The complexity for minimum component costs has
increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to
continue.”
Gordon Moore (Intel), 1965
Source:
ftp://download.intel.com/resea
rch/silicon/Gordon_Moore_ISSC
C_021003.pdf
Source: Intel
Was Moore right?
Source: Intel
Was Moore right?
Source: Intel
Feature size
Source:
ftp://download.intel.com/resea
rch/silicon/Gordon_Moore_ISSC
C_021003.pdf
Source: Intel
Clock speed
Source:
http://www.tomshardware.com/2005/11/21/the_mother_of_al
l_cpu_charts_2005/index.html
Power – the Clock speed limiter?
• 1 GHz CPU requires ≈ 25 W
• 3 GHz CPU requires ≈ 100 W
Power – the Clock speed limiter?
• 1 GHz CPU requires ≈ 25 W
• 3 GHz CPU requires ≈ 100 W
Source:
http://www.hotchips.org/hc19/docs/keynote2.pdf
What to do with all these transistors?
Parallel computing
Multi-core chips are either:
– Instruction parallel
(Multipile Instruction, Multiple Data) – MIMD
or
– Data parallel
(Single Instruction, Multiple Data) – SIMD
Today’s commodity MIMD chips: CPUs
Intel Core 2 Quad
• 4 cores
• 2.4 GHz
• 65nm features
• 582 million transistors
• 8MB on chip memory
Today’s commodity SIMD chips: GPUs
NVIDIA 8800 GTX
• 128 cores
• 1.35 GHz
• 90nm features
• 681 million transistors
• 768MB on board memory
CPUs vs GPUs
Source:
http://www.eng.cam.ac.uk/~gp10006/research/Brandvi
k_Pullan_2008a_DRAFT.pdf
CPUs vs GPUs
Transistor usage:
Source:
ftp://download.nvidia.com/developer/presentations/200
4/Perfect_Kitchen_Art/English_Evolution_of_GPUs.pdf
Graphics pipeline
GPUs and scientific computing
Source:
http://www.ece.ucdavis.edu/~jowens/talks/intel-
santaclara-070420.pdf
GPU – Programming for graphics
Courtesy, John Owens, UC Davis
Draw a quad
Source:
http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20wo
rkshop%20-%2006%20NVIDIA%20-%20Kirk.pdf
NVIDIA G80 hardware implementation
Divide 128 cores into
16 Multiprocessors (MPs)
•Each MP has:
–Registers
–Shared memory
–Read only constant
cache
–Read only texture
cache
NVIDIA’s CUDA programming model
• G80 chip supports MANY active threads: 12,288
• Threads are lightweight:
– Little creation overhead
– “instant” switching
– Efficiency achieved through 1000’s of threads
• Threads are organised into blocks (1D, 2D, 3D)
• Blocks are further organised into a grid
Kernels, grids, blocks and threads
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
• Software:
– Threads from one block may cooperate:
• Using data in shared memory
• Through synchronising
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
• Software:
– Threads from one block may cooperate:
• Using data in shared memory
• Through synchronising
• Hardware:
– A block runs on one MP
– Hardware free to schedule any block on any MP
– More than one block can reside on one MP
Kernels, grids, blocks and threads
CUDA implementation
• CUDA implemented as extensions to C
• CUDA programs:
– explicitly manage host and device memory:
• allocation
• transfers
– set thread blocks and grid
– launch kernels
– are compiled with the CUDA nvcc compiler
Part 4: An example – CFD
Distribution function
f = f (c, x , t )
c is microscopic velocity
ρ = ∫ f dc
ρ u = ∫ cf d c
u is macroscopic velocity
Boltzmann equation
The evolution of f:
∂f ∂f
+ u ⋅ ∇f =
∂t ∂t collisions
Major simplification:
∂f 1
+ u ⋅ ∇f = − ( f − f )
eq
∂t τ
Lattice Boltzmann Method
Uniform mesh (lattice)
Lattice Boltzmann Method
Uniform mesh (lattice) Restrict microscopic velocities
to a finite set:
ρ = ∑ fα ρu = ∑ fα cα
α α
Macroscopic flow
⎛ 1 ⎞ Δx 2
• With viscosity: ν = ⎜τ − ⎟
⎝ 2 ⎠ Δt
Solution procedure
ρ = ∑ fα ρu = ∑ fα cα
α α
2. Evaluate fαeq ( ρ , u )
3. Find
fα = fα −
*
τ
1
( f α − fα
eq
)
Solution procedure
Solution procedure
Simple prescriptions at
boundary nodes
CPU code: main.c
/* Memory allocation */
f0 = (float *)malloc(ni*nj*sizeof(float));
...
/* Main loop */
Stream (...args...);
Apply_BCs (...args...);
Collide (...args...);
GPU code: main.cu
/* allocate memory on host */
f0 = (float *)malloc(ni*nj*sizeof(float));
/* Main loop */
Stream (...args...);
Apply_BCs (...args...);
Collide (...args...);
CPU code – collide.c
for (j=0; j<nj; j++) {
for (i=0; i<ni; i++) {
i2d = I2D(ni,i,j);
/* Flow properties */
density = ...function of f’s ...
vel_x = ... “
vel_y = ... “
/* Equilibrium f’s */
f0eq = ... function of density, vel_x, vel_y ...
f1eq = ... “
/* Collisions */
f0[i2d] = rtau1 * f0[i2d] + rtau * f0eq;
f1[i2d] = rtau1 * f1[i2d] + rtau * f1eq;
...
}
}
GPU code – collide.cu – kernel wrapper
/* Launch kernel */
collide_kernel<<<grid, block>>>(... args ...);
}
GPU code – collide.cu - kernel
/* Evaluate indices */
i = blockIdx.x*TILE_I + threadIdx.x;
j = blockIdx.y*TILE_J + threadIdx.y;
i2d = i + j*pitch/sizeof(float);
/* Read from device global memory */
f0now = f0_data[i2d];
f1now = f1_data[i2d];
i = blockIdx.x*TILE_I + threadIdx.x;
j = blockIdx.y*TILE_J + threadIdx.y;
i2d = i + j*pitch/sizeof(float);
80 GFLOPs
35 W !
IBM Cell BE
25 x 8 GFLOPs
Chip comparison (Giles 2008)
Source:
http://www.cardiff.ac.uk/arcca/services/events/NovelArchitecture/
Mike-Giles.pdf
Too much choice!
• Each device has
– different hardware characteristics
– different software (C extensions)
– different developer tools
• How can we write code for all SIMD devices for all
applications?
Big picture – all devices, all problems?
Forget the big picture
Tackle the dwarves!
The View from Berkeley (7 “dwarves”)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce
Source:
http://view.eecs.berkeley.edu/wiki/Main_Page
The View from Berkeley (13 dwarves?)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce
8. Combinational Logic
9. Graph Traversal
10. Dynamic Programming
11. Backtrack and Branch-and-Bound
12. Graphical Models
13. Finite State Machines
The View from Berkeley (13 dwarves?)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce
8. Combinational Logic
9. Graph Traversal
10. Dynamic Programming
11. Backtrack and Branch-and-Bound
12. Graphical Models
13. Finite State Machines
SBLOCK (Brandvik)
• Tackle structured grid, stencil operations dwarf
• Define kernel using high level Python abstraction
• Generate kernel for a range of devices from same
definition: CPU, GPU, Cell
• Use MPI to handle multiple devices
SBLOCK kernel definition
kind = "stencil"
bpin = ["a"]
bpout = ["b"]
lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0),
(0, 1, 0), (0, 0, 1), (0, 0, 1))
calc = {"lvalue": "b",
"rvalue": """sf1*a[0][0][0] +
sfd6*(a[1][0][0] + a[1][0][0] +
a[0][1][0] + a[0][1][0] +
a[0][0][1] + a[0][0][1])"""}
SBLOCK – CPU implementation (C)
void smooth(float sf, float *a, float *b)
{
for (k=0; k < nk; k++) {
for (j=0; j < nj; j++) {
for (i=0; i < ni; i++) {
/* compute indices i000, im100, etc */
b[i000] = sf1*a[i000] +
sfd6*(a[im100] + a[ip100] +
a[i0m10] + a[i0p10]
+ a[i00m1] + a[i00p1]);
}
}
}
}
SBLOCK – GPU implementation (CUDA)
__global__ void smooth_kernel(float sf, float
*a_data, float *b_data){
http://dx.doi.org/10.1109/JPROC.2008.917757
http://www.gpgpu.org
http://www.oerc.ox.ac.uk/research/many-core-and-
reconfigurable-supercomputing