1 Cuda
1 Cuda
1 Cuda
PRACE School
Barcelona Supercomputing Center. April, 18-21, 2017
Manuel Ujaldón
Full Professor @ University of Malaga (Spain)
CUDA Fellow @ NVIDIA Corporation (USA)
Agenda
4
I. Introduction
Market share: 2010-2016
6
Welcome to the GPU world
7
Commercial models: GeForce vs. Tesla
8
The characters of this story:
The CUDA family picture
9
The impressive evolution of CUDA
1 63 supercomputers
in TOP500.org
supercomputer Agregate: 80.000 TFLOPS
in top500.org (more than 14% of the
(77 TFLOPS) 567 PFLOPs in top500)
4.000 60.000
academic papers academic papers
10
Summary of GPU evolution
12
Three reason for feeling attracted to GPUs
Cost
Low price due to a massive selling marketplace.
Three GPUs are sold for each CPU, and the ratio keeps growing.
Ubiquitous
Everybody already has a bunch of GPUs.
And you can purchase one almost everywhere.
Power
Ten years ago GPUs exceed 200 watts. Now, they populate the
Green 500 list. Progression in floating-point computation:
GFLOPS/w on float (32-bit) GFLOPS/w. on double (64-bit)
Fermi (2010) 5-6 3
Kepler (2012) 15-17 7
Maxwell (2014) 40 12 13
Highlights
In processing power:
Frequency gives up the leadership: Heat and voltage set the barrier.
Instruction level parallelism (ILP), task parallelism (multi-thread)
and symmetric multiprocessing (SMP) saturate.
Solution: Exploit data parallelism on GPU, which is more scalable.
In static memory (SRAM):
Alternative: Leave small caches visible to programmer.
In dynamic memory (DRAM):
Increment the bandwidth (OK), but also the latency (upps).
Solution: Stacked-DRAM, or the way to solve the memory wall by
contributing simultaneously with quantity (GB.) and quality (speed).
14
GPU peak performance vs. CPU
Peak GFLOPS (fp64) Memory Bandwidth
16
CUDA C at a glance
Terminology:
Host: The CPU and the memory on motherboard [DDR3].
Device: The graphics card [GPU + video memory]:
GPU: Nvidia GeForce/Tesla.
Video memory: GDDR5 or 3D memory.
Host Device
18
Heterogeneous Computing (2/4)
CPU (host)
GPU
Cores
50 GB/s.
Caches (device)
#include <iostream>
#include <algorithm>
DEVICE CODE:
using namespace std;
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
int main(void) {
int *in, *out; // host copies of a, b, c
HOST CODE:
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
- Serial code.
// Alloc space for device copies
cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
- Parallel code.
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
- Serial code.
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
}
21
Simple Processing Flow (1/3)
PCI Bus
1. Copy
input data from CPU
memory to GPU memory.
22
Simple Processing Flow (2/3)
PCI Bus
23
Simple Processing Flow (3/3)
PCI Bus
25
Hello World! with device code (1/2)
__global__ void mykernel(void)
Two new syntactic elements:
{
The CUDA C keyword __global__
printf("Hello World!\n");
indicates a function that runs on the
}
device and is called from host code.
int main(void)
mykernel<<<1,1>>> is a CUDA
{
kernel launch from the host code.
mykernel<<<1,1>>>();
return 0;
That's all that is required to
} execute a function on the GPU!
nvcc separates source code into host and device.
Device functions (like mykernel()) are procesed by
NVIDIA compiler.
Host functions (like main()) are processed by host
compiler (gcc for Unix, cl.exe for Windows). 26
Hello World! with device code (2/2)
__global__ void mykernel(void)
{ Output:
}
$ nvcc hello.cu
int main(void) {
$ a.out
mykernel<<<1,1>>>();
Hello World!
$
printf("Hello World!\n");
return 0;
}
GPU Computing
30
II.1. CUDA hardware model
Overview of CUDA hardware generations
GFLOPS in double precision for each watt consumed
24
22
20 Pascal
3D Memory
18 NVLink
16
14
12 Maxwell
Unified memory
10 DX12
8
Kepler
6 Dynamic Parallelism
4
Fermi
2 Tesla FP64
CUDA
Multiprocessor 1
Massive parallelism:
Control
Applied to thousands of threads.
Core 1 Core 2 Core M
… Unit
(SIMD)
Multiprocessor N
A register file.
Shared memory. Multiprocessor 2
Multiprocessor 1
A constant cache and a texture
cache, both read-only. Shared memory
34
II.2. The first generation:
Tesla (G80 and GT200)
The first generation: G80 (GeForce 8800)
GPU G80 (around 600 MHz, much lower than the frequency for the cores)
Multiprocessor 16
Multiprocessor 2
Multiprocessor 1 (CUDA thread-blocks are mapped onto multiprocessors)
Shared memory (16 KB)
Multiprocessor 2
Multiprocessor 1 (CUDA thread-blocks are mapped onto multiprocessors)
Shared memory (16 KB)
Global memory (up to 4 GB) (GDDR3, 512 bits @ 2x 1.1GHz = 141.7 GB/s)
37
Scalability for future generations:
Alternatives for increasing performance
Raise the number of GPU
multiprocessors (basic node), Multiprocessor 30
(scalability within the 1st gener.)
that is, we grow over the Z Multiprocessor 2
dimension. This is the path Multiprocessor 1
followed by 1st gener. (16 to 30). Shared memory
Raise the number of processors Registers Registers Registers
within a multiprocessor, which
means growing over the X Core 1 Core 2 … Core 8
(scalability in 2nd and 3rd geners.)
dimension. That is what the 2nd
and 3rd geners. have done (from
Texture cache
8 to 32 and from there to 192).
Increment the size of shared
memory (extending the Y dim.). Global memory
38
II. 3. The second generation:
Fermi (GFxxx)
Fermi hardware compared to its predecessors
41
Arithmetic enhancements
Integer (ALUs):
Redesigned to optimize 64-bit integer
arithmetic.
Extended precision operations.
Fused instructions (“madd”):
Available for both single and double
precision data types.
Floating-point (FPUs): FP Unit INT Unit
of most CPUs.
42
The memory hierarchy
43
II. 4. The third generation:
Kepler (GKxxx)
Kepler GK110 Block Diagram
45
The SMX multiprocessor
Instruction scheduling
and issuing in warps Front-end
Instructions execution.
512 functional units:
- 192 for ALUs. Back-end
- 192 for FPUs S.P.
- 64 for FPUs D.P.
- 32 for load/store.
- 32 for SFUs (log,sqrt, ...)
Front-end
Back-end
47
SMX Balance of Resources
48
II. 5. The fourth generation:
Maxwell (GMxxx)
Maxwell and SMM multiprocessors
(for GeForce GTX 980, 16 SMMs)
1870 Mt.
148 mm2.
50
The SMMs
51
A comparison versus Kepler
52
Some commercial models for CCC 5.2
(all @ 28 nm)
GeForce GTX 950 GTX 960 GTX 970 GTX980 GTX 980 Ti Titan X
Release date Aug’15 Aug’15 Sep’14 Sep’14 Jun’15 Mar’15
GPU (code name) GM206-250 GM206-300 GM204-200 GM204-400 GM200-310 GM200-400
Multiprocessors 6 8 13 16 22 24
Number of cores 768 1024 1664 2048 2816 3072
Cores frequency (MHz) 1024-1188 1127-1178 1050-1178 1126-1216 1000-1075 1000-1075
DRAM bus width 128 bits 128 bits 256 bits 256 bits 384 bits 384 bits
DRAM frequency 2x 3.3 GHz 2x 3.5 GHz 2x 3.5 GHz 2x 3.5 GHz 2x 3.5 GHz 2x 3.5 GHz
DRAM bandwidth 105.6 GB/s 112 GB/s 224 GB/s 224 GB/s 336.5 GB/s 336.5 GB/s
GDDR5 memory size 2 GB 2 GB 4 GB 4 GB 6 GB 12 GB
Million of transistors 2940 2940 5200 5200 8000 8000
Die size 228 mm2 228 mm2 398 mm2 398 mm2 601 mm2 601 mm2
Maximum TDP 90 W 120 W 145 W 165 W 250 W 250 W
Power connectors 1 x 6 pines 1 x 6 pines 2 x 6 pines 2 x 6 pines 6 + 8 pines 6 + 8 pines
Price ($ upon release) 149 199 329 549 649 999 53
Major enhancements
54
Power efficiency
55
II. 6. The fifth generation:
Pascal (GPxxx)
Today
GPU PCIe
16 GB/s
CPU
GDDR5 DDR4
250-350 GB/s 50-75 GB/s
57
A 2015 graphics card:
Kepler/Maxwell GPU with GDDR5 memory
58
In 2017
GPU NVLINK
80 GB/s
CPU
Memory stacked DDR4: 100 GB/s
in 4 layers: 1 TB/s (4 channels of 25.6 GB/s)
59
A 2017 graphics card:
Pascal GPU with Stacked DRAM
60
A Pascal GPU prototype
c m .
14
7.
8
cm
.
61
First commercial model: GeForce GTX 1080.
Comparative with the previous 2 generations
GTX 680 (Kepler) GTX 980 (Maxwell) GTX 1080 (Pascal)
Year 2012 2014 2016
Transistors 3.54 B @ 28 nm. 5.2 B @ 28 nm. 7.2 B @ 16 nm.
Power consumption & die size 195 W & 294 mm2 165 W & 398 mm2 180 W & 314 mm2
Multiprocessors 8 16 40
Cores / Multiproc. 192 128 64
Cores / GPU 1536 2048 2560
Clock (wo. and w. GPU Boost) 1006, 1058 MHz 1126, 1216 MHz 1607, 1733 MHz
Peak performance 3250 GFLOPS 4980 GFLOPS 8873 GFLOPS
Shared memory 16, 32, 48 KB 64 KB
L1 cache size 48, 32, 16 KB Integrated with texture cache
L2 cache size 512 KB 2048 KB
DRAM memory: Interface 256-bit GDDR5 256-bit GDDR5 256-bit GDDR5X
DRAM memory: Frequency 2x 3000 MHz 2x 3500 MHz 4x 2500 MHz
DRAM memory: Bandwidth 192.2 GB/s 224 GB/s 320 GB/s
62
Commercial models for Tesla P100 (Pascal)
and comparative with 2 previous generations
Tesla K40 (Kepler) Tesla M40 (Maxwell) P100 w. NV-link P100 w. PCI-e
Release date 2012 2014 2016
Transistors 7.1 B @ 28 nm. 8 B @ 28 nm. 15.3 B @ 16 nm. FinFET
# of multiprocessors 15 24 56
fp32 cores / Multiproc. 192 128 64
fp32 cores / GPU 2880 3072 3584
fp64 cores / Multiproc. 64 4 32
fp64 cores / GPU 960 (1/3 fp32) 96 (1/32 fp32) 1792 (1/2 fp32)
Clock frequency 745,810,875 MHz 948, 1114 MHz 1328, 1480 MHz 1126, 1303 MHz
Thermal Design Power 235 W 250 W 300 W 250 W
Peak performance (DP) 1680 GFLOPS 213 GFLOPS 5304 GFLOPS 4670 GFLOPS
L2 cache size 1536 KB 3072 KB 4096 KB
Memory interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2
Memory size Up to 12 GB Up to 24 GB 16 GB
Memory bandwidth 288 GB/s 288 GB/s 720 GB/s
63
The physical layout for multiprocessors,
memory controllers and buses
64
Pascal multiprocessor
65
Nvidia’s roadmap
66
Stacked DRAM: A tale of two consortiums
67
II. 7. A summary of four generations
Scalability for the architecture:
A summary of four generations (2006-2015)
Tesla Fermi Kepler Maxwell
N (multiprocs.) 16 30 16 7 8 14 15 30 5 16
Number of cores 128 240 512 336 1536 2688 2880 5760 640 2048
69
New models for 2016/17
Maxwell Pascal
2014 2014
Time Frame 2016 2016 2016 2017
/15 /15
CUDA Compute
5.0 5.2 5.3 5.3 6.0 6.0
Capability
N (multiprocs.) 5 16 24 24 40 56
70
III. Programming
Comparing the GPU and the CPU
72
From POSIX threads in CPU
to CUDA threads in GPU
75
Preliminary definitions
On-chip
memory
Threads / Warp 32 32 32 32 32 32
Software Throughput
Blocks / Multiprocessor 8 8 8 16 32 32
79
Guidelines to identify
Nvidia commercial series
200: Developed during 3Q’08, until 4Q’09. Upgrades the G80
with 240 cores (GTX260 and GTX280).
400: Starts in 1Q’10. Introduces Fermi. Until 3Q’11.
500: Starts in 4Q’10 with GTX580 [GF110], and concludes the
Fermi generation in 1Q’12.
600: 2012-13. Introduces Kepler, but also includes Fermis.
700: 2013-14. Focuses on Kepler, but brings the last Fermi
models [GF108] and the first Maxwells [GM107, GM108].
800M: 1Q’14 and only for laptops, combining Fermi [GF117],
Kepler [GK104] and Maxwell [GM107, GM108].
900: Starts in 4Q’14, with a Maxwell upgrade [GM20x].
1000: Starts in 2Q’16, with the first Pascal models [GP10x]. 80
GPU threads and blocks
Kepler/Maxwell’s limits: 1024 threads per block, 2048 threads per multiprocessor
Blocks are
assigned to
multiprocessors
[Limit: 16-32
concurrent blocks
per multiprocessor]
Block 0 Block 1 Block 2
····
Grid 0 [Kepler/Maxwell’s limit: 4G blocks per grid]
Blocks to share
RF RF RF RF RF RF RF RF the same
texture memory
also available
Constant and
multiprocessor
LMLocal memory:
LM LM Off-chip
LM LM LM LM LM if memory
constraints are
Shared memory fulfilled
85
Now think big:
1D partitioning on a 64 million elems. array
Maximum number of threads per block: 1024.
Maximum number of blocks:
64K on Fermi.
4G on Kepler/Maxwell.
Larger sizes for data structures can only be covered with a
huge number of blocks (keeping fine-grained parallelism).
Choices:
64K blocks of 1K threads each (maximum for Fermi).
128K blocks of 512 threads each (no longer available on Fermi).
256K blocks of 256 threads each (no longer available on Fermi).
... and so on.
86
Summarizing about kernels,
blocks, threads and parallelism
Kernels are launched in grids. Grid
Each block executes fully on a
single multiprocessor (SMX/SMM). Block (0, 0) Block (1, 0)
Does not migrate.
Several blocks can reside Shared memory Shared memory
concurrently on one SMX/SMM.
With control limitations. For
example, in Kepler/Maxwell, we have: Regs Regs Regs Regs
87
Transparent scalability
Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
Block 4 Block 5
! A kernelscales across any number of multiprocessors
Block 6 Block 7
(as long as we have declared enough number of blocks).
88
Partition¡ng data and computations
thread has to bound its area/ Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
volume of local computation. Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
89
Memory spaces
90
IV. Syntax
IV. 1. Basic elements
CUDA is C with some extra keywords.
A preliminar example
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i]; C code on the CPU
}
// Invoke the SAXPY function sequentially
saxpy_serial(n, 2.0, x, y);
94
Interaction between CPU and GPU
A kernel does not start until all previous kernels are over.
Streams allow you to run kernels in parallel. 96
Modifiers for the functions and
launching executions on GPU
Modifiers for the functions executed on GPU:
__global__ void MyKernel() { } // Invoked by the CPU
__device__ float MyFunc() { } // Invoked by the GPU
Modifiers for the variables within GPU:
__shared__ float MySharedArray[32]; // In shared mem.
__constant__ float MyConstantArray[32];
Configuration for the execution to launch kernels:
dim2 gridDim(100,50); // 5000 thread blocks
dim3 blockDim(4,8,8); // 256 threads per blocks
MyKernel <<< gridDim,blockDim >>> (pars.); // Launch
Note: We can see an optional third parameter here to
indicate as a hint the amount of shared memory
allocated dynamically by the kernel during its
execution.
97
Intrinsics
100
Let’s manage video memory
103
Example 1: Solution
[C code in red, CUDA extensions in blue]
int main()
{
int N = 16;
int num_bytes = N*sizeof(int);
int *d_a=0, *h_a=0; // Pointers in device (GPU) and host (CPU)
free(h_a);
cudaFree(d_a);
}
104
Asynchronous memory transfers
105
Example 2: Increment a scalar value “b”
to the N elements of an array
The CUDA kernel running on GPU
The C program.
followed by host code running on CPU.
This file is compiled with gcc
This file is compiled with nvcc
__global__ void increment_gpu(float *a, float b, int N)
void increment_cpu(float *a, float b, int N) {
{ int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int idx = 0; idx<N; idx++) if (idx < N)
a[idx] = a[idx] + b; a[idx] = a[idx] + b;
} }
void main()
{
void main()
…..
{
dim3 dimBlock (blocksize);
.....
dim3 dimGrid (ceil(N/(float)blocksize));
increment_cpu(a, b, N);
increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);
}
}
106
Example 2: Increment a scalar “b”
to the N elements of a vector
void function_in_CPU(… )
{
... CUDA The rest of
} kernels the C code
void other_funcs_CPU(int ...)
{
...
}
EDG
Separates GPU and CPU code.
EDG CPU Code Open64
Generates PTX assembler.
Parallel Thread eXecution (PTX)
Virtual machine and ISA.
Open64
Programming model.
Resources and execution states.
PTX Code
Compilation
process in
Windows:
113
Determining resource usage
115
Heuristics (cont.)
118
To reach the maximum degree of parallelism,
use wisely the orange table of the tool (1)
The first row is the number of threads per block:
The limit is 1024 in Fermi and Kepler generations.
Power of two values are usually the best choices.
List of potential candidates: 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024.
We'll use 256 as first estimate, development cycles will tune the
optimal value here, but usually:
Small values [2, 4, 8, 16] do not fully exploit the warp size and shared memory
banks.
Intermediate values [32, 64] compromise thread cooperation and scalability in
Kepler, Maxwell and future GPUs.
Large values [512, 1024] prevent from having enough number of concurrent blocks
on each multiprocessor (the limits for the threads per block and per SMX are very
close to each other). Also, the amount of registers per thread is too small.
119
To reach the maximum degree of parallelism,
use wisely the orange table of the tool (2)
The second row is the number of registers per thread.
We access the .cubin file to know this.
The limit for each SM is 8K (G80), 16K (GT200), 32K (Fermi), 64K
(Kepler), so when consuming 10 regs./thread is possible to execute:
On G80: 768 threads/SM, that is, 3 blocks of 256 thr [3*256*10=7680] (< 8192).
On Kepler: We reach the maximum of 2048 threads per SMX, but the use of
registers is very low (we could have used up to 29 registers per thread):
8 blocks * 256 threads/block * 10 registers/thread = 20480 regs. (< 65536 max.).
In the G80 case, using 11 registers/thread, it would have meant to
stay in 2 blocks, sacrificing 1/3 of parallelism => It is worth cutting
that register down working more on the CUDA code for the thread.
In Kepler, we may use up to 29 registers without compromising
parallelism.
120
To reach the maximum degree of parallelism,
use wisely the orange table of the tool (3)
The third row is the shared memory spent for each block:
We will also get this from the .cubin file, though we can carry out a
manual accounting, as everything depends on where we put the
__shared__ prefix during memory declarations in our program.
Limit: 16 KB (CCC 1.x), 16/48 KB (CCC 2.x), 16/32/48 KB (3.x).
In the previous case for the G80, we won’t spend more than 5 KB
of shared memory per block, so that we can reach the maximum of 3
concurrent blocks on each multiprocessor:
3 blocks x 5 KB./block = 15 KB (< 16 KB.)
With more than 5.34 KB. of shared memory used for each block,
we sacrifice 33% of parallelism, the same performance hit than
previously if we were unable of cutting down to 10 registers/thread.
121
VI. Examples: VectorAdd, Stencil,
ReverseArray, MxM
Step for building the CUDA source code
123
Coordinated efforts in parallel are required
130
VI. 2. Stencil kernels
Rationale
132
1D Stencil
radius radius
134
Sharing data between threads. Limitations
__syncthreads();
138
Summary of major concepts
applied during this example
Launch N blocks with M threads per block to execute threads
in parallel. Use:
kernel <<< N, M >>> ();
Access block index within grid and thread index within block:
blockIdx.x and threadIdx.x;
Calculate global indices where each thread has to work
depending on data partitioning. Use:
int index = blockIdx.x * blockDim.x + threadIdx.x;
Declare a variable/array in shared memory. Use:
__shared__ (as prefix to the data type).
Synchronize threads to prevent data hazards. Use:
__syncthreads(); 139
VI. 3. Reverse the order
of a vector of elements
GPU code for the ReverseArray kernel
(1) using a single block
__global__ void reverseArray(int *in, int *out) {
int index_in = threadIdx.x;
int index_out = blockDim.x – 1 – threadIdx.x;
141
GPU code for the ReverseArray kernel
(2) using multiple blocks
__global__ void reverseArray(int *in, int *out) { // For thread 0 within block 0:
int in_offset = blockIdx.x * blockDim.x; // in_offset = 0;
int out_offset = (gridDim.x – 1 – blockIdx.x) * blockDim.x; // out_offset = 12;
int index_in = in_offset + threadIdx.x; // index_in = 0;
int index_out = out_offset + (blockDim.x – 1 – threadIdx.x); // index_out = 15;
142
A more sophisticated version
using shared memory
143
GPU code for the ReverseArray kernel
(3) using multiple blocks and shared memory
__global__ void reverseArray(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE];
int gindex = blockIdx.x * blockDim.x + threadIdx.x;
int lindex = threadIdx.x;
Dependency: In (i2), values written by a warp, have to be read (before) by another warp.
Solution: Use a temp2[BLOCK_SIZE] array to store intermediate results (also in (i4)).
Improvement: (i3) is not required. Also, if you swap indices within temp[] and temp2[]
in (i2), then (i1) is not required (but (i3) becomes mandatory).
If you substitute all temp and temp2 instances by their equivalent expressions, you
converge into the previous CUDA version.
Every array element is accessed once, so using shared memory does not improve anyway! 144
VI. 4. Matrix product
Typical CPU code written in C language
C = A * B. (P = M * N in hands-on) B
N
float b = B[k*N + j];
sum += a*b;
}
C[i*N + j] = sum;
}
N N
} 146
CUDA version for the matrix product:
A draft for the parallel code
B
void MxMonGPU(float* A, float* B, float* C, int N);
{
float sum=0;
int i, j;
N
N N
147
CUDA version for the matrix product:
Explaining parallelization
Each thread computes a single element of C.
Matrices A and B are loaded N times from video memory.
Blocks accomodate threads in groups of 1024 threads
N
(internal CUDA constraint in Fermi and Kepler). That way,
we may use 2D blocks composed of 32x32 threads each.
WidthB WidthA WidthB
Grid
· · · · Block ····
······ ······· ······ = X
HeightA C(x, y)
···· ····
···············
N
C HeightA A B
···· ···· dim2 dimBlock(BLOCKSIZE, BLOCKSIZE);
· · · · · · Th(x,y)
······· ······ dim2 dimGrid(WidthB/BLOCKSIZE, HeightA/BLOCKSIZE);
···· ···· ...
N
MxMonGPU <<<dimGrid,dimBlock>>> (A, B,N C, N);
148
CUDA version for the matrix product:
Analysis
Each thread requires 10 registers, so we can reach the
maximum amount of parallelism in Kepler:
2 blocks of 1024 threads (32x32) on each SMX. (2x1024x10 = 20480
registers, which is lower than 65536 registers available).
Problems:
Low arithmetic intensity.
Demanding on memory bandwidth, which becomes the bottleneck.
Solution:
Use shared memory on each multiprocessor.
149
Using shared memory:
Version with tiling for A and B
B
The 32x32 submatrix Csub computed by
M
each thread block uses tiles of 32x32
elements of A and B which are repeatedly
N
allocated on shared memory.
A and B are loaded only (N/32) times
M
from global memory.
A C
Achievements:
Less demanding on
Csub
memory bandwidth.
N
More arithmetic intensity.
M M M M
N N
150
Tiling: Implementation details
151
A trick to avoid shared memory bank conflicts
Rationale:
The shared memory is structured into 16 (pre-Fermi) or 32 banks.
Threads within a block are numbered in column major order, that is,
the x dimension is the fastest varying.
When using the regular indexing scheme to shared
memory arrays: As[threadIdx.x][threadIdx.y],
threads within a half-warp will be reading from the same
column, that is, from the same bank in shared memory.
However, using As[threadIdx.y][threadIdx.x],
threads within a half-warp will be reading from the same row,
which implies reading from a different bank each.
So, tiles store/access data transposed in shared memory. 152
An example for solving conflicts
to banks in shared memory
(0,0) (1,0) warp 0 (31,0) (0,0) (1,0) warp 0 (31,0) Consecutive threads within a warp
(0,1) (1,1) warp 1 (31,1) (0,1) (1,1) warp 1 (31,1) differ in the first dimension.
(0,2) (1,2) warp 2 (31,2) (0,2) (1,2) warp 2 (31,2)
but consecutive positions of memory
store data of a bidimensional matrix
Block (0,0) Block (1,0) which differ in the second dimension:
a[0][0], a[0][1], a[0][2], ...
(0,29)(1,29) warp 29 (31,29) (0,29)(1,29) warp 29 (31,29)
(0,30)(1,30) warp 30 (31,30) (0,30)(1,30) warp 30 (31,30) It is If thread (x,y) If thread (x,y)
(0,31)(1,31) warp 31 (31,31) (0,31)(1,31) warp 31 (31,31) data stored uses a[x][y], uses a[y][x],
(0,0) (1,0) warp 0 (31,0) (0,0) (1,0) warp 0 (31,0) in bank warp access to warp access to
(0,1) (1,1) warp 1 (31,1) (0,1) (1,1) warp 1 (31,1)
a[0][0] 0 X X
(0,2) (1,2) warp 2 (31,2) (0,2) (1,2) warp 2 (31,2)
a[0][1] 1 X
Block (0,1) Block (1,1) a[0][31] 31 X
a[1][0] 0 X
(0,29)(1,29) warp 29 (31,29) (0,29)(1,29) warp 29 (31,29) a[31][0] 0 X
(0,30)(1,30) warp 30 (31,30) (0,30)(1,30) warp 30 (31,30)
(0,31)(1,31) warp 31 (31,31) (0,31)(1,31) warp 31 (31,31)
... 100% No
... (más bloques de 32 x 32 hilos) conflicts conflicts 153
Tiling: The CUDA code for the GPU kernel
__global__ void MxMonGPU(float *A, float *B, float *C, int N)
{
int sum=0, tx, ty, i, j;
tx = threadIdx.x; ty = threadIdx.y;
i = blockIdx.x * blockDim.x + tx; j = blockIdx.y * blockDim.y + ty;
__shared__ float As[32][32], float Bs[32][32];
100
75
GFLOPS
50
Tiling only
Tiling & Unrolling
25
0
4x4 8x8 12x12 16x16
Tile size (32x32 unfeasible on G80 hardware)
156
VII. Bibliography and tools
CUDA Zone:
The root Web for CUDA programmers
[developer.nvidia.com/cuda-zone]
158
159
160
161
162
163
CUDA books: From 2007 to 2015
166
Courses on-line (free access)
168
Talks and webinars
169
Developers
171
Future developments
173