Lecture GPUArchCUDA01
Lecture GPUArchCUDA01
1
Manycore GPU Architectures and
Programming: Outline
• Introduction
– GPU architectures, GPGPUs, and CUDA
• GPU Execution model
• CUDA Programming model
• Working with Memory in CUDA
– Global memory, shared and constant memory
• Streams and concurrency
• CUDA instruction intrinsic and library
• Performance, profiling, debugging, and error handling
• Directive-based high-level programming model
– OpenACC and OpenMP
2
Computer Graphics
GPU: Graphics Processing Unit
3
Graphics Processing Unit (GPU)
Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html
4
Graphics Processing Unit (GPU)
• Enriching user visual
experience
• Delivering energy-efficient
computing
• Unlocking potentials of
complex apps
• Enabling Deeper scientific
discovery
5
What is GPU Today?
• It is a processor optimized for 2D/3D graphics, video, visual
computing, and display.
• It is highly parallel, highly multithreaded multiprocessor
optimized for visual computing.
• It provide real-time visual interaction with computed
objects via graphics images, and video.
• It serves as both a programmable graphics processor and a
scalable parallel computing platform.
– Heterogeneous systems: combine a GPU with a CPU
• It is called as Many-core
6
Graphics Processing Units (GPUs): Brief History
GPU Computing
General-purpose computing on
graphics processing units
(GPGPUs)
GPUs with programmable
shading
Nvidia GeForce
GE 3 (2001) with
programmable shading
DirectX graphics API
OpenGL graphics API
Hardware-accelerated
3D graphics
S3 graphics cards-
single chip 2D
accelerator
Atari 8-bit computer IBM PC Professional Playstation
text/graphics chip Graphics Controller card
Fermi
NVIDIA's first Tesla
GPU with general C870, S870, C1060, S1070, C2050, …
purpose GeForce 400 series
processors GTX460/465/470/475/
Quadr 480/485
Established by Jen- GT 80o
Hsun Huang, Chris GeForce GeForce 200 series
GTX260/275/280/285/295
8800
Malachowsky, Curtis
GeForce 8 series
Priem
GeForce 2 series GeForce FX series
NV1 GeForce 1
1993 1995 1999 2000 2001 2002 2003 2004 2005 20062007 2008 2009 2010
http://en.wikipedia.org/wiki/GeForce
8
GPU Architecture Revolution
• Unified Scalar Shader Architecture
Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html
An Introduction to Modern GPU Architecture, Ashu Rege, NVIDIA Director of Developer Technology
9
ftp://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
GPUs with Dedicated Pipelines
-- late 1990s-early 2000s
• Graphics chips generally
Input stage
had a pipeline structure
with individual stages
performing specialized Vertex shader
stage
operations, finally
Graphics
leading to loading frame memory Geometry
buffer for display. shader stage
data.
10
Specialized Pipeline Architecture
11
Graphics Logical Pipeline
Graphics logical pipeline. Programmable graphics shader stages are blue, and fixed-function blocks are
white. Copyright © 2009 Elsevier, Inc. All rights reserved.
Unbalanced
and
inefficient
utilization
12
Unified Shader
• Optimal utilization in unified architecture
FIGURE A.2.4 Logical pipeline mapped to physical processors. The programmable shader stages execute on the
array of unified processors, and the logical graphics pipeline dataflow recirculates through the processors. Copyright ©
2009 Elsevier, Inc. All rights reserved. 13
Unified Shader Architecture
FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14
streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA
GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has
eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a 14
shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
Streaming Processing
To be efficient, GPUs must have high throughput, i.e.
processing millions of pixels in a single frame, but may be
high latency
17
17
GPU Performance Gains Over CPU
http://docs.nvidia.com/cuda/cuda-c-programming-guide
18
GPU Performance Gains Over CPU
19
Parallelism in CPUs v. GPUs
• Multi-/many- core/CPUs use • Manycore GPUs use data
task parallelism parallelism
– MIMD, i.e. Multiple tasks map – SIMD model (Single Instruction
to multiple threads Multiple Data)
• Terminology
– Host: The CPU and its memory (host memory)
– Device: The GPU and its memory (device memory)
21
Simple Processing Flow
PCI Bus
22
Simple Processing Flow
PCI Bus
23
Simple Processing Flow
PCI Bus
24
Offloading Computation
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
// Cleanup
serial code
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
25
}
Programming for NVIDIA GPUs
http://docs.nvidia.com/cuda/cuda-c-
programming-guide/
26
CUDA(Compute Unified Device Architecture)
Both an architecture and programming model
• Architecture and execution model
– Introduced in NVIDIA in 2007
– Get highest possible execution performance requires
understanding of hardware architecture
• Programming model
– Small set of extensions to C
– Enables GPUs to execute programs written in C
– Within C programs, call SIMT “kernel” routines that are
executed on GPU.
• Hello world introduction today
– More in later lectures
27
CUDA Thread Hierarchy
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);
organization
29
Hello World! with Device Code
__global__ void hellokernel() {
printf(”Hello World!\n”);
}
int main(void){
int num_threads = 1;
int num_blocks = 1;
hellokernel<<<num_blocks,num_threads>>>();
cudaDeviceSynchronize();
return 0; Output:
} $ nvcc
hello.cu
§ Two new syntactic elements… $ ./a.out
Hello World!
$
30
GPU code examples and try on Bridges
• Bridges:
– interact -gpu
– module load gcc/5.3.0 cuda/8.0 opencv/3.2.0
– cp -r ~yan/gpu_code_examples ~
– cd gpu_code_examples
– nvcc hello-1.cu –o hello-1
– ./hello-1
– nvcc hello-2.cu –o hello-2
– ./hello-2
31
Hello World! with Device Code
__global__ void hellokernel(void)
33
Hello World! with Device Code
__device__ const char *STR = "Hello World!";
const char STR_LENGTH = 12;
36
GPU Execution Model
• The GPU is a physically separate processor from the CPU
– Discrete vs. Integrated
• The GPU Execution Model offers different abstractions from
the CPU to match the change in architecture
PCI Bus
37
GPU Execution Model
• The GPU is a physically separate processor from the CPU
– Discrete vs. Integrated
• The GPU Execution Model offers different abstractions from
the CPU to match the change in architecture
PCI Bus
38
The Simplest Model: Single-Threaded
• Single-threaded Execution Model
– Exclusive access to all variables
– Guaranteed in-order execution of loads and stores
– Guaranteed in-order execution of arithmetic instructions
Single-Threaded
39
CPU SPMD Multi-Threading
• Single-Program, Multiple-Data (SPMD) model
– Makes the same in-order guarantees within each thread
– Says little or nothing about inter-thread behaviour or exclusive
variable access without explicit inter-thread synchronization
SPMD
Synchronize
40
GPU Multi-Threading
• Uses the Single-Instruction, Multiple-Thread model
– Many threads execute the same instructions in lock-step
– Implicit synchronization after every instruction (think vector
parallelism)
SIMT
41
GPU Multi-Threading
• In SIMT, all threads share instructions but operate on their
own private registers, allowing threads to store thread-local
state
SIMT
42
GPU Multi-Threading
• SIMT threads can be a = 4
b = 3
a = 3
b = 4
“disabled” when they need
to execute instructions
if (a > b) {
different from others in their
group
Disabled
max = a;
Disabled
similar vector-parallel models max = b;
(SIMD)
}
43
GPU Multi-Threading
• GPUs execute many groups of SIMT threads in parallel
– Each executes instructions independent of the others
SIMT Group 0
SIMT Group 1
44
Execution Model to Hardware
• How does this
execution model
map down to actual
GPU hardware?
45
Execution Model to Hardware
• NVIDIAGPU Streaming
Multiprocessors (SM) are
analogous to CPU cores
– Single computational unit
– Think of an SM as a single
vector processor
– Composed of multiple CUDA
“cores”, load/store units,
special function units (sin,
cosine, etc.)
– Each CUDA core contains
integer and floating-point
arithmetic logic units
46
Execution Model to Hardware
• GPUs can execute multiple SIMT groups on each SM
– For example: on NVIDIA GPUs a SIMT group is 32 threads, each
Kepler SM has 192 CUDA cores è simultaneous execution of 6 SIMT
groups on an SM
• SMs can support more concurrent SIMT groups than core count
would suggest
– Each thread persistently stores its own state in a private register set
– Many SIMT groups will spend time blocked on I/O, not actively
computing
– Keeping blocked SIMT groups scheduled on an SM would waste
cores
– Groups can be swapped in and out without worrying about losing
state
47
Execution Model to Hardware
• This leads to a nested thread hierarchy on GPUs
A SIMT Groups that SIMT Groups that
single SIMT concurrently run on the
Group execute together on the
thread same SM same GPU
48
GPU Memory Model
• Now that we understand how SIMT Thread Groups on a GPU
abstract threads of execution
SIMT Thread Groups on an SM
are mapped to the GPU:
– How do those threads store SIMT Thread Group
and retrieve data?
Registers Local Memory
– What rules are there about
memory consistency?
On-Chip Shared Memory
– How can we efficiently use
GPU memory?
Global Memory
Constant Memory
Texture Memory
49
GPU Memory Model
• There are many levels and types of GPU memory, each of
which has special characteristics that make it useful
– Size
– Latency
– Bandwidth
– Readable and/or Writable
– Optimal Access Patterns
– Accessibility by threads in the same SIMT group, SM, GPU
50
GPU Memory Model
• For now, we focus on two memory
types: on-chip shared memory and SIMT Thread Groups on a GPU
Local
Registers
Memory
• Each SM has a limited set of
registers, each thread receives its On-Chip Shared Memory
51
GPU Memory Model
• è Shared Memory and Registers are limited
– Per-SM resources which can impact how many threads can
execute on an SM
52
GPU Communication
• Communicating between the host and GPU is a piece of
added complexity, relative to homogeneous programming
models
PCIe Bus
53
GPU Communication
• Data transfer from CPU to GPU over the PCI bus adds
– Conceptual complexity
– Performance overhead
54
GPU Communication
• As a result, computation-communication overlap is a
common technique in GPU programming
– Asynchrony is a first-class citizen of most GPU programming
frameworks
GPU
Compute Compute Compute Compute
PCIe Bus
Copy Copy Copy Copy Copy
55
GPU Execution Model
• GPUs introduce a new conceptual model for programmers
used to CPU single- and multi-threaded programming
56
References
1. The sections on Introducing the CUDA Execution Model,
Understanding the Nature of Warp Execution, and Exposing
Parallelism in Chapter 3 of Professional CUDA C Programming
2. Michael Wolfe. Understanding the CUDA Data Parallel Threading
Model. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm
3. Will Ramey. Introduction to CUDA Platform.
http://developer.download.nvidia
.com/compute/developertrainingmaterials/presentations/general/W
hy_GPU_ Computing.pptx
4. Timo Stich. Fermi Hardware & Performance Tips.
http://theinf2.informatik.uni-jena.de/
theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_
2011.pdf
57