Programming Models For GPU Architecture
Programming Models For GPU Architecture
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Outline
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Introduction
What is CUDA?
- Compute Unified Device Architecture.
- A powerful parallel programming model for issuing and
managing computations on the GPU without mapping them to a
graphics API.
• Runtime:
Common component
Device component
Host component
• Driver:
Driver API
Introduction
CUDA SDK
Outline
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Motivation
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
CUDA: Unified Design
Advantage:
Feature 2:
• Fully general load/store to
GPU memory
• Untyped, not fixed texture types
• Pointer support
Feature 3:
• Dedicated on-chip memory
• Shared between threads for
inter-threads communication
• Explicitly managed
• As fast as registers
Programming Model
Important Concepts:
• Device: GPU, viewed as a
co-processor.
• Host: CPU
• Kernel: data-parallel,
computed-intensive positions
of application running on the
device.
Programming Model
Important Concepts:
• Thread: basic execution unit
• Thread block:
A batch of thread. Threads
in a block cooperate together,
efficiently share data.
Thread/block have unique id
• Grid:
A batch of thread block.
that excuate same kernel.
Threads in different block in
the same grid cannot directly
communicate with each other
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Programming Model
Hardware implementation:
A set of SIMD
Multiprocessors with On-
Chip shared memory
Programming Model
G80 Example:
• 16 Multiprocessors, 128 Thread Processors
• Up to 12,288 parallel threads active
• Per-block shared memory accelerates processing.
Programming Model
Streaming Multiprocessor (SM)
• Processing elements
o 8 scalar thread processors
o 32 GFLOPS peak at 1.35GHz
o 8192 32-bit registers (32KB)
o usual ops: float, int, branch...
• Hardware multithreading
o up to 8 blocks (3 active) residents at once
o up to 768 active threads in total
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Memory Model
There are 6 Memory Types :
Memory Model
There are 6 Memory Types :
• Registers
o on chip
o fast access
o per thread
o limited amount
o 32 bit
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
• Registers
• Shared Memory
o on chip
• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
Outline
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
CUDA API
CUDA API provides a easily path for users to write programs for
GPU device .
It consists of:
• Device component
o Mathematical/Time/Texture Functions
o Synchronization Function
o __syncthreads()
o Type Conversion/Casting Functions
CUDA API
o Structure
Driver API
Runtime API
o Functions
Device, Context, Memory, Module, Texture management
Execution control
Interoperability with OpenGL and Direct3D
CUDA API
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Outline
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Pro & Contra
CUDA allows
• massive parallel computing
• with a relative low price
• high integrated solution
• personal supercomputing
• ecofriendly production
• easy to learn
Pro & Contra
Problem ......
• slightly low precision
• limited support for IEEE-754
• no recursive function call
• hard to use for irregular join/fork logic
• no concurrency between jobs
Outline
• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Trend