0% found this document useful (0 votes)

36 views

Programming Models For GPU Architecture

Uploaded by

VincentKao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Programming Models For GPU Architecture

Uploaded by

VincentKao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 55

CUDA Programming Model

Xing Zeng, Dongyue Mou

Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Introduction

What is CUDA?
- Compute Unified Device Architecture.
- A powerful parallel programming model for issuing and
managing computations on the GPU without mapping them to a
graphics API.

• Heterogenous - mixed serial-parallel programming

• Scalable - hierarchical thread execution model
• Accessible - minimal but expressive changes to C
Introduction
Software Stack:
• Libraries:
CUFFT & CUBLAS

• Runtime:
Common component
Device component
Host component

• Driver:
Driver API
Introduction
CUDA SDK
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation
GPU Programming Model for Graphics
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation
GPGPU Programming Model
Trick the GPU into general-purpose
computing by casting problem as graphics

• Turn data into images ("texture maps")

• Turn algorithms into image synthesis ("rending passes")
Drawback:
• Tough learning curve
• potentially high overhead of graphics API
• highly constrained memory layout & access model
• Need for many passes drives up bandwidth consumption
Motivation
GPGPU Programming to do A + B
Motivation
What's wrong with GPGPU 1
APIs are specific to Graphics

Limited texture size and

dimension

Limited Instruction set

No thread Communication Limited local storage

Limited shader outputs

No scatter
Motivation
What's wrong with GPGPU 2
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
CUDA: Unified Design
Advantage:

HW: fully generally data-parallel arch-

tecture.
• General thread launch
• Global load-store
• Parallel data cache
• Scalar architecture
• Integers, bit operation

SW: program the GPU in C

• Scalable data parallel execuation/
memory model
• C with minimal yet powerful
extensions
Motivation
From GPGPU to CUDA Programming Model
Programming Model
Feature 1:
• Thread not pixel
• Full Integer and Bit Instructions
• No limits on branching, looping
• 1D, 2D, 3D threadID allocation

Feature 2:
• Fully general load/store to
GPU memory
• Untyped, not fixed texture types
• Pointer support

Feature 3:
• Dedicated on-chip memory
• Shared between threads for
inter-threads communication
• Explicitly managed
• As fast as registers
Programming Model
Important Concepts:
• Device: GPU, viewed as a
co-processor.
• Host: CPU
• Kernel: data-parallel,
computed-intensive positions
of application running on the
device.
Programming Model
Important Concepts:
• Thread: basic execution unit
• Thread block:
A batch of thread. Threads
in a block cooperate together,
efficiently share data.
Thread/block have unique id
• Grid:
A batch of thread block.
that excuate same kernel.
Threads in different block in
the same grid cannot directly
communicate with each other
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Programming Model
Hardware implementation:

A set of SIMD
Multiprocessors with On-
Chip shared memory
Programming Model
G80 Example:
• 16 Multiprocessors, 128 Thread Processors
• Up to 12,288 parallel threads active
• Per-block shared memory accelerates processing.
Programming Model
Streaming Multiprocessor (SM)
• Processing elements
o 8 scalar thread processors
o 32 GFLOPS peak at 1.35GHz
o 8192 32-bit registers (32KB)
o usual ops: float, int, branch...

• Hardware multithreading
o up to 8 blocks (3 active) residents at once
o up to 768 active threads in total

• 16KB on-chip memory

• supports thread communication
• shared amongst threads of a block
Programming Model
Execution Model:
Programming Model
Single Instruction Multiple Thread (SIMT) Execution:
• Groups of 32 threads formed
into warps
o always executing same instruction
o share instruction fetch/dispatch
o some become inactive
when code path diverges
o hardware automatically handles divergence

• Warps are primitive unit of scheduling

• pick 1 of 24 warps for each instruction slot.
• all warps from all active blocks are time-sliced
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Memory Model
There are 6 Memory Types :
Memory Model
There are 6 Memory Types :

• Registers
o on chip
o fast access
o per thread
o limited amount
o 32 bit
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
• Registers
• Shared Memory
o on chip

• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
CUDA API

CUDA API provides a easily path for users to write programs for
GPU device .

It consists of:

• A minimal set of extensions to C/C++

o type qualifiers
o call-syntax
o build-in variables

• A runtime library to support the execution

o host component
o device component
o common component
CUDA API

CUDA C/C++ Extensions:

• New function type qualifiers
__host__ void HostFunc(...); //executable on host
__global__ void KernelFunc(...); //callable from host
__device__ void DeviceFunc(...); //callable from device only
o Restrictions for device code (__global__ / __device__)
 no recursive call
 no static variable
 no function pointer
 __global__ function is asynchronous invoked
 __global__ function must have void return type
CUDA API

CUDA C/C++ Extensions:

• New variable type qualifiers
__device__ int GlobalVar; //in global memory, lifetime of app
__const__ int ConstVar; //in constant memory, lifetime of app
__shared__ int SharedVar; //in shared memory, lifetime of blocks
o Restrictions
 no external usage
 only file scope
 no combination with struct or union
 no initialization for __shared__
CUDA API

CUDA C/C++ Extensions:

• New syntax to invoke the device code
KernelFunc<<< Dg, Db, Ns, S >>>(...);
o Dg: dimension of grid
o Db: dimension of block
o Ns: optional, shared memory for external variables
o S : optional, associated stream

• New build-in variables for indexing the threads

o gridDim: dimension of the whole grid
o blockIdx: index of the current block
o blockDim: dimension of each block in the grid
o threadIdx: index of the current thread
CUDA API

CUDA Runtime Library:

• Common component
o Vector/Texture Types
o Mathematical/Time Functions

• Device component
o Mathematical/Time/Texture Functions
o Synchronization Function
o __syncthreads()
o Type Conversion/Casting Functions
CUDA API

CUDA Runtime Library:

• Host component

o Structure
 Driver API
 Runtime API

o Functions
 Device, Context, Memory, Module, Texture management
 Execution control
 Interoperability with OpenGL and Direct3D
CUDA API

The CUDA source file uses .cu as

extension. It contains host and
device source codes.

The CUDA Compiler Driver nvcc can

compile it and generate CPU/PTX
binary code.
(PTX: Parallel Thread Execution, a
device independent VM code)

PTX code may be further translated

for special GPU-Arch.
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Pro & Contra

CUDA allows
• massive parallel computing
• with a relative low price
• high integrated solution
• personal supercomputing
• ecofriendly production
• easy to learn
Pro & Contra

Problem ......
• slightly low precision
• limited support for IEEE-754
• no recursive function call
• hard to use for irregular join/fork logic
• no concurrency between jobs
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Trend

• More cores on-chip

• Better support for float point
• Flexiber configuration & control/data flow
• Lower price
• Support higher level programming language
References

[1] CUDA Programming Guide, nVidia Corp.

[2] The CUDA Compiler Driver, nVidia Corp.
[3] Parallel Thread Execution, nVidia Corp.
[4] CUDA: A Heterogeneous Parallel Programming Model for
Manycore Computing, ASPLOS 2008, gpgpu.org
Question?

Principles of Concurrent and Distributed Programming
No ratings yet
Principles of Concurrent and Distributed Programming
646 pages
Sullair SPEC - WS - ModBus 02250162 949 r07
100% (1)
Sullair SPEC - WS - ModBus 02250162 949 r07
40 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
No ratings yet
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
18 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
cuuda nvidai guide_Part1
No ratings yet
cuuda nvidai guide_Part1
15 pages
course-7
No ratings yet
course-7
21 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
cuda
No ratings yet
cuda
25 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
CUDA
No ratings yet
CUDA
33 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
chapter-8
No ratings yet
chapter-8
58 pages
Lec 1
No ratings yet
Lec 1
27 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
Cuda
No ratings yet
Cuda
69 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
CUDA 1_Introduction to GPU, CUDA (1)
No ratings yet
CUDA 1_Introduction to GPU, CUDA (1)
21 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
1. Introduction — CUDA C Programming Guide
No ratings yet
1. Introduction — CUDA C Programming Guide
573 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
Paper PDF
No ratings yet
Paper PDF
52 pages
N5 Computer Guide Practice Du Toit
No ratings yet
N5 Computer Guide Practice Du Toit
22 pages
Full C Rio Dev Guide
No ratings yet
Full C Rio Dev Guide
285 pages
OS Project Report
No ratings yet
OS Project Report
10 pages
2 - Practice Test Guidelines
No ratings yet
2 - Practice Test Guidelines
8 pages
M.Tech Embedded System Technologies Part Time Curriculum & Syllabus Semester I
No ratings yet
M.Tech Embedded System Technologies Part Time Curriculum & Syllabus Semester I
38 pages
Vista Release 5
No ratings yet
Vista Release 5
40 pages
Mirantis OpenStack 8.0 OperationsGuide
No ratings yet
Mirantis OpenStack 8.0 OperationsGuide
153 pages
WinPE Support
No ratings yet
WinPE Support
8 pages
WCMC Chapter 2 - Mobile Computing - Concise
100% (1)
WCMC Chapter 2 - Mobile Computing - Concise
20 pages
Calabrio Teleopti WFM - Genesys PureConnect - Pre-Requisites and Information
No ratings yet
Calabrio Teleopti WFM - Genesys PureConnect - Pre-Requisites and Information
13 pages
Document
No ratings yet
Document
4 pages
Vii PC Price List
No ratings yet
Vii PC Price List
2 pages
JAVA Socket Programming: 2003.3.19 Joonbok Lee Kaist
No ratings yet
JAVA Socket Programming: 2003.3.19 Joonbok Lee Kaist
30 pages
Lecture 8
No ratings yet
Lecture 8
29 pages
PRINTOUT FOR RECORD -CS3591 CN LAB (1)
No ratings yet
PRINTOUT FOR RECORD -CS3591 CN LAB (1)
50 pages
TR 069amendment1
No ratings yet
TR 069amendment1
131 pages
201 Firewall Installation 1 PDF
No ratings yet
201 Firewall Installation 1 PDF
1 page
IPSec Network Protocol Security
No ratings yet
IPSec Network Protocol Security
46 pages
70-346.examcollection - Premium.exam.110q: Number: 70-346 Passing Score: 800 Time Limit: 120 Min File Version: 9.0
No ratings yet
70-346.examcollection - Premium.exam.110q: Number: 70-346 Passing Score: 800 Time Limit: 120 Min File Version: 9.0
56 pages
Wireless Remote Control For Eot Crane Using Avr Micro Controller IJERTCONV5IS20024 PDF
No ratings yet
Wireless Remote Control For Eot Crane Using Avr Micro Controller IJERTCONV5IS20024 PDF
4 pages
Introduction To MPLS-TP
100% (1)
Introduction To MPLS-TP
59 pages
Setting Up A Wireless Hotspot Using TP-Link TL-MR3020 Wireless N Router - All
0% (1)
Setting Up A Wireless Hotspot Using TP-Link TL-MR3020 Wireless N Router - All
16 pages
Xbox 360 Hard Drive Guide
No ratings yet
Xbox 360 Hard Drive Guide
17 pages
6 Addressingmodes
No ratings yet
6 Addressingmodes
21 pages
Coaching & Educational Academy (Chiniot) : Objective Type
No ratings yet
Coaching & Educational Academy (Chiniot) : Objective Type
12 pages
IMP - Step by Step Guide To Install SAProuter
No ratings yet
IMP - Step by Step Guide To Install SAProuter
2 pages
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
34 pages

Programming Models For GPU Architecture

Uploaded by

Programming Models For GPU Architecture

Uploaded by

CUDA Programming Model

Xing Zeng, Dongyue Mou

• Heterogenous - mixed serial-parallel programming

GPU Programming Model

GPU Programming Model

GPU Programming Model

• Turn data into images ("texture maps")

Limited texture size and

Limited Instruction set

Limited shader outputs

GPU Programming Model

HW: fully generally data-parallel arch-

SW: program the GPU in C

• 16KB on-chip memory

• Warps are primitive unit of scheduling

• A minimal set of extensions to C/C++

• A runtime library to support the execution

CUDA C/C++ Extensions:

CUDA C/C++ Extensions:

CUDA C/C++ Extensions:

• New build-in variables for indexing the threads

CUDA Runtime Library:

CUDA Runtime Library:

The CUDA source file uses .cu as

The CUDA Compiler Driver nvcc can

PTX code may be further translated

• More cores on-chip

[1] CUDA Programming Guide, nVidia Corp.

You might also like