Gpu1 - GPU Introduction

This document provides an introduction to GPU computing. It discusses the history of graphics processing and how dedicated graphics cards evolved to use massively parallel architectures to improve performance. Modern GPUs have thousands of cores and very high memory bandwidth. Programming frameworks like CUDA and OpenCL allow general purpose programming on GPUs. Efficiently mapping algorithms to make use of thousands of GPU threads requires consideration of resources like shared memory, registers, and memory access patterns.

Uploaded by

Richik Dutta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Gpu1 - GPU Introduction

Uploaded by

Richik Dutta

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

CS 295: Modern Systems

GPU Computing Introduction

Sang-Woo Jun
Spring 2019
Graphic Processing – Some History
1990s: Real-time 3D rendering for video games were becoming common
o Doom, Quake, Descent, … (Nostalgia!)
3D graphics processing is immensely computation-intensive

Texture mapping Shading

Warren Moore, “Textures and Samplers in Metal,” Metal by Example, 2014
Gray Olsen, “CSE 470 Assignment 3 Part 2 - Gourad/Phong Shading,” grayolsen.com, 2018
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Doom (1993) : “Affine texture mapping”
• Linearly maps textures to screen location,
disregarding depth
• Doom levels did not have slanted walls or ramps,
to hide this
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Quake III arena (1999) : “Fast inverse square root”
magic!
Introduction of 3D Accelerator Cards
Much of 3D processing is short algorithms repeated on a lot of data
o pixels, polygons, textures, …
Dedicated accelerators with simple, massively parallel computation

A Diamond Monster 3D, using the Voodoo chipset (1997)

(Konstantin Lanzet, Wikipedia)
NVIDIA Volta-based GV100 Architecture (2018)

Many many cores,

not a lot of cache/control
Peak Performance vs. CPU
Throughput Power Throughput/Power
Intel Skylake 128 SP GFLOPS/4 Cores 100+ Watts ~1 GFLOPS/Watt
NVIDIA V100 15 TFLOPS 200+ Watts ~75 GFLOPS/Watt

Also,
System Architecture Snapshot With a GPU
(2019)
GPU Memory
GDDR5: 100s GB/s, 10s of GB
(GDDR5, GPU
HBM2: ~1 TB/s, 10s of GB
HBM2,…)

NVMe
CPU DDR4 2666 MHz I/O Hub (IOH) Network
128 GB/s Interface
100s of GB
…
QPI/UPI
12.8 GB/s (QPI) PCIe
Host Memory 20.8 GB/s (UPI) 16-lane PCIe Gen3: 16 GB/s
(DDR4,…) …
Lots of moving parts!
High-Performance Graphics Memory
Modern GPUs even employing 3D-stacked memory via silicon interposer
o Very wide bus, very high bandwidth
o e.g., HBM2 in Volta

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6 Memory Comparison,” 2019
Massively Parallel Architecture For
Massively Parallel Workloads!
NVIDIA CUDA (Compute Uniform Device Architecture) – 2007
o A way to run custom programs on the massively parallel architecture!
OpenCL specification released – 2008
Both platforms expose synchronous execution of a massive number of
threads GPU Threads
…

GPU

Thread Copy over PCIe Copy over PCIe

CPU
CUDA Execution Abstraction
Block: Multi-dimensional array of threads
o 1D, 2D, or 3D
o Threads in a block can synchronize among themselves
o Threads in a block can access shared memory
o CUDA (Thread, Block) ~= OpenCL (Work item, Work group)
Grid: Multi-dimensional array of blocks
o 1D or 2D
o Blocks in a grid can run in parallel, or sequentially
Kernel execution issued in grid units
Limited recursion (depth limit of 24 as of now)
Simple CUDA Example
Asynchronous call
CPU side GPU side

C/C++ Host Compiler

+ CUDA NVCC CPU+GPU
Code Compiler Software
Device
Compiler
Simple CUDA Example
1 block __global__:
N threads per block
In GPU, called from host/GPU
__device__:
In GPU, called from GPU
__host__:
Should wait for kernel to finish
In host, called from host
N instances of VecAdd spawned in GPU

One function can

be both
Which of N threads am I?
Only void allowed See also: blockIdx
More Complex Example:
Picture Blurring
Slides from NVIDIA/UIUC Accelerated Computing Teaching Kit
Another end-to-end example
https://devblogs.nvidia.com/even-easier-introduction-cuda/

Great! Now we know how to use GPUs – Bye?

Matrix Multiplication
Performance Engineering

No faster than CPU

Results from NVIDIA P100

Coleman et. al., “Efficient CUDA,” 2017 Architecture knowledge is needed (again)
NVIDIA Volta-based GV100 Architecture (2018)

Single Streaming Multiprocessor (SM) has

64 INT32 cores and 64 FP32 cores
(+8 Tensor cores…)

GV100 has 84 SMs

Volta Execution Architecture
64 INT32 Cores, 64 FP32 Cores, 4 Tensor Cores, Ray-
tracing cores..
o Specialization to make use of chip space…?
Not much on-chip memory per thread
o 96 KB Shared memory
o 1024 Registers per FP32 core
Hard limit on compute management
o 32 blocks AND 2048 threads AND 1024 threads/block
o e.g., 2 blocks with 1024 threads, or 4 blocks with 512 threads
o Enough registers/shared memory for all threads must be
available (all context is resident during execution)
More threads than cores – Threads interleaved to hide memory latency
Resource Balancing Details
How many threads in a block?
Too small: 4x4 window == 16 threads
o 128 blocks to fill 2048 thread/SM
o SM only supports 32 blocks -> only 512 threads used
• SM has only 64 cores… does it matter? Sometimes!
Too large: 32x48 window == 1536 threads
o Threads do not fit in a block!
Too large: 1024 threads using more than 64 registers
Limitations vary across platforms (Fermi, Pascal, Volta, …)
Warp Scheduling Unit
Threads in a block are executed in 32-thread “warp” unit
o Not part of language specs, just architecture specifics
o A warp is SIMD – Same PC, same instructions executed on every core
What happens when there is a conditional statement?
o Prefix operations, or control divergence
o More on this later!
Warps have been 32-threads so far, but may change in the future
Memory Architecture Caveats
Shared memory peculiarities
o Small amount (e.g., 96 KB/SM for Volta) shared across all threads
o Organized into banks to distribute access
o Bank conflicts can drastically lower performance
Relatively slow global memory
o Blocking, caching becomes important (again)
o If not for performance, for power consumption…

8-way bank conflict

1/8 memory bandwidth

FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
No ratings yet
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
20 pages
The Single Cycle CPU Project
No ratings yet
The Single Cycle CPU Project
16 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
ARM Architecture
No ratings yet
ARM Architecture
547 pages
ARM-A Mandatory Primer
No ratings yet
ARM-A Mandatory Primer
4 pages
NVSwitch
No ratings yet
NVSwitch
23 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Verification of High Performance Embedded Systems: Sisira K. Amarasinghe, PH.D
No ratings yet
Verification of High Performance Embedded Systems: Sisira K. Amarasinghe, PH.D
82 pages
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
100% (1)
FPGA - Based Accelerators of Deep LearningNetworks For Learning and Classification
37 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Pthread Tutorial by Peter (Good One)
No ratings yet
Pthread Tutorial by Peter (Good One)
29 pages
AMBA AHB Bus Protocol Checker With Efficient Debugging Mechanism
No ratings yet
AMBA AHB Bus Protocol Checker With Efficient Debugging Mechanism
4 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
No ratings yet
Cache Coherence: Part I: CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
31 pages
Amba
No ratings yet
Amba
7 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
MN Cache Coherence
No ratings yet
MN Cache Coherence
11 pages
HyperTransport 3.1 Interconnect Technology PDF
100% (1)
HyperTransport 3.1 Interconnect Technology PDF
30 pages
Cuda PDF
No ratings yet
Cuda PDF
18 pages
GPU
No ratings yet
GPU
17 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
100% (1)
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
64 pages
Key Difference Between DDR4 and DDR3 RAM
No ratings yet
Key Difference Between DDR4 and DDR3 RAM
4 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Embedded Systems: Theory and Design
50% (2)
Embedded Systems: Theory and Design
27 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
RISC-V - Control Unit
100% (1)
RISC-V - Control Unit
25 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Uvm: The Next Generation in Verification Methodology: Mark Glasser, Methodology Architect February 4, 2011
No ratings yet
Uvm: The Next Generation in Verification Methodology: Mark Glasser, Methodology Architect February 4, 2011
6 pages
QEMU Internals
No ratings yet
QEMU Internals
10 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Image-Based 3D Reconstruction Neural Networks vs. Multiview Geometry
No ratings yet
Image-Based 3D Reconstruction Neural Networks vs. Multiview Geometry
7 pages
Jetson Xavier NX Data Sheet v1.3
No ratings yet
Jetson Xavier NX Data Sheet v1.3
40 pages
ARM Processor
No ratings yet
ARM Processor
296 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
Executable File Format
No ratings yet
Executable File Format
22 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
White Paper: Pci Express Technology
100% (1)
White Paper: Pci Express Technology
11 pages
Full Download Physically Based Rendering From Theory to Implementation 4th edition Matt Pharr PDF DOCX
100% (2)
Full Download Physically Based Rendering From Theory to Implementation 4th edition Matt Pharr PDF DOCX
50 pages
Computer Arichitecture
No ratings yet
Computer Arichitecture
60 pages
DDR5-Anil Pandey PDF
No ratings yet
DDR5-Anil Pandey PDF
3 pages
Vivado HLD
No ratings yet
Vivado HLD
530 pages
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
No ratings yet
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
2,608 pages
A Brief Overview of The Graphics Pipeline: Cedric Lee
No ratings yet
A Brief Overview of The Graphics Pipeline: Cedric Lee
33 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
A Configurable Risc V Processor Core For Fpga Devices
No ratings yet
A Configurable Risc V Processor Core For Fpga Devices
53 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
Coresight v3 0 Architecture Specification IHI0029E
No ratings yet
Coresight v3 0 Architecture Specification IHI0029E
280 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Environment Setup
No ratings yet
Environment Setup
25 pages
Reduction
No ratings yet
Reduction
91 pages
Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
100% (5)
Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
55 pages
Aman Seminar Report
No ratings yet
Aman Seminar Report
52 pages
AMD_HIP_Programming_Guide
No ratings yet
AMD_HIP_Programming_Guide
95 pages
Lab1 PGPU
No ratings yet
Lab1 PGPU
3 pages
CUDA C Best Practices Guide
No ratings yet
CUDA C Best Practices Guide
100 pages
Lenovo Diagnostics - LOG: Modules Test Results
No ratings yet
Lenovo Diagnostics - LOG: Modules Test Results
15 pages
Neo, an AI Desktop Assistant! _ 10 Steps (with Pictures) - Instructables
No ratings yet
Neo, an AI Desktop Assistant! _ 10 Steps (with Pictures) - Instructables
9 pages
CUDA Cuts Fast Graph Cuts On The GPU
No ratings yet
CUDA Cuts Fast Graph Cuts On The GPU
8 pages
Log20211113 205404
No ratings yet
Log20211113 205404
23 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Medical Image Processing On GPU
No ratings yet
Medical Image Processing On GPU
6 pages
Lasagne
No ratings yet
Lasagne
127 pages
DP DK Acceleration With Gpu
No ratings yet
DP DK Acceleration With Gpu
16 pages
CUDA - MonteCarloPi Code
No ratings yet
CUDA - MonteCarloPi Code
6 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
PDF Parallel Programming For Modern High Performance Computing Systems Czarnul Download
100% (3)
PDF Parallel Programming For Modern High Performance Computing Systems Czarnul Download
62 pages
nvidia-learning-training course-catalog
No ratings yet
nvidia-learning-training course-catalog
34 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA 6.0: Acknowledgements
No ratings yet
CUDA 6.0: Acknowledgements
13 pages
Nuke 7.0v9 ReleaseNotes
No ratings yet
Nuke 7.0v9 ReleaseNotes
62 pages
Nvidia A40 Datasheet
No ratings yet
Nvidia A40 Datasheet
2 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Kamikaze
No ratings yet
Kamikaze
16 pages
Metashape-Pro 1 6 en PDF
No ratings yet
Metashape-Pro 1 6 en PDF
160 pages
UserGuide 4
No ratings yet
UserGuide 4
1 page
Amrit CV Tinigrad
No ratings yet
Amrit CV Tinigrad
1 page