Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
70 views

Gpu1 - GPU Introduction

This document provides an introduction to GPU computing. It discusses the history of graphics processing and how dedicated graphics cards evolved to use massively parallel architectures to improve performance. Modern GPUs have thousands of cores and very high memory bandwidth. Programming frameworks like CUDA and OpenCL allow general purpose programming on GPUs. Efficiently mapping algorithms to make use of thousands of GPU threads requires consideration of resources like shared memory, registers, and memory access patterns.

Uploaded by

Richik Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Gpu1 - GPU Introduction

This document provides an introduction to GPU computing. It discusses the history of graphics processing and how dedicated graphics cards evolved to use massively parallel architectures to improve performance. Modern GPUs have thousands of cores and very high memory bandwidth. Programming frameworks like CUDA and OpenCL allow general purpose programming on GPUs. Efficiently mapping algorithms to make use of thousands of GPU threads requires consideration of resources like shared memory, registers, and memory access patterns.

Uploaded by

Richik Dutta
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

CS 295: Modern Systems

GPU Computing Introduction

Sang-Woo Jun
Spring 2019
Graphic Processing – Some History
1990s: Real-time 3D rendering for video games were becoming common
o Doom, Quake, Descent, … (Nostalgia!)
3D graphics processing is immensely computation-intensive

Texture mapping Shading


Warren Moore, “Textures and Samplers in Metal,” Metal by Example, 2014
Gray Olsen, “CSE 470 Assignment 3 Part 2 - Gourad/Phong Shading,” grayolsen.com, 2018
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Doom (1993) : “Affine texture mapping”
• Linearly maps textures to screen location,
disregarding depth
• Doom levels did not have slanted walls or ramps,
to hide this
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Quake III arena (1999) : “Fast inverse square root”
magic!
Introduction of 3D Accelerator Cards
Much of 3D processing is short algorithms repeated on a lot of data
o pixels, polygons, textures, …
Dedicated accelerators with simple, massively parallel computation

A Diamond Monster 3D, using the Voodoo chipset (1997)


(Konstantin Lanzet, Wikipedia)
NVIDIA Volta-based GV100 Architecture (2018)

Many many cores,


not a lot of cache/control
Peak Performance vs. CPU
Throughput Power Throughput/Power
Intel Skylake 128 SP GFLOPS/4 Cores 100+ Watts ~1 GFLOPS/Watt
NVIDIA V100 15 TFLOPS 200+ Watts ~75 GFLOPS/Watt

Also,
System Architecture Snapshot With a GPU
(2019)
GPU Memory
GDDR5: 100s GB/s, 10s of GB
(GDDR5, GPU
HBM2: ~1 TB/s, 10s of GB
HBM2,…)

NVMe
CPU DDR4 2666 MHz I/O Hub (IOH) Network
128 GB/s Interface
100s of GB

QPI/UPI
12.8 GB/s (QPI) PCIe
Host Memory 20.8 GB/s (UPI) 16-lane PCIe Gen3: 16 GB/s
(DDR4,…) …
Lots of moving parts!
High-Performance Graphics Memory
Modern GPUs even employing 3D-stacked memory via silicon interposer
o Very wide bus, very high bandwidth
o e.g., HBM2 in Volta

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6 Memory Comparison,” 2019
Massively Parallel Architecture For
Massively Parallel Workloads!
NVIDIA CUDA (Compute Uniform Device Architecture) – 2007
o A way to run custom programs on the massively parallel architecture!
OpenCL specification released – 2008
Both platforms expose synchronous execution of a massive number of
threads GPU Threads

GPU

Thread Copy over PCIe Copy over PCIe

CPU
CUDA Execution Abstraction
Block: Multi-dimensional array of threads
o 1D, 2D, or 3D
o Threads in a block can synchronize among themselves
o Threads in a block can access shared memory
o CUDA (Thread, Block) ~= OpenCL (Work item, Work group)
Grid: Multi-dimensional array of blocks
o 1D or 2D
o Blocks in a grid can run in parallel, or sequentially
Kernel execution issued in grid units
Limited recursion (depth limit of 24 as of now)
Simple CUDA Example
Asynchronous call
CPU side GPU side

C/C++ Host Compiler


+ CUDA NVCC CPU+GPU
Code Compiler Software
Device
Compiler
Simple CUDA Example
1 block __global__:
N threads per block
In GPU, called from host/GPU
__device__:
In GPU, called from GPU
__host__:
Should wait for kernel to finish
In host, called from host
N instances of VecAdd spawned in GPU

One function can


be both
Which of N threads am I?
Only void allowed See also: blockIdx
More Complex Example:
Picture Blurring
Slides from NVIDIA/UIUC Accelerated Computing Teaching Kit
Another end-to-end example
https://devblogs.nvidia.com/even-easier-introduction-cuda/

Great! Now we know how to use GPUs – Bye?


Matrix Multiplication
Performance Engineering

No faster than CPU

Results from NVIDIA P100


Coleman et. al., “Efficient CUDA,” 2017 Architecture knowledge is needed (again)
NVIDIA Volta-based GV100 Architecture (2018)

Single Streaming Multiprocessor (SM) has


64 INT32 cores and 64 FP32 cores
(+8 Tensor cores…)

GV100 has 84 SMs


Volta Execution Architecture
64 INT32 Cores, 64 FP32 Cores, 4 Tensor Cores, Ray-
tracing cores..
o Specialization to make use of chip space…?
Not much on-chip memory per thread
o 96 KB Shared memory
o 1024 Registers per FP32 core
Hard limit on compute management
o 32 blocks AND 2048 threads AND 1024 threads/block
o e.g., 2 blocks with 1024 threads, or 4 blocks with 512 threads
o Enough registers/shared memory for all threads must be
available (all context is resident during execution)
More threads than cores – Threads interleaved to hide memory latency
Resource Balancing Details
How many threads in a block?
Too small: 4x4 window == 16 threads
o 128 blocks to fill 2048 thread/SM
o SM only supports 32 blocks -> only 512 threads used
• SM has only 64 cores… does it matter? Sometimes!
Too large: 32x48 window == 1536 threads
o Threads do not fit in a block!
Too large: 1024 threads using more than 64 registers
Limitations vary across platforms (Fermi, Pascal, Volta, …)
Warp Scheduling Unit
Threads in a block are executed in 32-thread “warp” unit
o Not part of language specs, just architecture specifics
o A warp is SIMD – Same PC, same instructions executed on every core
What happens when there is a conditional statement?
o Prefix operations, or control divergence
o More on this later!
Warps have been 32-threads so far, but may change in the future
Memory Architecture Caveats
Shared memory peculiarities
o Small amount (e.g., 96 KB/SM for Volta) shared across all threads
o Organized into banks to distribute access
o Bank conflicts can drastically lower performance
Relatively slow global memory
o Blocking, caching becomes important (again)
o If not for performance, for power consumption…

8-way bank conflict


1/8 memory bandwidth

You might also like