Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I

UNIVERSIDAD NACIONAL
MAYOR DE SAN MARCOS

Decana de América
FACULTAD DE INGENIERÍA DE
SISTEMAS E INFORMATICA
ARQUITECTURA DE
COMPUTADORAS
Mg. JUAN CARLOS GONZALES
SUAREZ
Procesador
Grafico
Arquitectura de Computadoras
Mg. Juan Carlos Gonzales Suárez
Procesador
Grafico
System Architecture Snapshot With a GPU
(2019)
GPU Memory
GDDR5: 100s GB/s, 10s of GB (GDDR5, GPU
HBM2: ~1 TB/s, 10s of GB HBM2,…)
NVMe
I/O Hub
CPU DDR4 2666 MHz Network
(IOH)
128 GB/s Interface
100s of GB
…
QPI/UPI
12.8 GB/s PCIe
Host
(QPI) 16-lane PCIe Gen3: 16 GB/s
Memory
20.8 GB/s (UPI) …
(DDR4,…)
High-Performance Graphics Memory
Modern GPUs even employing 3D-stacked memory via silicon interposer
o Very wide bus,
o very high
bandwidth
o e.g., HBM2
in Volta
Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6

Memory Comparison,” 2019
Massively Parallel Architecture For
Massively Parallel Workloads!
• NVIDIA CUDA (Compute Uniform Device Architecture) – 2007
– A way to run custom programs on the massively parallel architecture!
• OpenCL specification released – 2008
• Both platforms expose synchronous execution of a massive number of
threads
GPU Threads
…
GPU
Thread Copy over Copy over PCIe

PCIe
CPU
ATI Radeon 5000 Series Architecture
Radeon SIMD Engine
• 16 Stream Cores (SC)

• Local Data Share
VLIW Stream Core (SC)
Local Data Share (LDS)
GPU NVIDIA
GPU Architecture
NVIDIA Fermi, 512 Processing Elements (PEs)
GPU: Un coprocesador
multithreaded
SM
SP: scalar processor SP SP SP SP
‘CUDA core’ SP SP SP SP
Ejecuta un thread SP SP SP SP
SP SP SP SP
SM SHARED
streaming multiprocessor MEMORY
32xSP (or 16, 48 or more)
GLOBAL MEMORY
Fast local ‘shared memory’ (ON DEVICE)
(shared between SPs)
16 KiB (or 64 KiB)
• GPU:
SMs SM
SP SP SP SP
o 30xSM en GT200,
SP SP SP SP
o 14xSM en Fermi
SP SP SP SP
Por ejemplo, GTX 480:
SP SP SP SP
14 SMs x 32 cores
= 448 cores en un GPU SHARED
MEMORY
Memoria GDDR GLOBAL MEMORY

(ON DEVICE)
512 MiB - 6 GiB

Como Programar GPUs
• Paralelización SM
– Decomposición a threads
SP SP SP SP
• Memoria
– Memoria compartida, memoria SP SP SP SP
global SP SP SP SP
SP SP SP SP
SHARED
MEMORY
GLOBAL MEMORY
(ON DEVICE)
Notas importantes en mente
• Evitar saltos divergentes SM
– Los Threads de un solo SM deben SP SP SP SP
ejecutar el mismo codigo
SP SP SP SP
– Los còdigos que realizan saltos
extensos e impredecibles seran SP SP SP SP
ejecutados lentamente
SP SP SP SP
• Los Threads deberian ser lo mas
independientes como se posible SHARED
– La Sincronización y comunicación MEMORY
pueden ser realizadas eficentemente
para un solo procesador. GLOBAL MEMORY
(ON DEVICE)
Como Programar GPUs
• Paralelización SM
– Decomposición a threads
SP SP SP SP
• Memoria
– Memoria compartida, memoria SP SP SP SP
global SP SP SP SP
SP SP SP SP
• Enorme potencia de
procesamiento SHARED
MEMORY
– Evitar divergencia
• Comunicaciòn por Thread
GLOBAL MEMORY
– Sincronizacion, no (ON DEVICE)
interdependencias
CUDA Execution Abstraction
• Block: Multi-dimensional array of threads

– 1D, 2D, or 3D
– Threads in a block can synchronize among themselves
– Threads in a block can access shared memory
– CUDA (Thread, Block) ~= OpenCL (Work item, Work group)
• Grid: Multi-dimensional array of blocks
– 1D or 2D
– Blocks in a grid can run in parallel, or sequentially
• Kernel execution issued in grid units
• Limited recursion (depth limit of 24 as of now)
Simple CUDA Example
Asynchronous call
CPU side GPU side
Host
C/C++
Compiler
+ CUDA NVCC CPU+GPU
Code Compiler Software
Device
Compiler
Simple CUDA Example
1 block __global__:
N threads per block
In GPU, called from host/GPU
__device__:
In GPU, called from GPU
__host__:
Should wait for kernel to finish
In host, called from host
N instances of VecAdd spawned in GPU
One function
can be both
Which of N threads am I?
Only void allowed See also: blockIdx
More Complex Example:
Picture Blurring
• Slides from NVIDIA/UIUC Accelerated
Computing Teaching Kit
• Another end-to-end example
https://devblogs.nvidia.com/even-easier-introduction-cuda/
• Great! Now we know how to use GPUs – Bye?

Ejemplos de Aplicación
Gracias
Juan Carlos Gonzales Suarez

jgonzaless@unmsm.edu.pe

Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I

Uploaded by

Copyright:

Available Formats

Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I

Uploaded by

Copyright:

Available Formats

UNIVERSIDAD NACIONAL

MAYOR DE SAN MARCOS

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6

Thread Copy over Copy over PCIe

• 16 Stream Cores (SC)

Local Data Share (LDS)

Memoria GDDR GLOBAL MEMORY

512 MiB - 6 GiB

• Block: Multi-dimensional array of threads

N instances of VecAdd spawned in GPU

• Great! Now we know how to use GPUs – Bye?

Juan Carlos Gonzales Suarez

You might also like