GPU Architecture

GPU programming allows highly parallel general-purpose computations to be run on GPU accelerators. There are several programming frameworks for GPU programming, including CUDA, OpenCL, and OpenACC. CUDA provides direct access to the GPU's instruction set for executing compute kernels. OpenCL is an open standard for parallel programming across heterogeneous platforms including CPUs and GPUs. OpenACC uses directives to help port codes to heterogeneous HPC hardware.

Uploaded by

salution technology

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (2 votes)

317 views

GPU Architecture

Uploaded by

salution technology

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

GPU and Programing

• Questions you should ask yourself,

before starting to code or optimize
Will my code run faster on the GPU?
Is my existing code running as fast as it should?
Is performance limited by computations or memory bandwidth?
Pen-and-pencil calculations can (often) answer such questions
GPU vs CPU
• A Central Processing Unit (CPU) is a latency-optimized general
purpose processor that is designed to handle a wide range of distinct
tasks sequentially, while a Graphics Processing Unit (GPU) is a
throughput-optimized specialized processor designed for high-end
parallel computing.
What is a GPU?
• A Graphics Processing Unit (GPU) is a specialized processor whose job
is to rapidly manipulate memory and accelerate the computer for a
number of specific tasks that require a high degree of parallelism.
GPU Architecture
• As the GPU uses thousands of lightweight cores whose instruction
sets are optimized for dimensional matrix arithmetic and floating-
point calculations, it is extremely fast with linear algebra and similar
tasks that require a high degree of parallelism.
• As a rule of thumb, if your algorithm accepts vectorized data, the job
is probably well-suited for GPU computing.
Task
How GPU vs CPU Limitations ?
What Is GPU Computing And How Is It Applied Today?
The basic architecture of a GPU
• a GPU uses many lightweight processing cores, leverages data
parallelism, and has high memory throughput. While the specific
components will vary by model, fundamentally most modern GPUs
use single instruction multiple data (SIMD) stream architecture
What is Flynn’s Taxonomy?
• lynn’s Taxonomy is a categorization of computer architectures by
Stanford University’s Michael J. Flynn. The basic idea behind Flynn’s
Taxonomy is simple: computations consist of 2 streams (data and
instructional streams) that can be processed in sequence(1 stream at
a time) or in parallel (multiple streams at once). Two data streams
with two possible methods to process them leads to the 4 different
categories in Flynn’s Taxonomy. Let’s take a look at each.
Single Instruction Single Data (SISD)
• SISD stream is an architecture where a single instruction stream (e.g.
a program) executes on one data stream. This architecture is used in
older computers with a single-core processor, as well as many simple
compute devices.
Single Instruction Multiple Data (SIMD)
• A SIMD stream architecture has a single control processor and
instruction memory, so only one instruction can be run at any given
point in time. That single instruction is copied and ran across each
core at the same time. This is possible because each processor has its
own dedicated memory which allows for parallelism at the data-level
(a.k.a. “data parallelism”).
• The fundamental advantage of SIMD is that data parallelism allows it
to execute computations quickly (multiple processors doing the same
thing) and efficiently (only one instruction unit).
Multiple Instruction Single Data (MISD)
• MISD stream architecture is effectively the reverse of SIMD
architecture. With MISD multiple instructions are performed on the
same data stream. The use cases for MISD are very limited today.
Most practical applications are better addressed by one of the other
architectures.
Multiple Instruction Multiple Data (MIMD)
• MIMD stream architecture offers parallelism for both data and
instruction streams. With MIMD, multiple processors execute
instruction streams independently against different data streams.
• What makes SIMD best for GPUs?
• What about SIMT?
CUDA compute hierarchy
• The processing resources in CUDA are designed to help optimize
performance for GPU use cases. Three of the fundamental
components of the hierarchy are threads, thread blocks, and kernel
grids.
CUDA memory hierarchy
• Like compute resources, memory allocation follows a specific
hierarchy in CUDA. While the CUDA compiler automatically handles
memory allocation, CUDA developers can and do program to optimize
memory usage directly. Here are the key concepts to understand
about the CUDA memory hierarchy.
• Registers
• Read-only memory
• Cache/shared memory
• L2 Cache
• Global memory
• Explain the brief history of Nvidia GPU Architecture
What is GPU Programming?
• GPU Programming is a method of running highly parallel general-
purpose computations on GPU accelerators.

• While the past GPUs were designed exclusively for computer graphics,
today they are being used extensively for general-purpose computing
(GPGPU computing) as well. In addition to graphical rendering, GPU-
driven parallel computing is used for scientific modelling, machine
learning, and other parallelization-prone jobs today.
GPU Programming APIs
• GPU understands computational problems in terms of graphical
primitives. Today there are several programming frameworks available
that operate these primitives for you under the hood, so you could
focus on the higher-level computing concepts.
CUDA
• Compute Unified Device Architecture (CUDA) is a parallel computing
platform and application programming interface (API) created by
Nvidia in 2006, that gives direct access to the GPU’s virtual instruction
set for the execution of compute kernels.
OpenCL
• While CUDA is a proprietary framework, OpenCL is an open standard
for parallel programming across heterogeneous platforms created by
the Khronos Group. OpenCL works with central processing units
(CPU), graphics processing units (GPU), digital signal processors, field-
programmable gate arrays (FPGA) and other processors or hardware
accelerators.
OpenACC
• OpenACC is a user-driven directive-based parallel programming
standard designed for scientists and engineers interested in porting
their codes to a wide variety of heterogeneous high-performance
computing (HPC) hardware platforms. The standard is designed for
the users by the users.

Advanced Hacking Training
No ratings yet
Advanced Hacking Training
33 pages
Structure Charts-WS
No ratings yet
Structure Charts-WS
6 pages
Report On Gpu
No ratings yet
Report On Gpu
39 pages
Main GPU
No ratings yet
Main GPU
87 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
Arm Instructions
No ratings yet
Arm Instructions
24 pages
Cpu Vs Gpu
No ratings yet
Cpu Vs Gpu
12 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Microsoft PowerPoint - SoC Design Flow Tools Codesign
No ratings yet
Microsoft PowerPoint - SoC Design Flow Tools Codesign
110 pages
CH - 4 Flip-Flops and Related Devices
No ratings yet
CH - 4 Flip-Flops and Related Devices
137 pages
Linux Kernel Debugging Going Beyond Printk Messages
No ratings yet
Linux Kernel Debugging Going Beyond Printk Messages
65 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
Arm Basics
No ratings yet
Arm Basics
26 pages
Microprocessor Archetecture Cheat Sheet
No ratings yet
Microprocessor Archetecture Cheat Sheet
3 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Verilog Hardware Description Language
No ratings yet
Verilog Hardware Description Language
61 pages
Cache Memory
67% (3)
Cache Memory
72 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Embedded C Notes
100% (2)
Embedded C Notes
16 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Direct Memory Access (Dma)
100% (2)
Direct Memory Access (Dma)
14 pages
ETI Notes For 1st Unit AI (Artificial Intelligence)
No ratings yet
ETI Notes For 1st Unit AI (Artificial Intelligence)
11 pages
Embedded Systems: Theory and Design
50% (2)
Embedded Systems: Theory and Design
27 pages
Riscv Iommu PDF
No ratings yet
Riscv Iommu PDF
103 pages
Low-Power Digital VLSI Design
No ratings yet
Low-Power Digital VLSI Design
530 pages
Programming Methodology in C
No ratings yet
Programming Methodology in C
117 pages
Android Graphics
No ratings yet
Android Graphics
58 pages
GPU
No ratings yet
GPU
17 pages
Code Coverage
No ratings yet
Code Coverage
69 pages
Verilog HDL
No ratings yet
Verilog HDL
193 pages
Oop Lab Manual
No ratings yet
Oop Lab Manual
9 pages
ARM Based Development Course by Mouli Sankaran
100% (7)
ARM Based Development Course by Mouli Sankaran
1,027 pages
Nvidia Cuda
No ratings yet
Nvidia Cuda
26 pages
VX Works
100% (1)
VX Works
35 pages
Digital Video Processing
No ratings yet
Digital Video Processing
261 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Computer Organization and Architecture Module 3
100% (1)
Computer Organization and Architecture Module 3
34 pages
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
No ratings yet
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
22 pages
ARM Processors' - Architecture
No ratings yet
ARM Processors' - Architecture
27 pages
Python Scripting
100% (1)
Python Scripting
15 pages
SoC Design Flow
No ratings yet
SoC Design Flow
21 pages
OOPS Interview Questions
No ratings yet
OOPS Interview Questions
58 pages
Comparison Between C and C++ and Lisp and Prolog
100% (1)
Comparison Between C and C++ and Lisp and Prolog
18 pages
QEMU Internals
No ratings yet
QEMU Internals
10 pages
Amba
No ratings yet
Amba
7 pages
8.2.0 ARM Architecture
No ratings yet
8.2.0 ARM Architecture
117 pages
Arm Cortex
100% (2)
Arm Cortex
131 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
C++ Notes
No ratings yet
C++ Notes
7 pages
Embedded and Real-Time Operating Systems: Course Code: 70439
No ratings yet
Embedded and Real-Time Operating Systems: Course Code: 70439
76 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Real Time Embedded - System
No ratings yet
Real Time Embedded - System
15 pages
DSP Architecture Design Essentials
No ratings yet
DSP Architecture Design Essentials
353 pages
ARM Processors
No ratings yet
ARM Processors
16 pages
02 Embedded Design Life Cycle
100% (1)
02 Embedded Design Life Cycle
19 pages
Designing of Alu Using Verilog HDL
No ratings yet
Designing of Alu Using Verilog HDL
2 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
Cuda
No ratings yet
Cuda
69 pages
Unit I_Introduction to Parallel Processing (1)
No ratings yet
Unit I_Introduction to Parallel Processing (1)
45 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Introduction To JavaScript
No ratings yet
Introduction To JavaScript
17 pages
MALWARE
No ratings yet
MALWARE
33 pages
Network Security
No ratings yet
Network Security
22 pages
Image Enhancement PPT-2
50% (2)
Image Enhancement PPT-2
28 pages
Distributed Systems
No ratings yet
Distributed Systems
47 pages
Visionine_User_Guide
No ratings yet
Visionine_User_Guide
8 pages
Computer Planner Class 4
No ratings yet
Computer Planner Class 4
2 pages
Comparc Cpo203
No ratings yet
Comparc Cpo203
39 pages
Management Information Systems Notes
No ratings yet
Management Information Systems Notes
82 pages
User Manual - Temperature Monitoring System (VACLOG) - Viewers
No ratings yet
User Manual - Temperature Monitoring System (VACLOG) - Viewers
21 pages
International Business Machine
No ratings yet
International Business Machine
3 pages
Avamar Enterprise Manager Transition To Backup and Recovery Manager
No ratings yet
Avamar Enterprise Manager Transition To Backup and Recovery Manager
19 pages
5G PPP 5G Architecture White Paper Final
No ratings yet
5G PPP 5G Architecture White Paper Final
182 pages
ABDOURAHAMANE_BARRY_Resume-ITOE
No ratings yet
ABDOURAHAMANE_BARRY_Resume-ITOE
1 page
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
100% (5)
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
66 pages
Programming Methodology in C: Hugh Anderson
No ratings yet
Programming Methodology in C: Hugh Anderson
117 pages
AAI UNIT-I Chap-2 (Expert System)
100% (1)
AAI UNIT-I Chap-2 (Expert System)
13 pages
Quiz on Components of Computer System
No ratings yet
Quiz on Components of Computer System
2 pages
Advanced OOP and Design Patterns
100% (3)
Advanced OOP and Design Patterns
229 pages
S - AC0 - 52000888 Report of Vendor Bal PC Wise
No ratings yet
S - AC0 - 52000888 Report of Vendor Bal PC Wise
10 pages
Stanley Nwador Data Analyst Resume
No ratings yet
Stanley Nwador Data Analyst Resume
3 pages
Neuroshell Trader 6: Product Review
No ratings yet
Neuroshell Trader 6: Product Review
4 pages
Fluentchpter
No ratings yet
Fluentchpter
84 pages
Cape - Unit 1 Digital Media 2017 Paper 1
No ratings yet
Cape - Unit 1 Digital Media 2017 Paper 1
8 pages
BASLER ELECTRIC BE1-11g Guideform Specification
No ratings yet
BASLER ELECTRIC BE1-11g Guideform Specification
4 pages
Encrypting A Linux Partition Using LUKS
100% (1)
Encrypting A Linux Partition Using LUKS
6 pages
Title Bar Ruler: What Is Word Processor?
No ratings yet
Title Bar Ruler: What Is Word Processor?
16 pages
Javalog 20230109
No ratings yet
Javalog 20230109
3 pages
This Study Resource Was: Chapter 16-Network Security
No ratings yet
This Study Resource Was: Chapter 16-Network Security
6 pages
Control Ad or
No ratings yet
Control Ad or
2 pages
Advanced BASIC Scientific Subroutines
No ratings yet
Advanced BASIC Scientific Subroutines
189 pages
Taxable Invoice
No ratings yet
Taxable Invoice
4 pages
ADVANCEDDEVOPS
No ratings yet
ADVANCEDDEVOPS
117 pages