0% found this document useful (0 votes)

146 views

Accelerating CUDA Graph Algorithms at Maximum Warp

This document summarizes a research paper about accelerating CUDA graph algorithms at maximum warp. [1] It presents a warp-centric programming method to address workload imbalance and memory inefficiency issues that arise when directly mapping graph algorithms to GPU architectures. [2] The method runs programs in either SISD or SIMT phases to better utilize hardware and avoid path divergence. [3] Experimental results show this approach outperforms traditional thread-to-data mappings, especially on irregular graphs.

Uploaded by

thangmle

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views

Accelerating CUDA Graph Algorithms at Maximum Warp

Uploaded by

thangmle

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Accelerating CUDA Graph Algorithms at Maximum Warp

By: Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, Kunle Olukotun

Presenter: Thang M. Le

Authors

Sungpack Hong

Ph.D graduate from Stanford Currently, a principal member of technical staff at Oracle

Sang Kyun Kim

Ph.D candidate at Stanford

Tayo Oguntebi

Ph.D candidate at Stanford

Kunle Olukotun Professor of Electrical Engineering & Computer Science at Stanford Director of Pervasive Parallel Laboratory

Agenda

What Is the Problem? Why Does the Problem Exist? Warp-Centric Programming Method Other Techniques Experimental Results Study of Architectural Effects Q&A

What Is the Problem?

The Parallel Random Access Machine (PRAM) abstract is often used to investigate theoretical parallel performance of graph algorithm PRAM approximation is quite accurate in supercomputer domains such as Cray XMT PRAM based algorithms fail to perform well on GPU architectures due to the workload imbalance among threads

Why Does the Problem Exist?

CUDA thread model exhibits certain discrepancies with the GPU architecture

Notably, no explicit notion of warps Different behaviors when accessing memory

Requests targeting the same address are merged Spatial locality accesses are maximally coalesced All other memory requests are serialized

Why Does the Problem Exist?

SIMT eliminates SIMD constraints by allowing threads in a warp to pursue on different paths (aka path divergence) Path divergence provides more flexibilities at the cost of performance Path divergence leads to hardware underutilization

Scattering memory access patterns

Why Does the Problem Exist?

A thread that processes a high-degree node will iterate the loop at line 23 many more times than other threads, stalling other threads in the warp
Path divergence

Why Does the Problem Exist?

Non-coalesced memory

Warp-Centric Programming Method

A program can be run in either SISD phase or SIMT phase:

SISD:

all threads in a warp are executed on the same data degree of parallelism (per SM) = O(# concurrent warps) Each thread is executed on different data degree of parallelism (per SM) = O(# threads in a warp x # concurrent warps)

SIMT:

By default, all threads in a warp are executed following SISD When appropriate, switch to SIMT to exploit data parallelism

Warp-Centric Programming Method

Is it safe to run a logic in SISD on GPU architecture?

methodA
input

output

Since methodA will be executed many times by different threads on the same input, the logic of methodA must be deterministic to guarantee the correctness in SISD phase

Warp-Centric Programming Method

Advantages:

No path divergence Increase memory coalescing Allows to take advantage of shared memory

Warp-Centric Programming Method

Traditional approach: thread-to-data

mapping Threads 0 1 2 3 4 5 . . . . n-2 n-1 n

Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n

Warp-Centric Programming Method

Warp-centric approach: warp-to-chunk of data

Chunks 0 Warps Threads 0 0 1 2 3 1 1 4 5 . . . . n-2 k k n-1 n

Level[ ] 0 1 2 3 4 5 . . . . n-2 n-1 n

mapping

Warp-Centric Programming Method

Warp-centric approach: warp-to-chunk of data

Coarse-grained mapping Each chunk must be mapped to a warp Number of chunks Number of warps Chunk size & warp size are independent Chunk size is limited by the size of shared memory

Warp-Centric Programming Method

Disadvantages:

If native SIMT width of the user application is small, the underlying hardware will be under-utilized (# threads assigned to data < #physical cores)

Warp-Centric Programming Method

Disadvantages:

The ratio of the SIMT phase duration to the SISD phase duration imposes an Amdahls limit on performance
Amdahls Law:

Improvement:

Warp-Centric Programming Method

Addressing these issues:

Partition a warp into multiple virtual warps with smaller size

Virtual warp size =

Increase parallelism within the SISD phase Improve ALU utilization by O(K) times Drawback: might re-introduce path divergence among these multiple virtual warps. This is the trade-off between path divergence and ALU utilization.

Other Techniques

Deferring Outliners:

Define some threshold Defer processing any node having degree greater than the threshold Process outliners in a separate kernel method

Dynamic Workload Distribution:

Virtual warp-centric does not prevent work imbalance among warps in a block Each warp fetches a chunk of work from shared work queue Trade-off between static/dynamic chunk of work:

static work distribution suffers from work imbalance dynamic work distribution imposes overhead

Experimental Results

Input Graphs:

RMAT: scale-free graph which follows a power law degree distribution like many real world graph. The average vertex degree is 12. RANDOM: uniformly distributed graph instance created by randomly connecting m pairs of nodes out of total n nodes. The average vertex degree is 12. LiveJournal: real world graph, is a very irregular structure Patent: is relatively regular graph, has a smaller average degree

Experimental Results

Study of Architectural Effects

Advantages of GPU Architecture

Enables massively parallel execution Uses large number of warps to hide memory latency Uses GDDR3 memory which has higher bandwidth and lower latency than FB-DIMM based CPU main memory

Study of Architectural Effects

Effect of bandwidth utilization and latency hiding in GPUs

Conclusion:

Graph algorithms are bound by memory bandwidth

Last But Not Least

A supercomputer is a device for turning compute-bound problems into I/O-bound problems Ken Batcher

A Note On Goldberg and Rao Algorithm For The Maximum Flow Problem
No ratings yet
A Note On Goldberg and Rao Algorithm For The Maximum Flow Problem
9 pages
Linear Algerbra PDF
No ratings yet
Linear Algerbra PDF
1 page
GTC-S62191 (1)
No ratings yet
GTC-S62191 (1)
89 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Control Flow and Synchronization
No ratings yet
Control Flow and Synchronization
10 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
1083_Wang
No ratings yet
1083_Wang
56 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
lecture-GPU-17
No ratings yet
lecture-GPU-17
51 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Performance
No ratings yet
Performance
51 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Khaitan PSERC Webinar HPC Mar 2013 Slides
No ratings yet
Khaitan PSERC Webinar HPC Mar 2013 Slides
52 pages
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
No ratings yet
Improving GPU Performance Via Large Warps and Two-Level Warp Scheduling
10 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Pemrosesan Parale2l
No ratings yet
Pemrosesan Parale2l
27 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
41 pages
Benchmarking the cost of thread divergence in CUDA
No ratings yet
Benchmarking the cost of thread divergence in CUDA
8 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
No ratings yet
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
21 pages
1
No ratings yet
1
44 pages
Cours 1
No ratings yet
Cours 1
38 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
No ratings yet
NVIDIA - Cooperative Groups - Slides - GTC 2017 (S7622-Kyrylo-Perelygin-Robust-And-Scalable-Cuda)
56 pages
Cours 1
No ratings yet
Cours 1
38 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Part4 22
No ratings yet
Part4 22
65 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
Lecture 12 Exascale, Quantum Computing, Review (2)
No ratings yet
Lecture 12 Exascale, Quantum Computing, Review (2)
38 pages
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
No ratings yet
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
19 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Project Linux Scheduler 2.6.32
No ratings yet
Project Linux Scheduler 2.6.32
25 pages
Java Feels Secure
No ratings yet
Java Feels Secure
4 pages
Study 2 POSIX RealTime Extension
No ratings yet
Study 2 POSIX RealTime Extension
11 pages
A Study On Windows Mobile 6.5 Operation System
No ratings yet
A Study On Windows Mobile 6.5 Operation System
13 pages
Ergonomic Design and Analysis of A Post in A Stall: Article Information
No ratings yet
Ergonomic Design and Analysis of A Post in A Stall: Article Information
10 pages
Agricultural Extension
No ratings yet
Agricultural Extension
48 pages
Academic Year 2021-22 Scheme and Syllabus
No ratings yet
Academic Year 2021-22 Scheme and Syllabus
49 pages
Annual QN Paper, Class XI 2023 SAY
No ratings yet
Annual QN Paper, Class XI 2023 SAY
6 pages
ISA Transactions: Peyman Sindareh Esfahani, Jeffrey Kurt Pieper
No ratings yet
ISA Transactions: Peyman Sindareh Esfahani, Jeffrey Kurt Pieper
11 pages
Continuous RV Probability Distributions
No ratings yet
Continuous RV Probability Distributions
51 pages
Invesigatory Project On Seebeck Effect
67% (3)
Invesigatory Project On Seebeck Effect
16 pages
Addition and Subtraction Workbook Grade 1, 5 Minute Drill
0% (1)
Addition and Subtraction Workbook Grade 1, 5 Minute Drill
152 pages
The Bounded Convergence Theorem - Brian Thomson
No ratings yet
The Bounded Convergence Theorem - Brian Thomson
22 pages
MAT9004 Lecture Outline
No ratings yet
MAT9004 Lecture Outline
4 pages
Thornton Et Al. - 2019 - Developing Athlete Monitoring Systems in Team Spor
No ratings yet
Thornton Et Al. - 2019 - Developing Athlete Monitoring Systems in Team Spor
27 pages
Download An Introductory Course on Mathematical Game Theory and Applications 2nd Edition González-Díaz ebook All Chapters PDF
100% (4)
Download An Introductory Course on Mathematical Game Theory and Applications 2nd Edition González-Díaz ebook All Chapters PDF
40 pages
Actex 1P-84-ACT-T Sample 5-6-11
No ratings yet
Actex 1P-84-ACT-T Sample 5-6-11
8 pages
Cambridge O Level: Mathematics (Syllabus D) 4024/21
No ratings yet
Cambridge O Level: Mathematics (Syllabus D) 4024/21
20 pages
Composite Beams Columns To Eurocode 4
100% (1)
Composite Beams Columns To Eurocode 4
155 pages
W-Hu - MIT
No ratings yet
W-Hu - MIT
67 pages
Teacher Job Satisfaction the Importance of School Working Conditions and Teacher Characteristics
No ratings yet
Teacher Job Satisfaction the Importance of School Working Conditions and Teacher Characteristics
28 pages
Or 2marks Ans
100% (1)
Or 2marks Ans
6 pages
Addition Subtraction Worksheets
No ratings yet
Addition Subtraction Worksheets
2 pages
Artificial Neural Networks Yegnanarayana PDF Downloadgolkes PDF
No ratings yet
Artificial Neural Networks Yegnanarayana PDF Downloadgolkes PDF
2 pages
Computational Neural Networks Driving Complex Analytical Problem Solving
No ratings yet
Computational Neural Networks Driving Complex Analytical Problem Solving
7 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Structural Dynamics of Linear Elastic Multiple-Degrees-Of-Freedom ...
100% (2)
Structural Dynamics of Linear Elastic Multiple-Degrees-Of-Freedom ...
125 pages
Performance of Diaphragm Seal
No ratings yet
Performance of Diaphragm Seal
2 pages
Resume Fulltime Anand
No ratings yet
Resume Fulltime Anand
2 pages
MMW M6 CHECK-IN-ACTIVITY2 Sheet1
No ratings yet
MMW M6 CHECK-IN-ACTIVITY2 Sheet1
1 page
Module 8
No ratings yet
Module 8
7 pages
Lab 2 Edt FRRM Sahil
No ratings yet
Lab 2 Edt FRRM Sahil
10 pages
Algebra 1 Study Guide
No ratings yet
Algebra 1 Study Guide
16 pages