Intro To Matlab GPU Programming

PDF of presentation given by UMN CEMS IT staff Kai Mollerud in summer of 2013, giving basic principles of GPU computing and examples of MATLAB optimization for GPU platforms.

Uploaded by

modlyzko

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views

Intro To Matlab GPU Programming

PDF of presentation given by UMN CEMS IT staff Kai Mollerud in summer of 2013, giving basic principles of GPU computing and examples of MATLAB optimization for GPU platforms.

Uploaded by

modlyzko

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Matlab Optimization, parallelism,

and GPU computing.

Kai Mollerud
CEMS IT office
What Ill Cover
Basics; what is parallel computing, why GPUs are so good at
it.
When is a GPU better than a CPU.
What youll need / how to use the GPU
Development process
Learning to write fast, non-GPU, programs
Turning non-GPU programs, into GPU programs
Parallelism
This is the key idea behind all high power computing,
especially GPU computing.
Parallelism can be difficult to fully understand, because
people dont often do things in parallel.
Here is an image of some real-world parallel problem solving:
Analogue Parallelism
How is This Parallelism?
The chalk holder is
performing whats called
a SIMD operation, single
instruction, multiple data.
Each piece of data (chalk)
must be of the same type
to fit in the array, but they
can have different values
(color, length)
Likewise, a computer can
perform the same
operation on each element
in an array simultaneously.
So, why GPU computing?
GPU vs. CPU
A modern CPU has between 2 and 16 processing cores.
CPUs are designed to handle a wide array of tasks, often
performing several heterogeneous operations at once.
A modern GPU on the other hand, can have up to 2048
stream processors.
A GPUs usual job is to decide what color each of the pixels on
your monitor are, a 1080p monitor has 2,073,600 pixels that
can change color ~60 times a second.
Parallel Problems
Not all problems are well suited to parallel computation.
There are 3 levels of parallelism, determined by how much
the operations involved depend on each other.
Fine-grained, Coarse-grained, Embarrassingly
Put simply, GPU computing is best suited to Embarrassingly parallel
problems, and sometimes usable for problems with Coarse-grained
parallelism.
The technical reasoning here revolves around memory performance,
ask me later if you would like a more detailed explanation.
When to use GPU computing
Just because a problem is parallel, doesnt mean GPU
computing is the right choice.
CPUs can do multiple operations at once, and run much faster than
GPUs.
Where GPUs really shine are problems that are parallel, and
have very large amounts of data to process.
Deciding whether or not a problem will really benefit from
GPU computing isnt always obvious until you have actually
written the program.
Luckily, matlab makes it easy to write a program for the CPU first, then
adapt it to the GPU to see if its worth it.
The Development Process
Step 1) Write a program
Step 2) Make the program fast
Step 3) Adapt the program to use the GPU
Step 1) Write a program
When you start writing a program,
performance is not important.
Try and focus on good organization of your
program, make it easy to read and modify.
Keeping things organized will make the next 2
steps much easier.
Personally, I start by writing comments to
describe each block of code.

Example Code #1
first_draft.m
1. Populates an array with some floating point
values
2. Calculates the mean value of the array
3. Perform an operation on each element
4. Repeat 1-3 1000 times
This obviously isnt a useful calculation, but it is
computationally similar to some programs I have
seen researchers using.
Step 2) Make it fast
This is not a simple subject, computers are
complex and making a program run quickly
means understanding how the computer runs the
program.
An inefficient program wont get better just
because you run it on the GPU.
Rather than tell you every trick I know for
speeding up programs, Ill show you how to
experiment and learn.
Ill also show you a few tricks.
Optimization tools
Code profiler
Programs run a bit slower in the profiler
You can save the output of the profiler as a html file to look at later,
this is useful when measuring performance changes.
Control your runtime
You will need to run your code again and again
Scale down the simulation detail, comment out plotting functions, etc.
If its part of a larger program, find a way to isolate it from the rest.
tic + toc
The code profiler does this for you, but sometimes you just want one
number to look at, and these are easy to use.
Use a fast computer.
If your group runs simulations, you should think about getting a
dedicated computer to run them on.
Optimization techniques
Avoid nesting loops if at all possible
Use for loops instead of while loops
Not necessarily faster, but cleaner and easier to parallelize
Avoid conditionals
Use the find() function
If you use an if else, put the most common part first.
Consider using a switch() statement
Avoid calling functions inside loops.
Think about MEX functions for very big calculations
lets you use C programs from matlab
C is a lot faster than matlab
Dont use the mean() function, its slow. Use sum()/numel()

Example code #2
Second_draft.m
About 92% faster than #1
Uses find() to avoid conditionals
Eliminates the nested loops by using vector
operations
Replaces the mean() function with
sum()/numel()
Step 3) Using the GPU
Matlab uses vectors for everything. GPUs are
built for vector operations
This makes the conversion really easy.
To do GPU computing in matlab you will need:
Parallel computing toolbox (university has this
licensed)
A nVidia graphics card with compute capability
version 1.3 or higher.
entry cost of about $150 for a decent card
GPU functions
Performing a calculation on the GPU involves
2-3 steps.
Put the data you need into GPU memory
Call a GPU enabled function on that data
Move the results from GPU memory to CPU
memory.

Putting data on the GPU
Matlabs parallel computation toolbox
provides the gpuArray data type
Any gpuArray variable is stored in GPU memory
gpuArray supports most data types, and behave
more or less the same as normal arrays
Any operation on a gpuArray variable will return a
gpuArray variable.
Putting data on the GPU
You can create gpuArrays in 2 ways
Copy a variable from CPU memory to GPU
memory
Create a variable directly on the GPU
Copying a variable to the GPU
Copying a variable to the GPU
a and b are independent, subsequent
operations on one do not affect the other
a must be nonsparse, and must be of type
single, double, int/uint 8/16/32/64, or logical
i.e. no custom data types
b has a 108 byte placeholder in CPU memory,
and uses 1600 bytes on GPU memory
Transferring takes time, dont do it inside a
loop
Creating data on a GPU directly
Creating data on a GPU directly
You can use; ones, zeros, inf, nan, true, false,
eye, colon, rand, randi, randn, linspace,
logspace
This avoids the time cost of transferring from
CPU memory to GPU memory.
GPU computing functions
Matlab has overloaded many functions to
execute on the GPU when you call them with
a gpuArray as an argument.
A few important ones: trig functions, log, find,
max, plot (& related)
Full list online:
http://www.mathworks.com/help/distcomp/using
-gpuarray.html (some added in 2013b not listed)
Example code #3
third_draft.m
Almost identical to #2
Turns the array into a gpuArray so the operations
are run on the GPU
Actually a bit slower than #2
That is, slower when using the same parameters.
More on this shortly.
Bringing GPU data back
The gather() function takes in a gpuArray and
copies it to CPU memory.
Again, this takes time, try and leave data on
the GPU as long as you can and transfer all of
it back at once.
I can go into detail about GPU vs CPU memory
behavior later if theres time/interest, otherwise
ask me / email me.
Using the GPU in your code
Knowing how to use the GPU is half the battle,
the rest is knowing when.
Theres a simple way to learn this, take some
code, change something to a gpuArray and
see how the runtime changes.
When to use the GPU
GPUs are good for:
Big arrays/vectors
Doing simple tasks many times
Theyre bad for:
Conditional logic
Manipulating a few specific array elements
Quantitative example
I wrote 3 programs to do the same task. The task exhibits
coarse-grained parallelism, and has a deterministic run-time.
Naive.m is a simple, non-parallel implementation. It isnt exceptionally
bad, but no effort has been made to make it run efficiently.
CPU.m is a CPU-only, parallel implementation that is essentially as fast
as it can be.
GPU.m is very similar to CPU.m, but uses GPU operations wherever
possible.
I recorded performance metrics from these 3 programs across
a range of inputs, increasing the size of the input data each
time.
Testing details
The tests were run on a dell optiplex 990
Intel i5-2400 4-cores @3.1Ghz (3.3 with turbo boost)
4Gb 1333Mhz RAM
nVidia GeForce GTX 650 Ti
1Gb GDDR5 memory @5400Mhz
768 cell processors @941Mhz
Windows 7 64-bit enterprise
The numbers I gathered are unique to this
computer. Your results will vary, but should
follow similar trends.
Runtime Vs. array size
Elements per second
Coding for the GPU
Try not to move data between CPU and GPU very
often
Replace conditional logic with set theory (loops
and if statements VS. vector ops and find())
Try to isolate variables.
Storing values in an array to look at later can replace
random accesses to those values while calculating
them
Be clever.
You may need to change your entire approach to a
problem to get the most out of GPU computing
Questions?

Springer SV Solutions Manual
89% (38)
Springer SV Solutions Manual
63 pages
Interview Questions
100% (3)
Interview Questions
7 pages
Chapter 2 Instructions Language of The Computer
No ratings yet
Chapter 2 Instructions Language of The Computer
95 pages
1 Database Language DDL, DCL, TCL
0% (1)
1 Database Language DDL, DCL, TCL
2 pages
GPU Programming in MATLAB
No ratings yet
GPU Programming in MATLAB
6 pages
QEMU Emulator User Documentation: 1.1 Features
No ratings yet
QEMU Emulator User Documentation: 1.1 Features
88 pages
ARM Cortex-A9 MPCore
No ratings yet
ARM Cortex-A9 MPCore
34 pages
Advance Computer Architecture (Autosaved)
No ratings yet
Advance Computer Architecture (Autosaved)
128 pages
System Bus Noc
No ratings yet
System Bus Noc
102 pages
The Memory System: Fundamental Concepts
No ratings yet
The Memory System: Fundamental Concepts
115 pages
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
No ratings yet
Large and Fast: Exploiting Memory Hierarchy: Omputer Rganization and Esign
87 pages
System Bus
No ratings yet
System Bus
11 pages
Introduction To SOC
No ratings yet
Introduction To SOC
192 pages
Rockchip PX30 TRM V1.1 Part1-20180917 PDF
No ratings yet
Rockchip PX30 TRM V1.1 Part1-20180917 PDF
636 pages
WCDMA Air Interface Fundamentals
No ratings yet
WCDMA Air Interface Fundamentals
29 pages
Piezo COMSOL 50.compressed PDF
0% (1)
Piezo COMSOL 50.compressed PDF
55 pages
2G + 3G + LTE + CDMA + 1x + EVDO % Calculator
No ratings yet
2G + 3G + LTE + CDMA + 1x + EVDO % Calculator
38 pages
Digital Comm Lab 4 PDF
No ratings yet
Digital Comm Lab 4 PDF
11 pages
Adaptive Filters: Solutions of Computer Projects
No ratings yet
Adaptive Filters: Solutions of Computer Projects
74 pages
Fpga Adv WKB 62
No ratings yet
Fpga Adv WKB 62
638 pages
Digital To Analog Converter (DAC)
No ratings yet
Digital To Analog Converter (DAC)
32 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
Unit-III: Memory: Topics
No ratings yet
Unit-III: Memory: Topics
54 pages
Design and Implementation Af LZW Data Compression Algorithm
No ratings yet
Design and Implementation Af LZW Data Compression Algorithm
11 pages
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
No ratings yet
DDI0475C Corelink Nic400 Network Interconnect r0p2 TRM
74 pages
Unit Ii Arm
No ratings yet
Unit Ii Arm
243 pages
Axi Dma PDF
No ratings yet
Axi Dma PDF
62 pages
0014 SharedMemoryArchitecture
No ratings yet
0014 SharedMemoryArchitecture
31 pages
DSP Architecture
No ratings yet
DSP Architecture
90 pages
Introduction To Xilinx System Generator
100% (1)
Introduction To Xilinx System Generator
69 pages
AMBA Protocols Introduction
No ratings yet
AMBA Protocols Introduction
14 pages
ARM7 LP2148 Tyro V4 User's Guide: Pantech Prolabs India PVT LTD
No ratings yet
ARM7 LP2148 Tyro V4 User's Guide: Pantech Prolabs India PVT LTD
28 pages
Lab 06
No ratings yet
Lab 06
17 pages
UMTS Channel PDF
No ratings yet
UMTS Channel PDF
47 pages
VGA Video Signal Generation
100% (14)
VGA Video Signal Generation
34 pages
Lecture 1 Part 2 Channel Model
No ratings yet
Lecture 1 Part 2 Channel Model
101 pages
Booting
No ratings yet
Booting
14 pages
Adaptive Equalization Using MATLAB Coder™ - MATLAB
No ratings yet
Adaptive Equalization Using MATLAB Coder™ - MATLAB
5 pages
MT6737 LTE Smartphone Application Processor Functional Specification V1.0
No ratings yet
MT6737 LTE Smartphone Application Processor Functional Specification V1.0
288 pages
PRBS Generator
No ratings yet
PRBS Generator
9 pages
CMOS Mixed Signal Circuit Design
No ratings yet
CMOS Mixed Signal Circuit Design
261 pages
Matlab Manuals
No ratings yet
Matlab Manuals
123 pages
8.2.0 ARM Architecture
No ratings yet
8.2.0 ARM Architecture
117 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
How To Create A Microblaze AXI4 DDR3 Embedded System and Stay Alive
No ratings yet
How To Create A Microblaze AXI4 DDR3 Embedded System and Stay Alive
12 pages
High Speed Data Acquisition System Using Fpslic
No ratings yet
High Speed Data Acquisition System Using Fpslic
4 pages
Chap3 PDF
No ratings yet
Chap3 PDF
185 pages
WCDMA/UMTS Overview WCDMA/UMTS Overview: Prepared by Ahmad Dedi Affandi Ericsson AB Sudan Ericsson AB Sudan
No ratings yet
WCDMA/UMTS Overview WCDMA/UMTS Overview: Prepared by Ahmad Dedi Affandi Ericsson AB Sudan Ericsson AB Sudan
58 pages
GPU
No ratings yet
GPU
17 pages
Bandpass Signaling
No ratings yet
Bandpass Signaling
76 pages
RAT Takehome Exam PDF
No ratings yet
RAT Takehome Exam PDF
17 pages
DSP Project
No ratings yet
DSP Project
23 pages
DeviceNet The Ultimate Step-By-Step Guide
From Everand
DeviceNet The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
S4421 Gpu Computing With Matlab
No ratings yet
S4421 Gpu Computing With Matlab
27 pages
HPC 1
No ratings yet
HPC 1
27 pages
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
No ratings yet
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
43 pages
ML Senior Software Developer - Eng
No ratings yet
ML Senior Software Developer - Eng
2 pages
Dice Bot
No ratings yet
Dice Bot
1 page
CICS Training Material
100% (1)
CICS Training Material
183 pages
Testing A Website: Best Practices by Kunal Peshin Software Test Engineer
No ratings yet
Testing A Website: Best Practices by Kunal Peshin Software Test Engineer
12 pages
Renamer by Den4b
No ratings yet
Renamer by Den4b
24 pages
Synchronization Between Threads
No ratings yet
Synchronization Between Threads
7 pages
OpenGLProg MacOSX
No ratings yet
OpenGLProg MacOSX
164 pages
Calc Fields Networking and Sharing: Welcome To
100% (1)
Calc Fields Networking and Sharing: Welcome To
42 pages
70-463implementing A Data Warehouse With Microsoft SQL Server 2012 2014-10-01
No ratings yet
70-463implementing A Data Warehouse With Microsoft SQL Server 2012 2014-10-01
161 pages
Creating A Dataset For High-Performance Computing Code Translation Using LLMS: A Bridge Between Openmp Fortran and C++
No ratings yet
Creating A Dataset For High-Performance Computing Code Translation Using LLMS: A Bridge Between Openmp Fortran and C++
7 pages
(Ebook) Writing a C Compiler: Build a Real Programming Language From Scratch by Nora Sandler ISBN 9781718500433, 1718500432 - Quickly access the ebook and start reading today
100% (1)
(Ebook) Writing a C Compiler: Build a Real Programming Language From Scratch by Nora Sandler ISBN 9781718500433, 1718500432 - Quickly access the ebook and start reading today
52 pages
Library Management System: Supervisor
No ratings yet
Library Management System: Supervisor
21 pages
Runtime Behaviour of Abinitio Transform Components
No ratings yet
Runtime Behaviour of Abinitio Transform Components
8 pages
CMM
No ratings yet
CMM
9 pages
01 Introduction
No ratings yet
01 Introduction
25 pages
HSC Software Design & Development 2023
No ratings yet
HSC Software Design & Development 2023
46 pages
SE MODULE 5
No ratings yet
SE MODULE 5
77 pages
What I Wish I Knew When Learning Haskell
No ratings yet
What I Wish I Knew When Learning Haskell
369 pages
Oop Lab Report
No ratings yet
Oop Lab Report
6 pages
Project Name Client Created by Creation Date Approval Date: Epic User Story ID
100% (1)
Project Name Client Created by Creation Date Approval Date: Epic User Story ID
10 pages
Two Sum. (Leetcode Easy Problem) - by Sukanya Bharati - Nerd For Tech - Medium
No ratings yet
Two Sum. (Leetcode Easy Problem) - by Sukanya Bharati - Nerd For Tech - Medium
5 pages
Use-Sound A React Hook That Lets You Play Sound Effects
No ratings yet
Use-Sound A React Hook That Lets You Play Sound Effects
16 pages
A Survey On Resource Management and Security Issues in IoT Operating Systems
No ratings yet
A Survey On Resource Management and Security Issues in IoT Operating Systems
5 pages
Bangalore University: Computer Science and Engineering
No ratings yet
Bangalore University: Computer Science and Engineering
26 pages
DBMS Manual
No ratings yet
DBMS Manual
40 pages
Siban Sajid Patait: Career Objective
No ratings yet
Siban Sajid Patait: Career Objective
2 pages
Contact Management Final
No ratings yet
Contact Management Final
29 pages

Intro To Matlab GPU Programming

Uploaded by

Intro To Matlab GPU Programming

Uploaded by

Matlab Optimization, parallelism,

and GPU computing.

You might also like