PDF of presentation given by UMN CEMS IT staff Kai Mollerud in summer of 2013, giving basic principles of GPU computing and examples of MATLAB optimization for GPU platforms.
PDF of presentation given by UMN CEMS IT staff Kai Mollerud in summer of 2013, giving basic principles of GPU computing and examples of MATLAB optimization for GPU platforms.
Kai Mollerud CEMS IT office What Ill Cover Basics; what is parallel computing, why GPUs are so good at it. When is a GPU better than a CPU. What youll need / how to use the GPU Development process Learning to write fast, non-GPU, programs Turning non-GPU programs, into GPU programs Parallelism This is the key idea behind all high power computing, especially GPU computing. Parallelism can be difficult to fully understand, because people dont often do things in parallel. Here is an image of some real-world parallel problem solving: Analogue Parallelism How is This Parallelism? The chalk holder is performing whats called a SIMD operation, single instruction, multiple data. Each piece of data (chalk) must be of the same type to fit in the array, but they can have different values (color, length) Likewise, a computer can perform the same operation on each element in an array simultaneously. So, why GPU computing? GPU vs. CPU A modern CPU has between 2 and 16 processing cores. CPUs are designed to handle a wide array of tasks, often performing several heterogeneous operations at once. A modern GPU on the other hand, can have up to 2048 stream processors. A GPUs usual job is to decide what color each of the pixels on your monitor are, a 1080p monitor has 2,073,600 pixels that can change color ~60 times a second. Parallel Problems Not all problems are well suited to parallel computation. There are 3 levels of parallelism, determined by how much the operations involved depend on each other. Fine-grained, Coarse-grained, Embarrassingly Put simply, GPU computing is best suited to Embarrassingly parallel problems, and sometimes usable for problems with Coarse-grained parallelism. The technical reasoning here revolves around memory performance, ask me later if you would like a more detailed explanation. When to use GPU computing Just because a problem is parallel, doesnt mean GPU computing is the right choice. CPUs can do multiple operations at once, and run much faster than GPUs. Where GPUs really shine are problems that are parallel, and have very large amounts of data to process. Deciding whether or not a problem will really benefit from GPU computing isnt always obvious until you have actually written the program. Luckily, matlab makes it easy to write a program for the CPU first, then adapt it to the GPU to see if its worth it. The Development Process Step 1) Write a program Step 2) Make the program fast Step 3) Adapt the program to use the GPU Step 1) Write a program When you start writing a program, performance is not important. Try and focus on good organization of your program, make it easy to read and modify. Keeping things organized will make the next 2 steps much easier. Personally, I start by writing comments to describe each block of code.
Example Code #1 first_draft.m 1. Populates an array with some floating point values 2. Calculates the mean value of the array 3. Perform an operation on each element 4. Repeat 1-3 1000 times This obviously isnt a useful calculation, but it is computationally similar to some programs I have seen researchers using. Step 2) Make it fast This is not a simple subject, computers are complex and making a program run quickly means understanding how the computer runs the program. An inefficient program wont get better just because you run it on the GPU. Rather than tell you every trick I know for speeding up programs, Ill show you how to experiment and learn. Ill also show you a few tricks. Optimization tools Code profiler Programs run a bit slower in the profiler You can save the output of the profiler as a html file to look at later, this is useful when measuring performance changes. Control your runtime You will need to run your code again and again Scale down the simulation detail, comment out plotting functions, etc. If its part of a larger program, find a way to isolate it from the rest. tic + toc The code profiler does this for you, but sometimes you just want one number to look at, and these are easy to use. Use a fast computer. If your group runs simulations, you should think about getting a dedicated computer to run them on. Optimization techniques Avoid nesting loops if at all possible Use for loops instead of while loops Not necessarily faster, but cleaner and easier to parallelize Avoid conditionals Use the find() function If you use an if else, put the most common part first. Consider using a switch() statement Avoid calling functions inside loops. Think about MEX functions for very big calculations lets you use C programs from matlab C is a lot faster than matlab Dont use the mean() function, its slow. Use sum()/numel()
Example code #2 Second_draft.m About 92% faster than #1 Uses find() to avoid conditionals Eliminates the nested loops by using vector operations Replaces the mean() function with sum()/numel() Step 3) Using the GPU Matlab uses vectors for everything. GPUs are built for vector operations This makes the conversion really easy. To do GPU computing in matlab you will need: Parallel computing toolbox (university has this licensed) A nVidia graphics card with compute capability version 1.3 or higher. entry cost of about $150 for a decent card GPU functions Performing a calculation on the GPU involves 2-3 steps. Put the data you need into GPU memory Call a GPU enabled function on that data Move the results from GPU memory to CPU memory.
Putting data on the GPU Matlabs parallel computation toolbox provides the gpuArray data type Any gpuArray variable is stored in GPU memory gpuArray supports most data types, and behave more or less the same as normal arrays Any operation on a gpuArray variable will return a gpuArray variable. Putting data on the GPU You can create gpuArrays in 2 ways Copy a variable from CPU memory to GPU memory Create a variable directly on the GPU Copying a variable to the GPU Copying a variable to the GPU a and b are independent, subsequent operations on one do not affect the other a must be nonsparse, and must be of type single, double, int/uint 8/16/32/64, or logical i.e. no custom data types b has a 108 byte placeholder in CPU memory, and uses 1600 bytes on GPU memory Transferring takes time, dont do it inside a loop Creating data on a GPU directly Creating data on a GPU directly You can use; ones, zeros, inf, nan, true, false, eye, colon, rand, randi, randn, linspace, logspace This avoids the time cost of transferring from CPU memory to GPU memory. GPU computing functions Matlab has overloaded many functions to execute on the GPU when you call them with a gpuArray as an argument. A few important ones: trig functions, log, find, max, plot (& related) Full list online: http://www.mathworks.com/help/distcomp/using -gpuarray.html (some added in 2013b not listed) Example code #3 third_draft.m Almost identical to #2 Turns the array into a gpuArray so the operations are run on the GPU Actually a bit slower than #2 That is, slower when using the same parameters. More on this shortly. Bringing GPU data back The gather() function takes in a gpuArray and copies it to CPU memory. Again, this takes time, try and leave data on the GPU as long as you can and transfer all of it back at once. I can go into detail about GPU vs CPU memory behavior later if theres time/interest, otherwise ask me / email me. Using the GPU in your code Knowing how to use the GPU is half the battle, the rest is knowing when. Theres a simple way to learn this, take some code, change something to a gpuArray and see how the runtime changes. When to use the GPU GPUs are good for: Big arrays/vectors Doing simple tasks many times Theyre bad for: Conditional logic Manipulating a few specific array elements Quantitative example I wrote 3 programs to do the same task. The task exhibits coarse-grained parallelism, and has a deterministic run-time. Naive.m is a simple, non-parallel implementation. It isnt exceptionally bad, but no effort has been made to make it run efficiently. CPU.m is a CPU-only, parallel implementation that is essentially as fast as it can be. GPU.m is very similar to CPU.m, but uses GPU operations wherever possible. I recorded performance metrics from these 3 programs across a range of inputs, increasing the size of the input data each time. Testing details The tests were run on a dell optiplex 990 Intel i5-2400 4-cores @3.1Ghz (3.3 with turbo boost) 4Gb 1333Mhz RAM nVidia GeForce GTX 650 Ti 1Gb GDDR5 memory @5400Mhz 768 cell processors @941Mhz Windows 7 64-bit enterprise The numbers I gathered are unique to this computer. Your results will vary, but should follow similar trends. Runtime Vs. array size Elements per second Coding for the GPU Try not to move data between CPU and GPU very often Replace conditional logic with set theory (loops and if statements VS. vector ops and find()) Try to isolate variables. Storing values in an array to look at later can replace random accesses to those values while calculating them Be clever. You may need to change your entire approach to a problem to get the most out of GPU computing Questions?
(Lecture Notes in Electrical Engineering 145) Abbas Mohammadi, Fadhel M. Ghannouchi (Auth.) - RF Transceiver Design For MIMO Wireless Communications-Springer-Verlag Berlin Heidelberg (2012)
(Lecture Notes in Electrical Engineering 145) Abbas Mohammadi, Fadhel M. Ghannouchi (Auth.) - RF Transceiver Design For MIMO Wireless Communications-Springer-Verlag Berlin Heidelberg (2012)