Image Rotation Using CUDA
Image Rotation Using CUDA
DECLARATION BY STUDENT
I hereby declare that the project entitled Parallel Programming using CUDA carried out by me during the Summer Term of the academic year 2011 2012, for the Practical Training/Educational Tour as per the B.Tech (I.T) degree curriculum, is my original work and has been completed successfully according to my guides direction and NITKs specifications.
ACKNOWLEDGEMENT
I take the opportunity to express my sincere gratitude to my mentor Mr. Vivek Na and Prof Tim Poston who has sincerely helped and supported me during my project. I am thankful to them as they have devoted their precious time, out of their busy schedules to have discussion on the project. Without their kind support and help project completion was not successful. I would like to extend my sincere thanks to all of them. I would like to express my gratitude towards Sir Ashutosh Mukherji Professor of National Institute of Advanced Studies for their kind co-operation and encouragement which help me in completion of this project. My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities.
ABSTRACT
Problem Statement: Developing a parallel code for fast interpolation for image rotation, required in the numerics of a novel 3-degree-of-freedom optical sensor, and for equalization of pressure across a grid of cells, which will play a part in a deformable sets schema that will model foams, the folding of the brains, and other phenomenon. Aim: Developing algorithm for image rotation on CPU and then deploying the algorithm to work on GPU parallel threads using CUDA. Implementation details: NVIDIA graphics card as 240-core GPU to be installed on a PC, Running SDK tool kit to enabling CUDA for parallel programming. Developing algorithm to run on parallel artitecture.
CONTENTS
1. Introduction 1.1 About Parallel Programming 1.2 GPU 1.3 GPU Computing 1.4 About CUDA 1.5 About Project 2.Working Details Of CUDA 2.1 The CUDA Architecture 2.2 Advantages of CUDA 2.3 CUDA programming model 2.4 Installing the CUDA Development Tools 2.5 Purpose Of NVCC 2.6 Writing C/C++ Code for CUDA 2.7 Kernels 2.8 Thread Hierarchy 2.9 Image Rotation 2.10 Bilinear Interpolation 2.11 Implementation 2.12 Results 2.13 Limitations 2.14 Future usages of CUDA architecture for image processing. 3.Conclusion 4.References
1. Introduction
1.2 GPU
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized processor that offloads 3D or 2D graphics rendering from the microprocessor. It is used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a dedicated video card. GPUs can be of following types: 1. Dedicated video cards: These have their own dedicated memory. 2. Integrated graphics processors: Share a portion of RAM 3. Hybrid: Share RAM while having their own cache memory.
1.4
About CUDA
Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the companys powerful GPUs. It is a NVIDIA's parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU. Computing here is evolving from "central processing" on the CPU to "coprocessing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture. CUDAs programming model differs significantly from single threaded CPU code. In a single-threaded model, the CPU fetches a single instruction stream that operates serially on the data. While in CUDA multiple instructions are processed simultaneously.
1.5
About Project
The base was to develop a parallel code for fast interpolation for image rotation, which occurs in the numerics of a novel 3-degree-of-freedom optical sensor, and for equalization of pressure across a grid of cells, which will play a part in a deformable sets schema that will model foams, the folding of the brains, and other phenomenon. The Algorithms designed for image rotation and pressure balance developed for serial execution can easily be implemented in CUDA.
CUDA includes C/C++ software development tools, function libraries, and a hardware abstraction mechanism that hides the GPU hardware from developers. Although CUDA requires programmers to write special code for parallel processing, it doesnt require them to explicitly manage threads in the conventional sense, which greatly simplifies the programming model. CUDA development tools work alongside a conventional C/C++ compiler, so programmers can mix GPU code with general-purpose code for the host CPU.
2.3
Parallel code (kernel) is launched and executed on a device by many threads Threads are grouped into thread blocks Parallel code is written for a thread
8
Each thread is free to execute a unique code path Built-in thread and block ID variables
Test your installation by compiling and running one of the sample programs in the CUDA software to validate that the hardware and software are running correctly and communicating with each other.
2.7
Kernels
C for CUDA extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using the __global__ declaration specifier and the number of CUDA threads for each call is specified using a new <<<>>> syntax: Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
10
Kernel launches a grid of thread blocks Threads within block cooperate via shared memory Threads within a block can synchronize Threads in different blocks cannot cooperate
Each thread has access to: 1.) threadIdx.x - thread ID within block 2.) blockIdx.x - block ID within grid 3.) blockDim.x - number of threads per block
2.9
Image Rotation
Image Rotation is a common digital image process. A usual method is a geometry transformation which rotates angle a around center of the image. For example [1], the original pixel p [x , y] rotates to p[x, y].
11
If rotated angle a pi/2 *n (n =integer) then pixel will be integer too. However some rotations may cause zoom effect because some rotated pixels are not integer . in this case interpolation can be used to figure out the rotated pixel to reduce distortion. A rotation matrix is a matrix that is used to perform a rotation in Euclidean space. For example the matrix.
, .
2.10
Bilinear Interpolation
Bilinear Interpolation is a resampling method that uses the distance-weighted average of the four nearest pixel values to estimate a new pixel value. The key idea is to perform linear interpolation first in one direction, and then again in the other direction. Although each step is linear in the sampled values and in the position, the interpolation as a whole is not linear but rather quadratic in the sample location (details below).
12
Figure 2: The four red dots show the data points and the green dot is the point at which we want to interpolate .
Suppose that we want to find the value of the unknown function f at the point P = (x, y). It is assumed that we know the value of f at the four points Q11 = (x1, y1), Q12 = (x1, y2), Q21 = (x2, y1), and Q22 = (x2, y2).
where R1 = (x,y1),
where R2 = (x,y2).
2.11
Implementation
To rotate an Image by an angle Q: V(X, Y) denotes the initial pixel value of image at Co-ordinates X, Y Co-ordinates after rotation, where we have to store value of that pixel are: X = X*cos(Q) Y*sin(Q); Y = X*sin(Q) + Y*cos(Q); This is being implemented in CUDA as Follows: The index of a thread and its thread ID relate to each other in a straightforward way: For a twodimensional block of size (Dx, Dy). __global__ void Rotate(float Source[N][N], float Destination[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) Destination[i*cos(Q) - j*sin(Q)][i*sin(Q) + j*cos(Q)] = Source[i][j]; }
14
// Kernel definition
2.12
Results
given a square input matrix of integers representing a grayscale image, generates a corresponding output matrix of the same dimension which contains the image rotated by a given angle theta. How ? Take every co-ordinate in the destination matrix x, and y, and rotate them ( about the exact center of the matrix ) by -theta, getting nx, ny which will almost always be irrational numbers. Look up the 4 values in the source matrix that are nearest to the position nx and ny. Interpolate those values to get the image intensity at the point nx, ny and place that value (after rounding) in the destination matrix at x, y. Implementing this algorithm in CUDA gives a very good result but is slower if size of the image is small otherwise produces optimal results.
2.13 Limitations
Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine)
15
Therefore with small size images as input results are not optimal as compared to serial execution of rotating an image. However , large size images perform better with CUDA as work is done parallelly. Maximum number of threads per block: 512, therefore concurrency is limited.
2.14
Accelerated rendering of 3D graphics Accelerated interconversion of video file formats Accelerated encryption, decryption and compression Medical analysis simulations, for example virtual reality based on CT and MRI scan images. Physical simulations, in particular in fluid dynamics.
16
4. Conclusion
In this Implementation I have demonstrated three features of the algorithm that help it achieve such high efficiency: Straight forward parallelism with sequential memory access patterns Data reuse that keeps the arithmetic units busy Fully pipelined arithmetic, including complex operations such as rotation of co-ordinates, This is much faster clock-for-clock on a GeForce 8800 GTX GPU than on a CPU. The result is an algorithm that runs more than 50 times as fast as a highly tuned serial implementation or 250 times faster than our portable C implementation. At this performance level, 3D simulations of large numbers of pixels can be run interactively, efficiently, effectively.
17
5. References
[1] NVIDIA_CUDA_BestPracticesGuide_2.3.pdf [2] NVIDIA_CUDA_Programming_Guide_2.3.pdf [3] CUDA_Getting_Started_2.2_Windows.pdf [4]CUDA_Reference_Manual_2.3.pdf [5] nvcc_2.0.pdf [6] NVIDIA Corporation. 2007. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. Version 0.8.1 [7] http://nbodylab.interconnect.com/docs/P3.1.6_revised.pdf
18