Numerical Methods Implementation On CUDA
Numerical Methods Implementation On CUDA
on
Ankur Sharma (2007UCP132) Nihar Amin (2007UCP161) Praveen Khokher (2007UCP157) Shehjad Khan (2007UCP113)
May 2011
Contents
Acknowledgements Certicate 1 Overview Of CUDA Programming 1.1 Introduction . . . . . . . . . . . . 1.2 Thread Level Heirarchy . . . . . . 1.3 Memory Level Heirarchy . . . . . Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix xi 1 1 2 3 5 5 logics:6 6 7 7 7 7 8 8 11 11 12 12 12 12 13 13 13 14 17 17 18 19
2 Implementation Of Matrix Multiplication Algorithm On CUDA 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Matrix proves to be advantageous in the implementation of following 2.3 Sequential matrix-multiplication: . . . . . . . . . . . . . . . . . . . 2.4 Parallel matrix-multiplications on CUDA:- . . . . . . . . . . . . . . 2.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Kernel Specications: . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Implementation Of Prex Sum Algorithm 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Sequential Prex-sum algorithm: . . . . . 3.3 Parallel Prex-Sum On CUDA: . . . . . . 3.3.1 Implementation- . . . . . . . . . . 3.4 Kernel Specications: . . . . . . . . . . . . 3.5 Salient Features: . . . . . . . . . . . . . . 3.6 Limitations: . . . . . . . . . . . . . . . . . 3.7 Observations: . . . . . . . . . . . . . . . . 3.8 Conclusions: . . . . . . . . . . . . . . . . . On CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
4 Implementation Of Bitonic Sort Algorithm On 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Parallel Bitonic-Sort On CUDA: . . . . . . . . . 4.3 Salient Features: . . . . . . . . . . . . . . . . . i
CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Implementation of Odd Even transposition Sort 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . 5.2 The odd even merge sort is advantageous as it can 5.3 Sequential Odd-Even Merge Sort: . . . . . . . . . 5.4 Parallel Odd Even Transposition Sort: . . . . . . 5.4.1 Implemention . . . . . . . . . . . . . . . . 5.5 Kernel Specication: . . . . . . . . . . . . . . . . 5.6 Salient Features:- . . . . . . . . . . . . . . . . . . 5.7 Limitations: . . . . . . . . . . . . . . . . . . . . . 5.8 Observations: . . . . . . . . . . . . . . . . . . . . 5.9 Conclusions: . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
6 Implementation Of Parallel Quicksort By Regular 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 6.2 Sequential Quicksort: . . . . . . . . . . . . . . . . . 6.3 Parallel Quicksort Using Regular Sampling: . . . . 6.3.1 Implementation: . . . . . . . . . . . . . . . . 6.4 Kernel Specications: . . . . . . . . . . . . . . . . . 6.5 Salient features: . . . . . . . . . . . . . . . . . . . . 6.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . 6.7 Observations: . . . . . . . . . . . . . . . . . . . . . 6.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . .
7 Implementation of matrix transpose algorithm on CUDA 35 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7.2 Matrix transpose proves to be advantageous in the implementation of following logics: 36 7.3 Sequential matrix transpose: . . . . . . . . . . . . . . . . . . . . . . 36 7.4 Parallel matrix transpose: . . . . . . . . . . . . . . . . . . . . . . . 36 7.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.5 Kernel specications: . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.6 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8 Implementation of parallel sum algorithm on CUDA 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Parallel-sum proves to be advantageous in the implementation of 8.3 Sequential Parallel-Sum Algorithm:- . . . . . . . . . . . . . . . . 8.4 Parallel Prex-Sum: . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . 41 . . 41 following logics: 41 . . 42 . . 42 . . 42
CONTENTS 8.5 8.6 8.7 8.8 8.9 Kernel Specication:Salient Features:- . . Limitations:- . . . . . Observations: . . . . Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii 43 43 43 44 44 47 47 47 48 48 48 49 49 49 50 53
9 Calculation Of Variance and Standard Deviations on CUDA 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Finding VARIANCE AND DEVIATION proves to be advantageous 9.3 Sequentially Calculate Variance and SD: . . . . . . . . . . . . . . . 9.4 Parallely Calculate Variance and SD: . . . . . . . . . . . . . . . . . 9.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Kernel Specication: . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Observations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Data of Algorithms
List of Figures
1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 5.1 5.2 6.1 6.2 6.3 6.4 7.1 7.2 7.3 8.1 8.2 8.3 9.1 9.2 9.3 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . Thread Level Heirarchy . . . execution time vs Input size SpeedUp vs input size . . . SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 5 8 9 9
Prex-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Prex-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Prex-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Sample Bitonic Sorting . . . . . Kernel Used in Bitonic Sorting . Execution time vs input size . . slope of speedUp vs input size . speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 19 20 20 21
Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 26 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Sequential Quicksort algorithm execution time vs input size . . speedUp vs input size . . . . . . speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 33 33 34
Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 38 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 38 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 44 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 50 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 50 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 51
List of Tables
10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 Matrix Multiplication(time in 106 s) . . . . Bitonic Sort Algorithm (time in 106 s) . . . Prex Sum (time in 106 s) . . . . . . . . . . Odd-Even Transposition Sort (time in 106 s) Quicksort (time in 106s) . . . . . . . . . . . Matrix-transpose (time in 106 s) . . . . . . Summation Algorithm (time in 106 s) . . . Variance and SD (time in 106s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 54 55 55 55 56 56
vii
Acknowledgements
We wish to express our gratitude to all people involved in the successful completion of our Final Year Major Project, especially to our project mentor Dr. Vijay Laxmi for her guidance and critical reviews. Our sincere thanks to Dr. M.S Gaur who was very generous to devote his precious time, sharing his knowledge with us, and helping us out in every possible manner We are also thankful to all of our team members, working with whom was a great experience. And nally, our deep gratitude to our family members for their uninching emotional support during the whole period.
Ankur Sharma Nihar Amin Praveen Khokher Shehjad Khan May 2011
ix
Certicate
This is to certify that the work contained in this report entitled Numerical Methods Implementation On CUDA by Ankur Sharma (2007UCP132), Nihar Amin (2007UCP161), Praveen Khokher (2007UCP157) and Shehjad Khan (2007UCP113) has been carried out under my supervision and this work has not been submitted elsewhere for a degree.
May, 2011
Dr. Vijay Laxmi Department of Computer Engineering, Malaviya National Institute of Technology, Jaipur.
xi
ABSTRACT
Parallel computing is the process of dividing large problems into smaller ones and concurrently executing them. This implies that many computations are carried out simultaneously. The main objective of devising parallel algorithms is to check whether they give faster responses than their sequential versions. The implementation of numerical methods for heavy calculations on CUDA architecture and their comparison with time taken for the same calculations sequentially on the CPU is the basic aim of the project. The understanding of CUDA architecture and how mapping is done using threads and blocks is rst understood. Algorithms that can be implemented parallely are recognized, their sequential CPU codes are written and then their parallel implementation on CUDA architecture is done. Sets of data are used to study the time taken by both implementations and inferences are made. These are primarily on the basis of complexities of sequential algorithms and their method of implementation on CUDA. Some parallel algorithms give sufcient speed up and some are slower than the sequential versions. The reasons and conclusions are inferred and optimizations that can be done are mentioned.
Compute Unied Device Architecture(CUDA) is an application programming interface to the graphical processors .It is basically a parallel computing architecture developed by Nvidia. The architecture emphasizes the thinking of working many threads slowly in parallel rather than running a particular thread very fastly.CUDA specic computations are performed on GPU(graphics processing units).The architecture favours applications which are compute intensive rather then memory intensive.It is a scalable programming model .Programmers generally use C for CUDA for executing the code on GPU. There are levels of abstarction in CUDA which are visible to the programmers:1. Thread level heirarchy 2. Memory level heirarchy 3. barrier synchronizations The basic advantage of using CUDA is to run the parallel fraction of a large code eciently and quick.It basically follows the approach of dividing a large set of input data into blocks and execute the dierent blocks in parallel.The main features to look out for in parallel processing of blocks are ecient communication of data between dirent blocks and between the threads of the same block,synchronization 1
CUDA executes the sequential part of the code on CPU ,while the parallel portion is executed on GPU.The GPU code is compiled by the open64 compiler that produces parallel thread execution(PTX) les to run on the GPU.Qualiers are used to distinguish between the variables and functions of the CPU code and GPU code .CUDA operates on single instruction multiple data (SIMD ) architecture but the thread can diverge from this on the basis of conditional opeartors ,blockId and threadId.
1.2
The thread level abstraction on CUDA can be viewed as a grid of blocks containing threads.Each thread possesses a unique ID associated with it .A Block can contain upto maximum of 512 threads quadroF X1700GP GP Uarchitecture,a thread basically can have its unique Id in x, y ,z dimension ,ie threadIdx.x, hreadIdx.y, theadIdx.z.similarly a collection of blocks is called a grid and can contain blocks in all the three dimensions.The threads within a block can communicate with each other using the shared memory visible per block and can synchronize there execution using the inbuilt syncthreads() function.The execution between dierent blocks launched by the kernel cannot be done using the synthreads()
function.Dierent blocks communicate with each other using the device memory or the global memory.when a kernel is launched a grid of thread blocks gets created on the device with each thread block containing many threads .Both Fine grained data parallelism and coarse grained data parallelism can be implemented in CUDA .The threads provide Fine grained parallelism while the blocks provide coarse grained parallelism.
1.3
Figure 1.2: Memory Level Heirarchy There are four dierent types of memories shown above:registers,shared,global,constant(not including the texture memory).The global memory can be accessed by every thread,dierent blocks and the CPU.The registers are specic to each thread and are the fastest type of memory.The shared memory is visible to a particular block and thus threads of a block can access the shared memory.Constant memory is faster than global memory but slower than registers and shared memory however, it can only be written to in host code. Device code can read constant memory but it can not write to it.The sizes of global and constant memeory can scale in Gbs but the sizes of shared memory is very limited (usually upto 16Kb). The memory allocation and deallocation of the global memory is done by the host.Functions like cudaMemcpy() and cudaMalloc(),are used for the allocation and movement of data from or to the device .Identiers like cudaMemcpyDeviceToHost are used guide the direction of data transfer The memory transfer functions
can be synchronous as well as asynchronous . Synchronous means the CPU can start its execution only after the entire data has been transfered to the GPU.
Matrix multiplication have inherent parallelism in it and thus by using a parallel architecture we can compute the work in lesser time i.e achieve speed up. We multiply to matrix of size M x N and N x O and get a resulting matrix of dimension M x O. Its a necessary condition that the number of column of 1st matrix is equal to number of rows of 2nd matrix,otherwise multiplicationis not possible.
Figure 2.1: Thread Level Heirarchy INPUT- Two matrices say, A and B with dimensions M x N and N x O OUTPUT Final matrix with dimension M x O . 5
2.2
1. Graph Theory 2. Probability theory and statistics 3. Symmetries and transformations of physics 4. MATLAB
2.3
Sequential matrix-multiplication:
Suppose we have to multiple two matrix A and B and get the nal result in matrix C. Then each element of C can be found by sum=sum+ mat1[i][k]*mat2[k][j]; mat3[i][j]=sum; here r1 is the number of rows of rst matrix and c2 is the number of coloums of second matrix for(i=0;i < r1;i=i+1) { for(j=0;j < c2;j=j+1) { sum=0; for(k=0;k < c1;k++) sum=sum+mat1[i][k]*mat2[k][j]; mat3[i][j]=sum; } }
2.4
As matrix multiplication have many independent stages thus we can think of getting some speed-up using parallel architecture like CUDA.
2.4.1
Implementation:
We launch same number of threads as the number of element in a resultant matrix. Each thread simultaneously calculate the the corresponding index of the resultant matrix. Our blocks are of 2D nature and have dimension N x O (here we have taken input values such that N and O both are equal). Both the dimensions of 2D grid is equal to sqrt(total number of blocks lauched).Indexing to each element is done using the threadIdx, threadIdx , blockIdx and blockIdx. dim3 threads(My block blocksX,My block); oat grid D=sqrt(My block); dim3 grid(grid D,grid D);
2.5
A)
Kernel Specications:
global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4
bytesof cmem[1].
2.6
Salient Features:
1. We have implemented on global memory as our threads are independent of each other and we face no synchronisation problem. 2. Motivation for using global memory was to run our code for matrices with large dimensions. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .
2.7
Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512.
Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 2. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR.
2.8
Observations:
1. Immediate speedUp for N>32,due to n3 complexity of sequential algorithm. 2. Sequential time almost linearly proportional to size of resultant matrix. 3. Initial
speedU p N
4. With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant ,but the overall performance of the parallel code is degraded by the time acoounted for memorycopy overhead between host and device .
2.9
Conclusions:
1. As the sequential algorithm is of order n to the power 3 thus for large of dimensions we got a decent speed-up. 2. Parallel approach very favourable when sequential complexity is higher. 3. Even better speedUps can be achieved with memory optimization techniques.
Prex sum also known as the partial sum of the series is in programming terms the fold of addition operation.Th prex sum is considered to be the simplest and most useful block of parallel algorithms.The prex sum can be calculated for a very large sets of input data and is generally described a s below:For a set of N values { a1 ,a2 ,a3 ,a4 ..................................an } prex-sum can be calculated as { a1 ,(a1 +a2 ),(a1 +a2 +a3 ),.....(a1 +.....an -1} For Example - a[8]={1,3,4,2,6,3,7,1} prex-sum ={1,4,8,10,16,19,26,27} Prex-sum proves to be advantageous in the implementation of following logics:1. In the implementation of radix sort quick sort . 2. Performing lexical analysis and search for regular expressions. 3. In evaluating polynomials ,solving recurrences and addition of multiprecision numbers . 4. It can be very much helpful in performing string matching algorithms.
11
12
3.2
The sequential prex-sum algorithm is a very simple method to calculate the prex-sum of a given input array of numbers ,just by looping through the size of the array and adding the current value with that of the previous indexed value .The logic is demonstrated as below:for( i=1;i<size;i=i+1) a[i]=a[i]+a[i-1];
This code performs exactly N adds for a array of size N.and thus is a very simple implementation.
3.3
The prex-sum algorithm can be very eciently performed using the parallel architecture.We just need to divide the input array into blocks of proper dimension.and launch the kernel.
3.3.1
Implementation-
For a input array of size N(can be very large),a single dimension grid is created
N with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with one block and containing N number of threads is launched by the kernel function.
Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a specic thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Thus,respective elements are copied from the global memory to the shared memory of each block. The parallel sums of values in each block is generated and stored in a global array according to the respective block index.
3.4
1.
Kernel Specications:
global Sum prex() - 6 registers,4120+16bytes of smem,4 bytes of cmem[1]
13
3.5
Salient Features:
1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. It was dicult to perform synchronization between dierent blocks ,so the sums of previous blocks were propagated to the consecutive blocks using a global array . 4. The code is generalised to run on very large number of values. 5. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .
3.6
Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512. 2. gpu-occupancy of 67 % was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..
3.7
Observations:
faster then the parallel code
1. For very small input sizes ,the sequential prex sum appears to be much
2. With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant.
14
Chapter 3 Implementation Of Prex Sum Algorithm On CUDA 3. Very large speedup wrt. kernel execution times are achieved,which demostrates the eciency of running the parallel code on cuda ,but the memory overhead for large values limits the overall speed up .
3.8
Conclusions:
1. Using eecient memory optimizing techniques,the memory transfer overhead between the host and the device can be reduced. 2. Using much better kernel optimization speedUp can be increased.
15
It is a fast method to sort the large number of values. Basically contains two types of operations which are shown by down arrow(also by (+) operation ,just a symbolic representation) and up arrow(also by (-) operation). In + operation both the values are compared and after comparsion larger value should be at higher index (for this purpose swapping might be required).In - operation both the values are compared and larger value should be at the lower index(again swapping may or may not be required). INPUT:- Array of N element say A OUTPUT:- Sorted array of A, say sort(A)= such that for (i and j )=0 to n-1 a(i)<=a(j) for i<j
Bitonic-sort proves to be advantageous in the implementation of following logics:1. In any application which requires sorted input as for example binary search algorithm. 2. In forming directory and managing large data. 17
18
4.2
The parallel bitonic-sort can be very eciently performed using the parallel CUDA architecture.For N number of element , we can divide our problem into log to the (base 2 ) of N number of stages,and further each stage can be divided into number of substages. For stage i number of sub- stages in it are equal to i, i.e if we have 8 elements then total number of stages are 3. 1st stage have 1 sub-stage,2nd stage have 2 sub-stages and 3rd stage have 3 sub-stages. Each sub-stage has to do N/2 number of independent computations. Thus we can lauch N/2 number of threads for these N/2 computations. But sub-stages are not independent from each other and thus we have to ensure proper synchronization between threads , otherwise we will get incorrect results. As in our CUDA architecture we can only at maximum have 512 threads in a block thus, for values larger than 512, we have to launch multiple number of blocks. As we feel we have to perform interblock synchronization,which we have tried but cant implement it so we have computed result only upto 512 values. We have to nd whether the thread has to perform (+) or (-) operations. For this purpose we have used a ag variable in our kernel flag=(int)(id/power(i))%2; If ag has value 0 then we have to perform (+) operation,otherwise the (-) operation. Threads in blocks are of 1D nature and can be accessed by indexing them using threadIdx.x indexing id = threadIdx ;
19
For synchronisation of threads of the same block we have used the standard library function syncthreads();
4.3
Salient Features:
1. Dierent sub-satge at the same stage level are not independent. 2. In last stage we only have to perform (+) operations.
4.4
Limitations:
1. We have assumed that the number of input value must be in form of 2s power,like 4 ,8 , 16 , 32, 64, 128, 256, 512 2. As we have only used 1D block so at max we can take 512 values for the sorting. 3. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..
4.5
Observations:
1. SpeedUp gained for (N>256). 2. For sequential nearly linear increase in time with increasing N. 3. Very sharp increase in speedUp after (N=256).
20
4.6
Conclusions:
1. Speedup due to memory overhead decreases signicantly. 2. Much higher SpeedUps can be achieved with multiple blocks.
21
The network odd-even transposition sort for n input data consists of n comparing stages. In each stage, either all inputs at odd index positions or all inputs at even index positions are compared with their next element. Odd stages are followed by the even stages and only after the completion of an Odd stage an Even stage can start and vice versa. It is similar to the bubble sort except for the fact that odd-even transposition sort compares disjointed pairs by using alternating odd and even index values during dierent phases of the sort.
5.2
1. Can be used for sorting on 2-D processor arrays and 2. Be parallely implemented which can achieve speed ups of more than 2.0 even on marginally small number of elements.
23
24
5.3
The algorithm is simple to implement and is synonymous with bubble sort. In the rst phase of odd-even exchange, control jumps to all the even indices and compare their neighbouring element. In the second phase control jumps to odd indices and compares their neighbouring elements.These pair of phases continue till the array is sorted. Thus, there are exactly half the number of pair of phases as there are elements in the array to be sorted. The looping logic as follows for (i = 0; i< n ; i=i+1 ) 2 { for (j = 0; j+1<n; j=j+2) if (A[j]>A[j+1]) { int T=A[j]; A[j]=A[j+1]; A[j+1]=T; } for (j = 1; j+1< n; j=j+2) if (A[j]>A[j+1]) { int T = A[j]; A[j] = A[j+1]; A[j+1] = T; } }
5.4
The odd-even transposition sort on CUDA architecture is implemented on a single block with a max size of 512 elements. Each thread process one element and hence even threads process even indexed elements and odd threads process odd indexed elements.
25
5.4.1
Implemention
For an input size of N a block with N threads is created and each thread processes one element.The kernel creates a shared memory portion for the block and copies the array in this.All the values of the input array which are stored in global memory are mapped with a specic thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx Thus, respective elements are copied from the global memory to the shared memory for the block. The kernel then sorts the array in combinations of odd-even phases and the resultant is copied back to the host memory.The kernel functioan can be examined as follows.
5.5
Kernel Specication:
global Sort() - 8 registers,2068+16bytes of smem,4 bytes of cmem[1].
5.6
Salient Features:-
1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. It was dicult to perform synchronization between dirent blocks ,so the sums of previous blocks were propagated to the consecutive blocks using a global array. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated. 5. Synchronization done as to ensure that during parallel execution of threads the even phase always follows the odd phase
26
5.7
Limitations:
1. Maximum size of array can be 512, limited to maximum threads in a block 2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..
5.8
Observations:
1. Steep increase in speedUp as N increases. 2. Due to N being limited to 512 memory overhead time is less than calculation time .Therefore less eect of memory overhead in performance graph
5.9
Conclusions:
gains recognizable speedUp.
2. Due to N being limited to 512 memory overhead time is less than calculation time .Therefore less eect of memory overhead in performance graph
27
Quicksort (also known as partition -exchange sort)is a very well known sorting algorithm developed by A.R Hoare.It is a comparison sort and in eecient implementations ,is not a stable sort.Quicksort tends to make a excellrnt usage of memory heirarchy ,taking a perfect advantage of virtual memory and availible caches .It is very well suited for modern computer architectures ,as it uses no temporaray memory and thus is a in-place sort.
6.2
Sequential Quicksort:
The sequential implementation of quicksort algorithm follows a divide and conquer approach to sort a large input array of values.Th procedure involves:1. Selecting one of the numbers (any random numbermay be selected) from the input as pivot element. 2. Locating the index(position) of the number in the input array and then dividing the array into sub-arrays .the Lower sub array contains elements with 29
30
Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA value smalller then the pivot,and the upper sub array containing elements with values higher then that of the pivot element . 3. Applying the step one recursively on both the lower and upper arrays. 4. Finally a sorted list of values is obtained(sorted here in ascending order).
ILLUSTRATION OF QUICKSORT
Figure 6.1: Sequential Quicksort algorithm Quicksort is known to be the fastest sorting algorithm based on comparison of pivots, in the average case and Quicksort has some natural concurrency(sorting the lower and upper list concurrently).
6.3
Parallel quicksort using regular sampling can be applied on a very large sets of data .It basically involves segmenting the unsorted list into blocks.The unsorted list is evenly distributed among the blocks.There are in all four phases invloved :1. Individual sorting of values on each segment ,selecting data items at local n indices 0, p2 , 2n , . . . , (p1)n as a regular sample of its locally sorted block. p2 p2 2. All the selected pivots are then again sorted and (p-1) pivots are selected and broadcast to every block. 3. Each Block then partitions its sorted subarray into P disjoint partitions 4. Each Block (i) keeps its (ith ) partition and sends the (j th ) partition to process (j), for all (j=i) and then each block merges its P partitions into a single global array.
31
6.3.1
Implementation:
1. The input Unsorted list is divided into N blocks ( size ) .and the unsorted 512 partitions are then copied from the global array to the shared array of each block on the GPU. 2. Sorting of the segemented list stored in shared array is performed by every block independent of each other 3. Local pivots are selected and copied to a global array ,indexed according to the blockId. 4. The list of pivots is then again sorted and P-1 pivots are agin selected and brodcast to every block. 5. Local sorted arrays are partioned according to the pivots and then the partitions are merged to a global array accordingly.
6.4
1. 2. 3.
Kernel Specications:
kernel1 6 registers,6810+16 bytes smem,4 bytes cmem kernel2 8 registers,24+16 bytes smem,4 bytes cmem kernel3 7 registers,2084+16 bytes smem,8 bytes cmem
6.5
Salient features:
1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. The code is generalised to run on very large number of values. 3. Better load balance 4. Repeated communications of a same value are avoided 5. Use of three kernel functions to increase the extent of parallelization at the same time continuosly using shared memory.
32
6.6
Limitations:
1. The input size is limited to be taken in multiples of 512. 2. The sorting of segmented array performed at block level is implemented using a single thread ,this aecting the overall eciency and reducing parallelism. 3. Better load balance 4. There is a constant use of global memory for broadcasting the pivots and globally sorting them 5. GPU-Occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..
6.7
Observations:
1. Highly ecient and recursive sequential code 2. Use of three kernels drastically increses the execution time
6.8
Conclusions:
33
34
Matrix transpose is a operation in which we exchange the rows with there corresponding column i.e values in row 1s t becames the values of column 1s t. Transpose can opnly be found for a square matrix i.e both the dimension of matrix should be same. The matrix transpose can be calculated for a very large sets of input data and is generally described as below:INPUT: Matrice A having N*N dimension OUTPUT Matrice transpose(A) having same dimensions. 1s t row of A must match with 1s t column of transpose (A) and so on. Example:matrix A=
1 2 3 4 5 6 7 8 9 1 4 7
transpose (A)= 2 5 8 3 6 9
35
36
7.2
7.3
The logic for sequential is pretty straight-forward as the rows and colum are exchanged hence basically we have swapped there two indexs, i.e A[i][j]=transpose(A[j][i]); thus we have to index our program to follow the above logic. for(i=0;i< r1 ;i=i+1) { for(j=0;j< c1 ;j=j+1) { transpose(A[j][i])=A[i][j]; } }
here r1 = number of rows in A matrice and c1 number of column and we know both must be equal as its a square matrix
7.4
As matrice A and transpose(A) are dierent and thus we can launch as many threads as there are number of element and thus we dont even have to synchronize them.
7.4.1
Implementation:
For a input array of size N(can be very large),a 2-D grid is created . If the square of N<512 ,then a grid with one block and containing N*N number of threads
37
is launched by the kernel function.If (N*N>512) then number of block launched are N N and a 2-D block of each with dimesion 16 is launched.Indexing to each 256 element is done using the threadIdx, threadIdx , blockIdx and blockIdx. Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D +threadIdx;
7.5
Kernel specications:
global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4 bytes of cmem[1].
7.6
Salient features:
1. we have implemented on global memory as our threads are independent of each other and we face no synchronisation problem. 2. Motivation for using global memory was to run our code for matrices with large dimensions. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .
7.7
Limitations:
1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512. 2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY
7.8
Observations:
calculationtime memoryoverhead
38
7.9
Conclusions:
1. SpeedUp in calculations at (CPU vs GPU) easily achieved 2. Better memory optimizations can gain signicant speedUp.
39
Parallel sum is the program to nd out the sum of all the elements present in an array. The parallel sum can be calculated for a very large sets of input data and is generally described as below:INPUT: For a set on N values [a1 ,a2 ,a3 ,.....................................,an -1,an ] OUTPUT We will get the nal sum of array say SUM=a1 +a2 +a3 ...........................+ an -1 + an ; For Example - a[8]={1,3,4,2,6,3,7,1} SUM={1+3+4+2+6+3+7+1}=27
8.2
1. In the implementation of nding out mean of set of values . 2. In the implementation of nding of variance. 41
42
8.3
The sequential parallel-sum algorithm is a very simple method to calculate the total sum of a given input array of numbers ,just by looping through the size of the array and adding the current value with the variable sum .The logic is demonstrated as below:SUM=0; for( i=0;i<size;i=i+1) SUM=a[i]+SUM ;
This code performs exactly N adds for a array of size N.and thus is a very simple implementation.
8.4
Parallel Prex-Sum:
The prex-sum algorithm can be very eciently performed using the parallel architecture.We assume our size of input array to in form of powers of two, i.e 2 ,4 ,16 , 32 ....1024,...8192...and so on.
8.4.1
Implementation:
For a input array of size N(can be very large),a single dimension grid is created
N with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with one block and containing N number of threads is launched by the kernel function.
Basically in kernel function each thread executes it code by performing the sum of two elements and storing that sum in the index of number with lower index. For example: if we have input array say A={1,2,3,4,5,6,7,8} Now in rst run for 8 values we create 4 threads rst thread,i.e thread with the threadIdx=0 adds the value (a[0]=a[0]+a[1]=1+2=3) and stores it at the lower index i.e 0;similarly second thread(threadIdx=1) adds the value (a[2]=a[2]+a[3]=3+4=7) third thread(threadIdx=2) adds the value (a[4]=a[4]+a[5]=5+6=11) fourth thread(threadIdx=3) adds the value (a[6]=a[6]+a[7]=7+8=15) Now the number of values have reduced from 8 to 4 now we require only 2 threads instead of 4. this is done by using thread Ids
43
of threads. Condition: if((int)(threadIdx)-power(j)geq0 here j denotes the value of run i.e for 1s t run its equal to 0 for second run its equal to 1 and so on. As we observe each time number of values reduces by 2. thus to compute the sum of N value we need log to the base 2 of value N. Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a specic thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Proper synchronisation must be insured between the dirent run of threads. We have used the standard function from CUDA library ( syncthreads();)
8.5
1.
8.6
Salient Features:-
1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .
8.7
Limitations:-
1. We have assumed that the number of input value must be in form of 2s power.
44
Chapter 8 Implementation of parallel sum algorithm on CUDA 2. We can run it for large values untill the condition of maximum number of blocks occurs, i.e we can have at max number of blocks is 65536 thus we can compute parallel sum of array having 65536*512=33554432 number of elements. 3. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..
8.8
Observations:
1. For very small input sizes ,the sequential sum appears to be much faster then the parallel code . 2. Good speedup wrt. kernel execution times are achieved,which demostrates the eciency of running the parallel code on CUDA.
8.9
Conclusions:
(a) use of shared memory requires extreme synchronization logics. (b) bank conicts very comon due to unrestricted access of shared memory.
45
The mean of a data set is simply the arithmetic average of the values in the set, obtained by summing the values and dividing by the number of values. The mean is a measure of the center of the distribution. The variance is used as a measure of how far a set of numbers are spread out from each other. It gives a measure of how away or far the numbers lie from their mean. The variance of a data set is the arithmetic average of the squared dierences between the values and the mean Standard deviation gives a measure of how much variation or dispersion is there from the mean. Mathematically it is the square root of the variance. The variance and the standard deviation are both measures of the spread of the distribution about the mean
9.2
proves to be advantageous
(a) The spread of the data around the mean is to be found (b) When large data is to be analyzed on the basis of extent of the spread in the data 47
48
Chapter 9 Calculation Of Variance and Standard Deviations on CUDA (c) For example, the margin of error in polling data is determined by calculating the standard deviation in the results if the polling is to be done multiple times.
9.3
the sum is easily calculated by adding each element of the N sized array and the mean is found by dividing this dum by N. for(i=0; i<n; i=i+1) { sum = sum + A[i]; } avrg =
sum n
the variance is then calculated using the deviation from this mean value by using the formula stated above. The looping would be: for(i=0; i<n; i++) { sum1+=(A[i]-avrg)*(A[i]-avrg); } var = SD =
sum1 n
(var)
the SD is the standard deviation which is the square root of the variance.
9.4
The process of nding the sum parallely on CUDA is a complex one due to synchronization problems. The sum is calculated using the kernel described in chapter 3. the sum gives the average by dividing the sum by N and this is used by the 2nd kernel for the calculation of variance and SD.
9.4.1
Implementation:
For a input array of size N(can be very large),a single dimension grid is N created with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with
49
one block and containing N number of threads is launched by the kernel function. Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a specic thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Thus, respective elements are copied from the global memory to the shared memory of each block. The average calculated by kernel 1 is passed on to the kernel 2 and the variance of each block is calculated and stored in an array. Its summation gives the variance of the data and the square root of the variance gives the SD.
9.5
(a)
Kernel Specication:
global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].
9.6
Limitations:
(a) For lage values of arrays(>512 values),the input size was limited to the multiples of 512. (b) GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY.
9.7
(a) For very small input sizes ,the sequential prex sum appears to be much (b) With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant ,but the overall performance of the parallel code is degraded by the time acoounted for memorycopy overhead between host and device . (c) Very large speedup wrt kernel execution times are achieved,which demostrates the eciency of running the parallel code on cuda ,but the memory overhead for large values limits the overall speed up .
50
9.8
Conclusions:
there is no speed up achieved as the kernel for nding the sum has synchronization problems to be met.
(a) Finding the mean, variance and SD sequentially is of the O(n). hence
(b) Memory optimization techniques can be used to control synchronization of shared memory and speed up may be achieved but not guaranteed.
51
The Nvidia Quadro FX 1700 GPGPU we used has the following specications: CUDA Parallel Processor Cores Memory Size Memory Interface Graphics Memory Bandwidth : : : : 32 512 MB 128-bit 12.8 GB/sec
The graphics card used for our experiment (Quadro FX 1700) is of compute capability 1.1. The version does not support double precision oating point. Also,the mathematical functions used are not accurate. This leads to mild loss of accuracy in the nal results.
53
54 Input 4 8 16 32 64 128 256 512 1024 2048 SeqEx-time 1 554 2360 9414 37784 131292 538807 2378744 11560038 52087845 PEx-time1 43 924 2118 8160 32041 133058 526462 2118810 8538991 34331100 PEx-time1 67 1009 2250 8405 32486 133952 528415 2122760 8547882 34357273
Chapter 10 Data of Algorithms Speed-up1 21 0.17 0.60 1.11 1.18 0.99 1.02 1.12 1.35 1.52 Speed-up2 2 0.15 0.55 1.05 1.01 0.98 1.02 1.12 1.35 1.52
Table 10.1: Matrix Multiplication(time in 106s) Input 4 8 16 32 64 128 256 512 SeqEx-time 1 2 6 13 33 77 179 423 PEx-time1 51 61 77 94 120 147 182 251 PEx-time1 190 200 226 243 280 297 332 402 Speed-up1 0.02 0.03 0.08 0.14 0.28 0.52 0.98 1.69 Speed-up2 0.01 0.01 0.03 0.05 0.12 0.26 0.54 1.05
Table 10.2: Bitonic Sort Algorithm (time in 106 s) Input SeqEx-time 16 1 32 1 64 2 128 3 256 4 512 7 1024 14 2048 28 4096 54 8192 108 16384 215 32768 430 65536 858 262144 2956 524288 5922 PEx-time1 76 68 75 81 98 146 151 151 250 467 934 1958 4503 33192 107130 PEx-time1 97 93 98 105 123 172 179 179 301 553 1075 2266 5087 35403 111562 Speed-up1 0.01 0.01 0.03 0.04 0.04 0.05 0.09 0.19 0.22 0.23 0.23 0.22 0.19 0.09 0.06 Speed-up2 0.01 0.01 0.02 0.03 0.03 0.04 0.08 0.16 0.18 0.20 0.2 0.19 0.17 0.08 0.05
55
PEx-time1 67 73 73 83 105 67 67 67
Input SeqEx-time 4 1 16 2 32 3 64 7 256 32 512 68 1024 144 2048 290 8192 1252 32768 5392 131072 23079
PEx-time1 63 177 549 1023 2000 2608 4608 7500 17062 29865 92452
PEx-time1 87 201 574 1030 2018 2646 4698 7568 17124 29936 92498
Speed-up1 0.02 0.01 0.01 0.01 0.02 0.03 0.03 0.04 0.07 0.18 0.25
Speed-up2 0.01 0.01 0.01 0.01 0.02 0.03 0.03 0.04 0.07 0.18 0.25
Speed-up1 0.02 0.02 0.07 0.17 0.57 0.86 1.11 1.35 1.36 1.85 0.98
Speed-up2 0.01 0.01 0.04 0.10 0.29 0.43 0.44 0.55 0.72 0.91 s0.82
56
Input SeqEx-time 16 1 64 1 256 2 512 3 1024 6 4096 22 8192 41 16384 83 32768 168 262144 1155 1048576 4650 4194304 18459 16777216 73597
PEx-time1 80 79 98 113 117 190 312 572 1102 8045 30856 118429 470787
Speed-up1 0.02 0.02 0.03 0.03 0.07 0.16 0.18 0.20 0.22 0.20 0.20 0.20 0.20
Speed-up2 0.01 0.01 0.02 0.03 0.05 0.12 0.13 0.15 0.15 0.14 0.15 0.16 0.16
Input SeqEx-time 512 7 1024 13 2048 25 4096 49 8192 98 16384 166 32768 384 65536 767 131072 1323 262144 2639 1048576 10651 4194304 42947 16777216 171841
PEx-time1 272 273 281 380 582 997 1767 3415 6666 13148 51131 200280 799691
PEx-time1 294 297 311 418 639 1090 1976 3813 7427 14643 55262 214462 854827
Speed-up1 0.03 0.05 0.09 0.13 0.17 0.17 0.22 0.22 0.20 0.20 0.21 0.21 0.21
Speed-up2 0.02 0.04 0.08 0.12 0.15 0.15 0.19 0.20 0.18 0.18 0.19 0.20 0.20
Bibliography
57