Parallel Computing Lab4
Parallel Computing Lab4
EXPERIMENT NO. 4
Lab Title: Introduction to Parallel Programming with CudaC: Exploring CUDA C programming 2D
operations
LAB ASSESSMENT:
Ability to Conduct
Experiment
Data Presentation
Experiment Results
Conclusion
Objective:
Implement and analyze various 2D array/matrix operations in CUDAC
Introduction:
The aim of this lab was to explore parallel computing techniques by implementing basic 2D
matrix operations using CUDA C. These operations include matrix addition, matrix
multiplication, matrix transposition, and scalar multiplication. CUDA (Compute Unified
Device Architecture) provides a platform for parallel computing on NVIDIA GPUs, allowing
developers to write code that exploits data-level parallelism for large datasets, such as 2D
matrices.
Experiment Setup:
• Software: CUDA toolkit, NVIDIA CUDA Compiler (NVCC), C/C++ for code
implementation.
• Hardware: A machine with an NVIDIA GPU compatible with CUDA.
Each operation was implemented on a 16x16 matrix, using a block size of 16x16 threads.
This configuration allowed one thread to compute one element of the matrix.
Matrix Addition:
The task is to add two 2D matrices element-wise. Each thread computes the sum for one element of
the resulting matrix.
Key steps:
1. Matrices A and B are initialized on the host.
2. Memory is allocated on the device, and data is transferred from the host to the
device.
3. A CUDA kernel is launched, where each thread adds corresponding elements of
matrices A and B.
4. The result is copied back from the device to the host.
Matrix Multiplication:
Matrix multiplication involves computing the dot product of the rows of the first matrix with
the columns of the second matrix.
Key steps:
1. Each thread computes the value of one element in the resulting matrix.
2. For each thread, the dot product of one row of matrix A and one column of matrix B
is computed and assigned to the result matrix C.
Matrix Transposition:
Matrix transposition involves switching the rows and columns of a matrix. In this case, each
thread transposes one element of the matrix.
Key steps:
1. A CUDA kernel is launched where each thread switches the row and column indices
to transpose the matrix.
2. For every element A[i][j], it is assigned to B[j][i].
Scalar Multiplication:
Scalar multiplication involves multiplying each element of a matrix by a constant scalar
value.
Key steps:
1. Each thread multiplies the element of the matrix A by a scalar k.
2. The result is stored in matrix C.
Performance Considerations:
For all of the operations:
1. Thread Management: The grid and block dimensions were chosen to optimize the
number of threads per block, ensuring efficient parallelism.
2. Memory Transfer: Efficient transfer of data between host and device is crucial. The
use of pinned memory or using memory pools may further optimize this.
3. Thread Synchronization: No explicit synchronization is required in these operations
since each thread works independently on separate elements of the matrix.
Lab Tasks:
Code and Output:
Task2: