Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Lab 1 Parallel

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lab 1 Parallel

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Omar Obeid

Parallel Lab 1

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
#include <time.h>

// CUDA kernel for matrix-vector multiplication


__global__ void matrixVectorMulKernel(float *matrix, float *vector, float *result, int M, int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M) {
float sum = 0.0;
for (int col = 0; col < N; col++) {
sum += matrix[row * N + col] * vector[col];
}
result[row] = sum;
}
}

// Function to perform matrix-vector multiplication on CPU for validation


void matrixVectorMulCPU(float *matrix, float *vector, float *result, int M, int N) {
for (int row = 0; row < M; row++) {
float sum = 0.0;
for (int col = 0; col < N; col++) {
sum += matrix[row * N + col] * vector[col];
}
result[row] = sum;
}
}

// Helper function to initialize matrix and vector with random values


void initializeMatrixAndVector(float *matrix, float *vector, int M, int N) {
srand(time(NULL));
for (int i = 0; i < M * N; i++) {
matrix[i] = (float)(rand() % 100) / 10.0;
}
for (int i = 0; i < N; i++) {
vector[i] = (float)(rand() % 100) / 10.0;
}
}
// Function to validate GPU results by comparing with CPU results
bool validateResults(float *cpuResult, float *gpuResult, int M) {
for (int i = 0; i < M; i++) {
if (fabs(cpuResult[i] - gpuResult[i]) > 1e-4) {
return false;
}
}
return true;
}

int main() {
int M, N;
printf("Enter matrix dimensions M (rows) and N (columns): ");
scanf("%d %d", &M, &N);

// Allocate memory for matrix and vectors on host (CPU)


float *h_matrix = (float*)malloc(M * N * sizeof(float));
float *h_vector = (float*)malloc(N * sizeof(float));
float *h_result_cpu = (float*)malloc(M * sizeof(float));
float *h_result_gpu = (float*)malloc(M * sizeof(float));

// Initialize matrix and vector with random values


initializeMatrixAndVector(h_matrix, h_vector, M, N);

// Allocate memory on device (GPU)


float *d_matrix, *d_vector, *d_result;
cudaMalloc(&d_matrix, M * N * sizeof(float));
cudaMalloc(&d_vector, N * sizeof(float));
cudaMalloc(&d_result, M * sizeof(float));

// Transfer matrix and vector from host to device


cudaMemcpy(d_matrix, h_matrix, M * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_vector, h_vector, N * sizeof(float), cudaMemcpyHostToDevice);

// Define thread and block configuration


int blockSize = 256;
int gridSize = (M + blockSize - 1) / blockSize;

// Measure GPU computation time


cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
// Launch kernel on GPU
matrixVectorMulKernel<<<gridSize, blockSize>>>(d_matrix, d_vector, d_result, M, N);

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTime = 0;
cudaEventElapsedTime(&gpuTime, start, stop);

// Copy result from device back to host


cudaMemcpy(h_result_gpu, d_result, M * sizeof(float), cudaMemcpyDeviceToHost);

// Perform matrix-vector multiplication on CPU for validation


clock_t cpuStart = clock();
matrixVectorMulCPU(h_matrix, h_vector, h_result_cpu, M, N);
clock_t cpuStop = clock();
float cpuTime = 1000.0 * (cpuStop - cpuStart) / CLOCKS_PER_SEC;

// Validate GPU results against CPU results


if (validateResults(h_result_cpu, h_result_gpu, M)) {
printf("Validation: GPU and CPU results match!\n");
} else {
printf("Validation failed: Results do not match.\n");
}

// Print computation times


printf("GPU computation time: %.2f milliseconds\n", gpuTime);
printf("CPU computation time: %.2f milliseconds\n", cpuTime);

// Free allocated memory


free(h_matrix);
free(h_vector);
free(h_result_cpu);
free(h_result_gpu);
cudaFree(d_matrix);
cudaFree(d_vector);
cudaFree(d_result);

return 0;
}Explanation of Code

1. CUDA Kernel (matrixVectorMulKernel):


o Each GPU thread calculates one element of the resulting vector by iterating
through the corresponding row of the matrix and multiplying each element by
the vector elements.
2. CPU Function for Validation (matrixVectorMulCPU):
o This function performs matrix-vector multiplication on the CPU to validate the
GPU result. It uses nested loops to calculate each element of the output vector.
3. Data Initialization (initializeMatrixAndVector):
o Populates the matrix and vector with random values, simulating typical inputs
for the computation.
4. Memory Management:
o Allocates memory on both the CPU (host) and GPU (device) for matrix, vector,
and result vectors.
o Transfers data between the host and device as necessary.
5. Kernel Launch:
o Uses a grid and block configuration to define the number of threads needed.
Each thread computes one element of the output vector.
6. Validation (validateResults):
o Compares each element of the CPU and GPU results to check if they match
within a small error tolerance.
7. Timing:
o Measures execution times for both CPU and GPU computations to compare
performance.

Sample Output
sql
Copy code
Enter matrix dimensions M (rows) and N (columns): 1000 500
Matrix (1000x500) and Vector (size 500) successfully generated.
Validation: GPU and CPU results match!
GPU computation time: X.XX milliseconds
CPU computation time: Y.YY milliseconds
Discussion
The GPU is generally faster than the CPU due to parallel computation, particularly as matrix
dimensions increase. The grid and block configuration ensures that each GPU thread handles
one element of the output vector, maximizing parallelism and reducing computation time.

You might also like