2023-CSC14120-Lecture01-CUDAIntroduction
2023-CSC14120-Lecture01-CUDAIntroduction
2
CUDA C/C++: is extended-C/C++, allows us to write a program running
on both CPU (sequential parts) and GPU (massively parallel parts)
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
Device = GPU
// Store the result
out[gindex] = result;
}
int main(void) {
Device chay song song
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
parallel code
// Launch stencil_1d() kernel on GPU
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);
serial code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
}
return 0;
}
5
int main(int argc, char **argv)
{
int n; // Vector size
float *in1, *in2; // Input vectors
float *out; // Output vector
7
// Host allocates memories on device
float *d_in1, *d_in2, *d_out;
cudaMalloc(&d_in1, n * sizeof(float));
cudaMalloc(&d_in2, n * sizeof(float));
cudaMalloc(&d_out, n * sizeof(float));
8
// Host allocates memories on device
float *d_in1, *d_in2, *d_out;
cudaMalloc(&d_in1, n * sizeof(float));
cudaMalloc(&d_in2, n * sizeof(float));
cudaMalloc(&d_out, n * sizeof(float));
9
// Host allocates memories on device
float *d_in1, *d_in2, *d_out;
cudaMalloc(&d_in1, n * sizeof(float));
cudaMalloc(&d_in2, n * sizeof(float));
cudaMalloc(&d_out, n * sizeof(float));
10
// Host allocates memories on device
float *d_in1, *d_in2, *d_out;
cudaMalloc(&d_in1, n * sizeof(float));
cudaMalloc(&d_in2, n * sizeof(float));
cudaMalloc(&d_out, n * sizeof(float));
Data
index
Redundant 12
n=700
More on CUDA Function Declarations
Callable
from Execute on Execute by
__device__ float DeviceFunc() device device Caller host thread
New grid of
__global__ void KernelFunc() host device
device thread
Caller thread
__host__ float HostFunc() host host
device
14
Kernel function execution is
asynchronous w.r.t host by default
After host calls a kernel function to be executed on device,
host will be free to do other works without waiting the
kernel to be completed
...
// Host invokes kernel function to add vectors on device
dim3 blockSize(256);
dim3 gridSize((n - 1) / blockSize.x + 1);
addVecOnDevice<<<gridSize, blockSize>>>(d_in1, d_in2, d_out, n);
16
Kernel function execution is
asynchronous w.r.t host by default
...
// Host invokes kernel function to add vectors on device
dim3 blockSize(256);
dim3 gridSize((n - 1) / blockSize.x + 1);
double start = seconds(); // seconds is my function to get the current time
addVecOnDevice<<<gridSize, blockSize>>>(d_in1, d_in2, d_out, n);
cudaDeviceSynchronize(); // Host waits here until device completes its work
double time = seconds() - start; // ✓
…
17
Error checking
when calling CUDA API functions
• It’s possible that an error happens but the CUDA program still run normally
and give wrong result
• → don’t know where to fix bug
• → to know where to fix bug, we should always check error when calling
CUDA API functions
• For convenience, we can define a macro to check error and wrap it around
#define
CUDA APICHECK(call)
function calls \
{ \
cudaError_t err = call; \
if (err != cudaSuccess) \
{ \
printf("%s in %s at line %d!\n", cudaGetErrorString(err), __FILE__, __LINE__); \
exit(EXIT_FAILURE); \
} \
}
18
// Host allocates memories on device
float *d_in1, *d_in2, *d_out;
CHECK(cudaMalloc(&d_in1, n * sizeof(float)));
CHECK(cudaMalloc(&d_in2, n * sizeof(float)));
CHECK(cudaMalloc(&d_out, n * sizeof(float)));
20
Experiment: host vs device
• Generate input vectors with random values in [0, 1]
• Compare running time between host (addVecOnHost
function) and device (addVecOnDevice function, block size
512) with different vector sizes
• GPU: GeForce GTX 560 Ti (compute capability 2.1)
21
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64
22
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
23
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256
24
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
25
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
1024
26
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
1024 0.006 0.017 0.347
27
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
1024 0.006 0.017 0.347
4096
28
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
1024 0.006 0.017 0.347
4096 0.030 0.017 1.775
29
Experiment: host vs device
Vec size Host time (ms) Device time (ms) Host time / Device time
64 0.001 0.040 0.024
256 0.002 0.018 0.118
1024 0.006 0.017 0.347
4096 0.030 0.017 1.775
16384 0.127 0.017 7.403
65536 0.516 0.055 9.409
262144 1.028 0.197 5.220
1048576 3.773 0.277 13.619
4194304 13.870 0.617 22.479
16777216 55.177 1.993 27.683
30
Reference
• [1] Slides from Illinois-NVIDIA GPU Teaching Kit
• [2] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
31
THE END
32