Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CENG443_2023_Final

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Name: 12.01.

2023
Student ID:

CENG443 Final Exam - Fall 2023


(90 minutes)

Q.1 (20 points) A kernel performs 36 floating-point operations and 7 32-bit word global memory accesses
per thread. For each of the following device properties, indicate whether this kernel is compute- or memory-
bound. Explain your answer.

a) Peak FLOPS= 200 GFLOPS, Peak Memory Bandwidth= 100 GB/s

36 FLOPs
7*4 = 28 bytes

200/36 = 5.5
100/28 = 3.5

Memory-bound

b) Peak FLOPS= 300 GFLOPS, Peak Memory Bandwidth= 250 GB/s

300/36 = 8.3
250/28= 8.9

Compute-bound
Q.2 (30 Points) Assume that we have a GPU device that can overlap kernel executions in different streams
and the following code with k1, k2, k3, k4 kernel functions, f1 CPU function, where each function execution
takes similar time.

int n_streams = 4;
cudaStream_t streams[4];
//memory allocations and initializations
for (int i = 0; i < n_streams; i++) {
k1<<<grid, block, 0, streams[i]>>>();
k2<<<grid, block, 0, streams[i]>>>();
k3<<<grid, block>>>();
k4<<<grid, block, 0, streams[i]>>>();
f1();
}

Show an example execution order of the functions in different streams and CPU. Make sure your order
includes at least one overlapped kernel execution. For your execution order, use a presentation as the
following (The given ordering is just an example here):

Time

default stream k1 k2 k3

streams[0] k1 k2 k3

streams[1] k1 k2 k3

streams[2] .. .. ..

streams[3] .. .. ..

CPU .. .. ..

Show how your execution order changes if f1 function is a kernel function and launched as the following:

f1<<<grid, block>>>();

Any valid sequence


Q.3 (20 Points) Consider the vector addition kernel below and assume that the size of A, B, and C is 20,000
elements each.

__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
int i = threadIdx.x + 2*blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
i += blockDim.x;
if(i<n) C[i] = A[i] + B[i];
}

int vectAdd(float* A, float* B, float* C, int n) {


...
int size = n * sizeof(float);
cudaMalloc((void **) &A_d, size);
cudaMalloc((void **) &B_d, size);
cudaMalloc((void **) &C_d, size);
cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

vecAddKernel<<<ceil(n/(2*1024.0)), 1024>>>(A_d, B_d, C_d, n);

cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

...
}

a) How many thread blocks will be generated?

10

b) How many warps are there in each block?

32

c) How many threads will be created in the grid?

10240

d) Is there any control divergence during the execution of the kernel? If so, identify the block number that
causes the control divergence.

Yes, block number 9


Q.4 (30 Points) Which code is better in terms of memory coalescing if we replace KERNEL function in
main function with a) doubleKernelStride, b) doubleKernel. Explain your answer with example
memory access scenarios.

__global__
void doubleKernelStride(int *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = idx; i < N; i += stride)
a[i] *= 2;
}

__global__
void doubleKernel(int *a, int N)
{
int idx = (threadIdx.x+blockDim.x*blockIdx.x)*2;
a[idx] *= 2;
int jdx = (threadIdx.x+blockDim.x*blockIdx.x)*2 + 1;
a[jdx] *= 2;
}

int main()
{
int N = 16384; // 2^14
int *a;

...

KERNEL<<<32, 256>>>(a, N);

...
}

doubleKernelStride

You might also like