Student ID:
Q.1 (20 points) A kernel performs 36 floating-point operations and 7 32-bit word global memory accesses
per thread. For each of the following device properties, indicate whether this kernel is compute- or memory-
bound. Explain your answer.
36 FLOPs
7*4 = 28 bytes
200/36 = 5.5
100/28 = 3.5
300/36 = 8.3
250/28= 8.9
Q.2 (30 Points) Assume that we have a GPU device that can overlap kernel executions in different streams
and the following code with k1, k2, k3, k4 kernel functions, f1 CPU function, where each function execution
takes similar time.
int n_streams = 4;
cudaStream_t streams[4];
//memory allocations and initializations
for (int i = 0; i < n_streams; i++) {
k1<<<grid, block, 0, streams[i]>>>();
k2<<<grid, block, 0, streams[i]>>>();
k3<<<grid, block>>>();
k4<<<grid, block, 0, streams[i]>>>();
Show an example execution order of the functions in different streams and CPU. Make sure your order
includes at least one overlapped kernel execution. For your execution order, use a presentation as the
following (The given ordering is just an example here):
default stream k1 k2 k3
streams[0] k1 k2 k3
streams[1] k1 k2 k3
streams[2] .. .. ..
streams[3] .. .. ..
CPU .. .. ..
Show how your execution order changes if f1 function is a kernel function and launched as the following:
f1<<<grid, block>>>();
void vecAddKernel(float* A, float* B, float* C, int n) {
int i = threadIdx.x + 2*blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
i += blockDim.x;
if(i<n) C[i] = A[i] + B[i];
d) Is there any control divergence during the execution of the kernel? If so, identify the block number that
causes the control divergence.
void doubleKernelStride(int *a, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
for (int i = idx; i < N; i += stride)
a[i] *= 2;
void doubleKernel(int *a, int N)
int idx = (threadIdx.x+blockDim.x*blockIdx.x)*2;
a[idx] *= 2;
int jdx = (threadIdx.x+blockDim.x*blockIdx.x)*2 + 1;
a[jdx] *= 2;
int main()
int N = 16384; // 2^14
int *a;