BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
1. If we want to allocate an array of v integer elements in CUDA device global memory, what would
be an appropriate expression for the second argument of the cudaMalloc() call?
(A) n
(B) v
(C) n * sizeof(int)
(D) v * sizeof(int)
Answer: (D)
Answer: (D)
3. If we want to copy 3000 bytes of data from host array h_A (h_A is a pointer to element 0 of the
source array) to device array d_A (d_A is a pointer to element 0 of the destination array), what
would be an appropriate API call for this in CUDA?
(A) cudaMemcpy(3000, h_A, d_A, cudaMemcpyHostToDevice);
(B) cudaMemcpy(h_A, d_A, 3000, cudaMemcpyDeviceTHost);
(C) cudaMemcpy(d_A, h_A, 3000, cudaMemcpyHostToDevice);
(D) cudaMemcpy(3000, d_A, h_A, cudaMemcpyHostToDevice);
Answer: (C)
4. How would one declare a variable err that can appropriately receive returned value of a CUDA
API call?
(A) int err;
(B) cudaError err;
(C) cudaError_t err;
(D) cudaSuccess_t err;
Answer: (C)
5. If we need to use each thread to calculate one output element of a vector addition, what would
be the expression for mapping the thread/block indices to data index:
(A) i=threadIdx.x + threadIdx.y;
(B) i=blockIdx.x + threadIdx.x;
(C) i=blockIdx.x*blockDim.x + threadIdx.x;
(D) i=blockIdx.x * threadIdx.x;
Answer: (C)
Explanation: This is the case we covered in Lecture 2.3.
6. We want to use each thread to calculate two (adjacent) output elements of a vector addition.
Assume that variable i should be the index for the first element to be processed by a thread. What
would be the expression for mapping the thread/block indices to data index of the first element?
(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2
(C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x
Answer: (C)
Explanation: Every thread covers two adjacent output elements. The starting data index is
simply twice the global thread index. Another way to look at it is that all previous blocks cover
(blockIdx.x*blockDim.x)*2. Within the block, each thread covers 2 elements so the beginning
position for a thread is threadIdx.x.
7. We want to use each thread to calculate two output elements of a vector addition. Each thread
block processes 2*blockDim.x consecutive elements that form two sections. All threads in each
block will first process a section, each processing one element. They will then all move to the next
section, again each processing one element. Assume that variable i should be the index for the
first element to be processed by a thread. What would be the expression for mapping the
thread/block indices to data index of the first element?
(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2
(C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x
Answer: (D)
8. For a vector addition, assume that the vector length is 8000, each thread calculates one output
element, and the thread block size is 1024 threads. The programmer configures the kernel launch
to have a minimal number of thread blocks to cover all output elements. How many threads will
be in the grid?
(A) 8000
(B) 8196
(C) 8192
(D) 8200
Answer: (C)
9. The following table shows CUDA function declarations, state where the function can be
executed and is callable from
Executed on the: Only callable from the:
__host__ float FuncC() Host host
__global__ void FuncB () Device Host
__device__ float FuncA () Device Device