Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
49 views

BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda

This document contains 9 multiple choice questions about CUDA programming concepts like memory allocation, data transfer between host and device, and mapping thread indices to data indices. It provides the questions, possible answers, and a short explanation for each question. The questions cover topics like using cudaMalloc() to allocate memory on the device, using cudaMemcpy() to transfer data between host and device, mapping thread and block indices to data indices for common parallel programming patterns like processing individual elements or pairs of adjacent elements, and determining the number of threads in a grid given the vector length and block size.

Uploaded by

amin minshaf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda

This document contains 9 multiple choice questions about CUDA programming concepts like memory allocation, data transfer between host and device, and mapping thread indices to data indices. It provides the questions, possible answers, and a short explanation for each question. The questions cover topics like using cudaMalloc() to allocate memory on the device, using cudaMemcpy() to transfer data between host and device, mapping thread and block indices to data indices for common parallel programming patterns like processing individual elements or pairs of adjacent elements, and determining the number of threads in a grid given the vector length and block size.

Uploaded by

amin minshaf
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

BCS3413 Principle & Applications of Parallel Programming

Quiz 2 : GPGPU CUDA

1. If we want to allocate an array of v integer elements in CUDA device global memory, what would
be an appropriate expression for the second argument of the cudaMalloc() call?
(A) n
(B) v
(C) n * sizeof(int)
(D) v * sizeof(int)

Answer: (D)

2. If we want to allocate an array of n floating-point elements and have a floating-point pointer


variable d_A to point to the allocated memory, what would be an appropriate expression for the
first argument of the cudaMalloc() call?
(A) n
(B) (void *) d_A
(C) *d_A
(D) (void **) &d_A

Answer: (D)

Explanation: &d_A is pointer to a pointer of float. To convert it to a generic pointer required by


cudaMalloc() should use (void **) to cast it to a generic double-level pointer.

3. If we want to copy 3000 bytes of data from host array h_A (h_A is a pointer to element 0 of the
source array) to device array d_A (d_A is a pointer to element 0 of the destination array), what
would be an appropriate API call for this in CUDA?
(A) cudaMemcpy(3000, h_A, d_A, cudaMemcpyHostToDevice);
(B) cudaMemcpy(h_A, d_A, 3000, cudaMemcpyDeviceTHost);
(C) cudaMemcpy(d_A, h_A, 3000, cudaMemcpyHostToDevice);
(D) cudaMemcpy(3000, d_A, h_A, cudaMemcpyHostToDevice);

Answer: (C)

Explanation: See Lecture 2.2 slides.

4. How would one declare a variable err that can appropriately receive returned value of a CUDA
API call?
(A) int err;
(B) cudaError err;
(C) cudaError_t err;
(D) cudaSuccess_t err;

Answer: (C)

Explanation: See Lecture 2.2 slides.

5. If we need to use each thread to calculate one output element of a vector addition, what would
be the expression for mapping the thread/block indices to data index:
(A) i=threadIdx.x + threadIdx.y;
(B) i=blockIdx.x + threadIdx.x;
(C) i=blockIdx.x*blockDim.x + threadIdx.x;
(D) i=blockIdx.x * threadIdx.x;

Answer: (C)
Explanation: This is the case we covered in Lecture 2.3.

6. We want to use each thread to calculate two (adjacent) output elements of a vector addition.
Assume that variable i should be the index for the first element to be processed by a thread. What
would be the expression for mapping the thread/block indices to data index of the first element?
(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2
(C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x

Answer: (C)

Explanation: Every thread covers two adjacent output elements. The starting data index is
simply twice the global thread index. Another way to look at it is that all previous blocks cover
(blockIdx.x*blockDim.x)*2. Within the block, each thread covers 2 elements so the beginning
position for a thread is threadIdx.x.

7. We want to use each thread to calculate two output elements of a vector addition. Each thread
block processes 2*blockDim.x consecutive elements that form two sections. All threads in each
block will first process a section, each processing one element. They will then all move to the next
section, again each processing one element. Assume that variable i should be the index for the
first element to be processed by a thread. What would be the expression for mapping the
thread/block indices to data index of the first element?
(A) i=blockIdx.x*blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x*threadIdx.x*2
(C) i=(blockIdx.x*blockDim.x + threadIdx.x)*2
(D) i=blockIdx.x*blockDim.x*2 + threadIdx.x

Answer: (D)

Explanation: Each previous block covers (blockIdx.x*blockDim.x)*2. The beginning elements of


the threads are consecutive in this case so just add threadIdx.x to it.

8. For a vector addition, assume that the vector length is 8000, each thread calculates one output
element, and the thread block size is 1024 threads. The programmer configures the kernel launch
to have a minimal number of thread blocks to cover all output elements. How many threads will
be in the grid?
(A) 8000
(B) 8196
(C) 8192
(D) 8200

Answer: (C)

Explanation: ceil(8000/1024)*1024 = 8 * 1024 = 8192. Another way to look at it is the minimal


multiple of 1024 to cover 8000 is 1024*8 = 8192.

9. The following table shows CUDA function declarations, state where the function can be
executed and is callable from
Executed on the: Only callable from the:
__host__ float FuncC() Host host
__global__ void FuncB () Device Host
__device__ float FuncA () Device Device

You might also like