A GPGPU Compiler For Memory Optimization And: Parallelism Management
A GPGPU Compiler For Memory Optimization And: Parallelism Management
A GPGPU Compiler For Memory Optimization And: Parallelism Management
Parallelism Management
Yi Yang Ping Xiang Jingfei Kong Huiyang Zhou
Dept. of ECE School of EECS School of EECS, UCF Dept. of ECE
North Carolina State University Univ. of Central Florida Univ. of Central Florida North Carolina State University
yyang14@ncsu.edu xp@knights.ucf.edu jfkong@cs.ucf.edu hzhou@ncsu.edu
Figure 13. Performance improvement of our optimized kernels over CUBLAS 2.2 implementations on GTX280.
Among the algorithms in Table 1, six are implemented in the tion 2. Another reason is the side effect of memory coalescing.
CUDA CUBLAS library. In the next experiment, we compare our Without data vectorization, the compiler recognized that the array
optimized kernel with the highly tuned CUBLAS v2.2 on GTX accesses to both real and imaginary parts (A[2*idx] and
280. Figure 13 shows the performance comparison of the algo- A[2*idx+1]) are not coalesced. So, it uses shared memory as tem-
rithms with different input sizes. From Figure 13, we can see that porary storage to generate coalesced memory accesses as dis-
the kernel optimized by our compiler achieves consistently better cussed in Section 3.3. In comparison, the accesses in the kernel
performance than CUBLAS 2.2 for transpose matrix vector mul- after data vectorization, A[idx], is coalesced. As a result, the data
tiplication (tmv), matrix vector multiplication (mv), vector vector are directly loaded into registers for computation. Although the
multiplication (vv), and matrix equation solver (strsm) for differ- compiler uses the shared memory to improve memory reuse for
ent input sizes. For matrix multiplication (mm) and reduction (rd), both vectorized and un-vectorized versions, there are more shared
the performance of our optimized code is very close to CUBLAS memory accesses in the un-vectorized kernel ‘optimized_wo_vec’
2.2 (within 2% difference). On average (based on the geometric due to code transformation for coalescing. These extra shared
mean), our performance improvement over CUBLAS varies from memory accesses contribute to the performance differences be-
26% to 33% for different input sizes. tween the ‘optimized_wo_vec’ and ‘optimized’ kernels.
To study the effect of data vectorization, we chose the reduc- Among all the kernels, transpose (tp) and matrix-vector mul-
tion (rd) algorithm since rd is the only algorithm in our study that tiplications (mv) exhibit the partition camping problem. Ruetsch
has a corresponding version for complex numbers (CublasScasum) and Micikevicius [12] proposed diagonal block reordering to ad-
in CUBLAS. We changed the naïve kernel of rd to process com- dress the issue with transpose and their implementation is in-
plex numbers by using two float-type variables to read the real cluded in the latest CUDA SDK. In Figure 15, we compare the
(A[2*idx]) and imaginary (A[2*idx+1]) parts of a complex num- performance of our optimized kernel (labeled ‘optimized’) with
ber instead of a single float2 variable. Then, we optimized this theirs (labeled ‘SDK new’) and we also include the previous CU-
naïve kernel with and without the data vectorization step. For DA SDK version for reference (labeled ‘SDK prev’). Since tp
different input sizes, we compared the performance of the two does not have any floating point operations, the effective band-
optimized kernels (labeled ‘optimized_wo_vec’ and ‘optimized’, width is used. From Figure 15, it can be seen that although our
respectively) and the results are show in Figure 14. compiler uses the same approach to eliminate partition camping,
the remaining optimizations taken by our compiler result in better
performance than the version in the latest SDK.
Reduction on GTX280 In mv, the thread blocks are in one dimension. Therefore, di-
optimized_wo_vec optimized cublas
agonal block reordering cannot be applied. Our compiler uses the
25
Performance (GFLOPS)
Figure 14. The effect of data vectorization on reduction with Matrix vector multiplication on GTX 280
complex number inputs. Naïve Opti_PC Optimized CUBLAS2.2
45
40
35
Matrix transpose on GTX280 30
GFLOPS
90
80 20
70 15
60 10
50 5
40 0
30
20 2kx2k 2kx4k 2kx8k 2kx16k 2kx32k 2kx64k 4kx4k 3kx3k
10
0 Matrix size