Abstract
The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.
This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.
As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.
Similar content being viewed by others
References
Torres Y, Gonzalez-Escribano A, Llanos DR (2012) Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA’2012, Leganes, Madrid, Spain, 2012
NVIDIA (2010) NVIDIA CUDA programming guide 3.0 Fermi
NVIDIA (2012) NVIDIA CUDA programming guide 4.2: Kepler
Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach, February 2010. Morgan Kaufmann, San Mateo
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. PPoPP’08, Salt Lake City, UT, USA, pp 73–82
Xiang Cui CZ, Chen Y, Mei H (2010) Auto-tuning dense matrix multiplication for GPGPU with cache. In: Proc. ICPADS’2010, Shanghai, China, December 2010, pp 237–242
Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: Intl. conf. on high performance computing and simulation, HPCS 2011, pp 631–639
Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proc. ISPASS’2010, March 2010, pp 235–246
Zhang Y, Owens J (2011) A quantitative performance analysis model for gpu architectures. In: Proc. HPCA’2011, February 2011, pp 382–393
NVIDIA (2012) NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Last visit: June 2012. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Greg Ruetsch PM (2010) NVIDIA optimizing matrix transpose in CUDA, June 2010. Last visit: December 2, 2010. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf
Aji AM, Daga M, Feng W-c (2011) Bounding the effect of partition camping in GPU kernels. In: Proc. 8th ACM int. conf. on computing frontiers, ser. CF’11. ACM, New York, pp 27:1–27:10 (online). Available: http://doi.acm.org/10.1145/2016604.2016637
Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, 1969 (online). Available: http://portal.acm.org/citation.cfm?coll=GUIDE/&dl=GUIDE/&id=905686
Torres Y, Gonzalez-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. Dept. Informatica, Universidad de Valladolid, Tech. Rep. IT-DI-2012-0001, December 2012. http://www.infor.uva.es/investigacion/publicaciones.html
Acknowledgements
This research is partly supported by the Ministerio de Industria, Spain (CENIT OCEANLIDER), MINECO (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H network TIN2010-12011-E and TIN2011-15734-E), Junta de Castilla y León (VA172A12-2), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Torres, Y., Gonzalez-Escribano, A. & Llanos, D.R. uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65, 1150–1163 (2013). https://doi.org/10.1007/s11227-013-0921-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0921-z