Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

uBench: exposing the impact of CUDA block geometry in terms of performance

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.

This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.

As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Torres Y, Gonzalez-Escribano A, Llanos DR (2012) Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA’2012, Leganes, Madrid, Spain, 2012

    Google Scholar 

  2. NVIDIA (2010) NVIDIA CUDA programming guide 3.0 Fermi

  3. NVIDIA (2012) NVIDIA CUDA programming guide 4.2: Kepler

  4. Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach, February 2010. Morgan Kaufmann, San Mateo

    Google Scholar 

  5. Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. PPoPP’08, Salt Lake City, UT, USA, pp 73–82

    Google Scholar 

  6. Xiang Cui CZ, Chen Y, Mei H (2010) Auto-tuning dense matrix multiplication for GPGPU with cache. In: Proc. ICPADS’2010, Shanghai, China, December 2010, pp 237–242

    Google Scholar 

  7. Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: Intl. conf. on high performance computing and simulation, HPCS 2011, pp 631–639

    Chapter  Google Scholar 

  8. Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proc. ISPASS’2010, March 2010, pp 235–246

    Google Scholar 

  9. Zhang Y, Owens J (2011) A quantitative performance analysis model for gpu architectures. In: Proc. HPCA’2011, February 2011, pp 382–393

    Google Scholar 

  10. NVIDIA (2012) NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Last visit: June 2012. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

  11. Greg Ruetsch PM (2010) NVIDIA optimizing matrix transpose in CUDA, June 2010. Last visit: December 2, 2010. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf

  12. Aji AM, Daga M, Feng W-c (2011) Bounding the effect of partition camping in GPU kernels. In: Proc. 8th ACM int. conf. on computing frontiers, ser. CF’11. ACM, New York, pp 27:1–27:10 (online). Available: http://doi.acm.org/10.1145/2016604.2016637

    Google Scholar 

  13. Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, 1969 (online). Available: http://portal.acm.org/citation.cfm?coll=GUIDE/&dl=GUIDE/&id=905686

  14. Torres Y, Gonzalez-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. Dept. Informatica, Universidad de Valladolid, Tech. Rep. IT-DI-2012-0001, December 2012. http://www.infor.uva.es/investigacion/publicaciones.html

Download references

Acknowledgements

This research is partly supported by the Ministerio de Industria, Spain (CENIT OCEANLIDER), MINECO (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H network TIN2010-12011-E and TIN2011-15734-E), Junta de Castilla y León (VA172A12-2), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arturo Gonzalez-Escribano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Torres, Y., Gonzalez-Escribano, A. & Llanos, D.R. uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65, 1150–1163 (2013). https://doi.org/10.1007/s11227-013-0921-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-0921-z

Keywords