Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Scratchpad Sharing in GPUs

Published: 26 May 2017 Publication History

Abstract

General-Purpose Graphics Processing Unit (GPGPU) applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism (TLP) present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this article, we propose architectural and compiler optimizations to improve the scratchpad memory utilization. Our approach, called Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad memory and also share scratchpad memory with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the part of scratchpad memory that is shared among thread blocks.
We propose compiler optimizations to improve the availability of shared scratchpad memory. We describe an allocation scheme that helps in allocating scratchpad variables such that shared scratchpad is accessed for short duration. We introduce a new hardware instruction, relssp, that when executed releases the shared scratchpad memory. Finally, we describe an analysis for optimal placement of relssp instructions, such that shared scratchpad memory is released as early as possible, but only after its last use, along every execution path.
We implemented the hardware changes required for scratchpad sharing and the relssp instruction using the GPGPU-Sim simulator and implemented the compiler optimizations in Ocelot framework. We evaluated the effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK, GPGPU-Sim, and Rodinia. The kernels that under-utilize scratchpad memory show an average improvement of 19% and maximum improvement of 92.17% in terms of the number of instruction executed per cycle when compared to the baseline approach, without affecting the performance of the kernels that are not limited by scratchpad memory.

References

[1]
Jayvant Anantpur and R. Govindarajan. 2014. Taming control divergence in GPUs through control flow linearization. In Proceedings of the Conference on Compiler Construction (CC’14).
[2]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.
[3]
Nicolas Brunie, Sylvain Collange, and Gregory Diamos. 2012. Simultaneous branch and warp interweaving for sustained GPU performance. In Proceedings of the International Symposium on Computer Architecture.
[4]
Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization.
[5]
CUDA 2012. CUDA C Programming Guide. (2012). Retrieved from http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
[6]
CUDA-SDK 2014. CUDA-SDK. (2014). Retrieved from http://docs.nvidia.com/cuda/cuda-samples.
[7]
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.
[8]
Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the Conference on High Performance Computer Architecture.
[9]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the International Symposium on Microarchitecture.
[10]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the International Symposium on Microarchitecture.
[11]
Juan Gomez-Luna, Jose Maria Gonzalez-Linares, Jose Ignacio Benavides Benitez, and Nicolas Guil. 2013. Performance modeling of atomic additions on GPU scratchpad memory. IEEE Trans. Parallel Distrib. Syst. (2013).
[12]
GPGPUSIM 2014. GPGPU-Sim Simulator. (2014). Retrieved from http://www.gpgpu-sim.org.
[13]
Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata. 2008. Memory locality exploitation strategies for FFT on the CUDA architecture. In Proceedings of the International Meeting on High-Performance Computing for Computational Science.
[14]
Tianyi David Han and Tarek S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units.
[15]
William H. Harrison. 1977. Compiler analysis of the value ranges for variables. IEEE Trans. Software Eng. 3, 3 (1977).
[16]
Ari B. Hayes and Eddy Z. Zhang. 2014. Unified on-chip memory allocation for SIMT architecture. In Proceedings of the International Conference on Supercomputing.
[17]
Xin Huo, V. T. Ravi, Wenjing Ma, and G. Agrawal. 2010. Approaches for parallelizing reductions on modern GPUs. In Proceedings of the Conference on High Performance Computing.
[18]
Vishwesh Jatala, Jayvant Anantpur, and Amey Karkare. 2015. The more we share, the more we have: Improving GPU performance through register sharing. CoRR abs/1503.05694 (2015).
[19]
Vishwesh Jatala, Jayvant Anantpur, and Amey Karkare. 2016a. Improving GPU performance through resource sharing. In Proceedings of the Conference on High-Performance Parallel and Distributed Computing (HPDC’16).
[20]
Vishwesh Jatala, Jayvant Anantpur, and Amey Karkare. 2016b. Scratchpad sharing in GPUs. CoRR abs/1607.03238 (2016).
[21]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems.
[22]
John B. Kam and Jeffrey D. Ullman. 1976. Global data flow analysis and iterative algorithms. J. ACM 23, 1 (Jan. 1976).
[23]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.
[24]
Uday Khedker, Amitabha Sanyal, and Bageshri Karkare. 2009. Data Flow Analysis: Theory and Practice (1st ed.). CRC Press, Inc., Boca Raton, FL.
[25]
Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the Conference on High Performance Computer Architecture.
[26]
Sangpil Lee, Won Woo Ro, Keunsoo Kim, Gunjae Koo, Myung Kuk Yoon, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the Conference on High Performance Computer Architecture.
[27]
Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. 2015. CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In Proceedings of the International Symposium on Computer Architecture.
[28]
Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.
[29]
Chao Li, Yi Yang, Zhen Lin, and Huiyang Zhou. 2015b. Automatic data placement into GPU on-chip memory resources. In Proceedings of the Conference on Code Generation and Optimization.
[30]
Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015a. Priority-based cache allocation in throughput processors. In Proceedings of the Conference on High Performance Computer Architecture.
[31]
T. Li, V. K. Narayana, E. El-Araby, and T. El-Ghazawi. 2011. GPU resource sharing and virtualization on high performance computing systems. In Proceedings of the International Conference on Parallel Processing.
[32]
Wenjing Ma and Gagan Agrawal. 2010. An integer programming framework for optimizing shared memory use on GPUs. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.
[33]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the International Symposium on Microarchitecture.
[34]
OpenCL 2009. Retrieved from https://www.khronos.org/opencl/ Accessed 2012.
[35]
PTX 2014. Parallel Thread Execution. (2014). Retrieved from http://docs.nvidia.com/cuda/parallel-thread-execution/.
[36]
Fernando Magno Quintao Pereira, Raphael Ernani Rodrigues, and Victor Hugo Sperle Campos. 2013. A fast and low-overhead technique to secure programs against integer overflows. In Proceedings of the Conference on Code Generation and Optimization.
[37]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the International Symposium on Microarchitecture.
[38]
Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the Conference on High Performance Computer Architecture.
[39]
D. Tarjan and K. Skadron. 2011. On demand register allocation and deallocation for a multithreaded processor. Retrieved from http://www.google.com/patents/US20110161616. (2011). U.S. Patent App. 12/649,238.
[40]
Ping Xiang, Yi Yang, and Huiyang Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the Conference on High Performance Computer Architecture.
[41]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the International Symposium on Microarchitecture.
[42]
Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proceedings of the Conference on Parallel Architectures and Compilation Techniques.

Cited By

View all
  • (2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 2
June 2017
259 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3086564
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2017
Accepted: 01 March 2017
Revised: 01 February 2017
Received: 01 July 2016
Published in TACO Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Scratchpad sharing
  2. code motion
  3. control flow graph
  4. thread level parallelism

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • TCS Ph.D. fellowship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)99
  • Downloads (Last 6 weeks)13
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)X-cacheProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527380(396-409)Online publication date: 18-Jun-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media