research-article

Thread Batching for High-performance Energy-efficient GPU Memory Design

Authors:

Hai (Helen) LiAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems (JETC), Volume 15, Issue 4

Article No.: 39, Pages 1 - 21

https://doi.org/10.1145/3330152

Published: 16 December 2019 Publication History

Abstract

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

References

[1]

Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power-efficient register file for GPGPUs. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Washington, DC, 412--423.

Digital Library

[2]

Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 607--618.

[3]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 416--427. Retrieved from http://dl.acm.org/citation.cfm?id=2337159.2337207.

Digital Library

[4]

Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 421--432.

[5]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.

[6]

Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (Mar. 2012), 28--37.

Digital Library

[7]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Austin, TX, 44--54.

Digital Library

[8]

Hanjin Chu. 2013. AMD heterogeneous uniform memory access. In Proceedings of the APU 13th Developer Summit. 11--13.

[9]

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 362--373.

[10]

Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (May 2008), 42--53.

Digital Library

[11]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 260--269.

Digital Library

[12]

Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http://products.amd.com/en-us/DesktopAPUDetail.aspx?id=100/.

[13]

Micron Technology Inc. [n.d.]. Micron DDR3 SDRAM Part MT41J256M8. Micron Technology Inc.

[14]

The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https://www.khronos.org/opencl/.

[15]

Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Design Automation Conference (DAC ’12). ACM, New York, NY, 850--855.

Digital Library

[16]

Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. [n.d.]. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture.

Digital Library

[17]

Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24.

Digital Library

[18]

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406.

[19]

Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 332--343.

[20]

Onur Kayıran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14). IEEE, Cambridge, 114--126.

Digital Library

[21]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 65--76.

Digital Library

[22]

Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367--378.

[23]

Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 367--376.

Digital Library

[24]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'05). ACM, New York, 190--200.

Digital Library

[25]

Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai (Helen) Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write-based racetrack memory. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY.

Digital Library

[26]

Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-hardware cooperative DRAM bank partitioning for chip multiprocessors. In Network and Parallel Computing. Springer, Berlin, 329--343.

[27]

Micron. [n.d.]. Micron system power calculators. Retrieved from http://www.micron.com/products/support/power-calc/

[28]

Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http://www.micron.com/products/dram/gddr5/

[29]

Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 146--160.

Digital Library

[30]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 63--74.

Digital Library

[31]

NVIDIA. [n.d.]. CUDA. Retrieved from http://www.nvidia.com/object/cuda_home_new.html/.

[32]

NVIDIA. [n.d.]. CUDA SDK. Retrieved from https://developer.nvidia.com/cuda-downloads/.

[33]

NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http://www.nvidia.com/object/fermi-architecture.html.

[34]

John D. Owens, William J. Dally, Scott Rixner, Peter Mattson, and Ujval J. Kapasi. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 128--138.

[35]

Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput. Architect. Lett. 14, 1 (Jan. 2015), 34--36.

Digital Library

[36]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’14). 568--578.

[37]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83.

[38]

Anand Lal Shimpi. 2012. Inside the titan supercomputer: 299k amd x86 cores and 18.6 k nvidia gpus. Retrieved on December 11, 2019 from https://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores.

[39]

John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-Perform. Comput. 127 (2012).

[40]

I.-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 513--522.

[41]

Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein, and Ragunathan Rajkumar. 2013. Coordinated bank and cache coloring for temporal protection of memory accesses. In Proceedings of the IEEE 16th International Conference on Computational Science and Engineering. IEEE, 685--692.

Digital Library

[42]

Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans. Architect. Code Optim. 12, 4 (Jan. 2016).

[43]

Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 344--355.

[44]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395--406.

Digital Library

[45]

Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97.

Digital Library

[46]

George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 34--44.

Cited By

Wei XHu QLi L(2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
https://doi.org/10.1109/ICPICS55264.2022.9873685
Liu YGu KLi XZhang Y(2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020
https://dl.acm.org/doi/10.1145/3414837

Index Terms

Thread Batching for High-performance Energy-efficient GPU Memory Design
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-...

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems

ACM Journal on Emerging Technologies in Computing Systems Volume 15, Issue 4

Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers

October 2019

226 pages

ISSN:1550-4832

EISSN:1550-4840

DOI:10.1145/3365594

Editor:
Ramesh Karri
Polytechnic Institute of New York University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 16 December 2019

Accepted: 01 May 2019

Revised: 01 April 2019

Received: 01 January 2019

Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NRC Associate Fellowship Award
U.S. National Science Foundation
U.S. Department of Energy (DOE)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
233
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)4

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei XHu QLi L(2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
https://doi.org/10.1109/ICPICS55264.2022.9873685
Liu YGu KLi XZhang Y(2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020
https://dl.acm.org/doi/10.1145/3414837

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents