Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Thread Batching for High-performance Energy-efficient GPU Memory Design

Published: 16 December 2019 Publication History

Abstract

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

References

[1]
Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power-efficient register file for GPGPUs. In Proceedings of the IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). IEEE Computer Society, Washington, DC, 412--423.
[2]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 607--618.
[3]
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12). IEEE Computer Society, Washington, DC, 416--427. Retrieved from http://dl.acm.org/citation.cfm?id=2337159.2337207.
[4]
Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 421--432.
[5]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 163--174.
[6]
Alexander Branover, Denis Foley, and Maurice Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (Mar. 2012), 28--37.
[7]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Austin, TX, 44--54.
[8]
Hanjin Chu. 2013. AMD heterogeneous uniform memory access. In Proceedings of the APU 13th Developer Summit. 11--13.
[9]
Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, José A. Joao, Onur Mutlu, and Yale N. Patt. 2011. Parallel application memory scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 362--373.
[10]
Stijn Eyerman and Lieven Eeckhout. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 3 (May 2008), 42--53.
[11]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A mapreduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 260--269.
[12]
Advanced Micro Devices Inc. [n.d.]. AMD Quad-Core A10-Series APU for Desktops. Retrieved from http://products.amd.com/en-us/DesktopAPUDetail.aspx?id=100/.
[13]
Micron Technology Inc. [n.d.]. Micron DDR3 SDRAM Part MT41J256M8. Micron Technology Inc.
[14]
The Khronos Group Inc. [n.d.]. OpenCL. Retrieved from https://www.khronos.org/opencl/.
[15]
Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the Design Automation Conference (DAC ’12). ACM, New York, NY, 850--855.
[16]
Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. [n.d.]. Balancing DRAM locality and parallelism in shared memory CMP systems. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture.
[17]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2012. Characterizing and improving the use of demand-fetched caches in GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS’12). ACM, New York, NY, 15--24.
[18]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 395--406.
[19]
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 332--343.
[20]
Onur Kayıran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14). IEEE, Cambridge, 114--126.
[21]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’10). IEEE Computer Society, Washington, DC, 65--76.
[22]
Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the IEEE 14th International Symposium on High Performance Computer Architecture. IEEE, 367--378.
[23]
Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 367--376.
[24]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'05). ACM, New York, 190--200.
[25]
Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai (Helen) Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write-based racetrack memory. In Proceedings of the 51st Annual Design Automation Conference (DAC’14). ACM, New York, NY.
[26]
Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-hardware cooperative DRAM bank partitioning for chip multiprocessors. In Network and Parallel Computing. Springer, Berlin, 329--343.
[27]
Micron. [n.d.]. Micron system power calculators. Retrieved from http://www.micron.com/products/support/power-calc/
[28]
Micron. [n.d.]. Micron TN-ED-01: GDDR5 SGRAM Introduction. Retrieved from http://www.micron.com/products/dram/gddr5/
[29]
Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07). IEEE Computer Society, Washington, DC, 146--160.
[30]
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08). IEEE Computer Society, Washington, DC, 63--74.
[31]
NVIDIA. [n.d.]. CUDA. Retrieved from http://www.nvidia.com/object/cuda_home_new.html/.
[32]
NVIDIA. [n.d.]. CUDA SDK. Retrieved from https://developer.nvidia.com/cuda-downloads/.
[33]
NVIDIA. 2009. Nvidia Fermi Architecture. Retrieved from http://www.nvidia.com/object/fermi-architecture.html.
[34]
John D. Owens, William J. Dally, Scott Rixner, Peter Mattson, and Ujval J. Kapasi. 2000. Memory access scheduling. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 128--138.
[35]
Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE Comput. Architect. Lett. 14, 1 (Jan. 2015), 34--36.
[36]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’14). 568--578.
[37]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83.
[38]
Anand Lal Shimpi. 2012. Inside the titan supercomputer: 299k amd x86 cores and 18.6 k nvidia gpus. Retrieved on December 11, 2019 from https://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores.
[39]
John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and W. M. W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center Reliable High-Perform. Comput. 127 (2012).
[40]
I.-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 513--522.
[41]
Noriaki Suzuki, Hyoseung Kim, Dionisio De Niz, Bjorn Andersson, Lutz Wrage, Mark Klein, and Ragunathan Rajkumar. 2013. Coordinated bank and cache coloring for temporal protection of memory accesses. In Proceedings of the IEEE 16th International Conference on Computational Science and Engineering. IEEE, 685--692.
[42]
Hiroyuki Usui, Lavanya Subramanian, Kevin Kai-Wei Chang, and Onur Mutlu. 2016. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators. ACM Trans. Architect. Code Optim. 12, 4 (Jan. 2016).
[43]
Mingli Xie, Dong Tong, Kan Huang, and Xu Cheng. 2014. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 344--355.
[44]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395--406.
[45]
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97.
[46]
George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). 34--44.

Cited By

View all
  • (2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
  • (2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 15, Issue 4
Special Issue on HALO for Energy-Constrained On-Chip Machine Learning, Part 2 and Regular Papers
October 2019
226 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3365594
  • Editor:
  • Ramesh Karri
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 16 December 2019
Accepted: 01 May 2019
Revised: 01 April 2019
Received: 01 January 2019
Published in JETC Volume 15, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. memory partitioning
  3. thread batch
  4. warp scheduler

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NRC Associate Fellowship Award
  • U.S. National Science Foundation
  • U.S. Department of Energy (DOE)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)4
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
  • (2020)Blind Image Quality Assessment by Natural Scene Statistics and Perceptual CharacteristicsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/341483716:3(1-91)Online publication date: 25-Aug-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media