research-article

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Authors:

Nachiappan Chidambaram Nachiappan,

Asit K. Mishra,

Mahmut T. Kandemir,

Ravishankar Iyer,

Chita R. DasAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 41, Issue 1

Pages 395 - 406

https://doi.org/10.1145/2490301.2451158

Published: 16 March 2013 Publication History

Abstract

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose applications, available hardware resources of a GPGPU are not efficiently utilized, leading to lost opportunity in improving performance. A major cause of this is the inefficiency of current warp scheduling policies in tolerating long memory latencies.

In this paper, we identify that the scheduling decisions made by such policies are agnostic to thread-block, or cooperative thread array (CTA), behavior, and as a result inefficient. We present a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies. The first two schemes, CTA-aware two-level warp scheduling and locality aware warp scheduling, enhance per-core performance by effectively reducing cache contention and improving latency hiding capability. The third scheme, bank-level parallelism aware warp scheduling, improves overall GPGPU performance by enhancing DRAM bank-level parallelism. The fourth scheme employs opportunistic memory-side prefetching to further enhance performance by taking advantage of open DRAM rows. Evaluations on a 28-core GPGPU platform with highly memory-intensive applications indicate that our proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

References

[1]

AMD. Radeon and FirePro Graphics Cards, Nov. 2011.

[2]

AMD. Heterogeneous Computing: OpenCL and the ATI Radeon HD 5870 (Evergreen) Architecture, Oct. 2012.

[3]

R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. In ISCA, 2012.

Digital Library

[4]

A. Bakhoda, J. Kim, and T. Aamodt. Throughput-effective On-chip Networks for Manycore Accelerators. In MICRO, 2010.

Digital Library

[5]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In ISPASS, 2009.

[6]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs. In ICS 2008.

Digital Library

[7]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA Code Generation for Affine Programs. In CC/ETAPS 2010.

Digital Library

[8]

M. Bauer, H. Cook, and B. Khailany. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In SC, 2011.

Digital Library

[9]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Smarter Memory Controller. In HPCA, 1999.

Digital Library

[10]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IISWC, 2009.

Digital Library

[11]

X. E. Chen and T. Aamodt. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors. IEEE Trans. Comput., 2012.

Digital Library

[12]

E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel Application Memory Scheduling. MICRO, 2011.

Digital Library

[13]

E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.

[14]

W. Fung, I. Sham, G. Yuan, and T. Aamodt. DynamicWarp Formation and Scheduling for Efficient GPU Control Flow. In MICRO, 2007.

Digital Library

[15]

W. W. L. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In HPCA, 2011.

Digital Library

[16]

W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. Hardware Transactional Memory for GPU Architectures. In MICRO, 2011.

Digital Library

[17]

M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA, 2011.

Digital Library

[18]

S. Hassan, D. Choudhary, M. Rasquinha, and S. Yalamanchili. Regulating Locality vs. Parallelism Tradeoffs in Multiple Memory Controller Environments. In PACT, 2011.

Digital Library

[19]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, 2008.

Digital Library

[20]

M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems . In HPCA, 2012.

Digital Library

[21]

W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and Improving the Use of Demand-fetched Caches in GPUs. In ICS, 2012.

Digital Library

[22]

A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, and C. R. Das. Cache Revive: Architecting Volatile STT-RAM Caches for Enhanced Performance in CMPs. In DAC, 2012.

Digital Library

[23]

D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. IEEE Trans. Comput., 1999.

Digital Library

[24]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither More Nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.

[25]

S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the Future of Parallel Computing. IEEE Micro, 2011.

Digital Library

[26]

Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS: A Scalable and High-performance Scheduling Algorithm for Multiple Memory Controllers. In HPCA, 2010.

[27]

Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In MICRO, 2010.

Digital Library

[28]

D. Kirk and Wen-mei. W. Hwu. Programming Massively Parallel Processors. 2010.

Digital Library

[29]

K. Krewell. Amd's Fusion Finally Arrives. MPR, 2011.

[30]

K. Krewell. Ivy Bridge Improves Graphics. MPR, 2011.

[31]

K. Krewell. Most Significant Bits. MPR, 2011.

[32]

K. Krewell. Nvidia Lowers the Heat on Kepler. MPR, 2012.

[33]

N. B. Lakshminarayana, J. Lee, H. Kim, and J. Shin. DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function. Computer Architecture Letters, 2012.

Digital Library

[34]

C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt. Prefetch-Aware DRAM Controllers. In MICRO, 2008.

Digital Library

[35]

C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt. Improving Memory Bank-Level Parallelism in the Presence of Prefetching. In MICRO, 2009.

Digital Library

[36]

J. Lee, N. Lakshminarayana, H. Kim, and R. Vuduc. Many-thread Aware Prefetching Mechanisms for GPGPU Applications. In MICRO, 2010.

Digital Library

[37]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008.

Digital Library

[38]

T. Moscibroda and O. Mutlu. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX SECURITY, 2007.

Digital Library

[39]

A. Munshi. The OpenCL Specification, June 2011.

[40]

S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning". In MICRO, 2011.

Digital Library

[41]

O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA, 2008.

Digital Library

[42]

O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO, 2007.

Digital Library

[43]

N. Chidambaram Nachiappan, A. K. Mishra, M. Kandemir, A. Sivasubramaniam, O. Mutlu, and C. R. Das. Application-aware Prefetch Prioritization in On-chip Networks. In PACT, 2012.

Digital Library

[44]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. Improving GPU Performance Via Large Warps and Two-level Warp Scheduling. In MICRO, 2011.

Digital Library

[45]

K. J. Nesbit, and J. E. Smith. Data Cache Prefetching Using a Global History Buffer. In HPCA, 2004.

Digital Library

[46]

NVIDIA. CUDA C Programming Guide, Oct. 2010.

[47]

NVIDIA. CUDA C/C++ SDK code samples, 2011.

[48]

NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture, Nov. 2011.

[49]

M. Rhu and M. Erez. CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures. In ISCA 2012.

Digital Library

[50]

S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000.

Digital Library

[51]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious Wavefront Scheduling. In MICRO, 2012.

Digital Library

[52]

S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA, 2007.

Digital Library

[53]

J. A. Stratton et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. 2012.

[54]

I. J. Sung, J. A. Stratton, and W.-M. W. Hwu. Data Layout Transformation Exploiting Memory-level Parallelism in Structured Grid Many-core Applications. In PACT, 2010.

Digital Library

[55]

R. Thekkath, and S. J. Eggers. The Effectiveness of Multiple Hardware Contexts. In ASPLOS, 1994.

Digital Library

[56]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU Microarchitecture Through Microbenchmarking. In ISPASS, 2010.

[57]

H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD, 2012.

Digital Library

[58]

G. Yuan, A. Bakhoda, and T. Aamodt. Complexity Effective Memory Access Scheduling forMany-core Accelerator Architectures. InMICRO, 2009.

Digital Library

[59]

W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to Be Issued Out of Order. U.S. Patent Number 5,630,096, 1997.

Cited By

Liu JYang JMelhem RPrvulovic M(2015)SAWSProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830822(383-394)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830822
Samavatian MArjomand MBashizade RSarbazi-Azad H(2015)Architecting the Last-Level Cache for GPUs using STT-RAM TechnologyACM Transactions on Design Automation of Electronic Systems10.1145/276490520:4(1-24)Online publication date: 28-Sep-2015
https://dl.acm.org/doi/10.1145/2764905
Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Show More Cited By

Index Terms

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Recommendations

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
Orchestrated scheduling and prefetching for GPGPUs
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 41, Issue 1

ASPLOS '13

March 2013

540 pages

ISSN:0163-5964

DOI:10.1145/2490301

Issue’s Table of Contents

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
March 2013
574 pages
ISBN:9781450318709
DOI:10.1145/2451116
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Published in SIGARCH Volume 41, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

249
Total Citations
View Citations
1,428
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)16

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu JYang JMelhem RPrvulovic M(2015)SAWSProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830822(383-394)Online publication date: 5-Dec-2015
https://dl.acm.org/doi/10.1145/2830772.2830822
Samavatian MArjomand MBashizade RSarbazi-Azad H(2015)Architecting the Last-Level Cache for GPUs using STT-RAM TechnologyACM Transactions on Design Automation of Electronic Systems10.1145/276490520:4(1-24)Online publication date: 28-Sep-2015
https://dl.acm.org/doi/10.1145/2764905
Li CXu Y(2024)Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model InferenceProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689153(53-67)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688351.3689153
Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Chaturvedi IGodala BWu YXu ZIliakis KEleftherakis PXydis SSoudris DSorensen TCampanoni SAamodt TAugust D(2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00011
Zhang YWang MMai YYu Z(2023)TensorCache: Reconstructing Memory Architecture With SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332674131:12(2030-2043)Online publication date: Dec-2023
https://doi.org/10.1109/TVLSI.2023.3326741
Kim HSong W(2023)LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.324780834:5(1479-1494)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3247808
K RB.K.S.V.L. VReorda MSingh V(2023)TREFU: An Online Error Detecting and Correcting Fault Tolerant GPGPU Architecture2023 IEEE 29th International Symposium on On-Line Testing and Robust System Design (IOLTS)10.1109/IOLTS59296.2023.10224865(1-7)Online publication date: 3-Jul-2023
https://doi.org/10.1109/IOLTS59296.2023.10224865
Wang WWang MZhang YWei YYu Z(2023)LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00178(1230-1237)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00178
Zhang YWang MWang WYu Z(2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
https://doi.org/10.1016/j.mejo.2023.105825
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents