research-article

CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads

Authors:

Akhil Arunkumar,

Carole-Jean WuAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 515 - 527

https://doi.org/10.1145/2749469.2750418

Published: 13 June 2015 Publication History

Abstract

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to hide pipeline stalls and memory access latency. However, recent characterization results have shown that general purpose GPU (GPGPU) applications commonly encounter long stall latencies that cannot be easily hidden with the large number of concurrent threads/warps. This results in varying execution time disparity between different parallel warps, hurting the overall performance of GPUs -- the warp criticality problem.

To tackle the warp criticality problem, we propose a coordinated solution, criticality-aware warp acceleration (CAWA), that efficiently manages compute and memory resources to accelerate the critical warp execution. Specifically, we design (1) an instruction-based and stall-based criticality predictor to identify the critical warp in a thread-block, (2) a criticality-aware warp scheduler that preferentially allocates more time resources to the critical warp, and (3) a criticality-aware cache reuse predictor that assists critical warp acceleration by retaining latency-critical and useful cache blocks in the L1 data cache. CAWA targets to remove the significant execution time disparity in order to improve resource utilization for GPGPU workloads. Our evaluation results show that, under the proposed coordinated scheduler and cache prioritization management scheme, the performance of the GPGPU workloads can be improved by 23% while other state-of-the-art schedulers, GTO and 2-level schedulers, improve performance by 16% and -2% respectively.

References

[1]

A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proc. of the 2009 IEEE International Symposium on Analysis of Systems and Software (ISPASS'09), Boston, MA, USA, April 2009.

[2]

A. Bhattacharjee and M. Martonosi, "Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors," in Proc. of the 36th IEEE/ACM International Symposium on Computer Architecture (ISCA'09), Austin, TX, USA, June 2009.

Digital Library

[3]

K. D. Bois, S. Eyerman, J. B. Sartor, and L. Eeckhout, "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2013.

Digital Library

[4]

M. Burtscher, R. Nasre, and K. Pingali, "A quantitative study of irregular programs on GPUs," in Proc. of the 2012 IEEE International Symposium on Workload Characterization (IISWC'12), San Diego, CA, USA, November 2012.

Digital Library

[5]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proc. of the 2009 IEEE International Symposium on Workload Characterization (IISWC'09), Austin, TX, USA, October 2009.

Digital Library

[6]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in Proc. of the 2010 IEEE International Symposium on Workload Characterization (IISWC'10), Atlanta, GA, USA, December 2010.

Digital Library

[7]

X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, and W. mei Hwu, "Adaptive cache management for energy-efficient GPU computing," in Proc. of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'14), Cambridge, UK, December 2014.

Digital Library

[8]

E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt, "Parallel application memory scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.

Digital Library

[9]

W. L. W. Fung and T. M. Aamodt, "Thread block compaction for efficient SIMT control flow," in Proc. of the 17th IEEE International Symposium on High Performance Computer Architecture (HPCA'11), San Antonio, TX, USA, February 2011.

Digital Library

[10]

W. L. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.

[11]

M. Gebhart, R. D. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindoholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in Proc. of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA'11), San Jose, CA, USA, June 2011.

Digital Library

[12]

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.

Digital Library

[13]

W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in GPUs," in Proc. of the 20th ACM International Conference on Supercomputing (ICS'12), Venice, Italy, June 2012.

Digital Library

[14]

W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: memory request prioritization for massively parallel processors," in Proc. of the 20th IEEE International Symposium on High Performance Computer Architecture (HPCA'14), Orlando, FL, USA, February 2014.

[15]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated scheduling and prefetching for GPGPUs," in Proc. of the 40th IEEE/ACM International Symposium on Computer Architecture (ISCA'13), Tel-Aviv, Isreal, June 2013.

Digital Library

[16]

A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," in Proc. of the 18th IEEE/ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'13), Houston, TX, USA, March 2013.

Digital Library

[17]

G. Keramidas, P. Petoumenos, and S. Kaxiras, "Cache replacement based on reuse-distance prediction," in Proc. of the 25th IEEE International Conference on Computer Design (ICCD'07), Lake Tahoe, CA, USA, October 2007.

[18]

S. Khan, Y. Tian, and D. Jimenez, "Sampling dead block prediction for last-level caches," in Proc. of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO'10), Atlanta, GA, USA, December 2010.

Digital Library

[19]

A.-C. Lai, C. Fide, and B. Falsafi, "Dead-block prediction & dead-block correlating prefetchers," in Proc. of the 28th IEEE/ACM International Symposium on Computer Architecture (ISCA'01), 2001.

Digital Library

[20]

S.-Y. Lee and C.-J. Wu, "CAWS: Criticality-aware warp scheduling for GPGPU workloads," in Proc. of the 23rd IEEE/ACM International Conference on Parallel Architectures and Compilation (PACT'14), Edmonton, AB, Canada, August 2014.

Digital Library

[21]

S.-Y. Lee and C.-J. Wu, "Characterizing the latency hiding ability of GPUs," in Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'14) as Poster Abstract, Monterey, CA, USA, March 2014.

[22]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture," IEEE Micro, vol. 28, pp. 39--55, March 2008.

Digital Library

[23]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in Proc. of the 37th IEEE/ACM International Symposium on Computer Architecture (ISCA'10), Saint-Malo, France, June 2010.

Digital Library

[24]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in Proc. of the 44th International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.

Digital Library

[25]

NVIDIA, "PTX ISA," 2009. Available: http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf

[26]

NVIDIA, "NVIDIA CUDA C programming guide v4.2," 2012. Available: http://developer.nvidia.com/nvidia-gpu-computing-documentation

[27]

NVIDIA, "NVIDIA GeForce GTX 980: Featuring Maxwell, the most advanced GPU ever made," September 2014.

[28]

M. A. O'Neil and M. Burtscher, "Microarchitectural performance characterization of irregular GPU kernels," in Proc. of the 2014 IEEE International Symposium on Workload Characterization (IISWC'14), Raleigh, NC, USA, October 2014.

[29]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Adaptive insertion policies for high performance caching," in Proc. of the 34th IEEE/ACM International Symposium on Computer Architecture (ISCA'07), San Diego, CA, USA, June 2007.

Digital Library

[30]

M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer, "Set-dueling-controlled adaptive insertion for high-performance caching," IEEE Micro, vol. 28, no. 1, pp. 91--98, January 2008.

Digital Library

[31]

M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. of the 39th IEEE/ACM International Symposium on Microarchitecture (MICRO'06), Orlando, FL, USA, December 2006.

Digital Library

[32]

M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," in Proc. of the 19th IEEE International Symposium on High Performance Computer Architecture (HPCA'13), Shenzhen, China, February 2013.

Digital Library

[33]

M. Rhu, M. Sullivan, J. Leng, and M. Erez, "A locality-aware memory hierarchy for energy-efficient GPU architecture," in Proc. of the 46th International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.

Digital Library

[34]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious wavefront scheduling," in Proc. of the 45th IEEE/ACM International Symposium on Microarchitecture (MICRO'12), Vancouver, BC, Canada, December 2012.

Digital Library

[35]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware warp scheduling," in Proc. of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 2013.

Digital Library

[36]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. W. Hwu, "The Parboil technical report," in IMPACT Technical Report (IMPACT-12-01), University of Illinois Urbana-Champaign, Champaign, IL, USA, March 2012.

[37]

A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi, "SIMD divergence optimization through intra-warp compaction," in Proc. of the IEEE/ACM 40th International Symposium on Computer Architecture (ISCA'13), Tel Aviv, Israel, June 2011.

Digital Library

[38]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. of the 44th IEEE/ACM International Symposium on Microarchitecture (MICRO'11), Porto Alegre, Brazil, December 2011.

Digital Library

[39]

X. Xie, Y. Liang, G. Sun, and D. Chen, "An efficient compiler framework for cache bypassing on GPUs," in Proc. of the 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD'13), San Jose, CA, USA, November 2013.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Crago NDamani SSankaralingam KKeckler S(2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00086
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Show More Cited By

Index Terms

CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads
ISCA'15

The ubiquity of graphics processing unit (GPU) architectures has made them efficient alternatives to chip-multiprocessors for parallel workloads. GPUs achieve superior performance by making use of massive multi-threading and fast context-switching to ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

79
Total Citations
View Citations
915
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)13

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Crago NDamani SSankaralingam KKeckler S(2024)WASP: Exploiting GPU Pipeline Parallelism with Hardware-Accelerated Automatic Warp Specialization2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00086(1-16)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00086
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Barnes AShen FRogers T(2023)Mitigating GPU Core Partitioning Performance Effects2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070957(530-542)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070957
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
https://doi.org/10.1007/978-981-15-6401-7_66-2
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
https://doi.org/10.1007/978-981-15-6401-7_66-1
Gao LWang JZhang W(2022)Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUsACM Transactions on Architecture and Code Optimization10.1145/354730119:4(1-21)Online publication date: 16-Sep-2022
https://dl.acm.org/doi/10.1145/3547301
Hu WZhou YQuan YWang YLou X(2022)Cache-locality Based Adaptive Warp Scheduling for Neural Network Acceleration on GPGPUs2022 IEEE 35th International System-on-Chip Conference (SOCC)10.1109/SOCC56010.2022.9908120(1-6)Online publication date: 5-Sep-2022
https://doi.org/10.1109/SOCC56010.2022.9908120
Joseph DAragon JParcerisa JGonzalez A(2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00028
Chen WTong W(2022)ACWS: Adaptive Cache-state Aware Warp Scheduling Based on Cache Feature Analysis2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC57696.2022.10075135(599-603)Online publication date: 2-Dec-2022
https://doi.org/10.1109/ICFTIC57696.2022.10075135
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents