research-article

Public Access

A model-driven approach to warp/thread-block level GPU cache bypassing

Authors:

Christos Kartsaklis,

Mike MantorAuthors Info & Claims

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

Article No.: 94, Pages 1 - 6

https://doi.org/10.1145/2897937.2897966

Published: 05 June 2016 Publication History

Abstract

The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks (TBs) to bypass the cache. Then we design a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme, which achieves 1.68x speedup on average on a set of memory-intensive benchmarks over the baseline. Compared to prior works, our scheme achieves 21.6% performance improvement over SWL-best [29] and 11.9% over CBWT-best [4] on average.

References

[1]

AMD GCN Architecture White paper, 2012.

[2]

A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS, 2009.

[3]

M. Burtscher et al. A quantitative study of irregular programs on GPUs. In Proceedings of IISWC, 2012.

Digital Library

[4]

X. Chen et al. Adaptive cache management for energy-efficient GPU computing. In Proceedings of MICRO, 2014.

Digital Library

[5]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC, 2009.

Digital Library

[6]

C. K Chow. Determination of cache's capacity and its matching storage hierarchy. Computers, IEEE Transactions on 100, 1976.

Digital Library

[7]

N. Duong et al. Improving cache management policies using dynamic reuse distances. In Proceedings of MICRO, 2012.

Digital Library

[8]

J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of ISCA, 2011.

Digital Library

[9]

A. González et al. Eliminating cache conflict misses through XOR-based placement functions. In Proceedings of ICS, 1997.

Digital Library

[10]

S. Grauer-Gray et al. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of InPar, 2012.

[11]

Z. Guz et al. Many-core vs. many-thread machines: Stay away from the valley. Computer Architecture Letters 8, no. 1, 2009.

Digital Library

[12]

A. Hartstein, et al. Cache miss behavior: is it√ 2?." Proceedings of the 3rd conference on Computing frontiers, 2006.

Digital Library

[13]

S. Hong et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of ISCA, 2009.

Digital Library

[14]

J. Huang et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. In Proceedings of MICRO, 2014.

Digital Library

[15]

W. Jia et al. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of HPCA, 2014.

[16]

O. Kayιran et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of PACT, 2013.

Digital Library

[17]

M. Kharbutli et al. Using prime numbers for cache indexing to eliminate conflict misses. In Software, IEE Proceedings-, 2004.

Digital Library

[18]

A. Li et al. Adaptive and transparent cache bypassing for GPUs. In Proceedings of SC, 2015.

Digital Library

[19]

C. Li et al. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of ICS, 2015.

Digital Library

[20]

C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proceedings of ISPASS, 2014.

[21]

J. Leng et al. GPUWattch: enabling energy optimizations in GPGPUs. In Proceedings of ISCA, 2013.

Digital Library

[22]

D. Li et al. Priority-based cache allocation in throughput processors. In Proceedings of HPCA, 2015.

[23]

V. Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of MICRO, 2011.

Digital Library

[24]

NVIDIA Kepler GK110 white paper. 2012.

[25]

NVIDIA, "CUDA C/C++ SDK code samples," 2011.

[26]

NVIDIA's CUDA compute architecture: Fermi. 2009.

[27]

NVIDIA Parallel Thread Execution ISA Version 4.2.

[28]

M. Qureshi et al. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches." In Proceedings of MICRO, 2006.

Digital Library

[29]

T. Rogers et al. Cache-conscious wavefront scheduling." In Proceedings MICRO, 2012.

Digital Library

[30]

I. Singh et al. Cache coherence for GPU architectures. In Proceedings of HPCA, 2013.

Digital Library

[31]

Y. Tian et al. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015.

Digital Library

[32]

S. Wilton et al. "CACTI: An enhanced cache access and cycle time model." Solid-State Circuits, IEEE Journal of 31, no. 5 (1996).

[33]

X. Xie et al. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of HPCA, 2015.

[34]

Y. Zhang et al. A quantitative performance analysis model for GPU architectures. In Proceedings of HPCA, 2011.

Digital Library

Cited By

Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00019
Joseph DAragon JParcerisa JGonzalez A(2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00028
Show More Cited By

A model-driven approach to warp/thread-block level GPU cache bypassing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend ...
Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

June 2016

1048 pages

ISBN:9781450342360

DOI:10.1145/2897937

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

DAC '16

DAC '16: The 53rd Annual Design Automation Conference 2016

June 5 - 9, 2016

Texas, Austin

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
394
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)8

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00019
Joseph DAragon JParcerisa JGonzalez A(2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00028
Fang JWei ZYang H(2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
https://doi.org/10.3390/mi12101262
Yu CBai YWang R(2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
https://doi.org/10.1109/TC.2020.3026043
Wang LJahre MAdileho AEeckhout L(2020)MDM: The GPU Memory Divergence Model2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00085(1009-1021)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00085
Wang XTumeo ALeidel JLi JChen Y(2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337867
Wang LJahre MAdileh AWang ZEeckhout L(2019)Modeling Emerging Memory-Divergent GPU ApplicationsIEEE Computer Architecture Letters10.1109/LCA.2019.292361818:2(95-98)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1109/LCA.2019.2923618
Kiani MRajabzadeh A(2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
https://dl.acm.org/doi/10.1145/3291051
Wang XLeidel JChen Y(2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225062
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents