research-article

Public Access

Locality-Driven Dynamic GPU Cache Bypassing

Authors:

Shuaiwen Leon Song,

Albert Sidelnik,

Siva Kumar Sastry Hari,

Huiyang ZhouAuthors Info & Claims

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Pages 67 - 77

https://doi.org/10.1145/2751205.2751237

Published: 08 June 2015 Publication History

Abstract

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.

To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

References

[1]

AMD APP SDK: http://developer.amd.com, 2015.

[2]

AMD Graphics Cores Next (GCN) Architecture White paper, 2012.

[3]

NVIDIA CUDA SDK: https://developer.nvidia.com/cuda-downloads. 2015.

[4]

NVIDIA Kepler GK110 white paper. 2012.

[5]

NVIDIA's next generation CUDA compute architecture: Fermi. 2009.

[6]

S. S. Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP '12. ACM, 2012.

Digital Library

[7]

A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS'09, April 2009.

[8]

M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In IISWC'12, Nov 2012.

Digital Library

[9]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, Oct 2009.

Digital Library

[10]

S. Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC '11. ACM, 2011.

Digital Library

[11]

X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In MICRO-47. ACM, 2014.

Digital Library

[12]

N. Duong et al. Improving cache management policies using dynamic reuse distances. In MICRO-45, 2012.

Digital Library

[13]

J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In ISCA '11. ACM, 2011.

Digital Library

[14]

A. Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP). In Proc of ISCA '10. ACM, 2010.

Digital Library

[15]

W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA'14.

[16]

W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS '12. ACM, 2012.

Digital Library

[17]

A. Jog et al. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In ASPLOS '13. ACM, 2013.

Digital Library

[18]

T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA '97. ACM, 1997.

Digital Library

[19]

M. Kharbutli and D. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, April 2008.

Digital Library

[20]

S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware warp scheduling for gpgpu workloads. In PACT '14. ACM, 2014.

Digital Library

[21]

C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS'14, March 2014.

[22]

V. Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO-44, 2011.

Digital Library

[23]

M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand Based Associativity via Global Replacement. In ISCA '05, 2005.

Digital Library

[24]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc of IEEE MICRO-45, 2012.

Digital Library

[25]

I. Singh et al. Cache coherence for GPU architectures. In HPCA '13. ACM, 2013.

Digital Library

[26]

I.-J. Sung, G. Liu, and W.-M. Hwu. DL: A data layout transformation system for heterogeneous computing. In InPar'12, May 2012.

[27]

Y. Tian et al. Adaptive gpu cache bypassing. In GPGPU'15 workshop. ACM, 2015.

Digital Library

[28]

S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5), May 1996.

[29]

C.-J. Wu et al. SHiP: Signature-based hit predictor for high performance caching. In MICRO-44. ACM, 2011.

Digital Library

[30]

X. Xie et al. An efficient compiler framework for cache bypassing on gpus. In ICCAD'13.

Digital Library

[31]

X. Xie et al. Coordinated static and dynamic cache bypassing for gpus. In Proc of HPCA'15, pages 76--88. IEEE, Feb 2015.

[32]

Y. Xie and G. H. Loh. PiPP: Promotion/Insertion pseudo-partitioning of multi-core shared caches. In Proc of ISCA'09. ACM, 2009.

Digital Library

[33]

Y. Yang et al. Shared memory multiplexing: A novel way to improve gpgpu throughput. In PACT '12. ACM, 2012.

Digital Library

Cited By

Zhao HZhang LZhang FThapliyal HDeMara RPartin-Vaisband IKatkoori S(2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3583781.3590305
Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Cheng BHuang EChao CSun WYeh TLee CTakahashi A(2023)COLABProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567838(314-319)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1145/3566097.3567838
Show More Cited By

Index Terms

Locality-Driven Dynamic GPU Cache Bypassing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Cache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

June 2015

446 pages

ISBN:9781450335591

DOI:10.1145/2751205

General Chair:
Laxmi N. Bhuyan
University of California, Riverside
,
Program Chairs:
Fred Chong
University of California, Santa Barbara
,
Vivek Sarkar
Rice University

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

ICS'15

Sponsor:

SIGARCH

ICS'15: 2015 International Conference on Supercomputing

June 8 - 11, 2015

California, Newport Beach, USA

Acceptance Rates

ICS '15 Paper Acceptance Rate 40 of 160 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

97
Total Citations
View Citations
1,631
Total Downloads

Downloads (Last 12 months)268
Downloads (Last 6 weeks)49

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao HZhang LZhang FThapliyal HDeMara RPartin-Vaisband IKatkoori S(2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3583781.3590305
Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Cheng BHuang EChao CSun WYeh TLee CTakahashi A(2023)COLABProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567838(314-319)Online publication date: 16-Jan-2023
https://dl.acm.org/doi/10.1145/3566097.3567838
Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Kim JEom HKim Y(2023)Analyzing Data Locality on GPU Caches Using Static Profiling of WorkloadsIEEE Access10.1109/ACCESS.2023.330731511(95939-95947)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3307315
Zhang YWang MWang WYu Z(2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
https://doi.org/10.1016/j.mejo.2023.105825
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 24-Feb-2022
https://dl.acm.org/doi/10.1109/TPDS.2022.3154315
Carmin MEnsina LAlves M(2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
https://doi.org/10.1109/SBESC56799.2022.9965178
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents