Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2751205.2751237acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Locality-Driven Dynamic GPU Cache Bypassing

Published: 08 June 2015 Publication History

Abstract

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance.
To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

References

[1]
AMD APP SDK: http://developer.amd.com, 2015.
[2]
AMD Graphics Cores Next (GCN) Architecture White paper, 2012.
[3]
NVIDIA CUDA SDK: https://developer.nvidia.com/cuda-downloads. 2015.
[4]
NVIDIA Kepler GK110 white paper. 2012.
[5]
NVIDIA's next generation CUDA compute architecture: Fermi. 2009.
[6]
S. S. Baghsorkhi et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In PPoPP '12. ACM, 2012.
[7]
A. Bakhoda et al. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS'09, April 2009.
[8]
M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on gpus. In IISWC'12, Nov 2012.
[9]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC'09, Oct 2009.
[10]
S. Che et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems. In SC '11. ACM, 2011.
[11]
X. Chen et al. Adaptive cache management for energy-efficient gpu computing. In MICRO-47. ACM, 2014.
[12]
N. Duong et al. Improving cache management policies using dynamic reuse distances. In MICRO-45, 2012.
[13]
J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In ISCA '11. ACM, 2011.
[14]
A. Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP). In Proc of ISCA '10. ACM, 2010.
[15]
W. Jia, K. Shaw, and M. Martonosi. MRPB: Memory request prioritization for massively parallel processors. In HPCA'14.
[16]
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in gpus. In ICS '12. ACM, 2012.
[17]
A. Jog et al. OWL: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In ASPLOS '13. ACM, 2013.
[18]
T. L. Johnson and W.-m. W. Hwu. Run-time adaptive cache hierarchy management via reference analysis. In ISCA '97. ACM, 1997.
[19]
M. Kharbutli and D. Solihin. Counter-based cache replacement and bypassing algorithms. Computers, IEEE Transactions on, April 2008.
[20]
S.-Y. Lee and C.-J. Wu. CAWS: Criticality-aware warp scheduling for gpgpu workloads. In PACT '14. ACM, 2014.
[21]
C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus. In ISPASS'14, March 2014.
[22]
V. Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO-44, 2011.
[23]
M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-Way cache: Demand Based Associativity via Global Replacement. In ISCA '05, 2005.
[24]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proc of IEEE MICRO-45, 2012.
[25]
I. Singh et al. Cache coherence for GPU architectures. In HPCA '13. ACM, 2013.
[26]
I.-J. Sung, G. Liu, and W.-M. Hwu. DL: A data layout transformation system for heterogeneous computing. In InPar'12, May 2012.
[27]
Y. Tian et al. Adaptive gpu cache bypassing. In GPGPU'15 workshop. ACM, 2015.
[28]
S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5), May 1996.
[29]
C.-J. Wu et al. SHiP: Signature-based hit predictor for high performance caching. In MICRO-44. ACM, 2011.
[30]
X. Xie et al. An efficient compiler framework for cache bypassing on gpus. In ICCAD'13.
[31]
X. Xie et al. Coordinated static and dynamic cache bypassing for gpus. In Proc of HPCA'15, pages 76--88. IEEE, Feb 2015.
[32]
Y. Xie and G. H. Loh. PiPP: Promotion/Insertion pseudo-partitioning of multi-core shared caches. In Proc of ISCA'09. ACM, 2009.
[33]
Y. Yang et al. Shared memory multiplexing: A novel way to improve gpgpu throughput. In PACT '12. ACM, 2012.

Cited By

View all
  • (2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)COLABProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567838(314-319)Online publication date: 16-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
June 2015
446 pages
ISBN:9781450335591
DOI:10.1145/2751205
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache bypassing
  2. gpu architecture optimization
  3. locality

Qualifiers

  • Research-article

Funding Sources

Conference

ICS'15
Sponsor:
ICS'15: 2015 International Conference on Supercomputing
June 8 - 11, 2015
California, Newport Beach, USA

Acceptance Rates

ICS '15 Paper Acceptance Rate 40 of 160 submissions, 25%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)268
  • Downloads (Last 6 weeks)49
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)RBGC: Repurpose the Buffer of Fixed Graphics Pipeline to Enhance GPU CacheProceedings of the Great Lakes Symposium on VLSI 202310.1145/3583781.3590305(173-177)Online publication date: 5-Jun-2023
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)COLABProceedings of the 28th Asia and South Pacific Design Automation Conference10.1145/3566097.3567838(314-319)Online publication date: 16-Jan-2023
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2023)Analyzing Data Locality on GPU Caches Using Static Profiling of WorkloadsIEEE Access10.1109/ACCESS.2023.330731511(95939-95947)Online publication date: 2023
  • (2023)Re-Cache: Mitigating cache contention by exploiting locality characteristics with reconfigurable memory hierarchy for GPGPUsMicroelectronics Journal10.1016/j.mejo.2023.105825138(105825)Online publication date: Aug-2023
  • (2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
  • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 24-Feb-2022
  • (2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media