research-article

Adaptive and transparent cache bypassing for GPUs

Authors:

Gert-Jan van den Braak,

Henk CorporaalAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 17, Pages 1 - 12

https://doi.org/10.1145/2807591.2807606

Published: 15 November 2015 Publication History

Abstract

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints. We validate the design on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can significantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the performance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.

References

[1]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. "A Survey of general-purpose computation on graphics hardware". In: Computer graphics forum. Vol. 26. 1. Wiley Online Library. 2007.

[2]

J. Sanders and E. Kandrot. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.

Digital Library

[3]

W. H. Wen-Mei. GPU Computing Gems Emerald Edition. Elsevier, 2011.

Digital Library

[4]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. "Analyzing CUDA workloads using a detailed GPU simulator". In: ISPASS. IEEE. 2009.

[5]

P. N. Glaskowsky. Nvidia's Fermi: the first complete GPU computing architecture. 2009.

[6]

J. Nickolls and W. J. Dally. "The GPU computing era". In: IEEE Micro 30.2 (2010).

Digital Library

[7]

E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. "Nvidia Tesla: A unified graphics and computing architecture". In: Ieee Micro 28.2 (2008).

Digital Library

[8]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. "A detailed GPU cache model based on reuse distance theory". In: HPCA. IEEE. 2014.

[9]

W. Jia, K. A. Shaw, and M. Martonosi. "MRPB: Memory request prioritization for massively parallel processors". In: HPCA. IEEE. 2014.

[10]

X. Xie, Y. Liang, G. Sun, and D. Chen. "An efficient compiler framework for cache bypassing on GPUs". In: ICCAD. IEEE. 2013.

Digital Library

[11]

O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. "Neither more nor less: Optimizing thread-level parallelism for GPGPUs". In: PACT. IEEE Press. 2013.

Digital Library

[12]

V. Volkov and J. W. Demmel. "Benchmarking GPUs to tune dense linear algebra". In: SC. IEEE. 2008.

Digital Library

[13]

Y. Zhang and J. D. Owens. "A quantitative performance analysis model for GPU architectures". In: HPCA. IEEE. 2011.

Digital Library

[14]

T. G. Rogers, M. OĆonnor, and T. M. Aamodt. "Cache-conscious wavefront scheduling". In: MICRO. IEEE Computer Society. 2012.

Digital Library

[15]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. "Divergence-aware warp scheduling". In: MICRO. ACM. 2013.

Digital Library

[16]

Z. Zheng, Z. Wang, and M. Lipasti. "Adaptive Cache and Concurrency Allocation on GPGPUs". In: (2013).

[17]

Nvidia. CUDA Programming Guide. 2015.

[18]

V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. "Improving GPU performance via large warps and two-level warp scheduling". In: MICRO. ACM. 2011.

Digital Library

[19]

A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance". In: ACM SIGARCH Computer Architecture News 41.1 (2013).

Digital Library

[20]

Nvidia. CUDA Best Practice Guide. 2015.

[21]

Nvidia. Kepler Tuning Guide. 2015.

[22]

X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. "Adaptive Cache Management for Energy-Efficient GPU Computing". In: MICRO. IEEE. 2014.

Digital Library

[23]

Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser. "Many-core vs. many-thread machines: Stay away from the valley". In: Computer Architecture Letters 8.1 (2009).

Digital Library

[24]

J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. "Many-thread aware prefetching mechanisms for GPGPU applications". In: MICRO. IEEE. 2010.

Digital Library

[25]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. "Orchestrated scheduling and prefetching for GPGPUs". In: ACM SIGARCH Computer Architecture News 41.3 (2013).

Digital Library

[26]

Nvidia. PTX: Parallel Thread Execution ISA. 2015.

[27]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. "Locality-Driven Dynamic GPU Cache Bypassing". In: ICS. ACM, 2015.

Digital Library

[28]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. "Rodinia: A benchmark suite for heterogeneous computing". In: IISWC. IEEE. 2009.

Digital Library

[29]

B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. "Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations". In: ICS. ICS '15. ACM, 2015.

Digital Library

[30]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. Hwu. "Parboil: A revised benchmark suite for scientific and commercial throughput computing". In: Center for Reliable and High-Performance Computing (2012).

[31]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. "Mars: a MapReduce framework on graphics processors". In: PACT. ACM. 2008.

Digital Library

[32]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. "Auto-tuning a high-level language targeted to GPU codes". In: Innovative Parallel Computing (InPar). IEEE. 2012.

[33]

W. Jia, K. A. Shaw, and M. Martonosi. "Characterizing and improving the use of demand-fetched caches in GPUs". In: ICS. ACM. 2012.

Digital Library

[34]

M. Bauer, S. Treichler, and A. Aiken. "Singe: leveraging warp specialization for high performance on GPUs". In: ACM SIGPLAN Notices 49.8 (2014).

Digital Library

[35]

V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. "Managing shared last-level cache in a heterogeneous multicore processor". In: PACT. IEEE Press. 2013.

Digital Library

[36]

D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. "Priority-based cache allocation in throughput processors". In: HPCA. IEEE. 2015.

Cited By

Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Carmin MEnsina LAlves M(2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
https://doi.org/10.1109/SBESC56799.2022.9965178
Show More Cited By

Index Terms

Adaptive and transparent cache bypassing for GPUs

Recommendations

Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
740
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Joseph DAragón JParcerisa JGonzález A(2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00019
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Carmin MEnsina LAlves M(2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
https://doi.org/10.1109/SBESC56799.2022.9965178
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Joseph DAragon JParcerisa JGonzalez A(2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00028
Do CKim CChung S(2022)Aggressive GPU cache bypassing with monolithic 3D-based NoCThe Journal of Supercomputing10.1007/s11227-022-04878-679:5(5421-5442)Online publication date: 21-Oct-2022
https://doi.org/10.1007/s11227-022-04878-6
Lal SVarma BJuurlink B(2022)A Quantitative Study of Locality in GPU Caches for Memory-Divergent WorkloadsInternational Journal of Parallel Programming10.1007/s10766-022-00729-250:2(189-216)Online publication date: 5-Apr-2022
https://doi.org/10.1007/s10766-022-00729-2
Di BHu DXie ZSun JChen HRen JLi D(2021)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3491218
Ukarande APatidar SRangan R(2021)Locality-Aware CTA Scheduling for Gaming ApplicationsACM Transactions on Architecture and Code Optimization10.1145/347749719:1(1-26)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3477497
Zuckerman JGiri DKwon JMantovani PCarloni L(2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480065
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents