Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807606acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Adaptive and transparent cache bypassing for GPUs

Published: 15 November 2015 Publication History

Abstract

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints. We validate the design on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can significantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the performance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.

References

[1]
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. "A Survey of general-purpose computation on graphics hardware". In: Computer graphics forum. Vol. 26. 1. Wiley Online Library. 2007.
[2]
J. Sanders and E. Kandrot. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.
[3]
W. H. Wen-Mei. GPU Computing Gems Emerald Edition. Elsevier, 2011.
[4]
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. "Analyzing CUDA workloads using a detailed GPU simulator". In: ISPASS. IEEE. 2009.
[5]
P. N. Glaskowsky. Nvidia's Fermi: the first complete GPU computing architecture. 2009.
[6]
J. Nickolls and W. J. Dally. "The GPU computing era". In: IEEE Micro 30.2 (2010).
[7]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. "Nvidia Tesla: A unified graphics and computing architecture". In: Ieee Micro 28.2 (2008).
[8]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. "A detailed GPU cache model based on reuse distance theory". In: HPCA. IEEE. 2014.
[9]
W. Jia, K. A. Shaw, and M. Martonosi. "MRPB: Memory request prioritization for massively parallel processors". In: HPCA. IEEE. 2014.
[10]
X. Xie, Y. Liang, G. Sun, and D. Chen. "An efficient compiler framework for cache bypassing on GPUs". In: ICCAD. IEEE. 2013.
[11]
O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das. "Neither more nor less: Optimizing thread-level parallelism for GPGPUs". In: PACT. IEEE Press. 2013.
[12]
V. Volkov and J. W. Demmel. "Benchmarking GPUs to tune dense linear algebra". In: SC. IEEE. 2008.
[13]
Y. Zhang and J. D. Owens. "A quantitative performance analysis model for GPU architectures". In: HPCA. IEEE. 2011.
[14]
T. G. Rogers, M. OĆonnor, and T. M. Aamodt. "Cache-conscious wavefront scheduling". In: MICRO. IEEE Computer Society. 2012.
[15]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. "Divergence-aware warp scheduling". In: MICRO. ACM. 2013.
[16]
Z. Zheng, Z. Wang, and M. Lipasti. "Adaptive Cache and Concurrency Allocation on GPGPUs". In: (2013).
[17]
Nvidia. CUDA Programming Guide. 2015.
[18]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. "Improving GPU performance via large warps and two-level warp scheduling". In: MICRO. ACM. 2011.
[19]
A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance". In: ACM SIGARCH Computer Architecture News 41.1 (2013).
[20]
Nvidia. CUDA Best Practice Guide. 2015.
[21]
Nvidia. Kepler Tuning Guide. 2015.
[22]
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. "Adaptive Cache Management for Energy-Efficient GPU Computing". In: MICRO. IEEE. 2014.
[23]
Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser. "Many-core vs. many-thread machines: Stay away from the valley". In: Computer Architecture Letters 8.1 (2009).
[24]
J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc. "Many-thread aware prefetching mechanisms for GPGPU applications". In: MICRO. IEEE. 2010.
[25]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. "Orchestrated scheduling and prefetching for GPGPUs". In: ACM SIGARCH Computer Architecture News 41.3 (2013).
[26]
Nvidia. PTX: Parallel Thread Execution ISA. 2015.
[27]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. "Locality-Driven Dynamic GPU Cache Bypassing". In: ICS. ACM, 2015.
[28]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. "Rodinia: A benchmark suite for heterogeneous computing". In: IISWC. IEEE. 2009.
[29]
B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter. "Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations". In: ICS. ICS '15. ACM, 2015.
[30]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-M. Hwu. "Parboil: A revised benchmark suite for scientific and commercial throughput computing". In: Center for Reliable and High-Performance Computing (2012).
[31]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. "Mars: a MapReduce framework on graphics processors". In: PACT. ACM. 2008.
[32]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. "Auto-tuning a high-level language targeted to GPU codes". In: Innovative Parallel Computing (InPar). IEEE. 2012.
[33]
W. Jia, K. A. Shaw, and M. Martonosi. "Characterizing and improving the use of demand-fetched caches in GPUs". In: ICS. ACM. 2012.
[34]
M. Bauer, S. Treichler, and A. Aiken. "Singe: leveraging warp specialization for high performance on GPUs". In: ACM SIGPLAN Notices 49.8 (2014).
[35]
V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. "Managing shared last-level cache in a heterogeneous multicore processor". In: PACT. IEEE Press. 2013.
[36]
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. "Priority-based cache allocation in throughput processors". In: HPCA. IEEE. 2015.

Cited By

View all
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
  • (2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. cache bypassing
  3. thread throttling

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
  • (2022)Comparison of Different Adaptable Cache Bypassing Approaches2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9965178(1-8)Online publication date: 21-Nov-2022
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
  • (2022)Aggressive GPU cache bypassing with monolithic 3D-based NoCThe Journal of Supercomputing10.1007/s11227-022-04878-679:5(5421-5442)Online publication date: 21-Oct-2022
  • (2022)A Quantitative Study of Locality in GPU Caches for Memory-Divergent WorkloadsInternational Journal of Parallel Programming10.1007/s10766-022-00729-250:2(189-216)Online publication date: 5-Apr-2022
  • (2021)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 6-Dec-2021
  • (2021)Locality-Aware CTA Scheduling for Gaming ApplicationsACM Transactions on Architecture and Code Optimization10.1145/347749719:1(1-26)Online publication date: 6-Dec-2021
  • (2021)Cohmeleon: Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480065(350-365)Online publication date: 18-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media