research-article

Adaptive Cache Bypass and Insertion for Many-core Accelerators

Authors:

Wei-Sheng Huang,

Wen-Mei W. HwuAuthors Info & Claims

MES '14: Proceedings of International Workshop on Manycore Embedded Systems

Pages 1 - 8

https://doi.org/10.1145/2613908.2613909

Published: 15 June 2014 Publication History

Abstract

Many-core accelerators, e.g. GPUs, are widely used for accelerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pattern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irregular accesses. However, GPU caches suffer from poor efficiency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency.

We propose an adaptive cache management policy specifically for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run-time bypass and insertion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is significantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

References

[1]

"NVIDIA CUDA SDK code samples."

[2]

AMD Graphics Cores Next (GCN) Architecture white paper, AMD Corporation, 2012.

[3]

A. Bakhoda, G. Yuan et al., "Analyzing cuda workloads using a detailed gpu simulator," In ISPASS '09, Boston, MA, 2009.

[4]

S. Che, M. Boyer et al., "Rodinia: A benchmark suite for heterogeneous computing," In IISWC '09, 2009.

Digital Library

[5]

H. Choi, J. Ahn et al., "Reducing off-chip memory traffic by selective cache management scheme in gpgpus," In GPGPU-5. New York, NY, USA: ACM, 2012.

Digital Library

[6]

N. Duong, D. Zhao et al., "Improving cache management policies using dynamic reuse distances," In MICRO-45 '12. Washington, DC, USA: IEEE Computer Society, 2012.

Digital Library

[7]

J. Gaur, M. Chaudhuri et al., "Bypass and insertion algorithms for exclusive last-level caches," In ISCA '11. New York, NY, USA: ACM, 2011.

Digital Library

[8]

J. Gaur, R. Srinivasan et al., "Efficient management of last-level caches in graphics processors for 3d scene rendering workloads," In MICRO '13. Davis, CA, USA: IEEE Computer Society, 2013.

Digital Library

[9]

S. Grauer-Gray, L. Xu et al., "Auto-tuning a high-level language targeted to gpu codes," In InPar '12, May 2012.

[10]

B. He, W. Fang et al., "Mars: A MapReduce framework on graphics processors," In PACT '08. New York, NY, USA: ACM, 2008.

Digital Library

[11]

A. Jaleel, K. B. Theobald et al., "High performance cache replacement using re-reference interval prediction (RRIP)," In ISCA '10. New York, NY, USA: ACM, 2010.

Digital Library

[12]

W. Jia, K. A. Shaw et al., "Characterizing and improving the use of demand-fetched caches in GPUs," In ICS '12. New York, NY, USA: ACM, 2012.

Digital Library

[13]

W. Jia, K. A. Shaw et al., "MRPB: Memory request prioritization for massively parallel processors," In HPCA-20 '14, 2014.

[14]

A. Jog, O. Kayiran et al., "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," In ASPLOS '13. New York, NY, USA: ACM, 2013.

Digital Library

[15]

S. M. Khan, Y. Tian et al., "Sampling dead block prediction for last-level caches," In MICRO '43. Washington, DC, USA: IEEE Computer Society, 2010.

Digital Library

[16]

M. Kharbutli and D. Solihin, "Counter-based cache replacement and bypassing algorithms," Computers, IEEE Transactions on, vol. 57, no. 4, pp. 433--447, 2008.

Digital Library

[17]

The OpenCL C Specification Version: 2.0, Khronos OpenCL Working Group, July 2013.

[18]

A.-C. Lai, C. Fide et al., "Dead-block prediction & dead-block correlating prefetchers," In ISCA '01. New York, NY, USA: ACM, 2001.

Digital Library

[19]

J. Lee and H. Kim, "TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture," In HPCA '12. Washington, DC, USA: IEEE Computer Society, 2012.

Digital Library

[20]

H. Liu, M. Ferdman et al., "Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency," In MICRO '41. Washington, DC, USA: IEEE Computer Society, 2008.

Digital Library

[21]

V. Mekkat, A. Holey et al., "Managing shared last-level cache in a heterogeneous multicore processor," In PACT '13. Piscataway, NJ, USA: IEEE Press, 2013.

Digital Library

[22]

V. Narasiman, M. Shebanow et al., "Improving GPU performance via large warps and two-level warp scheduling," In MICRO-44 '11. New York, NY, USA: ACM, 2011.

Digital Library

[23]

NVIDIA's Next Generation CUDA TM Compute Architecture: Fermi TM, NVIDIA Corporation, 2009.

[24]

NVIDIA's Next Generation CUDA TM Compute Architecture: Kepler TM GK110, NVIDIA Corporation, 2012.

[25]

CUDA C Programming Guide v5.5, NVIDIA Corporation, July 2013.

[26]

M. Rhu, "A locality-aware memory hierarchy for energy-efficient GPU architectures," In MICRO '13. Davis, CA, USA: IEEE Computer Society, 2013.

Digital Library

[27]

T. G. Rogers, M. O'Connor et al., "Cache-conscious wavefront scheduling," In MICRO '12. Washington, DC, USA: IEEE Computer Society, 2012.

Digital Library

[28]

I. Singh, A. Shriraman et al., "Cache coherence for GPU architectures," In HPCA '13, 2013.

Digital Library

[29]

J. A. Stratton, C. Rodrigrues et al., "Parboil: A revised benchmark suite for scientific and commercial throughput computing," UIUC, Urbana, Tech. Rep. IMPACT-12-01, Mar. 2012.

[30]

X. Xie, Y. Liang et al., "An efficient compiler framework for cache bypassing on GPUs," In ICCAD '13, 2013.

Digital Library

Cited By

Bagchi AJoshi DPanda P(2023)COBRRA: COntention-aware cache Bypass with Request-Response ArbitrationACM Transactions on Embedded Computing Systems10.1145/363274823:1(1-30)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3632748
Fang JZhao LCai MYang H(2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
https://doi.org/10.1007/s11227-023-05127-0
Fang JWei ZYang H(2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
https://doi.org/10.3390/mi12101262
Show More Cited By

Index Terms

Adaptive Cache Bypass and Insertion for Many-core Accelerators
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
Adaptive Cache Management for Energy-Efficient GPU Computing
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MES '14: Proceedings of International Workshop on Manycore Embedded Systems

June 2014

67 pages

ISBN:9781450328227

DOI:10.1145/2613908

General Chairs:
Masoud Daneshtalab
University of Turku, Finland
,
Maurizio Palesi
Kore University, Italy
,
Federico Angiolini
iNoCs, Switzerland
,
Program Chairs:
Juha Plosila
University of Turku, Finland
,
Masoumeh Ebrahimi
KTH Royal Institute of Technology, Sweden

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Univ. Turku: University of Turku

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MES '14

MES '14: International Workshop on Manycore Embedded Systems

June 15, 2014

MN, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 5 of 21 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
579
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bagchi AJoshi DPanda P(2023)COBRRA: COntention-aware cache Bypass with Request-Response ArbitrationACM Transactions on Embedded Computing Systems10.1145/363274823:1(1-30)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3632748
Fang JZhao LCai MYang H(2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
https://doi.org/10.1007/s11227-023-05127-0
Fang JWei ZYang H(2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
https://doi.org/10.3390/mi12101262
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Do CMoon MKim JKim C(2019)A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs10.1007/978-981-13-5907-1_22(210-219)Online publication date: 8-Feb-2019
https://doi.org/10.1007/978-981-13-5907-1_22
Ausavarungnirun RMiller VLandgraf JGhose SGandhi JJog ARossbach CMutlu O(2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173169
Zhou XTan JShroff N(2018)Heavy-traffic Delay Optimality in Pull-based Load Balancing SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873232:3(1-33)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287323
Agarwal AAmjad MShah DShen D(2018)Model Agnostic Time Series Analysis via Matrix EstimationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873192:3(1-39)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287319
Tang XPattnaik AKayiran OJog AKandemir MDas C(2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287318
Alharbi RStump TVafaie NPfammatter ASpring BAlshurafa N(2018)I Can't Be MyselfProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32649002:3(1-40)Online publication date: 18-Sep-2018
https://dl.acm.org/doi/10.1145/3264900
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents