Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2613908.2613909acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmesConference Proceedingsconference-collections
research-article

Adaptive Cache Bypass and Insertion for Many-core Accelerators

Published: 15 June 2014 Publication History

Abstract

Many-core accelerators, e.g. GPUs, are widely used for accelerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pattern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irregular accesses. However, GPU caches suffer from poor efficiency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency.
We propose an adaptive cache management policy specifically for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run-time bypass and insertion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is significantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

References

[1]
"NVIDIA CUDA SDK code samples."
[2]
AMD Graphics Cores Next (GCN) Architecture white paper, AMD Corporation, 2012.
[3]
A. Bakhoda, G. Yuan et al., "Analyzing cuda workloads using a detailed gpu simulator," In ISPASS '09, Boston, MA, 2009.
[4]
S. Che, M. Boyer et al., "Rodinia: A benchmark suite for heterogeneous computing," In IISWC '09, 2009.
[5]
H. Choi, J. Ahn et al., "Reducing off-chip memory traffic by selective cache management scheme in gpgpus," In GPGPU-5. New York, NY, USA: ACM, 2012.
[6]
N. Duong, D. Zhao et al., "Improving cache management policies using dynamic reuse distances," In MICRO-45 '12. Washington, DC, USA: IEEE Computer Society, 2012.
[7]
J. Gaur, M. Chaudhuri et al., "Bypass and insertion algorithms for exclusive last-level caches," In ISCA '11. New York, NY, USA: ACM, 2011.
[8]
J. Gaur, R. Srinivasan et al., "Efficient management of last-level caches in graphics processors for 3d scene rendering workloads," In MICRO '13. Davis, CA, USA: IEEE Computer Society, 2013.
[9]
S. Grauer-Gray, L. Xu et al., "Auto-tuning a high-level language targeted to gpu codes," In InPar '12, May 2012.
[10]
B. He, W. Fang et al., "Mars: A MapReduce framework on graphics processors," In PACT '08. New York, NY, USA: ACM, 2008.
[11]
A. Jaleel, K. B. Theobald et al., "High performance cache replacement using re-reference interval prediction (RRIP)," In ISCA '10. New York, NY, USA: ACM, 2010.
[12]
W. Jia, K. A. Shaw et al., "Characterizing and improving the use of demand-fetched caches in GPUs," In ICS '12. New York, NY, USA: ACM, 2012.
[13]
W. Jia, K. A. Shaw et al., "MRPB: Memory request prioritization for massively parallel processors," In HPCA-20 '14, 2014.
[14]
A. Jog, O. Kayiran et al., "OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance," In ASPLOS '13. New York, NY, USA: ACM, 2013.
[15]
S. M. Khan, Y. Tian et al., "Sampling dead block prediction for last-level caches," In MICRO '43. Washington, DC, USA: IEEE Computer Society, 2010.
[16]
M. Kharbutli and D. Solihin, "Counter-based cache replacement and bypassing algorithms," Computers, IEEE Transactions on, vol. 57, no. 4, pp. 433--447, 2008.
[17]
The OpenCL C Specification Version: 2.0, Khronos OpenCL Working Group, July 2013.
[18]
A.-C. Lai, C. Fide et al., "Dead-block prediction & dead-block correlating prefetchers," In ISCA '01. New York, NY, USA: ACM, 2001.
[19]
J. Lee and H. Kim, "TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture," In HPCA '12. Washington, DC, USA: IEEE Computer Society, 2012.
[20]
H. Liu, M. Ferdman et al., "Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency," In MICRO '41. Washington, DC, USA: IEEE Computer Society, 2008.
[21]
V. Mekkat, A. Holey et al., "Managing shared last-level cache in a heterogeneous multicore processor," In PACT '13. Piscataway, NJ, USA: IEEE Press, 2013.
[22]
V. Narasiman, M. Shebanow et al., "Improving GPU performance via large warps and two-level warp scheduling," In MICRO-44 '11. New York, NY, USA: ACM, 2011.
[23]
NVIDIA's Next Generation CUDA TM Compute Architecture: Fermi TM, NVIDIA Corporation, 2009.
[24]
NVIDIA's Next Generation CUDA TM Compute Architecture: Kepler TM GK110, NVIDIA Corporation, 2012.
[25]
CUDA C Programming Guide v5.5, NVIDIA Corporation, July 2013.
[26]
M. Rhu, "A locality-aware memory hierarchy for energy-efficient GPU architectures," In MICRO '13. Davis, CA, USA: IEEE Computer Society, 2013.
[27]
T. G. Rogers, M. O'Connor et al., "Cache-conscious wavefront scheduling," In MICRO '12. Washington, DC, USA: IEEE Computer Society, 2012.
[28]
I. Singh, A. Shriraman et al., "Cache coherence for GPU architectures," In HPCA '13, 2013.
[29]
J. A. Stratton, C. Rodrigrues et al., "Parboil: A revised benchmark suite for scientific and commercial throughput computing," UIUC, Urbana, Tech. Rep. IMPACT-12-01, Mar. 2012.
[30]
X. Xie, Y. Liang et al., "An efficient compiler framework for cache bypassing on GPUs," In ICCAD '13, 2013.

Cited By

View all
  • (2023)COBRRA: COntention-aware cache Bypass with Request-Response ArbitrationACM Transactions on Embedded Computing Systems10.1145/363274823:1(1-30)Online publication date: 17-Nov-2023
  • (2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
  • (2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MES '14: Proceedings of International Workshop on Manycore Embedded Systems
June 2014
67 pages
ISBN:9781450328227
DOI:10.1145/2613908
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Univ. Turku: University of Turku

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. bypass
  3. cache management
  4. insertion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MES '14

Acceptance Rates

Overall Acceptance Rate 5 of 21 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)COBRRA: COntention-aware cache Bypass with Request-Response ArbitrationACM Transactions on Embedded Computing Systems10.1145/363274823:1(1-30)Online publication date: 17-Nov-2023
  • (2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
  • (2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2019)A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs10.1007/978-981-13-5907-1_22(210-219)Online publication date: 8-Feb-2019
  • (2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
  • (2018)Heavy-traffic Delay Optimality in Pull-based Load Balancing SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873232:3(1-33)Online publication date: 21-Dec-2018
  • (2018)Model Agnostic Time Series Analysis via Matrix EstimationProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873192:3(1-39)Online publication date: 21-Dec-2018
  • (2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
  • (2018)I Can't Be MyselfProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32649002:3(1-40)Online publication date: 18-Sep-2018
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media