research-article

ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag Array

Authors:

Hao LiuAuthors Info & Claims

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 43, Issue 5

Pages 1429 - 1441

https://doi.org/10.1109/TCAD.2023.3337192

Published: 01 May 2024 Publication History

Abstract

To fully exploit the locality of GPU applications, the GPU shared L1 cache architecture, which shares L1 cache among multiple GPU cores, is a promising architecture while still suffering from high-resource contentions. We present a GPU shared L1 cache architecture with an aggregated tag array that minimizes the L1 cache contentions and takes full advantage of inter-core locality. The key idea is to decouple and aggregate the tag arrays of multiple L1 caches so that the cache requests can be compared with all tag arrays in parallel to probe the replicated data in other caches. The GPU caches are only accessed by other GPU cores when replicated data exists, filtering out unnecessary cache accesses that cause high-resource contentions. We also develop a two-level thread-block scheduling policy adapted for the shared L1 cache architecture to maximize the available locality. The experimental results show that GPU performance can be improved by 14.5% on average for applications with a high inter-core locality.

References

[1]

E. Choukseet al., “Buddy compression: Enabling larger memory for deep learning and HPC workloads on GPUs,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2020, pp. 926–939.

[2]

Y. Kwon, Y. Lee, and M. Rhu, “Tensor casting: Co-designing algorithm-architecture for personalized recommendation training,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 235–248.

[3]

Q. Sunet al., “GTuner: Tuning DNN computations on GPU via graph attention network,” in Proc. ACM/IEEE Des. Autom. Conf., 2022, pp. 1045–1050.

[4]

M. Kim, L. Liu, and W. Choi, “A GPU-aware parallel index for processing high-dimensional big data,” IEEE Trans. Comput., vol. 67, no. 10, pp. 1388–1402, Oct. 2018.

[5]

M. Rissoet al., “Lightweight neural architecture search for temporal convolutional networks at the edge,” IEEE Trans. Comput., vol. 72, no. 3, pp. 744–758, Mar. 2023.

[6]

Y. Taoet al., “GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication,” Concurr. Comput. Pract. Exp., vol. 27, no. 14, pp. 3771–3789, 2015.

Digital Library

[7]

D. Li and T. M. Aamodt, “Inter-core locality aware memory scheduling,” IEEE Comput. Archit. Lett., vol. 15, no. 1, pp. 25–28, Jan.–Jun. 2016.

Digital Library

[8]

X. Zhao, Y. Liu, A. Adileh, and L. Eeckhout, “LA-LLC: Inter-core locality-aware last-level cache to exploit many-to-many traffic in GPGPUs,” IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 42–45, Jan.–Jun. 2017.

Digital Library

[9]

S. Dublish, V. Nagarajan, and N. Topham, “Cooperative caching for GPUs,” ACM Trans. Archit. Code Optim. (TACO), vol. 13, no. 4, pp. 1–25, 2016.

Digital Library

[10]

M. A. Ibrahim, H. Liu, O. Kayiran, and A. Jog, “Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs,” in Proc. Int. Conf. Parallel Archit. Comp. Techn. (PACT), 2019, pp. 258–271.

[11]

M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging shared L1 caches in GPUs,” in Proc. ACM Int. Conf. Parallel Archit. Comp. Techn., 2020, pp. 161–173.

[12]

M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging decoupled L1 caches in GPUs,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 467–478.

[13]

L. Liu, S. Yang, L. Peng, and X. Li, “Hierarchical hybrid memory management in OS for tiered memory systems,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 10, pp. 2223–2236, Oct. 2019.

[14]

X. Li, L. Liu, S. Yang, L. Peng, and J. Qiu, “Thinking about a new mechanism for huge page management,” in Proc. ACM SIGOPS Asia–Pacific Workshop Syst., 2019, pp. 40–46.

[15]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU architecture,” IEEE Micro, vol. 31, no. 2, pp. 50–59, Mar./Apr. 2011.

Digital Library

[16]

J. Burgess, “RTX on—The Nvidia Turing GPU,” IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar./Apr. 2020.

[17]

X. Zhao, A. Adileh, Z. Yu, Z. Wang, A. Jaleel, and L. Eeckhout, “Adaptive memory-side last-level GPU caching,” in Proc. Int. Symp. Comput. Archit., 2019, pp. 411–423.

[18]

D. Tripathy, A. Abdolrashidi, L. N. Bhuyan, L. Zhou, and D. Wong, “PAVER: Locality graph-based thread block scheduling for GPUs,” ACM Trans. Archit. Code Optim., vol. 18, no. 3, pp. 1–26, 2021.

Digital Library

[19]

T. Baruahet al., “Valkyrie: Leveraging inter-TLB locality to enhance GPU performance,” in Proc. ACM Int. Conf. Parallel Archit. Compilation Techn., 2020, pp. 455–466.

[20]

B. Li, J. Wei, J. Sun, M. Annavaram, and N. S. Kim, “An efficient GPU cache architecture for applications with irregular memory access patterns,” ACM Trans. Archit. Code Optim., vol. 16, no. 3, pp. 1–24, 2019.

Digital Library

[21]

Y. Liang, X. Xie, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on GPUs,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34, no. 10, pp. 1677–1690, Oct. 2015.

Digital Library

[22]

A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2009, pp. 163–174.

[23]

L.-N. Pouchet and S. Grauer-Gray. “PolyBench: The polyhedral benchmark suite.” 2012. [Online]. Available: https://web.cs.ucla.edu/~pouchet/software/polybench/

[24]

S. Cheet al., “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization (IISWC), 2009, pp. 44–54.

[25]

A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon, “Tango: A deep neural network benchmark suite for various accelerators,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), 2019, pp. 137–138.

[26]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: A MapReduce framework on graphics processors,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2008, pp. 260–269.

[27]

T. Zheng, H. Zhu, and M. Erez, “SIPT: Speculatively indexed, physically tagged caches,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), 2018, pp. 118–130.

[28]

N. Vijaykumar, E. Nandita, K. Hsieh, P. B. Gibbons, and O. Mutlu, “The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 829–842.

[29]

K. Punniyamurthy and A. Gerstlauer, “TAFE: Thread address footprint estimation for capturing data/thread locality in GPU systems,” in Proc. ACM Int. Conf. Parallel Archit. Compilation Techn., 2020, pp. 17–29.

[30]

M. Khairy, V. Nikiforov, D. Nellans, and T. G. Rogers, “Locality-centric data and threadblock management for massive GPUs,” in Proc. Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2020, pp. 1022–1036.

[31]

W. Kwon, G.-I. Yu, E. Jeong, and B.-G. Chun, “Nimble: Lightweight and parallel GPU task scheduling for deep learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, Dec. 2020, pp. 8343–8354.

[32]

G. Malhotra, S. Goel, and S. R. Sarangi, “GpuTejas: A parallel simulator for GPU architectures,” in Proc. Int. Conf. High Perform. Comput. (HiPC), 2014, pp. 1–10.

[33]

S. Darabiet al., “Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources,” in Proc. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2022, pp. 228–244.

[34]

M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework for validated GPU modeling,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2020, pp. 473–486.

[35]

M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep learning accelerator enabled GPUs,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), 2019, pp. 79–92.

[36]

“NVIDIA H100 tensor core GPU architecture.” NVIDIA.com. Accessed: Jun. 1, 2023. [Online]. Available: https://resources.nvidia.com/en-us-tensor-core

[37]

J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming. Hoboken, NJ, USA: Wiley, 2014.

Digital Library

[38]

“CUDA toolkit documentation.” NVIDIA.com. Accessed: Jun. 1, 2023. [Online]. Available: https://docs.nvidia.com/cuda

[39]

Y. Liang, M. T. Satria, K. Rupnow, and D. Chen, “An accurate GPU performance model for effective control flow divergence optimization,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 35, no. 7, pp. 1165–1178, Jul. 2016.

Digital Library

[40]

L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang, “Improving GPGPU performance via cache locality aware thread block scheduling,” IEEE Comput. Archit. Lett., vol. 16, no. 2, pp. 127–131, Jul.–Dec. 2017.

[41]

J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, “Laperm: Locality aware scheduler for dynamic parallelism on GPUs,” ACM Comput. Archit. News, vol. 44, no. 3, pp. 583–595, 2016.

Digital Library

[42]

M. Leeet al., “Improving GPGPU resource utilization through alternative thread block scheduling,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), 2014, pp. 260–271.

[43]

A. Tabbakh, M. Annavaram, and X. Qian, “Power efficient sharing-aware GPU data management,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), 2017, pp. 698–707.

Cited By

Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3443707

Recommendations

Tag-Split Cache for Efficient GPGPU Cache Utilization
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Reducing Latency in an SRAM/DRAM Cache Hierarchy via a Novel Tag-Cache Architecture
DAC '14: Proceedings of the 51st Annual Design Automation Conference

Memory speed has become a major performance bottleneck as more and more cores are integrated on a multi-core chip. The widening latency gap between high speed cores and memory has led to the evolution of multi-level SRAM/DRAM cache hierarchies that ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Volume 43, Issue 5

May 2024

305 pages

Issue’s Table of Contents

0278-0070 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 May 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TCAD.2024.3443707

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents