Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag Array

Published: 01 May 2024 Publication History

Abstract

To fully exploit the locality of GPU applications, the GPU shared L1 cache architecture, which shares L1 cache among multiple GPU cores, is a promising architecture while still suffering from high-resource contentions. We present a GPU shared L1 cache architecture with an aggregated tag array that minimizes the L1 cache contentions and takes full advantage of inter-core locality. The key idea is to decouple and aggregate the tag arrays of multiple L1 caches so that the cache requests can be compared with all tag arrays in parallel to probe the replicated data in other caches. The GPU caches are only accessed by other GPU cores when replicated data exists, filtering out unnecessary cache accesses that cause high-resource contentions. We also develop a two-level thread-block scheduling policy adapted for the shared L1 cache architecture to maximize the available locality. The experimental results show that GPU performance can be improved by 14.5% on average for applications with a high inter-core locality.

References

[1]
E. Choukseet al., “Buddy compression: Enabling larger memory for deep learning and HPC workloads on GPUs,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2020, pp. 926–939.
[2]
Y. Kwon, Y. Lee, and M. Rhu, “Tensor casting: Co-designing algorithm-architecture for personalized recommendation training,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 235–248.
[3]
Q. Sunet al., “GTuner: Tuning DNN computations on GPU via graph attention network,” in Proc. ACM/IEEE Des. Autom. Conf., 2022, pp. 1045–1050.
[4]
M. Kim, L. Liu, and W. Choi, “A GPU-aware parallel index for processing high-dimensional big data,” IEEE Trans. Comput., vol. 67, no. 10, pp. 1388–1402, Oct. 2018.
[5]
M. Rissoet al., “Lightweight neural architecture search for temporal convolutional networks at the edge,” IEEE Trans. Comput., vol. 72, no. 3, pp. 744–758, Mar. 2023.
[6]
Y. Taoet al., “GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication,” Concurr. Comput. Pract. Exp., vol. 27, no. 14, pp. 3771–3789, 2015.
[7]
D. Li and T. M. Aamodt, “Inter-core locality aware memory scheduling,” IEEE Comput. Archit. Lett., vol. 15, no. 1, pp. 25–28, Jan.–Jun. 2016.
[8]
X. Zhao, Y. Liu, A. Adileh, and L. Eeckhout, “LA-LLC: Inter-core locality-aware last-level cache to exploit many-to-many traffic in GPGPUs,” IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 42–45, Jan.–Jun. 2017.
[9]
S. Dublish, V. Nagarajan, and N. Topham, “Cooperative caching for GPUs,” ACM Trans. Archit. Code Optim. (TACO), vol. 13, no. 4, pp. 1–25, 2016.
[10]
M. A. Ibrahim, H. Liu, O. Kayiran, and A. Jog, “Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs,” in Proc. Int. Conf. Parallel Archit. Comp. Techn. (PACT), 2019, pp. 258–271.
[11]
M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging shared L1 caches in GPUs,” in Proc. ACM Int. Conf. Parallel Archit. Comp. Techn., 2020, pp. 161–173.
[12]
M. A. Ibrahim, O. Kayiran, Y. Eckert, G. H. Loh, and A. Jog, “Analyzing and leveraging decoupled L1 caches in GPUs,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), 2021, pp. 467–478.
[13]
L. Liu, S. Yang, L. Peng, and X. Li, “Hierarchical hybrid memory management in OS for tiered memory systems,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 10, pp. 2223–2236, Oct. 2019.
[14]
X. Li, L. Liu, S. Yang, L. Peng, and J. Qiu, “Thinking about a new mechanism for huge page management,” in Proc. ACM SIGOPS Asia–Pacific Workshop Syst., 2019, pp. 40–46.
[15]
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu, “Fermi GF100 GPU architecture,” IEEE Micro, vol. 31, no. 2, pp. 50–59, Mar./Apr. 2011.
[16]
J. Burgess, “RTX on—The Nvidia Turing GPU,” IEEE Micro, vol. 40, no. 2, pp. 36–44, Mar./Apr. 2020.
[17]
X. Zhao, A. Adileh, Z. Yu, Z. Wang, A. Jaleel, and L. Eeckhout, “Adaptive memory-side last-level GPU caching,” in Proc. Int. Symp. Comput. Archit., 2019, pp. 411–423.
[18]
D. Tripathy, A. Abdolrashidi, L. N. Bhuyan, L. Zhou, and D. Wong, “PAVER: Locality graph-based thread block scheduling for GPUs,” ACM Trans. Archit. Code Optim., vol. 18, no. 3, pp. 1–26, 2021.
[19]
T. Baruahet al., “Valkyrie: Leveraging inter-TLB locality to enhance GPU performance,” in Proc. ACM Int. Conf. Parallel Archit. Compilation Techn., 2020, pp. 455–466.
[20]
B. Li, J. Wei, J. Sun, M. Annavaram, and N. S. Kim, “An efficient GPU cache architecture for applications with irregular memory access patterns,” ACM Trans. Archit. Code Optim., vol. 16, no. 3, pp. 1–24, 2019.
[21]
Y. Liang, X. Xie, G. Sun, and D. Chen, “An efficient compiler framework for cache bypassing on GPUs,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34, no. 10, pp. 1677–1690, Oct. 2015.
[22]
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2009, pp. 163–174.
[23]
L.-N. Pouchet and S. Grauer-Gray. “PolyBench: The polyhedral benchmark suite.” 2012. [Online]. Available: https://web.cs.ucla.edu/~pouchet/software/polybench/
[24]
S. Cheet al., “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. IEEE Int. Symp. Workload Characterization (IISWC), 2009, pp. 44–54.
[25]
A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon, “Tango: A deep neural network benchmark suite for various accelerators,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), 2019, pp. 137–138.
[26]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: A MapReduce framework on graphics processors,” in Proc. Int. Conf. Parallel Archit. Compilation Techn., 2008, pp. 260–269.
[27]
T. Zheng, H. Zhu, and M. Erez, “SIPT: Speculatively indexed, physically tagged caches,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), 2018, pp. 118–130.
[28]
N. Vijaykumar, E. Nandita, K. Hsieh, P. B. Gibbons, and O. Mutlu, “The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 829–842.
[29]
K. Punniyamurthy and A. Gerstlauer, “TAFE: Thread address footprint estimation for capturing data/thread locality in GPU systems,” in Proc. ACM Int. Conf. Parallel Archit. Compilation Techn., 2020, pp. 17–29.
[30]
M. Khairy, V. Nikiforov, D. Nellans, and T. G. Rogers, “Locality-centric data and threadblock management for massive GPUs,” in Proc. Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2020, pp. 1022–1036.
[31]
W. Kwon, G.-I. Yu, E. Jeong, and B.-G. Chun, “Nimble: Lightweight and parallel GPU task scheduling for deep learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, Dec. 2020, pp. 8343–8354.
[32]
G. Malhotra, S. Goel, and S. R. Sarangi, “GpuTejas: A parallel simulator for GPU architectures,” in Proc. Int. Conf. High Perform. Comput. (HiPC), 2014, pp. 1–10.
[33]
S. Darabiet al., “Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources,” in Proc. IEEE/ACM Int. Symp. Microarchit. (MICRO), 2022, pp. 228–244.
[34]
M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework for validated GPU modeling,” in Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA), 2020, pp. 473–486.
[35]
M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep learning accelerator enabled GPUs,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), 2019, pp. 79–92.
[36]
NVIDIA H100 tensor core GPU architecture.” NVIDIA.com. Accessed: Jun. 1, 2023. [Online]. Available: https://resources.nvidia.com/en-us-tensor-core
[37]
J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming. Hoboken, NJ, USA: Wiley, 2014.
[38]
CUDA toolkit documentation.” NVIDIA.com. Accessed: Jun. 1, 2023. [Online]. Available: https://docs.nvidia.com/cuda
[39]
Y. Liang, M. T. Satria, K. Rupnow, and D. Chen, “An accurate GPU performance model for effective control flow divergence optimization,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 35, no. 7, pp. 1165–1178, Jul. 2016.
[40]
L.-J. Chen, H.-Y. Cheng, P.-H. Wang, and C.-L. Yang, “Improving GPGPU performance via cache locality aware thread block scheduling,” IEEE Comput. Archit. Lett., vol. 16, no. 2, pp. 127–131, Jul.–Dec. 2017.
[41]
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, “Laperm: Locality aware scheduler for dynamic parallelism on GPUs,” ACM Comput. Archit. News, vol. 44, no. 3, pp. 583–595, 2016.
[42]
M. Leeet al., “Improving GPGPU resource utilization through alternative thread block scheduling,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), 2014, pp. 260–271.
[43]
A. Tabbakh, M. Annavaram, and X. Qian, “Power efficient sharing-aware GPU data management,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), 2017, pp. 698–707.

Cited By

View all
  • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: 1-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  Volume 43, Issue 5
May 2024
305 pages

Publisher

IEEE Press

Publication History

Published: 01 May 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: 1-Nov-2024

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media