SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Gao, Lan; Wang, Rui; Xu, Yunlong; Yang, Hailong; Luan, Zhongzhi; Qian, Depei; Zhang, Han; Cai, Jihong

doi:10.1007/s11227-018-2389-3

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Published: 19 May 2018

Volume 74, pages 3388–3414, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Lan Gao ORCID: orcid.org/0000-0001-5637-9417¹,
Rui Wang¹,
Yunlong Xu²,
Hailong Yang¹,
Zhongzhi Luan¹,
Depei Qian¹,
Han Zhang³ &
…
Jihong Cai³

671 Accesses
6 Citations
Explore all metrics

Abstract

Shared last-level cache (LLC) in on-chip CPU–GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running with CPU applications, GPU ones can easily occupy the majority of the LLC, making CPU applications starve severely. This imposes significant challenges to the design and management of the shared LLC in CPU–GPU heterogeneous architectures. To improve the overall system performance, we consider integrating conventional SRAM and a new memory technology (i.e., STT-RAM) to enlarge the shared LLC. Furthermore, we propose comprehensive management policies to reduce the contention between CPU and GPU units. Experimental results show that, compared with the conventional SRAM-only LLC design, our proposal improves the performance of CPU workloads by 17% while not hurting GPU ones and reduces the LLC energy consumption by 30% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

An Energy-Efficient 3D Stacked STT-RAM Cache Architecture for CMPs

Aggressive GPU cache bypassing with monolithic 3D-based NoC

Article 21 October 2022

References

Yuffe M, Knoll E, Mehalel M, Shor J, Kurts T (2011) A fully integrated multi-CPU, GPU and memory controller 32 nm processor. In: Proceedings of the International Solid-State Circuits Conference, pp 264–266
AMD (2017) AMD and HSA. http://www.amd.com/en-us/innovations/software-technologies/hsa. Accessed 25 Apr 2018
Nvidia (2014) NVIDIA Jetson TK1 development kit: bringing GPU-accelerated computing to embedded systems. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf. Accessed 25 Apr 2018
Nvidia (2015) NVIDIA Tegra X1: NVIDIA’S new mobile superchip. https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf. Accessed 25 Apr 2018
Chang M, Rosenfeld P, Lu S, Jacob B (2013) Technology comparison for large last-level caches (L3Cs): low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 143–154
Dong X, Xu C, Xie Y, Jouppi NP (2012) NVSim: a circuit-level performance, energy, and area model for emerging non-volatile memory. IEEE Trans Comput Aided Des Integr Circuits Syst 31(7):994–1007
Article Google Scholar
Jog A, Mishra AK, Xu C, Xie Y, Narayanan V, Iyer R, Das CR (2012) Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs. In: Proceedings of the 49th Design Automation Conference, pp 243–252
Dong X, Wu X, Sun G, Xie Y, Li H, Chen Y (2008) Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: Proceedings of the 45th Design Automation Conference, pp 554–559
Nvidia (2017) NVIDIA CUDA programming guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.eps. Accessed 25 Apr 2018
Khronos OpenCL Working Group (2017) Khronos OpenCL. http://www.khronos.org/opencl/. Accessed 25 Apr 2018
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 44–54
Wang Z, Jimenez DA, Xu C, Sun G, Xie Y (2014) Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In: Proceedings of the 20th International High Performance Computer Architecture Symposium, pp 13–24
Chen Y, Cong J, Huang H, Liu B, Liu C, Potkonjak M, Reinman G (2012) Dynamically reconfigurable hybrid cache: an energy-efficient last-level cache design. In: Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition, pp 45–50
Chen Y, Cong J, Huang H, Liu C, Prabhakar R, Reinman G (2012) Static and dynamic co-optimizations for blocks mapping in hybrid caches. In: Proceedings of the International Low Power Electronics and Design Design Symposium, pp 237–242
Lee J, Kim H (2012) TAP: a TLP-aware cache management policy for a CPU–GPU heterogeneous architecture. In: Proceedings of the 18th International High Performance Computer Architecture Symposium, pp 1–12
Power J, Hestness J, Orr MS, Hill MD, Wood DA (2015) gem5-gpu: a heterogeneous CPU-GPU simulator. IEEE Comput Archit Lett 14(1):34–36
Article Google Scholar
Spradling CD (2007) SPEC CPU2006 benchmark tools. ACM SIGARCH Comput Archit News 35(1):130–134
Article Google Scholar
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 13th International Conference on Parallel and Distributed Processing Techniques and Applications, pp 449–450
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of the 49th IEEE International Midwest Symposium on Circuits and Systems, pp 148–154
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. Trans Comput Sci 17(7420):73–97
Google Scholar
Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Trans Comput Sci 3(5300):99–121
Google Scholar
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188
Article Google Scholar
Jayashree HV, Thapliyal H, Arabnia HR, Agrawal VK (2016) Ancilla-input and garbage-output optimized design of a reversible quantum integer multiplier. J Supercomput 72(4):1477–1493
Article Google Scholar
Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1–2):185–203
Article Google Scholar
Mekkat V, Holey A, Yew P, Zhai A (2013) Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Parallel Architectures and Compilation Techniques Symposium, pp 225–234
Rai S, Chaudhuri M (2016) Exploiting dynamic reuse probability to manage shared last-level caches in CPU–GPU heterogeneous processors. In: Proceedings of the 30th International Supercomputing Symposium, pp 3–14
Zhan J, Kayiran O, Loh GH, Das CR, Xie Y (2016) OSCAR: orchestrating STT-RAM cache traffic in heterogeneous architectures. In: Proceedings of the 49th International Microarchitecture Symposium, pp 1–13
Garca V, GomezLuna J, Grass T, Rico A, Ayguade E, Pena AJ (2016) Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 1–10
Jadidi A, Arjomand M, Sarbazi-Azad H (2011) High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement. In: Proceedings of the 17th International Low Power Electronics and Design Symposium, pp 79–84
Sun G, Dong X, Xie Y, Li J, Chen Y (2009) A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In: Proceedings of the 15th International High Performance Computer Architecture Symposium, pp 239–249
Lin I, Chiou JS (2013) High-endurance hybrid cache design in cmp architecture with cache partitioning and access-aware policy. In: Proceedings of the 23rd International Great Lakes Symposium on VLSI, pp 19–24
Wang J, Tim Y, Wong WF, Ong ZL, Sun Z, Li H (2014) A coherent hybrid SRAM and STT-RAM L1 cache architecture for shared memory multicores. In: Proceedings of the 19th Asia and South Pacific Design Automation Conference, pp 610–615
Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2009) Hybrid cache architecture with disparate memory technologies. In: Proceedings of the International the 36th Computer Architecture Symposium, pp 34–45
Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2010) Design exploration of hybrid caches with disparate memory technologies. ACM Trans Archit Code Optim 7(3):15
Article Google Scholar
Wu X, Li J, Zhang L, Speight E, Xie Y (2009) Power and performance of read-write aware hybrid caches with non-volatile memories. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp 737–742
Li Y, Chen Y, Jones AK (2012) A software approach for combating asymmetries of non-volatile memories. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 191–196
Li Q, Li J, Shi L, Xue CJ, He Y (2012) MAC: migration-aware compilation for STT-RAM based hybrid cache in embedded systems. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 351–356
Li J, Shi L, Xue CJ, Yang C, Xu Y (2011) Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. In: Proceedings of the 9th Embedded Systems for Real-Time Multimedia Symposium, pp 19–28
Wang R, Jia D, Li T, Qian DP (2017) Achieving versatile and simultaneous cache optimizations with nonvolatile SRAM. IEEE Trans Comput Aided Des Integr Circuits Syst 36(2):241–254
Article Google Scholar
Smullen CW, Mohan V, Nigam A, Gurumurthi S, Stan MR (2011) Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: Proceedings of the 17th International High Performance Computer Architecture Symposium, pp 50–61
Samavatian MH, Abbasitabar H, Arjomand M, Sarbazi-Azad H (2014) An efficient STT-RAM last-level cache architecture for GPUs. In: Proceedings of the 51st Design Automation Conference, pp 1–6
Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th International Microarchitecture Symposium, pp 343–355
Goswami N, Cao B, Li T (2013) Power-performance co-optimization of throughput core architecture using resistive memory. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 342–353
Li G, Chen X, Sun G, Hoffmann H, Liu Y, Wang Y, Yang H (2015) A STT-RAM-based low-power hybrid register file for GPGPUs. In: Proceedings of the 52nd Design Automation Conference, pp 1–6
Liu X, Mao M, Bi X, Li H, Chen Y (2015) An efficient STT-RAM-based register file in GPU architectures. In: Proceedings of the 20th Asia and South Pacific Design Automation Conference, pp 490–495
Deng Q, Zhang Y, Zhang M, Yang J (2017) Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design. In: Proceedings of the 36th IEEE/ACM International Conference on Computer-Aided Design, pp 736–742

Download references

Acknowledgements

This work was supported in part by National Key R&D Program of China Nos. 2017YFB0203201 and 2017YFC0820100 and by NSFC Nos. 61732002 and 61202425. We thank all reviewers for their valuable comments and advice on improving this paper.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, China
Lan Gao, Rui Wang, Hailong Yang, Zhongzhi Luan & Depei Qian
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Yunlong Xu
Beijing Simulation Center, Beijing, China
Han Zhang & Jihong Cai

Authors

Lan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yunlong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author in PubMed Google Scholar
Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar
Han Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, L., Wang, R., Xu, Y. et al. SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures. J Supercomput 74, 3388–3414 (2018). https://doi.org/10.1007/s11227-018-2389-3

Download citation

Published: 19 May 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11227-018-2389-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

An Energy-Efficient 3D Stacked STT-RAM Cache Architecture for CMPs

Aggressive GPU cache bypassing with monolithic 3D-based NoC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

An Energy-Efficient 3D Stacked STT-RAM Cache Architecture for CMPs

Aggressive GPU cache bypassing with monolithic 3D-based NoC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation