Abstract
Shared last-level cache (LLC) in on-chip CPU–GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running with CPU applications, GPU ones can easily occupy the majority of the LLC, making CPU applications starve severely. This imposes significant challenges to the design and management of the shared LLC in CPU–GPU heterogeneous architectures. To improve the overall system performance, we consider integrating conventional SRAM and a new memory technology (i.e., STT-RAM) to enlarge the shared LLC. Furthermore, we propose comprehensive management policies to reduce the contention between CPU and GPU units. Experimental results show that, compared with the conventional SRAM-only LLC design, our proposal improves the performance of CPU workloads by 17% while not hurting GPU ones and reduces the LLC energy consumption by 30% on average.
Similar content being viewed by others
References
Yuffe M, Knoll E, Mehalel M, Shor J, Kurts T (2011) A fully integrated multi-CPU, GPU and memory controller 32 nm processor. In: Proceedings of the International Solid-State Circuits Conference, pp 264–266
AMD (2017) AMD and HSA. http://www.amd.com/en-us/innovations/software-technologies/hsa. Accessed 25 Apr 2018
Nvidia (2014) NVIDIA Jetson TK1 development kit: bringing GPU-accelerated computing to embedded systems. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf. Accessed 25 Apr 2018
Nvidia (2015) NVIDIA Tegra X1: NVIDIA’S new mobile superchip. https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf. Accessed 25 Apr 2018
Chang M, Rosenfeld P, Lu S, Jacob B (2013) Technology comparison for large last-level caches (L3Cs): low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 143–154
Dong X, Xu C, Xie Y, Jouppi NP (2012) NVSim: a circuit-level performance, energy, and area model for emerging non-volatile memory. IEEE Trans Comput Aided Des Integr Circuits Syst 31(7):994–1007
Jog A, Mishra AK, Xu C, Xie Y, Narayanan V, Iyer R, Das CR (2012) Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs. In: Proceedings of the 49th Design Automation Conference, pp 243–252
Dong X, Wu X, Sun G, Xie Y, Li H, Chen Y (2008) Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: Proceedings of the 45th Design Automation Conference, pp 554–559
Nvidia (2017) NVIDIA CUDA programming guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.eps. Accessed 25 Apr 2018
Khronos OpenCL Working Group (2017) Khronos OpenCL. http://www.khronos.org/opencl/. Accessed 25 Apr 2018
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 44–54
Wang Z, Jimenez DA, Xu C, Sun G, Xie Y (2014) Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In: Proceedings of the 20th International High Performance Computer Architecture Symposium, pp 13–24
Chen Y, Cong J, Huang H, Liu B, Liu C, Potkonjak M, Reinman G (2012) Dynamically reconfigurable hybrid cache: an energy-efficient last-level cache design. In: Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition, pp 45–50
Chen Y, Cong J, Huang H, Liu C, Prabhakar R, Reinman G (2012) Static and dynamic co-optimizations for blocks mapping in hybrid caches. In: Proceedings of the International Low Power Electronics and Design Design Symposium, pp 237–242
Lee J, Kim H (2012) TAP: a TLP-aware cache management policy for a CPU–GPU heterogeneous architecture. In: Proceedings of the 18th International High Performance Computer Architecture Symposium, pp 1–12
Power J, Hestness J, Orr MS, Hill MD, Wood DA (2015) gem5-gpu: a heterogeneous CPU-GPU simulator. IEEE Comput Archit Lett 14(1):34–36
Spradling CD (2007) SPEC CPU2006 benchmark tools. ACM SIGARCH Comput Archit News 35(1):130–134
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 13th International Conference on Parallel and Distributed Processing Techniques and Applications, pp 449–450
Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of the 49th IEEE International Midwest Symposium on Circuits and Systems, pp 148–154
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. Trans Comput Sci 17(7420):73–97
Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Trans Comput Sci 3(5300):99–121
Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188
Jayashree HV, Thapliyal H, Arabnia HR, Agrawal VK (2016) Ancilla-input and garbage-output optimized design of a reversible quantum integer multiplier. J Supercomput 72(4):1477–1493
Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1–2):185–203
Mekkat V, Holey A, Yew P, Zhai A (2013) Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Parallel Architectures and Compilation Techniques Symposium, pp 225–234
Rai S, Chaudhuri M (2016) Exploiting dynamic reuse probability to manage shared last-level caches in CPU–GPU heterogeneous processors. In: Proceedings of the 30th International Supercomputing Symposium, pp 3–14
Zhan J, Kayiran O, Loh GH, Das CR, Xie Y (2016) OSCAR: orchestrating STT-RAM cache traffic in heterogeneous architectures. In: Proceedings of the 49th International Microarchitecture Symposium, pp 1–13
Garca V, GomezLuna J, Grass T, Rico A, Ayguade E, Pena AJ (2016) Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 1–10
Jadidi A, Arjomand M, Sarbazi-Azad H (2011) High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement. In: Proceedings of the 17th International Low Power Electronics and Design Symposium, pp 79–84
Sun G, Dong X, Xie Y, Li J, Chen Y (2009) A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In: Proceedings of the 15th International High Performance Computer Architecture Symposium, pp 239–249
Lin I, Chiou JS (2013) High-endurance hybrid cache design in cmp architecture with cache partitioning and access-aware policy. In: Proceedings of the 23rd International Great Lakes Symposium on VLSI, pp 19–24
Wang J, Tim Y, Wong WF, Ong ZL, Sun Z, Li H (2014) A coherent hybrid SRAM and STT-RAM L1 cache architecture for shared memory multicores. In: Proceedings of the 19th Asia and South Pacific Design Automation Conference, pp 610–615
Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2009) Hybrid cache architecture with disparate memory technologies. In: Proceedings of the International the 36th Computer Architecture Symposium, pp 34–45
Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2010) Design exploration of hybrid caches with disparate memory technologies. ACM Trans Archit Code Optim 7(3):15
Wu X, Li J, Zhang L, Speight E, Xie Y (2009) Power and performance of read-write aware hybrid caches with non-volatile memories. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp 737–742
Li Y, Chen Y, Jones AK (2012) A software approach for combating asymmetries of non-volatile memories. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 191–196
Li Q, Li J, Shi L, Xue CJ, He Y (2012) MAC: migration-aware compilation for STT-RAM based hybrid cache in embedded systems. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 351–356
Li J, Shi L, Xue CJ, Yang C, Xu Y (2011) Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. In: Proceedings of the 9th Embedded Systems for Real-Time Multimedia Symposium, pp 19–28
Wang R, Jia D, Li T, Qian DP (2017) Achieving versatile and simultaneous cache optimizations with nonvolatile SRAM. IEEE Trans Comput Aided Des Integr Circuits Syst 36(2):241–254
Smullen CW, Mohan V, Nigam A, Gurumurthi S, Stan MR (2011) Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: Proceedings of the 17th International High Performance Computer Architecture Symposium, pp 50–61
Samavatian MH, Abbasitabar H, Arjomand M, Sarbazi-Azad H (2014) An efficient STT-RAM last-level cache architecture for GPUs. In: Proceedings of the 51st Design Automation Conference, pp 1–6
Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th International Microarchitecture Symposium, pp 343–355
Goswami N, Cao B, Li T (2013) Power-performance co-optimization of throughput core architecture using resistive memory. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 342–353
Li G, Chen X, Sun G, Hoffmann H, Liu Y, Wang Y, Yang H (2015) A STT-RAM-based low-power hybrid register file for GPGPUs. In: Proceedings of the 52nd Design Automation Conference, pp 1–6
Liu X, Mao M, Bi X, Li H, Chen Y (2015) An efficient STT-RAM-based register file in GPU architectures. In: Proceedings of the 20th Asia and South Pacific Design Automation Conference, pp 490–495
Deng Q, Zhang Y, Zhang M, Yang J (2017) Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design. In: Proceedings of the 36th IEEE/ACM International Conference on Computer-Aided Design, pp 736–742
Acknowledgements
This work was supported in part by National Key R&D Program of China Nos. 2017YFB0203201 and 2017YFC0820100 and by NSFC Nos. 61732002 and 61202425. We thank all reviewers for their valuable comments and advice on improving this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gao, L., Wang, R., Xu, Y. et al. SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures. J Supercomput 74, 3388–3414 (2018). https://doi.org/10.1007/s11227-018-2389-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2389-3