Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Shared last-level cache (LLC) in on-chip CPU–GPU heterogeneous architectures is critical to the overall system performance, since CPU and GPU applications usually show completely different characteristics on cache accesses. Therefore, when co-running with CPU applications, GPU ones can easily occupy the majority of the LLC, making CPU applications starve severely. This imposes significant challenges to the design and management of the shared LLC in CPU–GPU heterogeneous architectures. To improve the overall system performance, we consider integrating conventional SRAM and a new memory technology (i.e., STT-RAM) to enlarge the shared LLC. Furthermore, we propose comprehensive management policies to reduce the contention between CPU and GPU units. Experimental results show that, compared with the conventional SRAM-only LLC design, our proposal improves the performance of CPU workloads by 17% while not hurting GPU ones and reduces the LLC energy consumption by 30% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Yuffe M, Knoll E, Mehalel M, Shor J, Kurts T (2011) A fully integrated multi-CPU, GPU and memory controller 32 nm processor. In: Proceedings of the International Solid-State Circuits Conference, pp 264–266

  2. AMD (2017) AMD and HSA. http://www.amd.com/en-us/innovations/software-technologies/hsa. Accessed 25 Apr 2018

  3. Nvidia (2014) NVIDIA Jetson TK1 development kit: bringing GPU-accelerated computing to embedded systems. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf. Accessed 25 Apr 2018

  4. Nvidia (2015) NVIDIA Tegra X1: NVIDIA’S new mobile superchip. https://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf. Accessed 25 Apr 2018

  5. Chang M, Rosenfeld P, Lu S, Jacob B (2013) Technology comparison for large last-level caches (L3Cs): low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 143–154

  6. Dong X, Xu C, Xie Y, Jouppi NP (2012) NVSim: a circuit-level performance, energy, and area model for emerging non-volatile memory. IEEE Trans Comput Aided Des Integr Circuits Syst 31(7):994–1007

    Article  Google Scholar 

  7. Jog A, Mishra AK, Xu C, Xie Y, Narayanan V, Iyer R, Das CR (2012) Cache revive: architecting volatile STT-RAM caches for enhanced performance in CMPs. In: Proceedings of the 49th Design Automation Conference, pp 243–252

  8. Dong X, Wu X, Sun G, Xie Y, Li H, Chen Y (2008) Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: Proceedings of the 45th Design Automation Conference, pp 554–559

  9. Nvidia (2017) NVIDIA CUDA programming guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.eps. Accessed 25 Apr 2018

  10. Khronos OpenCL Working Group (2017) Khronos OpenCL. http://www.khronos.org/opencl/. Accessed 25 Apr 2018

  11. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 44–54

  12. Wang Z, Jimenez DA, Xu C, Sun G, Xie Y (2014) Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In: Proceedings of the 20th International High Performance Computer Architecture Symposium, pp 13–24

  13. Chen Y, Cong J, Huang H, Liu B, Liu C, Potkonjak M, Reinman G (2012) Dynamically reconfigurable hybrid cache: an energy-efficient last-level cache design. In: Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition, pp 45–50

  14. Chen Y, Cong J, Huang H, Liu C, Prabhakar R, Reinman G (2012) Static and dynamic co-optimizations for blocks mapping in hybrid caches. In: Proceedings of the International Low Power Electronics and Design Design Symposium, pp 237–242

  15. Lee J, Kim H (2012) TAP: a TLP-aware cache management policy for a CPU–GPU heterogeneous architecture. In: Proceedings of the 18th International High Performance Computer Architecture Symposium, pp 1–12

  16. Power J, Hestness J, Orr MS, Hill MD, Wood DA (2015) gem5-gpu: a heterogeneous CPU-GPU simulator. IEEE Comput Archit Lett 14(1):34–36

    Article  Google Scholar 

  17. Spradling CD (2007) SPEC CPU2006 benchmark tools. ACM SIGARCH Comput Archit News 35(1):130–134

    Article  Google Scholar 

  18. Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: Proceedings of the 13th International Conference on Parallel and Distributed Processing Techniques and Applications, pp 449–450

  19. Thapliyal H, Arabnia HR, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: Proceedings of the 49th IEEE International Midwest Symposium on Circuits and Systems, pp 148–154

  20. Thapliyal H, Jayashree HV, Nagamani AN, Arabnia HR (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. Trans Comput Sci 17(7420):73–97

    Google Scholar 

  21. Thapliyal H, Arabnia HR, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. Trans Comput Sci 3(5300):99–121

    Google Scholar 

  22. Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188

    Article  Google Scholar 

  23. Jayashree HV, Thapliyal H, Arabnia HR, Agrawal VK (2016) Ancilla-input and garbage-output optimized design of a reversible quantum integer multiplier. J Supercomput 72(4):1477–1493

    Article  Google Scholar 

  24. Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1–2):185–203

    Article  Google Scholar 

  25. Mekkat V, Holey A, Yew P, Zhai A (2013) Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Parallel Architectures and Compilation Techniques Symposium, pp 225–234

  26. Rai S, Chaudhuri M (2016) Exploiting dynamic reuse probability to manage shared last-level caches in CPU–GPU heterogeneous processors. In: Proceedings of the 30th International Supercomputing Symposium, pp 3–14

  27. Zhan J, Kayiran O, Loh GH, Das CR, Xie Y (2016) OSCAR: orchestrating STT-RAM cache traffic in heterogeneous architectures. In: Proceedings of the 49th International Microarchitecture Symposium, pp 1–13

  28. Garca V, GomezLuna J, Grass T, Rico A, Ayguade E, Pena AJ (2016) Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: IEEE Proceedings of the International Symposium on Workload Characterization, pp 1–10

  29. Jadidi A, Arjomand M, Sarbazi-Azad H (2011) High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement. In: Proceedings of the 17th International Low Power Electronics and Design Symposium, pp 79–84

  30. Sun G, Dong X, Xie Y, Li J, Chen Y (2009) A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In: Proceedings of the 15th International High Performance Computer Architecture Symposium, pp 239–249

  31. Lin I, Chiou JS (2013) High-endurance hybrid cache design in cmp architecture with cache partitioning and access-aware policy. In: Proceedings of the 23rd International Great Lakes Symposium on VLSI, pp 19–24

  32. Wang J, Tim Y, Wong WF, Ong ZL, Sun Z, Li H (2014) A coherent hybrid SRAM and STT-RAM L1 cache architecture for shared memory multicores. In: Proceedings of the 19th Asia and South Pacific Design Automation Conference, pp 610–615

  33. Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2009) Hybrid cache architecture with disparate memory technologies. In: Proceedings of the International the 36th Computer Architecture Symposium, pp 34–45

  34. Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y (2010) Design exploration of hybrid caches with disparate memory technologies. ACM Trans Archit Code Optim 7(3):15

    Article  Google Scholar 

  35. Wu X, Li J, Zhang L, Speight E, Xie Y (2009) Power and performance of read-write aware hybrid caches with non-volatile memories. In: Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp 737–742

  36. Li Y, Chen Y, Jones AK (2012) A software approach for combating asymmetries of non-volatile memories. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 191–196

  37. Li Q, Li J, Shi L, Xue CJ, He Y (2012) MAC: migration-aware compilation for STT-RAM based hybrid cache in embedded systems. In: Proceedings of the International Low Power Electronics and Design Symposium, pp 351–356

  38. Li J, Shi L, Xue CJ, Yang C, Xu Y (2011) Exploiting set-level write non-uniformity for energy-efficient NVM-based hybrid cache. In: Proceedings of the 9th Embedded Systems for Real-Time Multimedia Symposium, pp 19–28

  39. Wang R, Jia D, Li T, Qian DP (2017) Achieving versatile and simultaneous cache optimizations with nonvolatile SRAM. IEEE Trans Comput Aided Des Integr Circuits Syst 36(2):241–254

    Article  Google Scholar 

  40. Smullen CW, Mohan V, Nigam A, Gurumurthi S, Stan MR (2011) Relaxing non-volatility for fast and energy-efficient STT-RAM caches. In: Proceedings of the 17th International High Performance Computer Architecture Symposium, pp 50–61

  41. Samavatian MH, Abbasitabar H, Arjomand M, Sarbazi-Azad H (2014) An efficient STT-RAM last-level cache architecture for GPUs. In: Proceedings of the 51st Design Automation Conference, pp 1–6

  42. Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2014) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th International Microarchitecture Symposium, pp 343–355

  43. Goswami N, Cao B, Li T (2013) Power-performance co-optimization of throughput core architecture using resistive memory. In: Proceedings of the 19th International High Performance Computer Architecture Symposium, pp 342–353

  44. Li G, Chen X, Sun G, Hoffmann H, Liu Y, Wang Y, Yang H (2015) A STT-RAM-based low-power hybrid register file for GPGPUs. In: Proceedings of the 52nd Design Automation Conference, pp 1–6

  45. Liu X, Mao M, Bi X, Li H, Chen Y (2015) An efficient STT-RAM-based register file in GPU architectures. In: Proceedings of the 20th Asia and South Pacific Design Automation Conference, pp 490–495

  46. Deng Q, Zhang Y, Zhang M, Yang J (2017) Towards warp-scheduler friendly STT-RAM/SRAM hybrid GPGPU register file design. In: Proceedings of the 36th IEEE/ACM International Conference on Computer-Aided Design, pp 736–742

Download references

Acknowledgements

This work was supported in part by National Key R&D Program of China Nos. 2017YFB0203201 and 2017YFC0820100 and by NSFC Nos. 61732002 and 61202425. We thank all reviewers for their valuable comments and advice on improving this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, L., Wang, R., Xu, Y. et al. SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU–GPU heterogeneous architectures. J Supercomput 74, 3388–3414 (2018). https://doi.org/10.1007/s11227-018-2389-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2389-3

Keywords