Abstract
As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
Similar content being viewed by others
References
Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163–174. https://doi.org/10.1109/ISPASS.2009.4919648
Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44–54. https://doi.org/10.1109/IISWC.2009.5306797
Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441–451. https://doi.org/10.1109/IPDPS.2013.95
Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343–355. https://doi.org/10.1109/MICRO.2014.11
Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35. https://doi.org/10.1145/1048935.1050187
Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50–57. https://doi.org/10.1145/1400181.1400197
Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608–620. https://doi.org/10.1109/TPDS.2010.158
Gebhart M, Johnson D, Tarjan D, et al., 2011. Energyefficient mechanisms for managing thread context in throughput processors. Proc 38th Annual Int Symp Computer Architecture, p.235–246. https://doi.org/10.1145/2000064.2000093
Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12. https://doi.org/10.1145/2492408.2492423
Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made
Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20th Int Symp on High Performance Computer Architecture, p.272–283. https://doi.org/10.1109/HPCA.2014.6835938
Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395–406. https://doi.org/10.1145/2490301.2451158
Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20th Int Symp on High Performance Computer Architecture, p.260–271. https://doi.org/10.1109/HPCA.2014.6835937
Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42nd Annual Int Symp on Computer Architecture, p.515–527. https://doi.org/10.1145/2872887.2750418
Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44th Annual IEEE/ACM Int Symp on Microarchitecture, p.308–317. https://doi.org/10.1145/2155620.2155656
Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20th Int Symp on High Performance Computer Architecture, p.37–48. https://doi.org/10.1109/HPCA.2014.6835955
NVIDIA, 2009. NVIDIA’s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_ Compute_Architecture_Whitepaper.pdf NVIDIA, 2015.
NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computingdocumentation
Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.86–98. https://doi.org/10.1145/2540708.2540717
Rogers T, O’Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45th Annual IEEE/ACM Int Symp on Microarchitecture, p.72–83. https://doi.org/10.1109/MICRO.2012.16
Rogers T, O’Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.99–110. https://doi.org/10.1145/2540708.2540718
Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21st Int Symp on High Performance Computer Architecture, p.174–185. https://doi.org/10.1109/HPCA.2015.7056031
Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516–523. https://doi.org/10.1109/ICCAD.2013.6691165
Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21st Int Symp on High Performance Computer Architecture, p.76–88. https://doi.org/10.1109/HPCA.2015.7056023
Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395–406. https://doi.org/10.1109/TC.2017.2776272
Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524–1530. https://doi.org/10.1109/TrustCom.2016.0237
Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese).
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (No. 61170083) and the Specialized Research Fund for the Doctoral Program of Higher Education, China (No. 20114307110001)
Rights and permissions
About this article
Cite this article
Zhang, Y., Xing, Zc., Liu, C. et al. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs. Frontiers Inf Technol Electronic Eng 19, 206–220 (2018). https://doi.org/10.1631/FITEE.1700059
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1700059