Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bakhoda A, Yuan G, Fung W, et al., 2009. Analyzing CUDA workloads using a detailed GPU simulator. ISPASS IEEE Int Symp on Performance Analysis of Systems and Software, p.163–174. https://doi.org/10.1109/ISPASS.2009.4919648

    Google Scholar 

  • Che S, Boyer M, Meng J, et al., 2009. Rodinia: a benchmark suite for heterogeneous computing. IISWC IEEE Int Symp on Workload Characterization, p.44–54. https://doi.org/10.1109/IISWC.2009.5306797

    Google Scholar 

  • Chen J, Tao X, Yang Z, et al., 2013. Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. IEEE 27th Int Symp on Parallel & Distributed Processing, p.441–451. https://doi.org/10.1109/IPDPS.2013.95

    Google Scholar 

  • Chen X, Chang L, Rodrigues C, et al., 2014. Adaptive cache management for energy-efficient GPU computing. Proc 47th Annual IEEE/ACM Int Symp on Microarchitecture, p.343–355. https://doi.org/10.1109/MICRO.2014.11

    Google Scholar 

  • Dally W, Labonte F, Das A, et al., 2003. Merrimac: supercomputing with streams. Proc ACM/IEEE Conf on Supercomputing, Article 35. https://doi.org/10.1145/1048935.1050187

    Google Scholar 

  • Drew Y, 2008. A closer look at GPUs. Commun ACM, 51(10):50–57. https://doi.org/10.1145/1400181.1400197

    Article  Google Scholar 

  • Fang W, He B, Luo Q, et al., 2011. Mars: accelerating mapreduce with graphics processors. IEEE Trans Parall Distr Syst, 22(4):608–620. https://doi.org/10.1109/TPDS.2010.158

    Article  Google Scholar 

  • Gebhart M, Johnson D, Tarjan D, et al., 2011. Energyefficient mechanisms for managing thread context in throughput processors. Proc 38th Annual Int Symp Computer Architecture, p.235–246. https://doi.org/10.1145/2000064.2000093

    Google Scholar 

  • Gupta S, Xiang P, Zhou H, 2013. Analyzing locality of memory references in GPU architectures. Proc ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, Article 12. https://doi.org/10.1145/2492408.2492423

    Google Scholar 

  • Harris M, 2014. Maxwell: the Most Advanced CUDA GPU Ever Made. https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made

    Google Scholar 

  • Jia W, Shaw K, Martonosi M, 2014. MRPB: memory request prioritization for massively parallel processors. IEEE 20th Int Symp on High Performance Computer Architecture, p.272–283. https://doi.org/10.1109/HPCA.2014.6835938

    Google Scholar 

  • Jog A, Kayiran O, Nachiappan C, et al., 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGARCH Comput Arch News, 41(1):395–406. https://doi.org/10.1145/2490301.2451158

    Google Scholar 

  • Lee M, Song S, Moon J, et al., 2014. Improving GPGPU resource utilization through alternative thread block scheduling. IEEE 20th Int Symp on High Performance Computer Architecture, p.260–271. https://doi.org/10.1109/HPCA.2014.6835937

    Google Scholar 

  • Lee S, Arunkumar A, Wu C, 2015. CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. Proc 42nd Annual Int Symp on Computer Architecture, p.515–527. https://doi.org/10.1145/2872887.2750418

    Google Scholar 

  • Narasiman V, Shebanow M, Lee CJ, et al., 2011. Improving GPU performance via large warps and two-level warp scheduling. Proc 44th Annual IEEE/ACM Int Symp on Microarchitecture, p.308–317. https://doi.org/10.1145/2155620.2155656

    Google Scholar 

  • Nugteren C, van den Braak G, Corporaal H, et al., 2014. A detailed GPU cache model based on reuse distance theory. IEEE 20th Int Symp on High Performance Computer Architecture, p.37–48. https://doi.org/10.1109/HPCA.2014.6835955

    Google Scholar 

  • NVIDIA, 2009. NVIDIA’s next generation CUDA compute architecture: FERMI. v1.1. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_ Compute_Architecture_Whitepaper.pdf NVIDIA, 2015.

  • NVIDIA CUDA C Programming Guide v7.5. http://developer.nvidia.com/nvidia-gpu-computingdocumentation

  • Rhu M, Sullivan M, Leng J, et al., 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.86–98. https://doi.org/10.1145/2540708.2540717

    Chapter  Google Scholar 

  • Rogers T, O’Connor M, Aamodt T, 2012. Cache-conscious wavefront scheduling. Proc 45th Annual IEEE/ACM Int Symp on Microarchitecture, p.72–83. https://doi.org/10.1109/MICRO.2012.16

    Google Scholar 

  • Rogers T, O’Connor M, Aamodt T, 2013. Divergence-aware warp scheduling. Proc 46th Annual IEEE/ACM Int Symp on Microarchitecture, p.99–110. https://doi.org/10.1145/2540708.2540718

    Chapter  Google Scholar 

  • Sethia A, Jamshidi D, Mahlke S, 2015. Mascar: speeding up GPU warps by reducing memory pitstops. IEEE 21st Int Symp on High Performance Computer Architecture, p.174–185. https://doi.org/10.1109/HPCA.2015.7056031

    Google Scholar 

  • Xie X, Liang Y, Sun G, et al., 2013. An efficient compiler framework for cache bypassing on GPUs. IEEE/ACM Int Conf on Computer-Aided Design, p.516–523. https://doi.org/10.1109/ICCAD.2013.6691165

    Google Scholar 

  • Xie X, Liang Y, Wang Y, et al., 2015. Coordinated static and dynamic cache bypassing for GPUs. IEEE 21st Int Symp on High Performance Computer Architecture, p.76–88. https://doi.org/10.1109/HPCA.2015.7056023

    Google Scholar 

  • Xie X, Liang Y, Li X, et al., 2017. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. IEEE/ACM Int Symp on Microarchitecture, p.395–406. https://doi.org/10.1109/TC.2017.2776272

    Google Scholar 

  • Zhang Y, Xing Z, Zhou L, et al., 2017. Locality protected dynamic cache allocation scheme on GPUs. IEEE Trustcom/BigDataSE/ISPA, p.1524–1530. https://doi.org/10.1109/TrustCom.2016.0237

    Google Scholar 

  • Zheng Z, 2014. Research on Key Technologies for Cache Power and Performance Optimization on Many-Core Heterogeneous Architecture. PhD Thesis, National University of Defense Technology, Changsha, China (in Chinese).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Zhang.

Additional information

Project supported by the National Natural Science Foundation of China (No. 61170083) and the Specialized Research Fund for the Doctoral Program of Higher Education, China (No. 20114307110001)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Xing, Zc., Liu, C. et al. CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs. Frontiers Inf Technol Electronic Eng 19, 206–220 (2018). https://doi.org/10.1631/FITEE.1700059

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1700059

Key words

CLC number