Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3205289.3205294acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Warp-Consolidation: A Novel Execution Model for GPUs

Published: 12 June 2018 Publication History

Abstract

With the unprecedented development of compute capability and extension of memory bandwidth on modern GPUs, parallel communication and synchronization soon becomes a major concern for continuous performance scaling. This is especially the case for emerging big-data applications. Instead of relying on a few heavily-loaded CTAs that may expose opportunities for intra-CTA data reuse, current technology and design trends suggest the performance potential of allocating more lightweighted CTAs for processing individual tasks more independently, as the overheads from synchronization, communication and cooperation may greatly outweigh the benefits from exploiting limited data reuse in heavily-loaded CTAs. This paper proceeds this trend and proposes a novel execution model for modern GPUs that hides the CTA execution hierarchy from the classic GPU execution model; meanwhile exposes the originally hidden warp-level execution. Specifically, it relies on individual warps to undertake the original CTAs' tasks. The major observation is that by replacing traditional inter-warp communication (e.g., via shared memory), cooperation (e.g., via bar primitives) and synchronizations (e.g., via CTA barriers), with more efficient intra-warp communication (e.g., via register shuffling), cooperation (e.g., via warp voting) and synchronizations (naturally lockstep execution) across the SIMD-lanes within a warp, significant performance gain can be achieved. We analyze the pros and cons for this design and propose corresponding solutions to counter potential negative effects. Experimental results on a diverse group of thirty-two representative applications show that our proposed Warp-Consolidation execution model can achieve an average speedup of 1.7x, 2.3x, 1.5x and 1.2x (up to 6.3x, 31x, 6.4x and 3.8x) on NVIDIA Kepler (Tesla-K80), Maxwell (Tesla-M40), Pascal (Tesla-P100) and Volta (Tesla-V100) GPUs, respectively, demonstrating its applicability and portability. Our approach can be directly employed to either transform legacy codes or write new algorithms on modern commodity GPUs.

References

[1]
NVIDIA. CUDA Programming Guide, 2017.
[2]
Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. Barrier-Aware Warp Scheduling for Throughput Processors. In ICS-16. ACM.
[3]
Nicolas Brunie, Sylvain Collange, and Gregory Diamos. Simultaneous branch and warp interweaving for sustained GPU performance. In ISCA-12. IEEE.
[4]
Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. SIMD re-convergence at thread frontiers. In MICRO-11. ACM.
[5]
Wilson WL Fung and Tor M Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA-11. IEEE.
[6]
Ping Xiang, Yi Yang, and Huiyang Zhou. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In HPCA-14. IEEE.
[7]
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H Loh, Chita R Das, Mahmut T Kandemir, and Onur Mutlu. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In PACT-15. IEEE.
[8]
Ang Li, Wenfeng Zhao, and Shuaiwen Leon Song. BVF: enabling significant on-chip power savings via bit-value-favor for throughput processors. In MICRO. ACM, 2017.
[9]
David Tarjan and Kevin Skadron. On demand register allocation and deallocation for a multithreaded processor, December 29 2009. US Patent App. 12/649,238.
[10]
Vasily Volkov. Better performance at lower occupancy. In GTC-10.
[11]
Shucai Xiao and Wu-chun Feng. Inter-block GPU communication via fast barrier synchronization. In IPDPS-10. IEEE.
[12]
Jeff A Stuart and John D Owens. Efficient synchronization primitives for GPUs. arXiv preprint arXiv.1110.4623, 2011.
[13]
Ang Li, Gert-Jan van den Braak, Henk Corporaal, and Akash Kumar. Fine-grained synchronizations and dataflow programming on GPUs. In SC-15. ACM.
[14]
NVIDIA. Volta Architecture White Paper, 2018.
[15]
Michael Bauer, Sean Treichler, and Alex Aiken. Singe: leveraging warp specialization for high performance on GPUs. In PPoPP-14. ACM.
[16]
Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs. ISCA-16.
[17]
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS-10. IEEE.
[18]
Michael Bauer, Henry Cook, and Brucek Khailany. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In SC-11. ACM.
[19]
NVIDIA. Parallel Thread Execution ISA, 2017.
[20]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N Patt. Improving GPU performance via large warps and two-level warp scheduling. In MICRO-11. ACM.
[21]
Bryan Catanzaro. LDG and SHFL Intrinsics for arbitrary data types, 2014.
[22]
Weifeng Liu, Ang Li, Jonathan Hogg, Iain S Duff, and Brian Vinter. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In EuroPar-16. Springer.
[23]
Weifeng Liu, Ang Li, Jonathan D Hogg, Iain S Duff, and Brian Vinter. Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurrency and Computation: Practice and Experience, 2017.
[24]
Justin Luitjens. Faster Parallel Reductions on Kepler, 2014.
[25]
NVIDIA. CUDA C Best Practice Guide, 2017.
[26]
Vasily Volkov and James W Demmel. Benchmarking GPUs to tune dense linear algebra. In SC-08. IEEE.
[27]
Yi Yang, Ping Xiang, Jingfei Kong, Mike Mantor, and Huiyang Zhou. A unified optimizing compiler framework for different GPGPU architectures. TACO, 2012.
[28]
Alberto Magni, Christophe Dubach, and Michael O'Boyle. Automatic optimization of thread-coarsening for graphics processors. In PACT-14. IEEE.
[29]
Alberto Magni, Christophe Dubach, and Michael FP O'Boyle. A large-scale cross-architecture evaluation of thread-coarsening. In SC-13. IEEE.
[30]
Mike Murphy. NVIDIAs Experience with Open64. In Open64 Workshop at CGO, 2008.
[31]
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. Locality-Aware CTA Clustering for Modern GPUs. In ASPLOS-17. ACM.
[32]
NVIDIA. CUDA SDK Code Samples, 2015.
[33]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC-09. IEEE.
[34]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Autotuning a high-level language targeted to GPU codes. In InPar-12. IEEE.
[35]
Milind Kulkarni, Martin Burtscher, Calin Casçaval, and Keshav Pingali. Lonestar: A suite of parallel irregular programs. In ISPASS-09. IEEE.
[36]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012.
[37]
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S Meredith, Philip C Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S Vetter. The scalable heterogeneous computing (SHOC) benchmark suite. In GPGPU-10. ACM.
[38]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS-09. IEEE.
[39]
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. Adaptive and transparent cache bypassing for GPUs. In SC-15. IEEE.
[40]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for cuda. TACO, 2013.
[41]
Jun Shirako, Akihiro Hayashi, and Vivek Sarkar. Optimized two-level parallelization for GPU accelerators using the polyhedral model. In CC-17. ACM.
[42]
NVIDIA. CUDA Profiler User's Guide, 2018.
[43]
NVIDIA. Kepler GK110 Whitepaper, 2013.
[44]
Mohammad Abdel-Majeed and Murali Annavaram. Warped register file: A power efficient register file for GPGPUs. In HPCA-13. IEEE.
[45]
Andrew Davidson and John D Owens. Register packing for cyclic reduction: A case study. In GPGPU-11. ACM.
[46]
Andrew Davidson, David Tarjan, Michael Garland, and John D Owens. Efficient parallel merge sort for fixed and variable length keys. In InPar-12. IEEE.
[47]
Ari B Hayes and Eddy Z Zhang. Unified on-chip memory allocation for SIMT architecture. In SC-14. ACM.
[48]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In MICRO-15. ACM.
[49]
Ang Li, Shuaiwen Leon Song, Akash Kumar, Eddy Z Zhang, Daniel Chavarría-Miranda, and Henk Corporaal. Critical points based register-concurrency autotuning for GPUs. In DATE-16.
[50]
Ang Li, YC Tay, Akash Kumar, and Henk Corporaal. Transit: A visual analytical model for multithreaded machines. In HPDC-15. ACM.
[51]
Ang Li, Shuaiwen Leon Song, Eric Brugel, Akash Kumar, Daniel Chavarria-Miranda, and Henk Corporaal. X: A comprehensive analytic model for parallel machines. In IPDPS-16. IEEE.
[52]
Ang Li, Weifeng Liu, Mads RB Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels. In SC-17. ACM.
[53]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R Johnson, David Nellans, Mike O'Connor, and Stephen W Keckler. Flexible software profiling of GPU architectures. In SC-15. ACM.
[54]
Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu. Cudaadvisor: Llvm-based runtime profiling for modern gpus. In CGO-18. ACM.
[55]
Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. Shared memory multiplexing: a novel way to improve GPGPU throughput. In PACT-12. ACM.
[56]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. Accelerating CUDA graph algorithms at maximum warp. In PPoPP-11. ACM.
[57]
Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. Fine-grain task aggregation and coordination on GPUs. In ISCA-14. ACM.
[58]
Shin-Ying Lee and Carole-Jean Wu. CAWS: criticality-aware warp scheduling for GPGPU workloads. In PACT-14. ACM.
[59]
Thomas L Falch and Anne C Elster. Register caching for stencil computations on GPUs. In SYNASC-14. IEEE.
[60]
Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, and Eran Tromer. Fast multiplication in binary fields on gpus via register cache. In SC-16. ACM.
[61]
Sreepathi Pai, Matthew J Thazhuthaveetil, and Ramaswamy Govindarajan. Improving GPGPU concurrency with elastic kernels. In ASPLOS-13. ACM.

Cited By

View all
  • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
  • (2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
  • (2023)HE-Booster: An Efficient Polynomial Arithmetic Acceleration on GPUs for Fully Homomorphic EncryptionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322862834:4(1067-1081)Online publication date: Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '18: Proceedings of the 2018 International Conference on Supercomputing
June 2018
407 pages
ISBN:9781450357838
DOI:10.1145/3205289
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICS '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)300
  • Downloads (Last 6 weeks)46
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
  • (2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
  • (2023)HE-Booster: An Efficient Polynomial Arithmetic Acceleration on GPUs for Fully Homomorphic EncryptionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322862834:4(1067-1081)Online publication date: Apr-2023
  • (2023)Accelerating matrix-centric graph processing on GPUs through bit-level optimizationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.02.013177(53-67)Online publication date: Jul-2023
  • (2022)Bring orders into uncertaintyProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532379(1-14)Online publication date: 28-Jun-2022
  • (2022)Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00056(515-525)Online publication date: May-2022
  • (2022)Optimal Launch Bound Selection in CPU-GPU Hybrid Graph Applications with Deep Learning2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969364(1-7)Online publication date: 24-Oct-2022
  • (2022)QSketch: GPU-Aware Probabilistic Sketch Data Structures2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00080(706-715)Online publication date: May-2022
  • (2021)MAPAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3480853(1-14)Online publication date: 14-Nov-2021
  • (2021)SV-simProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476169(1-14)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media