Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3203217.3203243acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Taming irregular applications via advanced dynamic parallelism on GPUs

Published: 08 May 2018 Publication History

Abstract

On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck.
We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the "one-size-fits-all" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.

References

[1]
DIMACS Implementation Challenges. https://www.cc.gatech.edu/dimacs10. (2012).
[2]
The OpenCL Specification v2.0. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf. (2015).
[3]
NVIDIA Tesla P100 Whitepaper. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf. (2016).
[4]
Asynchronous Task and Memory Interface (ATMI). https://github.com/RadeonOpenCompute/atmi. (2017).
[5]
CUDA C Programming Guide. (2017).
[6]
Radeon's Next-generation Vega architecture. https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf. (2017).
[7]
ROCm - Open Source Platform for HPC and Ultrascale GPU Computing. https://rocm.github.io/install.html. (2017).
[8]
Profiler User's Guide. http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. (2018).
[9]
L. Breiman. Random Forests. Machine Learning 45, 1 (2001), 5--32.
[10]
M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of Irregular Programs on GPUs. In 2012 IEEE Int. Symp. on Workload Characterization. 141--151.
[11]
G. Chen and X. Shen. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. 2015 Int. Symp. on Microarchitecture, 407--419.
[12]
I. Hajj, J. Gomez-Luna, C. Li, L. W. Chang, D. Milojicic, and W. M. Hwu. KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 2016 Int. Symp. on Microarchitecture.
[13]
T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer.
[14]
K. Hou, W. c. Feng, and S. Che. Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors. In 2017 IEEE Int. Parallel Distrib. Processing Symp. Workshop.
[15]
W. Jia, K. A. Shaw, and M. Martonosi. Stargazer: Automated regression-based GPU design space exploration. In 2012 IEEE Int. Symp. on Performance Analysis of Systems Software. 2--13.
[16]
A. Kerr, E. Anger, G. Hendry, and S. Yalamanchili. Eiger: A framework for the automated synthesis of statistical performance models. In 2012 Int. Conf. on High Performance Computing. 1--6.
[17]
D. Li, H. Wu, and M. Becchi. Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations. In 2015 Int. Conf. on Parallel Processing. 979--988.
[18]
S. Madougou, A. L. Varbanescu, C. D. Laat, and R. V. Nieuwpoort. A Tool for Bottleneck Analysis and Performance Prediction for GPU-Accelerated Applications. In 2016 IEEE Int. Parallel Distrib. Processing Symp. Workshops. 641--652.
[19]
M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-Grain Task Aggregation and Coordination on GPUs. In 2014 Int. Symp. on Computer Architecture. 181--192.
[20]
F. N. Paravecino. Characterization and exploitation of nested parallelism and concurrent kernel execution to accelerate high performance applications. Ph.D. Dissertation. Massachusetts : Northeastern University, Boston.
[21]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In 2012 IEEE/ACM Int. Symp. on Microarchitecture. IEEE Computer Society, Washington, DC, USA, 72--83.
[22]
Y. Ukidave, X. Li, andD. Kaeli. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE Int. Parallel Distrib. Processing Symp. 353--362.
[23]
G. Wang, Y. Song Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In 2010 IEEE/ACM Int. Conf. on Green Computing and Communications. 344--350.
[24]
H. Wang, W. Liu, K. Hou, and W. c. Feng. Parallel Transposition of Sparse Data Structures. In 2016 Int. Conf. on Supercomputing. Article 33, 13 pages.
[25]
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In 2015 Int. Symp. on Computer Architecture. 528--540.
[26]
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In 2016 Int. Symp. on Computer Architecture. 583--595.
[27]
J. Wang and S. Yalamanchili. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE Int. Symp. on Workload Characterization. 51--60.
[28]
G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU performance and power estimation using machine learning. In 2015 IEEE Int. Symp. on High Performance Computer Architecture. 564--576.
[29]
H. Wu, D. Li, and M. Becchi. Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU. In 2016 IEEE Int. Parallel Distrib. Processing Symp. 534--543.
[30]
Y. Yang, C. Li, and H. Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications. Journal of Computer Science and Technology 30, 1 (2015), 3--19.
[31]
J. Yin, J. Wang, W. c. Feng, X. Zhang, and J. Zhang. SLAM: Scalable Locality-aware Middleware for I/O in Scientific Analysis and Visualization. In 2014 Int. Symp. on High-performance Parallel and Distrib. Computing. 257--260.
[32]
J. Yin, J. Zhang, J. Wang, and W. c. Feng. SDAFT: A novel scalable data access framework for parallel BLAST. Parallel Comput. 40, 10 (2014), 697 -- 709.
[33]
X. Yu, H. Wang, W. c. Feng, H. Gong, and G. Cao. An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs. In 2017 Computing Frontiers Conference. 97--106.
[34]
X. Yu, H. Wang, W. c. Feng, H. Gong, and G. Cao. GPU-Based Iterative Medical CT Image Reconstructions. Journal of Signal Processing Systems (2018).
[35]
D. Zhang, H. Wang, K. Hou, J. Zhang, and W. c. Feng. pDindel: Accelerating indel detection on a multicore CPU architecture with SIMD. In 2015 IEEE Int. Conf. on Computational Advances in Bio and Medical Sciences. 1--6.
[36]
J. Zhang, H. Wang, and W. c. Feng. cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU. IEEE/ACM Trans. Comput. Biol. Bioinf. 14, 4 (2017), 830--843.
[37]
Y. Zhang, Y. Hu, B. Li, and L. Peng. Performance and Power Analysis of ATI GPU: A Statistical Approach. In 2011 IEEE Int. Conf. on Networking, Architecture, and Storage. 149--158.
[38]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In 2011 IEEE Int. Symp. on High Performance Computer Architecture. 382--393.

Cited By

View all
  • (2022)A Compiler Framework for Optimizing Dynamic Parallelism on GPUs2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
  • (2021)NestGPU: Nested Query Processing on GPU2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00092(1008-1019)Online publication date: Apr-2021
  • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers
May 2018
401 pages
ISBN:9781450357616
DOI:10.1145/3203217
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. dynamic parallelism
  3. irregular applications
  4. performance modeling

Qualifiers

  • Research-article

Conference

CF '18
Sponsor:
CF '18: Computing Frontiers Conference
May 8 - 10, 2018
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)7
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Compiler Framework for Optimizing Dynamic Parallelism on GPUs2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
  • (2021)NestGPU: Nested Query Processing on GPU2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00092(1008-1019)Online publication date: Apr-2021
  • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
  • (2019)SEP-graphProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295733(38-52)Online publication date: 16-Feb-2019
  • (2019)Tasking in Accelerators: Performance Evaluation2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)10.1109/PDCAT46702.2019.00034(127-132)Online publication date: Dec-2019
  • (2019)Dynamic Block Size Adjustment and Workload Balancing Strategy Based on CPU-GPU Heterogeneous Platform2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00144(999-1006)Online publication date: Dec-2019
  • (2019)In-Memory Join Algorithms on GPUs for Large-Data2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00151(1060-1067)Online publication date: Aug-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media