research-article

Taming irregular applications via advanced dynamic parallelism on GPUs

Authors:

Michael L. Chu,

Wu-chun FengAuthors Info & Claims

CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers

Pages 146 - 154

https://doi.org/10.1145/3203217.3203243

Published: 08 May 2018 Publication History

Abstract

On recent GPU architectures, dynamic parallelism, which enables the launching of kernels from the GPU without CPU involvement, provides a way to improve the performance of irregular applications by generating child kernels dynamically to reduce workload imbalance and improve GPU utilization. However, in practice, dynamic parallelism does not improve performance due to high kernel launch overhead and low child kernel occupancy. Consequently, most existing studies focus on mitigating the kernel launch overhead. As the kernel launch overhead has decreased due to algorithmic redesigns and hardware architectural innovations, the organization of subtasks to child kernels becomes a new performance bottleneck.

We present an in-depth characterization of existing software approaches for dynamic parallelism optimizations on the latest GPUs. We observe that current approaches of subtask aggregation, which use the "one-size-fits-all" method by treating all subtasks equally, can under-utilize resources and degrade overall performance, as different subtasks require various configurations for optimal performance. To address this problem, we leverage statistical and machine-learning techniques and propose a performance modeling and task scheduling tool that can (1) analyze the performance characteristics of subtasks to identify the critical performance factors, (2) predict the performance of new subtasks, and (3) generate the optimal aggregation strategy for new subtasks. Experimental results show that our approach with the optimal subtask aggregation strategy can achieve up to a 1.8-fold speedup over the existing task aggregation approach for dynamic parallelism.

References

[1]

DIMACS Implementation Challenges. https://www.cc.gatech.edu/dimacs10. (2012).

[2]

The OpenCL Specification v2.0. https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf. (2015).

[3]

NVIDIA Tesla P100 Whitepaper. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf. (2016).

[4]

Asynchronous Task and Memory Interface (ATMI). https://github.com/RadeonOpenCompute/atmi. (2017).

[5]

CUDA C Programming Guide. (2017).

[6]

Radeon's Next-generation Vega architecture. https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf. (2017).

[7]

ROCm - Open Source Platform for HPC and Ultrascale GPU Computing. https://rocm.github.io/install.html. (2017).

[8]

Profiler User's Guide. http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf. (2018).

[9]

L. Breiman. Random Forests. Machine Learning 45, 1 (2001), 5--32.

Digital Library

[10]

M. Burtscher, R. Nasre, and K. Pingali. A Quantitative Study of Irregular Programs on GPUs. In 2012 IEEE Int. Symp. on Workload Characterization. 141--151.

Digital Library

[11]

G. Chen and X. Shen. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. 2015 Int. Symp. on Microarchitecture, 407--419.

Digital Library

[12]

I. Hajj, J. Gomez-Luna, C. Li, L. W. Chang, D. Milojicic, and W. M. Hwu. KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism. In 2016 Int. Symp. on Microarchitecture.

Digital Library

[13]

T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer.

[14]

K. Hou, W. c. Feng, and S. Che. Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors. In 2017 IEEE Int. Parallel Distrib. Processing Symp. Workshop.

[15]

W. Jia, K. A. Shaw, and M. Martonosi. Stargazer: Automated regression-based GPU design space exploration. In 2012 IEEE Int. Symp. on Performance Analysis of Systems Software. 2--13.

Digital Library

[16]

A. Kerr, E. Anger, G. Hendry, and S. Yalamanchili. Eiger: A framework for the automated synthesis of statistical performance models. In 2012 Int. Conf. on High Performance Computing. 1--6.

[17]

D. Li, H. Wu, and M. Becchi. Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations. In 2015 Int. Conf. on Parallel Processing. 979--988.

Digital Library

[18]

S. Madougou, A. L. Varbanescu, C. D. Laat, and R. V. Nieuwpoort. A Tool for Bottleneck Analysis and Performance Prediction for GPU-Accelerated Applications. In 2016 IEEE Int. Parallel Distrib. Processing Symp. Workshops. 641--652.

[19]

M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood. Fine-Grain Task Aggregation and Coordination on GPUs. In 2014 Int. Symp. on Computer Architecture. 181--192.

Digital Library

[20]

F. N. Paravecino. Characterization and exploitation of nested parallelism and concurrent kernel execution to accelerate high performance applications. Ph.D. Dissertation. Massachusetts : Northeastern University, Boston.

[21]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-Conscious Wavefront Scheduling. In 2012 IEEE/ACM Int. Symp. on Microarchitecture. IEEE Computer Society, Washington, DC, USA, 72--83.

Digital Library

[22]

Y. Ukidave, X. Li, andD. Kaeli. Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning. In 2016 IEEE Int. Parallel Distrib. Processing Symp. 353--362.

[23]

G. Wang, Y. Song Lin, and W. Yi. Kernel fusion: An effective method for better power efficiency on multithreaded GPU. In 2010 IEEE/ACM Int. Conf. on Green Computing and Communications. 344--350.

Digital Library

[24]

H. Wang, W. Liu, K. Hou, and W. c. Feng. Parallel Transposition of Sparse Data Structures. In 2016 Int. Conf. on Supercomputing. Article 33, 13 pages.

Digital Library

[25]

J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In 2015 Int. Symp. on Computer Architecture. 528--540.

Digital Library

[26]

J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In 2016 Int. Symp. on Computer Architecture. 583--595.

Digital Library

[27]

J. Wang and S. Yalamanchili. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE Int. Symp. on Workload Characterization. 51--60.

[28]

G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou. GPGPU performance and power estimation using machine learning. In 2015 IEEE Int. Symp. on High Performance Computer Architecture. 564--576.

[29]

H. Wu, D. Li, and M. Becchi. Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU. In 2016 IEEE Int. Parallel Distrib. Processing Symp. 534--543.

[30]

Y. Yang, C. Li, and H. Zhou. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications. Journal of Computer Science and Technology 30, 1 (2015), 3--19.

[31]

J. Yin, J. Wang, W. c. Feng, X. Zhang, and J. Zhang. SLAM: Scalable Locality-aware Middleware for I/O in Scientific Analysis and Visualization. In 2014 Int. Symp. on High-performance Parallel and Distrib. Computing. 257--260.

Digital Library

[32]

J. Yin, J. Zhang, J. Wang, and W. c. Feng. SDAFT: A novel scalable data access framework for parallel BLAST. Parallel Comput. 40, 10 (2014), 697 -- 709.

Digital Library

[33]

X. Yu, H. Wang, W. c. Feng, H. Gong, and G. Cao. An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs. In 2017 Computing Frontiers Conference. 97--106.

Digital Library

[34]

X. Yu, H. Wang, W. c. Feng, H. Gong, and G. Cao. GPU-Based Iterative Medical CT Image Reconstructions. Journal of Signal Processing Systems (2018).

[35]

D. Zhang, H. Wang, K. Hou, J. Zhang, and W. c. Feng. pDindel: Accelerating indel detection on a multicore CPU architecture with SIMD. In 2015 IEEE Int. Conf. on Computational Advances in Bio and Medical Sciences. 1--6.

Digital Library

[36]

J. Zhang, H. Wang, and W. c. Feng. cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU. IEEE/ACM Trans. Comput. Biol. Bioinf. 14, 4 (2017), 830--843.

Digital Library

[37]

Y. Zhang, Y. Hu, B. Li, and L. Peng. Performance and Power Analysis of ATI GPU: A Statistical Approach. In 2011 IEEE Int. Conf. on Networking, Architecture, and Storage. 149--158.

Digital Library

[38]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In 2011 IEEE Int. Symp. on High Performance Computer Architecture. 382--393.

Digital Library

Cited By

Olabi MLuna JMutlu OHwu WHajj I(2022)A Compiler Framework for Optimizing Dynamic Parallelism on GPUs2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
https://doi.org/10.1109/CGO53902.2022.9741284
Floratos SXiao MWang HGuo CYuan YLee RZhang X(2021)NestGPU: Nested Query Processing on GPU2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00092(1008-1019)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00092
Ren BBalakrishna SJo YKrishnamoorthy SAgrawal KKulkarni M(2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
https://dl.acm.org/doi/10.1145/3365663
Show More Cited By

Index Terms

Taming irregular applications via advanced dynamic parallelism on GPUs
1. Computing methodologies
  1. Concurrent computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores

Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this ...
Energy-efficient stencil computations on distributed GPUs using dynamic parallelism and GPU-controlled communication
E2SC '14: Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only ...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers

May 2018

401 pages

ISBN:9781450357616

DOI:10.1145/3203217

General Chairs:
David Kaeli
Northeastern University
,
Miquel Pericàs
Chalmers University of Technology, SE

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF '18

Sponsor:

SIGMICRO

CF '18: Computing Frontiers Conference

May 8 - 10, 2018

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)7

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Olabi MLuna JMutlu OHwu WHajj I(2022)A Compiler Framework for Optimizing Dynamic Parallelism on GPUs2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
https://doi.org/10.1109/CGO53902.2022.9741284
Floratos SXiao MWang HGuo CYuan YLee RZhang X(2021)NestGPU: Nested Query Processing on GPU2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00092(1008-1019)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00092
Ren BBalakrishna SJo YKrishnamoorthy SAgrawal KKulkarni M(2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
https://dl.acm.org/doi/10.1145/3365663
Wang HGeng LLee RHou KZhang YZhang XHollingsworth JKeidar I(2019)SEP-graphProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295733(38-52)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295733
Toledo LPena ACatalan SValero-Lara P(2019)Tasking in Accelerators: Performance Evaluation2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)10.1109/PDCAT46702.2019.00034(127-132)Online publication date: Dec-2019
https://doi.org/10.1109/PDCAT46702.2019.00034
Fang JZhou KTan CZhao H(2019)Dynamic Block Size Adjustment and Workload Balancing Strategy Based on CPU-GPU Heterogeneous Platform2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00144(999-1006)Online publication date: Dec-2019
https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00144
Guo CChen H(2019)In-Memory Join Algorithms on GPUs for Large-Data2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2019.00151(1060-1067)Online publication date: Aug-2019
https://doi.org/10.1109/HPCC/SmartCity/DSS.2019.00151

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents