research-article

A control-structure splitting optimization for GPGPU

Authors:

Snaider Carrillo,

Jakob Siegel,

Xiaoming LiAuthors Info & Claims

CF '09: Proceedings of the 6th ACM conference on Computing frontiers

Pages 147 - 150

https://doi.org/10.1145/1531743.1531766

Published: 18 May 2009 Publication History

Get Access

Abstract

Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads and consequently ruin the occupancy of GPU, that is, the number of threads running concurrently. Unlike traditional vector processing units that are inside a general purpose processor, the GPU cannot leave the control statements to the CPU because fine-grain statement scheduling between GPU and CPU is impossible. We need an effective method to handle the control statements "just in place" on the GPUs.

In this paper, we propose novel techniques to transform control statements so that they can be executed efficiently on GPUs. Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance. We focus our attention on how common programming structures such as loops and branches decrease the occupancy of single kernels and how to counter that. We demonstrate our optimizations on a synthetic benchmark and a complex parallel algorithm, the Lattice Boltzmann Method (LBM). Our results show that these techniques are very efficient and can lead to an increase in occupancy and a drastic improvement in performance compared to non-split version of the programs.

References

[1]

C. NVIDIA. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 2007.

Google Scholar

[2]

S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Wen-mei. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 73--82, 2008.

Digital Library

Google Scholar

[3]

S. Succi. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond.2001.

Google Scholar

[4]

Y. Zhao. Lattice Boltzmann based PDE solver on the GPU. The Visual Computer, 24(5):323--333, 2008.

Digital Library

Google Scholar

Cited By

View all

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Bagies TLe WSheaffer JJannesari A(2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
https://doi.org/10.1007/s11227-023-05375-0
Wang SYu LHer LHwang YLee J(2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3470644
Show More Cited By

Index Terms

A control-structure splitting optimization for GPGPU

Recommendations

A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Information & Contributors

Information

Published In

CF '09: Proceedings of the 6th ACM conference on Computing frontiers

May 2009

238 pages

ISBN:9781605584133

DOI:10.1145/1531743

General Chairs:
Gearold Johnson
Colorado State University, USA
,
Cartsen Trinitis
TU München, Germany
,
Program Chairs:
Georgi N. Gaydadjiev
TU Delft, The Nederland
,
Alex Veidenbaum
University of California, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 May 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF '09

Sponsor:

CF '09: Computing Frontiers Conference

May 18 - 20, 2009

Ischia, Italy

Acceptance Rates

CF '09 Paper Acceptance Rate 26 of 113 submissions, 23%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
677
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Bagies TLe WSheaffer JJannesari A(2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
https://doi.org/10.1007/s11227-023-05375-0
Wang SYu LHer LHwang YLee J(2021)Pointer-Based Divergence Analysis for OpenCL 2.0 ProgramsACM Transactions on Parallel Computing10.1145/34706448:4(1-23)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3470644
Vespa LPeters G(2021)Contrived and Remediated GPU Thread Divergence Using a Flattening TechniqueAdvances in Parallel & Distributed Processing, and Applications10.1007/978-3-030-69984-0_46(647-658)Online publication date: 19-Oct-2021
https://doi.org/10.1007/978-3-030-69984-0_46
(2018)Using program branch probability for the thread parallelisation of branch divergence on the CUDA platformInternational Journal of Autonomous and Adaptive Communications Systems10.1504/IJAACS.2018.09203111:2(171-191)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1504/IJAACS.2018.092031
Lucas Vespa L(2018)Unraveling the Divergence of GPU Threads2018 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI46756.2018.00270(1398-1403)Online publication date: Dec-2018
https://doi.org/10.1109/CSCI46756.2018.00270
Sitaridi ERoss K(2016)GPU-accelerated string matching for database applicationsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-015-0409-y25:5(719-740)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s00778-015-0409-y
Sarkar SMitra SPadmanabhuni SNambiar RDevanbu PRamanathan MSureka A(2015)A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUsProceedings of the 8th India Software Engineering Conference10.1145/2723742.2723760(176-185)Online publication date: 18-Feb-2015
https://dl.acm.org/doi/10.1145/2723742.2723760
Teixeira DCollange SPereira F(2015)Fusion of Calling Sites2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2015.16(90-97)Online publication date: Oct-2015
https://doi.org/10.1109/SBAC-PAD.2015.16
Vespa LBauman AWells J(2015)Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking2015 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2015.7322477(1-6)Online publication date: Sep-2015
https://doi.org/10.1109/HPEC.2015.7322477
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

A unified optimizing compiler framework for different GPGPU architectures

Performance analysis of accelerated image registration using GPGPU

A performance study of general-purpose applications on graphics processors using CUDA

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A unified optimizing compiler framework for different GPGPU architectures

Performance analysis of accelerated image registration using GPGPU

A performance study of general-purpose applications on graphics processors using CUDA

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations