research-article

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Authors:

Youngmin YiAuthors Info & Claims

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Pages 43 - 54

https://doi.org/10.1145/3410463.3414656

Published: 30 September 2020 Publication History

Abstract

Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the granularity of a task. In previous programming frameworks that support true pipelined computations on GPU, the choice has to be made by the programmers during the application development time. Due to many difficulties, programmers' decisions are often far from optimal, causing inferior performance and performance portability.

This paper presents GOPipe, a granularity-oblivious programming framework for efficient pipelined stencil executions on GPU. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and dynamically schedules tasks of that granularity for efficiency while observing all inter-task and inter-stage data dependencies. In our experiments on six real-life applications and various scenarios, GOPipe outperforms the state-of-the-art system by 1.39X on average with a much better programming productivity.

References

[1]

Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), Vol. 38, 4 (2019), 121.

Digital Library

[2]

Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.

[3]

Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149.

Digital Library

[4]

Prithayan Barua, Jun Shirako, and Vivek Sarkar. 2018. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, 32.

Digital Library

[5]

Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM sigplan notices, Vol. 44, 4 (2009), 219--228.

[6]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.

Digital Library

[7]

Mehmet E Belviranli, Seyong Lee, Jeffrey S Vetter, and Laxmi N Bhuyan. 2018. Juggler: a dependence-aware task-based execution framework for GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54--67.

Digital Library

[8]

Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171.

[9]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J Dongarra. 2013. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, Vol. 15, 6 (2013), 36--45.

Digital Library

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.

Digital Library

[11]

Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video coding on multicore graphics processors. IEEE Signal Processing Magazine, Vol. 27, 2 (2010), 79--89.

[12]

Nitin Chugh, Vinay Vasista, Suresh Purini, and Uday Bondhugula. 2016. A DSL compiler for accelerating image processing pipelines on FPGAs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 327--338.

Digital Library

[13]

Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102.

Digital Library

[14]

Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.

[15]

M Harris and K Perelygin. 2017. Cooperative groups: Flexible CUDA thread programming.

[16]

Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R Stan. 2006. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, 5 (2006), 501--513.

Digital Library

[17]

Brucek Khailany, William J Dally, Ujval J Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media processing with streams. IEEE micro, Vol. 21, 2 (2001), 35--46.

[18]

Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation.

Digital Library

[19]

Kai Li and Jeffrey F Naughton. 2000. Multiprocessor main memory transaction processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187.

Digital Library

[20]

Wei-Cheng Liao, Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Jenq-Kuen Lee, and Yuan-Shin Hwang. 2018. Scheduling Methods to Optimize Dependent Programs for GPU Architecture. In Proceedings of the 47th International Conference on Parallel Processing Companion. ACM, 13.

Digital Library

[21]

Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455--466.

Digital Library

[22]

MJ McDonnell. 1981. Box-filtering techniques. Computer Graphics and Image Processing, Vol. 17, 1 (1981), 65--70.

[23]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 429--443.

Digital Library

[24]

Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[25]

Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., Vol. 30, 4 (2011), 68--1.

Digital Library

[26]

Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable pipelines and research challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).

[27]

Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. Acm Transactions on Graphics, Vol. 34, 4 (2015), 1--13.

Digital Library

[28]

Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 9, 4 (2013), 1--25.

Digital Library

[29]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530.

Digital Library

[30]

Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, 109--120.

Digital Library

[31]

Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images. IEEE Journal of selected topics in applied earth observations and remote sensing, Vol. 4, 3 (2011), 683--690.

[32]

Tyler Sorensen, Alastair F Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 39--58.

Digital Library

[33]

Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics, Vol. 33, 6 (2014), 1--11.

Digital Library

[34]

Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36.

Digital Library

[35]

Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41.

Digital Library

[36]

Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task management for irregular-parallel workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37.

Digital Library

[37]

Hans Vandierendonck, George Tzenakis, and Dimitrios S Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 1--11.

Digital Library

[38]

Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.

[39]

Feng Zhang, Jidong Zhai, Bingsheng He, and Shuhao Zhang. 2016. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel & Distributed Systems (2016), 1--1.

[40]

Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599.

Digital Library

[41]

Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2019. HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 153--166.

Digital Library

Cited By

Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3343570
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Pan ZZheng ZZhang FWu RLiang HWang DQiu XBai JLin WDu XAamodt TSwift MJerger N(2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624761
Show More Cited By

Index Terms

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Recommendations

GOPipe: a granularity-oblivious programming framework for pipelined stencil executions on GPU
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Recent studies have shown promising performance benefits of pipelined stencil applications. An important factor for the computing efficiency of such pipelines is the granularity of a task. We presents GOPipe, the first granularity-oblivious programming ...
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing memory transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data ...
OpenACC Unified Programming Environment for Multi-hybrid Acceleration with GPU and FPGA
High Performance Computing
Abstract
Accelerated computing in HPC such as with GPU, plays a central role in HPC nowadays. However, in some complicated applications with partially different performance behavior is hard to solve with a single type of accelerator where GPU is not the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

September 2020

505 pages

ISBN:9781450380751

DOI:10.1145/3410463

General Chair:
Vivek Sarkar
Georgia Institute of Technology
,
Program Chair:
Hyesoon Kim
Georgia Institute of Technology

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Academy of Artificial Intelligence
National Research Foundation of Korea
Beijing Natural Science Foundation
National Key R&D Program of China

Conference

PACT '20

Sponsor:

SIGARCH

PACT '20: International Conference on Parallel Architectures and Compilation Techniques

October 3 - 7, 2020

GA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
288
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)2

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li DLi SLai ZFu YYe XCai LQiao L(2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3343570
Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
https://doi.org/10.1109/TPDS.2023.3325630
Pan ZZheng ZZhang FWu RLiang HWang DQiu XBai JLin WDu XAamodt TSwift MJerger N(2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624761
de Castro MSantamaria-Valenzuela ITorres YGonzalez-Escribano ALlanos D(2023)EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUsThe Journal of Supercomputing10.1007/s11227-022-05040-y79:9(9409-9442)Online publication date: 14-Jan-2023
https://doi.org/10.1007/s11227-022-05040-y
Oh CYi SSeok JJung HYoon IYi Y(2023)Hybridhadoop: CPU-GPU hybrid scheduling in hadoopCluster Computing10.1007/s10586-023-04178-527:3(3875-3892)Online publication date: 21-Nov-2023
https://doi.org/10.1007/s10586-023-04178-5
Liu XLiu YYang HLiao JLi MLuan ZQian DRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532392
Ye XLai ZLi SCai LSun DQiao LLi D(2021)Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN TrainingProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472497(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472497
Yu XDi SGok ATao DCappello F(2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00065
Sun QLiu YYang HJiang ZLiu XDun MLuan ZQian D(2021)csTuner: Scalable Auto-tuning Framework for Complex Stencil Computation on GPUs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00037(192-203)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00037

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten