Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3410463.3414656acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU

Published: 30 September 2020 Publication History

Abstract

Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the granularity of a task. In previous programming frameworks that support true pipelined computations on GPU, the choice has to be made by the programmers during the application development time. Due to many difficulties, programmers' decisions are often far from optimal, causing inferior performance and performance portability.
This paper presents GOPipe, a granularity-oblivious programming framework for efficient pipelined stencil executions on GPU. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and dynamically schedules tasks of that granularity for efficiency while observing all inter-task and inter-stage data dependencies. In our experiments on six real-life applications and various scenarios, GOPipe outperforms the state-of-the-art system by 1.39X on average with a much better programming productivity.

References

[1]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics (TOG), Vol. 38, 4 (2019), 121.
[2]
Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid methods in image processing. RCA engineer, Vol. 29, 6 (1984), 33--41.
[3]
Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Conference on High PERFORMANCE Graphics. 145--149.
[4]
Prithayan Barua, Jun Shirako, and Vivek Sarkar. 2018. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques. ACM, 32.
[5]
Muthu Manikandan Baskaran, Nagavijayalakshmi Vydyanathan, Uday Kumar Reddy Bondhugula, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2009. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. ACM sigplan notices, Vol. 44, 4 (2009), 219--228.
[6]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[7]
Mehmet E Belviranli, Seyong Lee, Jeffrey S Vetter, and Laxmi N Bhuyan. 2018. Juggler: a dependence-aware task-based execution framework for GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 54--67.
[8]
Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In International Symposium on Computer Architecture. Springer, 161--171.
[9]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J Dongarra. 2013. Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, Vol. 15, 6 (2013), 36--45.
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.
[11]
Nagai-Man Cheung, Xiaopeng Fan, Oscar C Au, and Man-Cheung Kung. 2010. Video coding on multicore graphics processors. IEEE Signal Processing Magazine, Vol. 27, 2 (2010), 79--89.
[12]
Nitin Chugh, Vinay Vasista, Suresh Purini, and Uday Bondhugula. 2016. A DSL compiler for accelerating image processing pipelines on FPGAs. In Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 327--338.
[13]
Robert L Cook, Loren Carpenter, and Edwin Catmull. 1987. The Reyes image rendering architecture. In ACM SIGGRAPH Computer Graphics, Vol. 21. ACM, 95--102.
[14]
Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--14.
[15]
M Harris and K Perelygin. 2017. Cooperative groups: Flexible CUDA thread programming.
[16]
Wei Huang, Shougata Ghosh, Sivakumar Velusamy, Karthik Sankaranarayanan, Kevin Skadron, and Mircea R Stan. 2006. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 14, 5 (2006), 501--513.
[17]
Brucek Khailany, William J Dally, Ujval J Kapasi, Peter Mattson, Jinyung Namkoong, John D Owens, Brian Towles, Andrew Chang, and Scott Rixner. 2001. Imagine: Media processing with streams. IEEE micro, Vol. 21, 2 (2001), 35--46.
[18]
Gwangsun Kim, Jiyun Jeong, John Kim, and Mark Stephenson. 2016. Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In International Conference on Parallel Architectures and Compilation.
[19]
Kai Li and Jeffrey F Naughton. 2000. Multiprocessor main memory transaction processing. In Proceedings of the first international symposium on Databases in parallel and distributed systems. IEEE Computer Society Press, 177--187.
[20]
Wei-Cheng Liao, Yuan-Ming Chang, Shao-Chung Wang, Chun-Chieh Yang, Jenq-Kuen Lee, and Yuan-Shin Hwang. 2018. Scheduling Methods to Optimize Dependent Programs for GPU Architecture. In Proceedings of the 47th International Conference on Parallel Processing Companion. ACM, 13.
[21]
Alberto Magni, Christophe Dubach, and Michael O'Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 455--466.
[22]
MJ McDonnell. 1981. Box-filtering techniques. Computer Graphics and Image Processing, Vol. 17, 1 (1981), 65--70.
[23]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 429--443.
[24]
Chanyoung Oh, Saehanseul Yi, and Youngmin Yi. 2015. Real-time face detection in Full HD images exploiting both embedded CPU and GPU. In 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[25]
Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. 2011. Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., Vol. 30, 4 (2011), 68--1.
[26]
Anjul Patney and John D Owens. 2008. Real-time Reyes: Programmable pipelines and research challenges. ACM SIGGRAPH Asia 2008 Course Notes (2008).
[27]
Anjul Patney, Stanley Tzeng, Kerry A. Seitz, and John D. Owens. 2015. Piko: a framework for authoring programmable graphics pipelines. Acm Transactions on Graphics, Vol. 34, 4 (2015), 1--13.
[28]
Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 9, 4 (2013), 1--25.
[29]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, Vol. 48, 6 (2013), 519--530.
[30]
Mahesh Ravishankar, Justin Holewinski, and Vinod Grover. 2015. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs. ACM, 109--120.
[31]
Changhe Song, Yunsong Li, and Bormin Huang. 2011. A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images. IEEE Journal of selected topics in applied earth observations and remote sensing, Vol. 4, 3 (2011), 683--690.
[32]
Tyler Sorensen, Alastair F Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 39--58.
[33]
Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. Acm Transactions on Graphics, Vol. 33, 6 (2014), 1--11.
[34]
Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems. IEEE Press, 25--36.
[35]
Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU task-parallel model with dependency resolution. Computer 8 (2012), 34--41.
[36]
Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task management for irregular-parallel workloads on the GPU. In Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37.
[37]
Hans Vandierendonck, George Tzenakis, and Dimitrios S Nikolopoulos. 2011. A unified scheduler for recursive and task dataflow parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE, 1--11.
[38]
Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1. IEEE, I--511.
[39]
Feng Zhang, Jidong Zhai, Bingsheng He, and Shuhao Zhang. 2016. Understanding Co-running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel & Distributed Systems (2016), 1--1.
[40]
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2017. Versapipe: a versatile programming framework for pipelined computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 587--599.
[41]
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, and Wenguang Chen. 2019. HiWayLib: A Software Framework for Enabling High Performance Communications for Heterogeneous Pipeline Computations. In Proceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 153--166.

Cited By

View all
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • (2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
September 2020
505 pages
ISBN:9781450380751
DOI:10.1145/3410463
  • General Chair:
  • Vivek Sarkar,
  • Program Chair:
  • Hyesoon Kim
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. optimizations
  3. programming framework

Qualifiers

  • Research-article

Funding Sources

  • Beijing Academy of Artificial Intelligence
  • National Research Foundation of Korea
  • Beijing Natural Science Foundation
  • National Key R&D Program of China

Conference

PACT '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Memory-Efficient Hybrid Parallel Framework for Deep Neural Network TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.334357035:4(577-591)Online publication date: Apr-2024
  • (2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: Jan-2024
  • (2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
  • (2023)EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUsThe Journal of Supercomputing10.1007/s11227-022-05040-y79:9(9409-9442)Online publication date: 14-Jan-2023
  • (2023)Hybridhadoop: CPU-GPU hybrid scheduling in hadoopCluster Computing10.1007/s10586-023-04178-527:3(3875-3892)Online publication date: 21-Nov-2023
  • (2022)Toward accelerated stencil computation by adapting tensor core unit on GPUProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532392(1-12)Online publication date: 28-Jun-2022
  • (2021)Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN TrainingProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472497(1-10)Online publication date: 9-Aug-2021
  • (2021)cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00065(307-319)Online publication date: Sep-2021
  • (2021)csTuner: Scalable Auto-tuning Framework for Complex Stencil Computation on GPUs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00037(192-203)Online publication date: Sep-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media