Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Juggler: a dependence-aware task-based execution framework for GPUs

Published: 10 February 2018 Publication History

Abstract

Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.
In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.

Supplementary Material

Artifacts Available (juggler-master-9536993d76c2dc0639ebedd3cf4db8e0440734f2.zip)
This file is a snapshot of the Juggler repository located at the following address:
Please refer to the repository for the most up-to-date version of the project.

References

[1]
Amir Ali Abdolrashidi, Devashree Tripathy, Mehmet Esat Belviranli, Laxmi Narayan Bhuyan, and Daniel Wong. 2017. Wireframe: Supporting Data-dependent Parallelism Through Dependency Graph Execution in GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17).
[2]
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, and S. Tomov. 2011. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators. In 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS'11).
[3]
Timo Aila and Samuli Laine. 2009. Understanding the Efficiency of Ray Traversal on GPUs. In Proceedings of the Conference on High Performance Graphics (HPG '09).
[4]
C. Augonnet, S. Thibault, R. Namyst, and P.A. Wacrenier. 2009. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par 2009 Parallel Processing (Euro-Par '09).
[5]
Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11).
[6]
Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14).
[7]
Mehmet E. Belviranli, Peng Deng, Laxmi N. Bhuyan, Rajiv Gupta, and Qi Zhu. 2015. PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15).
[8]
Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15).
[9]
Guoyang Chen and Xipeng Shen. 2015. Free Launch: Optimizing GPU Dynamic Kernel Launches Through Thread Reuse. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO '15).
[10]
L. Chen, O. Villa, S. Krishnamoorthy, and Guang R Gao. 2010. Dynamic load balancing on single-and multi-GPU systems. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).
[11]
T. Gautier, J. V. F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS '13).
[12]
R. Govindarajan and Jayvant Anantpur. 2013. Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO '13).
[13]
E. Hermann, B. Raffin, F. Faure, T. Gautier, and J. Allard. 2010. Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations. In Euro-Par 2010-Parallel Processing (Euro-Par, 10).
[14]
Huynh Phung Huynh, Andrei Hagiescu, and Rick Siow Mong Goh. 2012. Scalable framework for mapping streaming applications onto multi-GPU systems. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP '12).
[15]
Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling Hu, and Keshav Pingali. 2014. Adaptive Heterogeneous Scheduling for Integrated GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14).
[16]
Scott J. Krieder, Justin M. Wozniak, Timothy Armstrong, Michael Wilde, Daniel S. Katz, Ian T. Foster, and Ioan Raicu. 2014. Design and Evaluation of the Gemtc Framework for GPU-enabled Many-task Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14).
[17]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14).
[18]
Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open Accelerator Research Compiler for Directive-based, Efficient Heterogeneous Computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14).
[19]
Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, Jacob, et al. 2016. Performance analysis and optimization of Clang's OpenMP 4.5 GPU support (PMBS '16).
[20]
Pinar Muyan-Özçelik and John D. Owens. 2016. Multitasking Real-time Embedded GPU Computing Tasks. In Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '16).
[21]
Marc S Orr, Bradford M Beckmann, Steven K Reinhardt, and David A Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In 41st International Symposium on Computer Architecture (ISCA '14).
[22]
Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating System Abstractions to Manage GPUs As Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP '11).
[23]
Daniel Sanchez, David Lo, Richard M Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on (PACT '11).
[24]
Fengguang Song, Asim YarKhan, and Jack Dongarra. 2009. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09).
[25]
Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. In ACM Trans. Graph., Vol. 31.
[26]
Stanley Tzeng, Brandon Lloyd, and John D Owens. 2012. A GPU Task-Parallel Model with Dependency Resolution. Computer (2012).
[27]
U. Verner, A. Schuster, and M. Silberstein. 2011. Processing data streams with hard real-time constraints on heterogeneous systems. In Proceedings of the International Conference on Supercomputing (ICS '11).
[28]
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA '16).
[29]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA '16).
[30]
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. 2015. Enabling and Exploiting Flexible Task Assignment on GPU Through SM-Centric Program Transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15).
[31]
Shucai Xiao and Wu-chun Feng. 2010. Inter-block GPU communication via fast barrier synchronization. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS '10).
[32]
Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13).
[33]
Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2017. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17).
[34]
Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, and Wenguang Chen. 2017. Versapipe: A Versatile Programming Framework for Pipelined Computing on GPU. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '17).

Cited By

View all
  • (2024)Autonomous Execution for Multi-GPU Systems: Compiler SupportSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00155(1129-1140)Online publication date: 17-Nov-2024
  • (2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    February 2018
    442 pages
    ISBN:9781450349826
    DOI:10.1145/3178487
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018
Published in SIGPLAN Volume 53, Issue 1

Check for updates

Badges

Author Tags

  1. GP-GPU programming
  2. data dependence
  3. openMP 4.5
  4. task-based execution

Qualifiers

  • Research-article

Funding Sources

  • U.S. Department of Energy
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)29
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Autonomous Execution for Multi-GPU Systems: Compiler SupportSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00155(1129-1140)Online publication date: 17-Nov-2024
  • (2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
  • (2021)BlockMaestroProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00034(333-346)Online publication date: 14-Jun-2021
  • (2024)ACE: Efficient GPU Kernel Concurrency for Input-Dependent Irregular Computational GraphsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676897(258-270)Online publication date: 14-Oct-2024
  • (2024)Scheduling for Cyber-Physical Systems with Heterogeneous Processing Units under Real-World ConstraintsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656625(298-311)Online publication date: 30-May-2024
  • (2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
  • (2023)KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00046(247-254)Online publication date: 6-Nov-2023
  • (2022)Atos: A Task-Parallel GPU Scheduler for Graph AnalyticsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545056(1-11)Online publication date: 29-Aug-2022
  • (2022)A Compiler Framework for Optimizing Dynamic Parallelism on GPUs2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741284(1-13)Online publication date: 2-Apr-2022
  • (2021)BlockMaestro: Enabling Programmer-Transparent Task-based Execution in GPU Systems2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA52012.2021.00034(333-346)Online publication date: Jun-2021
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media