research-article

A paradigm shift in GP-GPU computing: task based execution of applications with dynamic data dependencies

Authors:

Mehmet E. Belviranli,

Chih-Hsun Chou,

Laxmi N. Bhuyan,

Rajiv GuptaAuthors Info & Claims

DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computing

Pages 29 - 34

https://doi.org/10.1145/2608020.2608024

Published: 23 June 2014 Publication History

Get Access

Abstract

General purpose computing on GPUs have became increasingly popular over the last decade. Scientific applications with SIMD computation characteristics show considerable performance improvements when run on these massively parallel architectures. However, data dependencies across thread blocks significantly impact the degree of achievable parallelism by requiring global synchronization across multi-processors (SMs) inside the GPU.

In order to efficiently run applications with inter-block data dependencies, we need fine-granular `task-based execution models' that will treat SMs inside GPU as stand-alone parallel processing units. Such a scheme will enable efficient execution by utilizing all internal computation elements inside GPU and eliminating unnecessary waits during global barriers.

In this paper, we propose a new, dynamic and `all-in-GPU' task execution framework for executing both regular and irregular data-dependent applications on GPUs. Our run-time eliminates the need for global synchronization and minimize inter-SM communication through distributed queues. In our preliminary experiments run on a Tesla c2050 GPU, we have obtained up to 62% more speedup when compared to centralized queue approach. The overhead of system has been measured as low as 5%.

References

[1]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009, pages 44-=-54. IEEE, 2009.

Digital Library

Google Scholar

[2]

L. Chen et al. Dynamic load balancing on single-and multi-gpu systems. In IPDPS 2010, pages 1--12. IEEE, 2010.

Crossref

Google Scholar

[3]

M. Garland et al. Parallel computing experiences with cuda. Micro, IEEE, 28(4):13--27, 2008.

Digital Library

Google Scholar

[4]

J. Hoogerbrugge et al. A multithreaded multicore system for embedded media processing. In Transactions on high-performance embedded architectures and compilers III, pages 154--173. Springer, 2011.

Digital Library

Google Scholar

[5]

D. Lustig et al. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In HPCA 2013. IEEE, 2013.

Digital Library

Google Scholar

[6]

J. Nickolls et al. Scalable parallel programming with cuda. Queue, 6(2):40--53, 2008.

Digital Library

Google Scholar

[7]

T. Okuyama et al. A task parallel algorithm for computing the costs of all-pairs shortest paths on the cuda-compatible gpu. In ISPA'08, pages 284--291. IEEE, 2008.

Digital Library

Google Scholar

[8]

J. A. Stuart et al. Efficient synchronization primitives for gpus. arXiv preprint arXiv:1110.4623, 2011.

Google Scholar

[9]

S. Tzeng et al. A gpu task-parallel model with dependency resolution. Computer, 45(8):34--41, 2012.

Digital Library

Google Scholar

Cited By

View all

Fan DLee RZhang X(2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654634
Abdolrashidi ATripathy DBelviranli MBhuyan LWong DHunter HMoreno JEmer JSanchez D(2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123976
Cole SBuhler J(2017)MERCATOR: A GPGPU Framework for Irregular Streaming Applications2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.111(727-736)Online publication date: Jul-2017
https://doi.org/10.1109/HPCS.2017.111
Show More Cited By

Index Terms

A paradigm shift in GP-GPU computing: task based execution of applications with dynamic data dependencies
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

Comments

Information & Contributors

Information

Published In

DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computing

June 2014

62 pages

ISBN:9781450329132

DOI:10.1145/2608020

General Chairs:
Esma Yildirim
Fatih University, Turkey
,
Mehmet Balman
VMware Inc. & Lawrence Berkeley National Lab., USA
,
Program Chair:
Esma Yildirim
Fatih University, Turkey

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'14

Sponsor:

University of Arizona
SIGARCH

HPDC'14: The 23rd International Symposium on High-Performance Parallel and Distributed Computing

June 23 - 27, 2014

BC, Vancouver, Canada

Acceptance Rates

DIDC '14 Paper Acceptance Rate 7 of 12 submissions, 58%;

Overall Acceptance Rate 7 of 12 submissions, 58%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
204
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Fan DLee RZhang X(2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654634
Abdolrashidi ATripathy DBelviranli MBhuyan LWong DHunter HMoreno JEmer JSanchez D(2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123976
Cole SBuhler J(2017)MERCATOR: A GPGPU Framework for Irregular Streaming Applications2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.111(727-736)Online publication date: Jul-2017
https://doi.org/10.1109/HPCS.2017.111
Kim GJeong JKim JStephenson MZaks AMendelson BRauchwerger LHwu W(2016)Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUsProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967952(341-352)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967952

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Many-core GPU computing with NVIDIA CUDA

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations