Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2608020.2608024acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

A paradigm shift in GP-GPU computing: task based execution of applications with dynamic data dependencies

Published: 23 June 2014 Publication History

Abstract

General purpose computing on GPUs have became increasingly popular over the last decade. Scientific applications with SIMD computation characteristics show considerable performance improvements when run on these massively parallel architectures. However, data dependencies across thread blocks significantly impact the degree of achievable parallelism by requiring global synchronization across multi-processors (SMs) inside the GPU.
In order to efficiently run applications with inter-block data dependencies, we need fine-granular `task-based execution models' that will treat SMs inside GPU as stand-alone parallel processing units. Such a scheme will enable efficient execution by utilizing all internal computation elements inside GPU and eliminating unnecessary waits during global barriers.
In this paper, we propose a new, dynamic and `all-in-GPU' task execution framework for executing both regular and irregular data-dependent applications on GPUs. Our run-time eliminates the need for global synchronization and minimize inter-SM communication through distributed queues. In our preliminary experiments run on a Tesla c2050 GPU, we have obtained up to 62% more speedup when compared to centralized queue approach. The overhead of system has been measured as low as 5%.

References

[1]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009, pages 44-=-54. IEEE, 2009.
[2]
L. Chen et al. Dynamic load balancing on single-and multi-gpu systems. In IPDPS 2010, pages 1--12. IEEE, 2010.
[3]
M. Garland et al. Parallel computing experiences with cuda. Micro, IEEE, 28(4):13--27, 2008.
[4]
J. Hoogerbrugge et al. A multithreaded multicore system for embedded media processing. In Transactions on high-performance embedded architectures and compilers III, pages 154--173. Springer, 2011.
[5]
D. Lustig et al. Reducing gpu offload latency via fine-grained cpu-gpu synchronization. In HPCA 2013. IEEE, 2013.
[6]
J. Nickolls et al. Scalable parallel programming with cuda. Queue, 6(2):40--53, 2008.
[7]
T. Okuyama et al. A task parallel algorithm for computing the costs of all-pairs shortest paths on the cuda-compatible gpu. In ISPA'08, pages 284--291. IEEE, 2008.
[8]
J. A. Stuart et al. Efficient synchronization primitives for gpus. arXiv preprint arXiv:1110.4623, 2011.
[9]
S. Tzeng et al. A gpu task-parallel model with dependency resolution. Computer, 45(8):34--41, 2012.

Cited By

View all
  • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 1-Mar-2024
  • (2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
  • (2017)MERCATOR: A GPGPU Framework for Irregular Streaming Applications2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.111(727-736)Online publication date: Jul-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computing
June 2014
62 pages
ISBN:9781450329132
DOI:10.1145/2608020
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data dependency
  2. gp-gpu computing
  3. task scheduling

Qualifiers

  • Research-article

Conference

HPDC'14
Sponsor:

Acceptance Rates

DIDC '14 Paper Acceptance Rate 7 of 12 submissions, 58%;
Overall Acceptance Rate 7 of 12 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)X-TED: Massive Parallelization of Tree Edit DistanceProceedings of the VLDB Endowment10.14778/3654621.365463417:7(1683-1696)Online publication date: 1-Mar-2024
  • (2017)WireframeProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123976(600-611)Online publication date: 14-Oct-2017
  • (2017)MERCATOR: A GPGPU Framework for Irregular Streaming Applications2017 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCS.2017.111(727-736)Online publication date: Jul-2017
  • (2016)Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUsProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967952(341-352)Online publication date: 11-Sep-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media