research-article

Scalable framework for mapping streaming applications onto multi-GPU systems

Authors:

Huynh Phung Huynh,

Andrei Hagiescu,

Rick Siow Mong GohAuthors Info & Claims

ACM SIGPLAN Notices, Volume 47, Issue 8

Pages 1 - 10

https://doi.org/10.1145/2370036.2145818

Published: 25 February 2012 Publication History

Abstract

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

References

[1]

NVIDIA CUDA 4.0. http://developer.nvidia.com/cuda-toolkit-40.

[2]

Streamit benchmarks. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.

[3]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In The 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10), 2010.

Digital Library

[4]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH '04, 2004.

Digital Library

[5]

L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), 2010.

[6]

G. Diamos and S. Yalamanchili. Speculative execution on multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS '10), 2010.

[7]

C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In The 19th Design Automation Conference (DAC '82), 1982.

Digital Library

[8]

M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In The 10th international conference on Architectural support for programming languages and operating systems (ASPLOS '02), Oct 2002.

Digital Library

[9]

M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In The 12th international conference on Architectural support for programming languages and operating systems (ASPLOS '06), 2006.

Digital Library

[10]

A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: folding streams in FPGAs. In The 46th Annual Design Automation Conference (DAC '09), 2009.

Digital Library

[11]

A. Hagiescu, H. P. Huynh, W. F. Wong, and R. S. M. Goh. Automated architecture-aware mapping of streaming applications onto GPUs. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011.

Digital Library

[12]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In The 16th international conference on Architectural support for programming languages and operating systems (ASPLOS '11), 2011.

Digital Library

[13]

H. P. Huynh, Y. Liang, and T. Mitra. Efficient custom instructions generation for system-level design. In 2010 International Conference on Field-Programmable Technology (FPT '10), 2010.

[14]

G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 1998.

Digital Library

[15]

B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 1970.

[16]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.

[17]

M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In The 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08), 2008.

Digital Library

[18]

E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36 (1), 1987.

Digital Library

[19]

J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30, 2010. ISSN 0272--1732.

Digital Library

[20]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26 (1), 2007.

[21]

D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS'11), 2009.

Digital Library

[22]

J. A. Stuart and J. D. Owens. Multi-GPU MapReduce on GPU clusters. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011.

Digital Library

[23]

A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In The 7th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '09), 2009.

Digital Library

[24]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS '10), 2010.

[25]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In (The 17th International Symposium on High Performance Computer Architecture (HPCA '11)), 2011.

Digital Library

Cited By

Thoman PTischler FSalzmann PFahringer T(2022)The Celerity High-level API: C++20 for Accelerator ClustersInternational Journal of Parallel Programming10.1007/s10766-022-00731-850:3-4(341-359)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00731-8
Jaroš MŘíha LStrakoš PŠpeťko M(2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
https://dl.acm.org/doi/10.1145/3447807
Tanaka MTaura KHanawa TTorisawa K(2021)Automatic Graph Partitioning for Very Large-scale Deep Learning2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00109(1004-1013)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00109
Show More Cited By

Index Terms

Scalable framework for mapping streaming applications onto multi-GPU systems
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Scalable framework for mapping streaming applications onto multi-GPU systems
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications ...
Communication-aware mapping of stream graphs for multi-GPU platforms
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Stream graphs can provide a natural way to represent many applications in multimedia and DSP domains. Though the exposed parallelism of stream graphs makes it relatively easy to map them to GP (General Purpose)-GPUs, very large stream graphs as well as ...
Multi-GPU DGEMM and High Performance Linpack on Highly Energy-Efficient Clusters

High Performance Linpack can maximize requirements throughout a computer system. An efficient multi-GPU double-precision general matrix multiply (DGEMM), together with adjustments to the HPL, is required to utilize a heterogeneous computer to its full ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 47, Issue 8

PPOPP '12

August 2012

334 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2370036

Issue’s Table of Contents

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Published in SIGPLAN Volume 47, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
872
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Thoman PTischler FSalzmann PFahringer T(2022)The Celerity High-level API: C++20 for Accelerator ClustersInternational Journal of Parallel Programming10.1007/s10766-022-00731-850:3-4(341-359)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10766-022-00731-8
Jaroš MŘíha LStrakoš PŠpeťko M(2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
https://dl.acm.org/doi/10.1145/3447807
Tanaka MTaura KHanawa TTorisawa K(2021)Automatic Graph Partitioning for Very Large-scale Deep Learning2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00109(1004-1013)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00109
Cedersjö GJanneck J(2019)TÿchoACM Transactions on Embedded Computing Systems10.1145/336269218:6(1-25)Online publication date: 14-Dec-2019
https://dl.acm.org/doi/10.1145/3362692
Jin ZFinkel H(2018)Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGAProceedings of the International Workshop on OpenCL10.1145/3204919.3204920(1-8)Online publication date: 14-May-2018
https://dl.acm.org/doi/10.1145/3204919.3204920
Ferrão PMarques HPaulino H(2018)Stream Processing on Hybrid CPU/Intel® Xeon Phi™ SystemsEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_56(796-810)Online publication date: 1-Aug-2018
https://doi.org/10.1007/978-3-319-96983-1_56
Belviranli MKhorasani FBhuyan LGupta R(2016)CuMASProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926271(1-12)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926271
Nguyen DLee JFranke BWu YRastello F(2016)Communication-aware mapping of stream graphs for multi-GPU platformsProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854055(94-104)Online publication date: 29-Feb-2016
https://dl.acm.org/doi/10.1145/2854038.2854055
Wang GWada KYamagiwa S(2016)Performance Evaluation of Parallelizing Algorithm Using Spanning Tree for Stream-Based Computing2016 Fourth International Symposium on Computing and Networking (CANDAR)10.1109/CANDAR.2016.0092(497-503)Online publication date: Nov-2016
https://doi.org/10.1109/CANDAR.2016.0092
Ma LChamberlain RAgrawal K(2014)Performance modeling for highly-threaded many-core GPUs2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors10.1109/ASAP.2014.6868641(84-91)Online publication date: Jun-2014
https://doi.org/10.1109/ASAP.2014.6868641
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents