Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scalable framework for mapping streaming applications onto multi-GPU systems

Published: 25 February 2012 Publication History

Abstract

Graphics processing units leverage on a large array of parallel processing cores to boost the performance of a specific streaming computation pattern frequently found in graphics applications. Unfortunately, while many other general purpose applications do exhibit the required streaming behavior, they also possess unfavorable data layout and poor computation-to-communication ratios that penalize any straight-forward execution on the GPU. In this paper we describe an efficient and scalable code generation framework that can map general purpose streaming applications onto a multi-GPU system. This framework spans the entire core and memory hierarchy exposed by the multi-GPU system. Several key features in our framework ensure the scalability required by complex streaming applications. First, we propose an efficient stream graph partitioning algorithm that partitions the complex application to achieve the best performance under a given shared memory constraint. Next, the resulting partitions are mapped to multiple GPUs using an efficient architecture-driven strategy. The mapping balances the workload while considering the communication overhead. Finally, a highly effective pipeline execution is employed for the execution of the partitions on the multi-GPU system. The framework has been implemented as a back-end of the StreamIt programming language compiler. Our comprehensive experiments show its scalability and significant performance speedup compared with a previous state-of-the-art solution.

References

[1]
NVIDIA CUDA 4.0. http://developer.nvidia.com/cuda-toolkit-40.
[2]
Streamit benchmarks. http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.
[3]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In The 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10), 2010.
[4]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH '04, 2004.
[5]
L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao. Dynamic load balancing on single- and multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), 2010.
[6]
G. Diamos and S. Yalamanchili. Speculative execution on multi-GPU systems. In 2010 IEEE International Parallel and Distributed Processing Symposium (IPDPS '10), 2010.
[7]
C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In The 19th Design Automation Conference (DAC '82), 1982.
[8]
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In The 10th international conference on Architectural support for programming languages and operating systems (ASPLOS '02), Oct 2002.
[9]
M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In The 12th international conference on Architectural support for programming languages and operating systems (ASPLOS '06), 2006.
[10]
A. Hagiescu, W.-F. Wong, D. F. Bacon, and R. Rabbah. A computing origami: folding streams in FPGAs. In The 46th Annual Design Automation Conference (DAC '09), 2009.
[11]
A. Hagiescu, H. P. Huynh, W. F. Wong, and R. S. M. Goh. Automated architecture-aware mapping of streaming applications onto GPUs. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011.
[12]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In The 16th international conference on Architectural support for programming languages and operating systems (ASPLOS '11), 2011.
[13]
H. P. Huynh, Y. Liang, and T. Mitra. Efficient custom instructions generation for system-level design. In 2010 International Conference on Field-Programmable Technology (FPT '10), 2010.
[14]
G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 1998.
[15]
B. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 1970.
[16]
Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.
[17]
M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In The 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI '08), 2008.
[18]
E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, 36 (1), 1987.
[19]
J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30, 2010. ISSN 0272--1732.
[20]
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26 (1), 2007.
[21]
D. Schaa and D. Kaeli. Exploring the multiple-GPU design space. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS'11), 2009.
[22]
J. A. Stuart and J. D. Owens. Multi-GPU MapReduce on GPU clusters. In 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11), 2011.
[23]
A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on GPUs. In The 7th annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '09), 2009.
[24]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS '10), 2010.
[25]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In (The 17th International Symposium on High Performance Computer Architecture (HPCA '11)), 2011.

Cited By

View all
  • (2022)The Celerity High-level API: C++20 for Accelerator ClustersInternational Journal of Parallel Programming10.1007/s10766-022-00731-850:3-4(341-359)Online publication date: 1-Aug-2022
  • (2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
  • (2021)Automatic Graph Partitioning for Very Large-scale Deep Learning2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00109(1004-1013)Online publication date: May-2021
  • Show More Cited By

Index Terms

  1. Scalable framework for mapping streaming applications onto multi-GPU systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
    • cover image ACM Conferences
      PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
      February 2012
      352 pages
      ISBN:9781450311601
      DOI:10.1145/2145816
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2012
    Published in SIGPLAN Volume 47, Issue 8

    Check for updates

    Author Tags

    1. multi-GPU
    2. scalable
    3. streaming
    4. streamit

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)The Celerity High-level API: C++20 for Accelerator ClustersInternational Journal of Parallel Programming10.1007/s10766-022-00731-850:3-4(341-359)Online publication date: 1-Aug-2022
    • (2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
    • (2021)Automatic Graph Partitioning for Very Large-scale Deep Learning2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00109(1004-1013)Online publication date: May-2021
    • (2019)TÿchoACM Transactions on Embedded Computing Systems10.1145/336269218:6(1-25)Online publication date: 14-Dec-2019
    • (2018)Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGAProceedings of the International Workshop on OpenCL10.1145/3204919.3204920(1-8)Online publication date: 14-May-2018
    • (2018)Stream Processing on Hybrid CPU/Intel® Xeon Phi™ SystemsEuro-Par 2018: Parallel Processing10.1007/978-3-319-96983-1_56(796-810)Online publication date: 1-Aug-2018
    • (2016)CuMASProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926271(1-12)Online publication date: 1-Jun-2016
    • (2016)Communication-aware mapping of stream graphs for multi-GPU platformsProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854055(94-104)Online publication date: 29-Feb-2016
    • (2016)Performance Evaluation of Parallelizing Algorithm Using Spanning Tree for Stream-Based Computing2016 Fourth International Symposium on Computing and Networking (CANDAR)10.1109/CANDAR.2016.0092(497-503)Online publication date: Nov-2016
    • (2014)Performance modeling for highly-threaded many-core GPUs2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors10.1109/ASAP.2014.6868641(84-91)Online publication date: Jun-2014
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media