research-article

Sponge: portable stream programming on graphics engines

Authors:

Amir H. Hormati,

Mehrzad Samadi,

Scott MahlkeAuthors Info & Claims

ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

Pages 381 - 392

https://doi.org/10.1145/1950365.1950409

Published: 05 March 2011 Publication History

Abstract

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a write-once software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge's compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.

References

[1]

I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777--786, Aug. 2004.

Digital Library

[2]

J. Chen, Z. Huang, F. Su, J.-K. Peir, J. Ho, and L. Peng. Weak execution ordering - exploiting iterative methods on many-core gpus. In Proc. of the 2010 IEEE Symposium on Performance Analysis of Systems and Software, pages 154--163, 2010.

[3]

K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 83, 2006.

Digital Library

[4]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. of the 40th Annual International Symposium on Microarchitecture, pages 407--420, 2007.

Digital Library

[5]

M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151--162, 2006.

Digital Library

[6]

M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303, Oct. 2002.

Digital Library

[7]

S. Ha and E. A. Lee. Compile-time scheduling and assignment of data-flow program graphs with data-dependent iteration. IEEE Transactions on Computers, 40(11):1225--1238, 1991.

Digital Library

[8]

T. Han and T. Abdelrahman. hicuda: High-level gpgpu programming. IEEE Transactions on Parallel and Distributed Systems, (99):1--1, 2010.

Digital Library

[9]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 152--163, 2009.

Digital Library

[10]

A. Hormati, M. Kudlur, D. Bacon, S. Mahlke, and R. Rabbah. Optimus: Efficient realization of streaming applications on FPGAs. In Proc. of the 2008 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 41--50, Oct. 2008.

Digital Library

[11]

A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proc. of the 18th International Conference on Parallel Architectures and Compilation Techniques, pages 214--223, 2009.

Digital Library

[12]

A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, T. Mudge, and S. Mahlke. Macross: Macro-simdization of streaming applications. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285--296, 2010.

Digital Library

[13]

KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010.

[14]

M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008.

Digital Library

[15]

E. Lee and D. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235--1245, 1987.

[16]

S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler framework for automatic translation and optimization. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 101--110, 2009.

Digital Library

[17]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In Proc. of the 37th Annual International Symposium on Computer Architecture, pages 451--460, 2010.

Digital Library

[18]

W. Mark, R. Glanville, K. Akeley, and J. Kilgard. Cg: A system for programming graphics hardware in a C-like language. In Proc. of the 30th International Conference on Computer Graphics and Interactive Techniques, pages 893--907, July 2003.

Digital Library

[19]

NVIDIA. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.

[20]

NVIDIA. Fermi: Nvidia's next generation cuda compute architecture, 2009. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[21]

NVIDIA. Gpus are only up to 14 times faster than cpus says intel, 2010. http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html.

[22]

J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling framework for synchronous dataflow graphs. Technical Report UCB/ERL M95/36, University of California, Berkeley, May 1995.

Digital Library

[23]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008.

Digital Library

[24]

J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 16--30, 2008.

Digital Library

[25]

W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proc. of the 19th International Conference on Parallel Architectures and Compilation Techniques, page To Appear, 2010.

Digital Library

[26]

W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International Conference on Compiler Construction, pages 179--196, 2002.

Digital Library

[27]

A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software pipelined execution of stream programs on gpus. In Proc. of the 2009 International Symposium on Code Generation and Optimization, pages 200--209, 2009.

Digital Library

[28]

S. wei Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation transformations for brook streaming applications on multiprocessors. Proc. of the 2006 International Symposium on Code Generation and Optimization, 0(1):196--207, 2006.

Digital Library

[29]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010.

Digital Library

[30]

S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu. Cuda-lite: Reducing gpu programming complexity. In Proc. of the 21st Workshop on Languages and Compilers for Parallel Computing, pages 1--15, 2008.

Digital Library

Cited By

Rockenbach DLöff JAraujo GGriebler DFernandes L(2022)High-Level Stream and Data Parallelism in C++ for GPUsProceedings of the XXVI Brazilian Symposium on Programming Languages10.1145/3561320.3561327(41-49)Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1145/3561320.3561327
Henriksen TThorøe FElsman MOancea CHollingsworth JKeidar I(2019)Incremental flattening for nested data parallelismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295707(53-67)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295707
Henriksen TSerup NElsman MHenglein FOancea C(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3140587.3062354
Show More Cited By

Index Terms

Sponge: portable stream programming on graphics engines
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Sponge: portable stream programming on graphics engines
ASPLOS '11

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. ...
Sponge: portable stream programming on graphics engines
ASPLOS '11

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. ...
Adaptive input-aware compilation for graphics engines
PLDI '12: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation

While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

March 2011

432 pages

ISBN:9781450302661

DOI:10.1145/1950365

General Chair:
Rajiv Gupta
University of California, Riverside
,
Program Chair:
Todd C. Mowry
Carnegie Mellon University

ACM SIGARCH Computer Architecture News Volume 39, Issue 1
ASPLOS '11
March 2011
407 pages
ISSN:0163-5964
DOI:10.1145/1961295
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 46, Issue 3
ASPLOS '11
March 2011
407 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1961296
Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS'11

Sponsor:

ASPLOS'11: Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems

March 5 - 11, 2011

California, Newport Beach, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

77
Total Citations
View Citations
885
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rockenbach DLöff JAraujo GGriebler DFernandes L(2022)High-Level Stream and Data Parallelism in C++ for GPUsProceedings of the XXVI Brazilian Symposium on Programming Languages10.1145/3561320.3561327(41-49)Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1145/3561320.3561327
Henriksen TThorøe FElsman MOancea CHollingsworth JKeidar I(2019)Incremental flattening for nested data parallelismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295707(53-67)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295707
Henriksen TSerup NElsman MHenglein FOancea C(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesACM SIGPLAN Notices10.1145/3140587.306235452:6(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3140587.3062354
Georgiev PLane NMascolo CChu DChoudhury TKo SCampbell AGanesan D(2017)Accelerating Mobile Audio Sensing Algorithms through On-Chip GPU OffloadingProceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services10.1145/3081333.3081358(306-318)Online publication date: 16-Jun-2017
https://dl.acm.org/doi/10.1145/3081333.3081358
Henriksen TSerup NElsman MHenglein FOancea CCohen AVechev M(2017)Futhark: purely functional GPU-programming with nested parallelism and in-place array updatesProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3062341.3062354(556-571)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3062341.3062354
Kolesnichenko APoskitt CNanz S(2017)SafeGPU: Contract- and library-based GPGPU for object-oriented languagesComputer Languages, Systems & Structures10.1016/j.cl.2016.08.00248(68-88)Online publication date: Jun-2017
https://doi.org/10.1016/j.cl.2016.08.002
Vijaykumar NHsieh KPekhimenko GKhan SShrestha AGhose SJog AGibbons PMutlu OHsu WYang CLipasti MLee H(2016)ZoruaThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195656(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195656
Lin SLiu YPlishker WBhattacharyya S(2016)A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU PlatformsProceedings of the 19th International Workshop on Software and Compilers for Embedded Systems10.1145/2906363.2906374(20-29)Online publication date: 23-May-2016
https://dl.acm.org/doi/10.1145/2906363.2906374
Wu JBelevich ABendersky EHeffernan MLeary CPienaar JRoune BSpringer RWeng XHundt RFranke BWu YRastello F(2016)gpucc: an open-source GPGPU compilerProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854041(105-116)Online publication date: 29-Feb-2016
https://dl.acm.org/doi/10.1145/2854038.2854041
Vijaykumar NHsieh KPekhimenko GKhan SShrestha AGhose SJog AGibbons PMutlu O(2016)Zorua: A holistic approach to resource virtualization in GPUs2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO.2016.7783718(1-14)Online publication date: Oct-2016
https://doi.org/10.1109/MICRO.2016.7783718
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents