research-article

Synergistic execution of stream programs on multicores with accelerators

Authors:

Abhishek Udupa,

R. Govindarajan,

Matthew J. ThazhuthaveetilAuthors Info & Claims

ACM SIGPLAN Notices, Volume 44, Issue 7

Pages 99 - 108

https://doi.org/10.1145/1543136.1542466

Published: 19 June 2009 Publication History

Abstract

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as Graphics Processing Units (GPUs) or CellBE which support abundant parallelism in hardware.

In this paper, we describe a novel method to orchestrate the execution of a StreamIt program on a multicore platform equipped with an accelerator. The proposed approach identifies, using profiling, the relative benefits of executing a task on the superscalar CPU cores and the accelerator. We formulate the problem of partitioning the work between the CPU cores and the GPU, taking into account the latencies for data transfers and the required buffer layout transformations associated with the partitioning, as an integrated Integer Linear Program (ILP) which can then be solved by an ILP solver.We also propose an efficient heuristic algorithm for the work partitioning between the CPU and the GPU, which provides solutions which are within 9.05% of the optimal solution on an average across the benchmark suite. The partitioned tasks are then software pipelined to execute on the multiple CPU cores and the Streaming Multiprocessors (SMs) of the GPU. The software pipelining algorithm orchestrates the execution between CPU cores and the GPU by emitting the code for the CPU and the GPU, and the code for the required data transfers. Our experiments on a platform with 8 CPU cores and a GeForce 8800 GTS 512 GPU show a geometric mean speedup of 6.84X with a maximum of 51.96X over a single threaded CPU execution across the StreamIt benchmarks. This is a 18.9% improvement over a partitioning strategy that maps only the filters that cannot be executed on the GPU -- the filters with state that is persistent across firings -- onto the CPU.

References

[1]

NVIDIA CUDA Programming Guide. URL http://www.nvidia.com/cuda.

[2]

OpenCL Overview. URL http://www.khronos.org/developers/library/overview/opencl_overview.pdf.

[3]

StreamIt Home Page. URL http://www.cag.lcs.mit.edu/streamit/.

[4]

S. S. Bhattacharyya and E. A. Lee. Looped Schedules for Dataflow Descriptions of Multirate Signal Processing Algorithms. Formal Methods in System Design, 5(3), 1994.

Digital Library

[5]

Ian Buck et. al. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Trans. on Graphics, 23(3), 2004.

Digital Library

[6]

J. A. Kahle et. al. Introduction to the Cell Multiprocessor. IBM Journal of Research and Development, 49(4--5), 2005.

Digital Library

[7]

Michael Bedford Taylor et. al. The RAW Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, 22(2), 2002.

Digital Library

[8]

Michael I. Gordon et. al. A Stream Compiler for Communication-Exposed Architectures. In ASPLOS-X: Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[9]

Shane Ryoo et. al. Program Optimization Space Pruning for a Multithreaded GPU. In CGO '08: Proc. of the sixth annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, 2008.

Digital Library

[10]

Shane Ryoo et. al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA. In PPoPP'08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 2008.

Digital Library

[11]

G. R. Gao, R. Govindarajan, and P. Panangaden. Well-Behaved Dataflow Programs for DSP Computation. ICASSP-92: IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1992., 5, Mar 1992.

[12]

Michael I Gordon, William Thies, and Saman Amarasinghe. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In ASPLOS-XII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006.

Digital Library

[13]

R. Govindarajan and Guang R. Gao. A Novel Framework for Multirate Scheduling in DSP Applications. In ASAP '93: Proc. of the 1993 Intl. Conf. on Application--Specific Array Processors, Oct 1993.

[14]

R. Govindarajan, Guang R. Gao, and Palash Desai. Minimizing Memory Requirements in Rate-optimal Schedules. In ASAP '94: Proc. of the 1994 Intl. Conf. on Application Specific Array Processors, Aug 1994.

[15]

Junwei Hou and Wayne Wolf. Process Partitioning for Distributed Embedded Systems. In CODES '96: Proc. of the 4th Intl. Workshop on Hardware/Software Co-Design, 1996.

Digital Library

[16]

G. Karypis and V. Kumar. Multilevel k-way Partitioning Scheme for Irregular Graphs. Journal of Parallel and Distributed Computing, 48, 1998.

Digital Library

[17]

B.W. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell System Tech. Journal, 49, Feb. 1970.

[18]

Manjunath Kudlur and Scott Mahlke. Orchestrating the Execution of Stream Programs on Multicore Platforms. In PLDI '08: Proc. of the 2008 ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2008.

Digital Library

[19]

E. A. Lee and D. G. Messerschmitt. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. on Computers, 36(1), 1987.

Digital Library

[20]

E.A. Lee and D.G. Messerschmitt. Synchronous Data Flow. Proc. of the IEEE, 75(9), Sept. 1987.

[21]

B. R. Rau. Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. In MICRO 27: Proc. of the 27th annual Intl. Symp. on Microarchitecture, 1994.

Digital Library

[22]

B. R. Rau, Michael S. Schlansker, and P. P. Tirumalai. Code Generation Schema for Modulo Scheduled Loops. In MICRO 25: Proc. of the 25th annual Intl. Symp. on Microarchitecture, 1992.

Digital Library

[23]

John Ruttenberg, Guang R. Gao, A. Stoutchinin, and W. Lichtenstein. Software Pipelining Showdown: Optimal vs. Heuristic Methods in a Production Compiler. In PLDI '96: Proc. of the ACM SIGPLAN 1996 Conf. on Programming Language Design and Implementation, 1996.

Digital Library

[24]

Janis Sermulins, William Thies, Rodric Rabbah, and Saman Amarasinghe Cache Aware Optimization of Stream Programs. In LCTES'05: Proc. of the 2005 ACM SIGPLAN/SIGBED Conf. on Languages, Compilers, and Tools for Embedded Systems, 2005.

Digital Library

[25]

David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In ASPLOSXII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006.

Digital Library

[26]

William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In CC '02: Proc. of the 11th Intl. Conf. on Compiler Construction, 2002.

Digital Library

[27]

Abhishek Udupa, R. Govindarajan, and Matthew J. Thazhuthaveetil. Software Pipelined Execution of Stream Programs on GPUs. In CGO'09: Proc. of the seventh annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, 2009.

Digital Library

[28]

Ti-Yen Yen and Wayne Wolf. Communication Synthesis for Distributed Embedded Systems. In ICCAD '95: Proc. of the 1995 IEEE/ACM Intl. Conf. on Computer-aided Design, 1995.

Digital Library

[29]

D. Zhang, Qiuyuan J. Li, Rodric Rabbah, and Saman Amarasinghe. A Lightweight Streaming Layer for Multicore Execution. SIGARCH Computer Architecture News, 36(2), 2008.

Digital Library

Cited By

Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Cong JHuang MLiu BZhang PZou YRosenstiel WMacii E(2012)Combining module selection and replication for throughput-driven streaming programsProceedings of the Conference on Design, Automation and Test in Europe10.5555/2492708.2492962(1018-1023)Online publication date: 12-Mar-2012
https://dl.acm.org/doi/10.5555/2492708.2492962
Shan JCasu MCortadella JLavagno LLazarescu M(2019)Exact and Heuristic Allocation of Multi-kernel Applications to Multi-FPGA PlatformsProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317821(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317821
Show More Cited By

Index Terms

Synergistic execution of stream programs on multicores with accelerators

Recommendations

Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Software Pipelined Execution of Stream Programs on GPUs
CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, ...
Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach
SCOPES '15: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 44, Issue 7

LCTES '09

July 2009

176 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1543136

Issue’s Table of Contents

LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
June 2009
188 pages
ISBN:9781605583563
DOI:10.1145/1542452
General Chair:
Christoph Kirsch
University of Salzburg, Austria
,
Program Chair:
Mahmut Kandemir
Pennsylvania State University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2009

Published in SIGPLAN Volume 44, Issue 7

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
487
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Cong JHuang MLiu BZhang PZou YRosenstiel WMacii E(2012)Combining module selection and replication for throughput-driven streaming programsProceedings of the Conference on Design, Automation and Test in Europe10.5555/2492708.2492962(1018-1023)Online publication date: 12-Mar-2012
https://dl.acm.org/doi/10.5555/2492708.2492962
Shan JCasu MCortadella JLavagno LLazarescu M(2019)Exact and Heuristic Allocation of Multi-kernel Applications to Multi-FPGA PlatformsProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317821(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317821
Farhad SNayeem MRahman MRahman M(2016)Mapping stream programs onto multicore platforms by local search and genetic algorithmComputer Languages, Systems and Structures10.1016/j.cl.2016.08.00746:C(182-205)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.cl.2016.08.007
Kumar Thakur RSrikant YCorporaal HStuijk S(2015)Efficient Compilation of Stream Programs for Heterogeneous ArchitecturesProceedings of the 18th International Workshop on Software and Compilers for Embedded Systems10.1145/2764967.2764968(38-47)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1145/2764967.2764968
Cong JHuang MZhang PBetz VConstantinides G(2014)Combining computation and communication optimizations in system synthesis for streaming applicationsProceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays10.1145/2554688.2554771(213-222)Online publication date: 26-Feb-2014
https://dl.acm.org/doi/10.1145/2554688.2554771
Farhad SKo YBurgstaller BScholz B(2012)Profile-guided deployment of stream programs on multicoresACM SIGPLAN Notices10.1145/2345141.224843047:5(79-88)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2345141.2248430
Farhad SKo YBurgstaller BScholz BWilhelm RFalk HYi W(2012)Profile-guided deployment of stream programs on multicoresProceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems10.1145/2248418.2248430(79-88)Online publication date: 12-Jun-2012
https://dl.acm.org/doi/10.1145/2248418.2248430
Cong JMuhuan Huang Bin Liu Peng Zhang Yi Zou (2012)Combining module selection and replication for throughput-driven streaming programs2012 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.1109/DATE.2012.6176645(1018-1023)Online publication date: Mar-2012
https://doi.org/10.1109/DATE.2012.6176645
Prasad AAnantpur JGovindarajan RHall MPadua D(2011)Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processorsProceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/1993498.1993517(152-163)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/1993498.1993517
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents