research-article

Unrolling and retiming of stream applications onto embedded multicore processors

Authors:

Karam S. ChathaAuthors Info & Claims

DAC '12: Proceedings of the 49th Annual Design Automation Conference

Pages 1272 - 1277

https://doi.org/10.1145/2228360.2228598

Published: 03 June 2012 Publication History

Abstract

In recent years, we have observed the prevalence of stream applications in many embedded domains. Stream applications distinguish themselves from traditional sequential programming languages through well defined independent actors, explicit data communication, and stable code/data access patterns. In order to achieve high performance and low power, scratch pad memory (SPM) has been introduced in today's embedded multicore processors. Programing on SPM based architecture is both challenging and time consuming. In this paper we address the problem of automatic compilation of stream applications onto SPM based embedded multicore processors through unrolling and retiming. In our technique, code overlay and data overlay are implemented to overcome the limited SPM capacity. Smart double buffering and code prefetching are introduced to amortize memory access delays. We evaluated the efficiency of our technique through compiling several stream applications onto the IBM Cell processor and compared their performance with existing approaches.

References

[1]

Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 2007.

[2]

I. Buck, T. Foley, and D. Horn et al. Brook for GPUs: stream computing on graphics hardware. ACM, 2004.

[3]

W. Che and K. Chatha. Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming. In DAC11, pages 122--127, june 2011.

Digital Library

[4]

W. Che, A. Panda, and K. Chatha. Compilation of stream programs for multicore processors that incorporate scratchpad memories. DATE, 2010.

Digital Library

[5]

Y. Choi and Y. Lin et al. Stream compilation for real-time embedded multicore systems. In CGO '09, pages 210--220, 2009.

Digital Library

[6]

A. Darte and G. Huard. Loop shifting for loop compaction. International Journal of Parallel Programming, 28, 2000.

Digital Library

[7]

J. Eker and J. W. Janneck. Cal language report. 2003.

[8]

M. I. Gordon and W. Thies et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGOPS Oper. Syst. Rev., 40:151--162, October 2006.

Digital Library

[9]

A. Hormati, Y. Choi, and M. Kudlur et al. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT '09, pages 214--223, sept. 2009.

Digital Library

[10]

A. Hormati, M. Samadi, and M. Woh et al. Sponge: portable stream programming on graphics engines. SIGPLAN Not., 46(3):381--392, Mar. 2011.

Digital Library

[11]

J. A. Kahle, M. N. Day, and H. P. Hofstee et al. Introduction to the Cell multiprocessor. IBM Journal of Research and Development, 49:589--604, 2005.

Digital Library

[12]

E. Kilgariff and R. Fernando. The GeForce 6 series GPU architecture. In SIGGRAPH. ACM, 2005.

Digital Library

[13]

M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of ACM SIGPLAN, 2008.

Digital Library

[14]

C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5--35, 1991.

Digital Library

[15]

C. Liang-Fang. Scheduling And Behavioral Transformations For Parallel Systems. PhD thesis, Princeton University, 1993.

[16]

S.-w. Liao and Z. Du et al. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of CGO, 2006.

Digital Library

[17]

G. D. Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill Higher Education, 1st edition, 1994.

Digital Library

[18]

C. Ostler et al. Ilp and heuristic techniques for system-level design on network processor architectures. TODAES, 2007.

Digital Library

[19]

J. Pino and E. Lee. Hierarchical static scheduling of dataflow graphs onto multiple processors. In ASSP, volume 4, 1995.

[20]

J. Stratton et al. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In LCPC, 2008.

Digital Library

[21]

W. Thies, M. Karczmarek, and S. Amarasinghe. Streamit: A language for streaming applications. 2304:49--84, 2002.

Digital Library

[22]

L. Truong. White paper: Low power consumption and a competitive price tag make the six-core TMS320C6472 ideal for high performance applications. Processing Business, oct. 2009.

[23]

S. Wasson. Ageia's physx physics processing unit. The tech report, PC hardware explored, Last accessed July 2008.

Cited By

Jeong DKim JOldja MHa S(2021)Parallel Scheduling of Multiple SDF Graphs Onto Heterogeneous ProcessorsIEEE Access10.1109/ACCESS.2021.30547259(20493-20507)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3054725
Belkebir DZga A(2019)Mapping and scheduling techniques in NoC: A survey of the state of the art2019 International Conference on Networking and Advanced Systems (ICNAS)10.1109/ICNAS.2019.8807815(1-6)Online publication date: Jun-2019
https://doi.org/10.1109/ICNAS.2019.8807815
Jalaja SVijaya Prakash A(2018)Partition Based Product Term Retiming for Reliable Low Power Logic StructureAdvances in Information and Communication Networks10.1007/978-3-030-03402-3_13(178-189)Online publication date: 6-Dec-2018
https://doi.org/10.1007/978-3-030-03402-3_13
Show More Cited By

Index Terms

Unrolling and retiming of stream applications onto embedded multicore processors
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming
DAC '11: Proceedings of the 48th Design Automation Conference

The prevalence of stream applications in signal processing, multi-media, and network processing domains has resulted in a new trend of programming and architecture design. Several languages and multicore architectures have been developed to support ...
Management and optimization for nonvolatile memory-based hybrid scratchpad memory on multicore embedded processors
Regular Papers

The recent emergence of various Non-Volatile Memories (NVMs), with many attractive characteristics such as low leakage power and high-density, provides us with a new way of addressing the memory power consumption problem. In this article, we target ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '12: Proceedings of the 49th Annual Design Automation Conference

June 2012

1357 pages

ISBN:9781450311991

DOI:10.1145/2228360

General Chair:
Patrick Groeneveld
Magma Design Automation, Inc., San Jose, CA
,
Program Chairs:
Donatella Sciuto
Politecnico di Milano, Milano, Italy
,
Soha Hassoun
Tufts Univ., Medford, MA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

EDAC: Electronic Design Automation Consortium
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computing and Communication Foundations

Conference

DAC '12

Sponsor:

EDAC
SIGDA

DAC '12: The 49th Annual Design Automation Conference 2012

June 3 - 7, 2012

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
240
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jeong DKim JOldja MHa S(2021)Parallel Scheduling of Multiple SDF Graphs Onto Heterogeneous ProcessorsIEEE Access10.1109/ACCESS.2021.30547259(20493-20507)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3054725
Belkebir DZga A(2019)Mapping and scheduling techniques in NoC: A survey of the state of the art2019 International Conference on Networking and Advanced Systems (ICNAS)10.1109/ICNAS.2019.8807815(1-6)Online publication date: Jun-2019
https://doi.org/10.1109/ICNAS.2019.8807815
Jalaja SVijaya Prakash A(2018)Partition Based Product Term Retiming for Reliable Low Power Logic StructureAdvances in Information and Communication Networks10.1007/978-3-030-03402-3_13(178-189)Online publication date: 6-Dec-2018
https://doi.org/10.1007/978-3-030-03402-3_13
Zhu XGeilen MBasten TStuijk S(2016)Multiconstraint Static Scheduling of Synchronous Dataflow Graphs Via Retiming and UnfoldingIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2015.249516735:6(905-918)Online publication date: Jun-2016
https://doi.org/10.1109/TCAD.2015.2495167
Jiang GLi ZWang FWei SNapoli CSalapura VFranke HHou R(2015)Scheduling stream programs with improving arithmetic unit usage on NoC-based VLIW multi-core architecturesProceedings of the 12th ACM International Conference on Computing Frontiers10.1145/2742854.2742872(1-8)Online publication date: 6-May-2015
https://dl.acm.org/doi/10.1145/2742854.2742872
Jahn JPagani SKobbe SChen JHenkel J(2015)Runtime Resource Allocation for Software PipelinesACM Transactions on Parallel Computing10.1145/27423472:1(1-23)Online publication date: 21-May-2015
https://dl.acm.org/doi/10.1145/2742347
Singh AShafique MKumar AHenkel J(2013)Mapping on multi/many-core systemsProceedings of the 50th Annual Design Automation Conference10.1145/2463209.2488734(1-10)Online publication date: 29-May-2013
https://dl.acm.org/doi/10.1145/2463209.2488734

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents