research-article

Scheduling stream programs with improving arithmetic unit usage on NoC-based VLIW multi-core architectures

Authors:

Shaojun WeiAuthors Info & Claims

CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers

Article No.: 18, Pages 1 - 8

https://doi.org/10.1145/2742854.2742872

Published: 06 May 2015 Publication History

Abstract

Stream programming model has received a lot of interest due to its naturally-exposed task, data and pipeline parallelism. Many researches concentrated on scheduling stream programs on multi-core systems. However, few of them consider the arithmetic unit utilization, which is a vital factor to determine the performance of multi-core systems. This paper focuses on scheduling stream programs on NoC-based VLIW multi-core architectures, aiming at improving the performance through increasing the arithmetic unit utilization. Three phases are proposed for the scheme. First, the stream program is replicated into multiple threads for providing enough parallel kernels. Second, parallel kernels are grouped and operators of each kernel group are scheduled together for high arithmetic unit utilization. Third, a hierarchical integer linear programming (ILP)-based methodology is proposed to map kernel groups onto each core for optimizing the maximum workload. A set of benchmarks are exploited for evaluation. Experimental results show that, compared with two other existing scheduling schemes, our proposed scheme can significantly improve the performance of the multi-core processor.

References

[1]

A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In Proc. Int. Conf. Parallel architectures and compilation techniques (PACT), pages 33--42, 2006.

Digital Library

[2]

P. Matton et al. Communication scheduling. In Proc. Int. Conf. Architectural support for programming languages and operating systems (ASPLOS), pages 82--92, 2000.

Digital Library

[3]

W. Thies et al. StreamIt: a language for streaming applications. In Proc. Int. Conf. on Compiler Construction (CC), pages 179--196, 2002.

Digital Library

[4]

D. Kirk. NVDIA CUDA software and GPU parallel computing architecture. In Proc. Int. Conf. Memory Management (ISMM), page 103, 2007.

Digital Library

[5]

M. B. Taylor et al. The Raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(15): 25--35, Mar./Apr. 2002.

Digital Library

[6]

H. P. Hofstee. Power efficient processor design and the Cell processor. In Proc. Int. Conf. High-Performance Computer Architecture (HPCA), pages 258--262, 2005.

Digital Library

[7]

J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30(2): 56--69, Mar./Apr. 2010.

Digital Library

[8]

S. Bell et al. TILE64 processor: A 64-core soc with mesh interconnect. In Proc. Int. Solid-State Circuits Conf. (ISSCC), pages 88--598, 2008.

[9]

S.-W. Liao et al. Data and computation transformations for brook streaming applications on multiprocessors. In Proc. Int. Conf. Code Generation and Optimation (CGO), pages 196--207, 2006.

Digital Library

[10]

Y. Choi, Y. Lin, N. Chong, S. Mahlke, and T. Mudge. Stream compilation for real-time embeded multicores systems. In Proc. Int. Conf. Code Generation and Optimation (CGO), pages 210--220, 2009.

Digital Library

[11]

A. H. Hormati et al. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Int. Conf. Parallel architectures and compilation techniques (PACT), pages 214--223, 2009.

Digital Library

[12]

W. Che and K. S. Chatha. Unrolling and retiming of stream applications onto embedded multicore processors. In Proc. Design Automation Conf. (DAC), pages 1272--1277, 2012.

Digital Library

[13]

Y. Wang, Duo Liu, Zhiwei Qin, and Zili Shao. Optimally Removing Intercore Communication Overhead for Streaming Applications on MPSoCs. IEEE Trans. Comput., 62(2): 336--350, February 2013.

Digital Library

[14]

S. M. Farhad, Yousun Ko, Bernd Burgstaller, and Bernhard Scholz. Orchestration by approximation: Mapping stream programs onto multicore architectures. In Proc. Int. Conf. Architectural support for programming languages and operating systems (ASPLOS), pages 357--368, 2011.

Digital Library

[15]

M. I. Gordon et al. A stream compiler for communication-exposed architectures. In Int. Conf. Architectural support for programming languages and operating systems (ASPLOS), pages 291--303, 2002.

Digital Library

[16]

M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grain task, data, and pipeline parallelism in stream programs. In Proc. Int. Conf. Architectural support for programming languages and operating systems (ASPLOS), pages 151--162, 2006.

Digital Library

[17]

H. Wei et al. StreamTMC: Stream compilation for tiled multi-core architectures. J. Parallel Distrib. Comput., 73(4): 484--494, April 2013.

Digital Library

[18]

Alessio Bonfietti, Michele Lombardi, Michela Milano, and Luca Benini. Maximum-throughput mapping of SDFGs on multi-core SoC platforms. J. Parallel Distrib. Comput., 73(10): 1337--1350, October 2013.

Digital Library

[19]

B. Khailany et al. Imagine: media processing with streams. IEEE Micro, 21(2): 35--46, Mar./Apr. 2001.

Digital Library

[20]

Gurobi Optimization, Houston, TX. Gruobi Solver.

[21]

S. Murali and G. De Micheli. Bandwidth-constrained mapping of cores onto NoC architectures. In Proc. Design Automat. Test Eur. (DATE), pages 896--901, 2004.

Digital Library

Cited By

Fan ZLi WLiu TTang SWang ZAn XYe XFan D(2022)A Loop Optimization Method for Dataflow Architecture2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00059(202-211)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00059
Guo QSartor ABrandon ABeck AZhou XWong SFanucci LTeich J(2016)Run-time phase prediction for a reconfigurable VLIW processorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972188(1634-1639)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972188

Index Terms

Scheduling stream programs with improving arithmetic unit usage on NoC-based VLIW multi-core architectures
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Asymmetry-Aware Scheduling in Heterogeneous Multi-core Architectures
NPC 2013: Proceedings of the 10th IFIP International Conference on Network and Parallel Computing - Volume 8147

As threads of execution in a multi-programmed computing environment have different characteristics and hardware resource requirements, heterogeneous multi-core processors can achieve higher performance as well as power efficiency than homogeneous multi-...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing Workshops

OpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers

May 2015

413 pages

ISBN:9781450333580

DOI:10.1145/2742854

General Chairs:
Claudia Di Napoli
Istituto di Calcolo e Reti ad Alte Prestazioni, CNR, ITALY
,
Valentina Salapura
IBM T. J. Watson Research Center
,
Program Chairs:
Hubertus Franke
IBM T.J.Watson Research Center
,
Rui Hou
Institute for Computing Technology, Chinese Academy of Sciences, PRC

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'15

Sponsor:

SIGMICRO

CF'15: Computing Frontiers Conference

May 18 - 21, 2015

Ischia, Italy

Acceptance Rates

CF '15 Paper Acceptance Rate 33 of 96 submissions, 34%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
95
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fan ZLi WLiu TTang SWang ZAn XYe XFan D(2022)A Loop Optimization Method for Dataflow Architecture2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00059(202-211)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00059
Guo QSartor ABrandon ABeck AZhou XWong SFanucci LTeich J(2016)Run-time phase prediction for a reconfigurable VLIW processorProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972188(1634-1639)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2972188

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents