research-article

A C2RTL Framework Supporting Partition, Parallelization, and FIFO Sizing for Streaming Applications

Authors:

Xiaobo Sharon Hu,

Huazhong YangAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 21, Issue 2

Article No.: 19, Pages 1 - 32

https://doi.org/10.1145/2797135

Published: 28 January 2016 Publication History

Abstract

Developing circuits for streaming applications written in C (or its variants) can benefit greatly from C-to-RTL (C2RTL) synthesis. Yet, most existing C2RTL tools lack system-level options to trade off various design constraints, such as delay and area. This article introduces a systematic way to accomplish C2RTL synthesis for streaming applications containing thousands of lines of C (or its variants) codes. Synthesizing circuits for such large applications presents serious challenges for existing C2RTL tools. Specifically, the proposed approach determines simultaneously the number of pipeline stages and the number of times that each functional block is duplicated in each pipeline stage. A mixed integer linear programming-based solution is formulated for obtaining the optimal solution. Furthermore, a heuristic algorithm is developed for large-scale problems. To accommodate the differences of the data rates between the adjacent hardware modules, first-in-first-out (FIFO) buffers are indispensable, but their overheads are nonnegligible. A parallelism-aware FIFO sizing method is also introduced to determine the optimal sizes of FIFOs. Experimental results on seven real-world applications demonstrate that the algorithms in the synthesis flow can make effective design trade-offs and find superior solutions in a short time compared with existing approaches. Furthermore, the algorithms achieve optimal results in most cases with subsecond running time.

References

[1]

P. Alexandros, D. Chen, W. Hwu, J. Cong, and Y. Liang. 2013a. Throughput-oriented Kernel porting onto FPGAs. In Proceedings of the Annual Design Automation Conference on Design Automation Conference (DAC'13). 1--10.

Digital Library

[2]

P. Alexandros, G. Karthik, J. A. Stratton, D. Chen, J. Cong, and W. M. W. Hwu. 2013b. Efficient compilation of CUDA kernels for high-performance computing on FPGAs. ACM Trans. Embed. Comput. Systems 13, 2 (2013), 25:1--26.

Digital Library

[3]

J. Ceng, J. Castrillón, W. Sheng, H. Scharwächter, R. Leupers, G. Ascheid, H. Meyr, T. Isshiki, and H. Kunieda. 2008. MAPS: An integrated framework for MPSoC application parallelization. In Proceedings of the Annual Design Automation Conference (DAC'08). 754--759.

Digital Library

[4]

Y. Chen and H. Zhou. 2012. Buffer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASPDAC'12). 127--132.

[5]

J. Cong, K. Guruaj, M. Huang, S. Li, B. Xiao, and Y. Zou. 2011c. Domain-specific processor with 3D integration for medical image processing. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'11). 247--250.

Digital Library

[6]

J. Cong, M. Huang, B. Liu, P. Zhang, and Y. Zou. 2012. Combining module selection and replication throughput-driven streaming programs. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'12). 1018--1023.

Digital Library

[7]

J. Cong, M. Huang, and P. Zhang. 2014. Combining computation and communication optimizations in system synthesis for streaming applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'14). 213--222.

Digital Library

[8]

J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011a. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2 (2011), 15:1--25.

Digital Library

[9]

J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. 2011b. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput.-Aid. Desi. Integr. Circuits Syst. 30, 4 (2011), 473--491.

Digital Library

[10]

D. Cordes, A. Heinig, P. Marwedel, and A. Mallik. 2011. Automatic extraction of pipeline parallelism for embedded software using linear programming. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS'11). 699--706.

Digital Library

[11]

L. Edward and M. David. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.

[12]

Y. Guo and D. McCain. 2006. Rapid prototyping and VLSI exploration for 3g/4G MIMO wireless systems using integrated catapult-c methodology. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC'06). 958--963.

[13]

S. T. Gurumani, C. Hisham, Y. Liang, R. Kyle, and D. Chen. 2013. High-level synthesis of multiple dependent CUDA kernels on FPGA. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'13). 305--312.

[14]

A. Hagiescu, W. Wong, D. F. Bacon, and R. Rabbah. 2009. A computing origami: Folding streams in FPGAs. In Proceedings of the Annual Design Automation Conference (DAC'09). 282--287.

Digital Library

[15]

Y. Hara, H. Tomiyama, S. Honda, and H. Takada. 2010. Partitioning of behavioral descriptions with exploiting function-level parallelism. IEICE Trans. Fund. of Electron. Commun. Comput. Sci. E93-A (2010), 488--499.

[16]

Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'08). 1192--1195.

[17]

J. Haris and P. Sri. 2008. Synthesis of heterogeneous pipelined multiprocessor systems using ILP: JPEG case study. In Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'08). 1--6.

Digital Library

[18]

A. H. Hormati, C. Yoonseo, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'09). 214--223.

Digital Library

[19]

R. Iyer. 2012. Accelerator-rich Architectures: Implications, opportunities and challenges. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'12). 106--107.

[20]

S. Kwon and S. Ha. 2010. Serialized parallel code generation framework for MPSoC. ACM Trans. Desi. Autom. Electron. Systems 15, 2 (2010), 11:1--27.

Digital Library

[21]

K. Lahiri, A. Raghunathan, and S. Dey. 2001. System-level performance analysis for designing on-chip communication architectures. IEEE Trans. Comput.-Aid. Desi. Integr. Circuits Syst. 20, 6 (2001), 768--783.

Digital Library

[22]

P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong. 2012b. Memory partitioning and scheduling co-optimization in behavioral synthesis. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'12). 488--495.

Digital Library

[23]

S. Li, Y. Liu, X. Hu, X. He, Y. Zhang, P. Zhang, and H. Yang. 2013. Optimal partition with block-level parallelization in C-to-RTL synthesis for streaming applications. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'13). 225--230.

[24]

S. Li, Y. Liu, D. Zhang, X. He, P. Zhang, and H. Yang. 2012a. A hierarchical C2RTL framework for FIFO connected stream applications. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'12). 133--138.

[25]

F. Liu, G. Soumyadeep, N. P. Johnson, and D. I. August. 2014. CGPA: Coarse-grained pipelined accelerators. In Proceedings of the Annual Design Automation Conference on Design Automation Conference (DAC'14). 1--6.

Digital Library

[26]

Y. Liu, S. Chakraborty, and R. Marculescu. 2006. Generalized rate analysis for media-processing platforms. In Proceedings of the IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'06). 305--314.

Digital Library

[27]

Y. Liu, S. Li, H. Yang, and P. Zhang. 2012. A hierarchical C2RTL framework for hardware configurable embedded systems. In Embedded Systems - Theory and Design Methodology. Intech, 367--386.

[28]

A. Maxiaguine, S. Künzli, S. Chakraborty, and L. Thiele. 2004. Rate analysis for streaming applications with on-chip buffer constraints. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'04). 131--136.

Digital Library

[29]

S. McConnell. 2009. Code Complete. O'Reilly Media, Inc.

[30]

M. A. Pasha, S. Derrien, and O. Sentieys. 2012. System-level synthesis for wireless sensor node controllers: A complete design flow. ACM Transact. Des. Automat. Electron. Syst. 17, 1 (2012), 2:1--24.

Digital Library

[31]

E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'08). 114--123.

Digital Library

[32]

M. Rossler, H. Wang, U. Heinkel, N. Engin, and W. Drescher. 2009. Rapid prototyping of a DVB-SH turbo decoder using high level synthesis. In Proceedings of the International Conference on Field Programmable Logic and Applications (FDL'09). 1--6.

[33]

B. C. Schafer. 2013. Automatic partitioning of behavioral descriptions for high-level synthesis with multiple internal throughputs. In Proceedings of the Electronic System Level Synthesis Conference (ESLsyn'13). 1--6.

[34]

B. C. Schafer, A. Trambadia, and K. Wakabayashi. 2010. Design of complex image processing systems in ESL. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'10). 809--814.

Digital Library

[35]

B. C. Schafer and K. Wakabayashi. 2012. Divide and conquer high-level synthesis design space exploration. ACM Trans. Desi Autom. Electron. Syst. 17, 3 (2012), 29:1--19.

Digital Library

[36]

A. P. Wang, J. Hahn, M. Roumi, and P. H. Chou. 2012. Buffer optimization and dispatching scheme for embedded systems with behavioral transparency. ACM Trans. Desi. Automat. Electron. Syst. 17, 4 (2012), 41:1--26.

Digital Library

[37]

Y. Wang, P. Li, and J. Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'14). 199--208.

Digital Library

[38]

Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the Annual Design Automation Conference (DAC'13). 1--8.

Digital Library

[39]

M. H. Wiggers, M. J. G. Bekooij, and G. J. M. Smit. 2008. Buffer capacity computation for throughput constrained streaming applications with data-dependent inter-task communication. In Proceedings of the Real-Time and Embedded Technology and Applications Symposium (RTAS'08). 183--194.

Digital Library

[40]

Xilinx. 2015. Vivado high-level synthesis. http://www.xilinx.com/.

[41]

YXI. 2013. YXI's eXCite tool. http://www.yxi.com/.

[42]

J. Zhu, I. Sander, and A. Jantsch. 2009. Buffer minimization of real-time streaming applications scheduling on hybrid CPU/FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'09). 1506--1511.

Digital Library

[43]

Y. Zhu, Y. Liu, D. Zhang, S. Li, P. Zhang, and T. Hadley. 2010. Acceleration of pedestrian detection algorithm on novel C2RTL HW/SW co-design platform. In Proceedings of the International Conference on Green Circuits and Systems (ICGCS'10). 615--620.

Cited By

Liu SLau FSchafer B(2020)Predictive Compositional Method to Design and Reoptimize Complex Behavioral DataflowsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.296644739:10(2615-2627)Online publication date: Oct-2020
https://doi.org/10.1109/TCAD.2020.2966447

Index Terms

A C2RTL Framework Supporting Partition, Parallelization, and FIFO Sizing for Streaming Applications
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
2. Hardware
  1. Electronic design automation

Recommendations

Parallelization of a color-entropy preprocessed Chan–Vese model for face contour detection on multi-core CPU and GPU
Highlights
- We introduce a novel way to parallelize a face contour detecting application.
- ...
Abstract
Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting ...
Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors

This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of ...
LLVM framework and IR extensions for parallelization, SIMD vectorization and offloading
LLVM-HPC '16: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPC

LLVM has become an integral part of the software-development ecosystem for developing advanced compilers, high-performance computing software and tools. This paper presents a small set of LLVM IR extensions for explicitly parallel vector, and offloading ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 21, Issue 2

January 2016

422 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/2888405

Editor:
Naehyuck Chang
Korea Advanced Institute of Science and Technology, Korea

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 28 January 2016

Accepted: 01 June 2015

Revised: 01 March 2015

Received: 01 November 2014

Published in TODAES Volume 21, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

High-Tech Research and Development (863) Program
Huawei Shannon Lab and the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
213
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu SLau FSchafer B(2020)Predictive Compositional Method to Design and Reoptimize Complex Behavioral DataflowsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.296644739:10(2615-2627)Online publication date: Oct-2020
https://doi.org/10.1109/TCAD.2020.2966447

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents