Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A C2RTL Framework Supporting Partition, Parallelization, and FIFO Sizing for Streaming Applications

Published: 28 January 2016 Publication History

Abstract

Developing circuits for streaming applications written in C (or its variants) can benefit greatly from C-to-RTL (C2RTL) synthesis. Yet, most existing C2RTL tools lack system-level options to trade off various design constraints, such as delay and area. This article introduces a systematic way to accomplish C2RTL synthesis for streaming applications containing thousands of lines of C (or its variants) codes. Synthesizing circuits for such large applications presents serious challenges for existing C2RTL tools. Specifically, the proposed approach determines simultaneously the number of pipeline stages and the number of times that each functional block is duplicated in each pipeline stage. A mixed integer linear programming-based solution is formulated for obtaining the optimal solution. Furthermore, a heuristic algorithm is developed for large-scale problems. To accommodate the differences of the data rates between the adjacent hardware modules, first-in-first-out (FIFO) buffers are indispensable, but their overheads are nonnegligible. A parallelism-aware FIFO sizing method is also introduced to determine the optimal sizes of FIFOs. Experimental results on seven real-world applications demonstrate that the algorithms in the synthesis flow can make effective design trade-offs and find superior solutions in a short time compared with existing approaches. Furthermore, the algorithms achieve optimal results in most cases with subsecond running time.

References

[1]
P. Alexandros, D. Chen, W. Hwu, J. Cong, and Y. Liang. 2013a. Throughput-oriented Kernel porting onto FPGAs. In Proceedings of the Annual Design Automation Conference on Design Automation Conference (DAC'13). 1--10.
[2]
P. Alexandros, G. Karthik, J. A. Stratton, D. Chen, J. Cong, and W. M. W. Hwu. 2013b. Efficient compilation of CUDA kernels for high-performance computing on FPGAs. ACM Trans. Embed. Comput. Systems 13, 2 (2013), 25:1--26.
[3]
J. Ceng, J. Castrillón, W. Sheng, H. Scharwächter, R. Leupers, G. Ascheid, H. Meyr, T. Isshiki, and H. Kunieda. 2008. MAPS: An integrated framework for MPSoC application parallelization. In Proceedings of the Annual Design Automation Conference (DAC'08). 754--759.
[4]
Y. Chen and H. Zhou. 2012. Buffer minimization in pipelined SDF scheduling on multi-core platforms. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASPDAC'12). 127--132.
[5]
J. Cong, K. Guruaj, M. Huang, S. Li, B. Xiao, and Y. Zou. 2011c. Domain-specific processor with 3D integration for medical image processing. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'11). 247--250.
[6]
J. Cong, M. Huang, B. Liu, P. Zhang, and Y. Zou. 2012. Combining module selection and replication throughput-driven streaming programs. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'12). 1018--1023.
[7]
J. Cong, M. Huang, and P. Zhang. 2014. Combining computation and communication optimizations in system synthesis for streaming applications. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'14). 213--222.
[8]
J. Cong, W. Jiang, B. Liu, and Y. Zou. 2011a. Automatic memory partitioning and scheduling for throughput and power optimization. ACM Transactions on Design Automation of Electronic Systems 16, 2 (2011), 15:1--25.
[9]
J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. 2011b. High-level synthesis for FPGAs: From prototyping to deployment. IEEE Trans. Comput.-Aid. Desi. Integr. Circuits Syst. 30, 4 (2011), 473--491.
[10]
D. Cordes, A. Heinig, P. Marwedel, and A. Mallik. 2011. Automatic extraction of pipeline parallelism for embedded software using linear programming. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS'11). 699--706.
[11]
L. Edward and M. David. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.
[12]
Y. Guo and D. McCain. 2006. Rapid prototyping and VLSI exploration for 3g/4G MIMO wireless systems using integrated catapult-c methodology. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC'06). 958--963.
[13]
S. T. Gurumani, C. Hisham, Y. Liang, R. Kyle, and D. Chen. 2013. High-level synthesis of multiple dependent CUDA kernels on FPGA. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'13). 305--312.
[14]
A. Hagiescu, W. Wong, D. F. Bacon, and R. Rabbah. 2009. A computing origami: Folding streams in FPGAs. In Proceedings of the Annual Design Automation Conference (DAC'09). 282--287.
[15]
Y. Hara, H. Tomiyama, S. Honda, and H. Takada. 2010. Partitioning of behavioral descriptions with exploiting function-level parallelism. IEICE Trans. Fund. of Electron. Commun. Comput. Sci. E93-A (2010), 488--499.
[16]
Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS'08). 1192--1195.
[17]
J. Haris and P. Sri. 2008. Synthesis of heterogeneous pipelined multiprocessor systems using ILP: JPEG case study. In Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'08). 1--6.
[18]
A. H. Hormati, C. Yoonseo, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'09). 214--223.
[19]
R. Iyer. 2012. Accelerator-rich Architectures: Implications, opportunities and challenges. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'12). 106--107.
[20]
S. Kwon and S. Ha. 2010. Serialized parallel code generation framework for MPSoC. ACM Trans. Desi. Autom. Electron. Systems 15, 2 (2010), 11:1--27.
[21]
K. Lahiri, A. Raghunathan, and S. Dey. 2001. System-level performance analysis for designing on-chip communication architectures. IEEE Trans. Comput.-Aid. Desi. Integr. Circuits Syst. 20, 6 (2001), 768--783.
[22]
P. Li, Y. Wang, P. Zhang, G. Luo, T. Wang, and J. Cong. 2012b. Memory partitioning and scheduling co-optimization in behavioral synthesis. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'12). 488--495.
[23]
S. Li, Y. Liu, X. Hu, X. He, Y. Zhang, P. Zhang, and H. Yang. 2013. Optimal partition with block-level parallelization in C-to-RTL synthesis for streaming applications. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'13). 225--230.
[24]
S. Li, Y. Liu, D. Zhang, X. He, P. Zhang, and H. Yang. 2012a. A hierarchical C2RTL framework for FIFO connected stream applications. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'12). 133--138.
[25]
F. Liu, G. Soumyadeep, N. P. Johnson, and D. I. August. 2014. CGPA: Coarse-grained pipelined accelerators. In Proceedings of the Annual Design Automation Conference on Design Automation Conference (DAC'14). 1--6.
[26]
Y. Liu, S. Chakraborty, and R. Marculescu. 2006. Generalized rate analysis for media-processing platforms. In Proceedings of the IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'06). 305--314.
[27]
Y. Liu, S. Li, H. Yang, and P. Zhang. 2012. A hierarchical C2RTL framework for hardware configurable embedded systems. In Embedded Systems - Theory and Design Methodology. Intech, 367--386.
[28]
A. Maxiaguine, S. Künzli, S. Chakraborty, and L. Thiele. 2004. Rate analysis for streaming applications with on-chip buffer constraints. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'04). 131--136.
[29]
S. McConnell. 2009. Code Complete. O'Reilly Media, Inc.
[30]
M. A. Pasha, S. Derrien, and O. Sentieys. 2012. System-level synthesis for wireless sensor node controllers: A complete design flow. ACM Transact. Des. Automat. Electron. Syst. 17, 1 (2012), 2:1--24.
[31]
E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'08). 114--123.
[32]
M. Rossler, H. Wang, U. Heinkel, N. Engin, and W. Drescher. 2009. Rapid prototyping of a DVB-SH turbo decoder using high level synthesis. In Proceedings of the International Conference on Field Programmable Logic and Applications (FDL'09). 1--6.
[33]
B. C. Schafer. 2013. Automatic partitioning of behavioral descriptions for high-level synthesis with multiple internal throughputs. In Proceedings of the Electronic System Level Synthesis Conference (ESLsyn'13). 1--6.
[34]
B. C. Schafer, A. Trambadia, and K. Wakabayashi. 2010. Design of complex image processing systems in ESL. In Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC'10). 809--814.
[35]
B. C. Schafer and K. Wakabayashi. 2012. Divide and conquer high-level synthesis design space exploration. ACM Trans. Desi Autom. Electron. Syst. 17, 3 (2012), 29:1--19.
[36]
A. P. Wang, J. Hahn, M. Roumi, and P. H. Chou. 2012. Buffer optimization and dispatching scheme for embedded systems with behavioral transparency. ACM Trans. Desi. Automat. Electron. Syst. 17, 4 (2012), 41:1--26.
[37]
Y. Wang, P. Li, and J. Cong. 2014. Theory and algorithm for generalized memory partitioning in high-level synthesis. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'14). 199--208.
[38]
Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the Annual Design Automation Conference (DAC'13). 1--8.
[39]
M. H. Wiggers, M. J. G. Bekooij, and G. J. M. Smit. 2008. Buffer capacity computation for throughput constrained streaming applications with data-dependent inter-task communication. In Proceedings of the Real-Time and Embedded Technology and Applications Symposium (RTAS'08). 183--194.
[40]
Xilinx. 2015. Vivado high-level synthesis. http://www.xilinx.com/.
[41]
YXI. 2013. YXI's eXCite tool. http://www.yxi.com/.
[42]
J. Zhu, I. Sander, and A. Jantsch. 2009. Buffer minimization of real-time streaming applications scheduling on hybrid CPU/FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'09). 1506--1511.
[43]
Y. Zhu, Y. Liu, D. Zhang, S. Li, P. Zhang, and T. Hadley. 2010. Acceleration of pedestrian detection algorithm on novel C2RTL HW/SW co-design platform. In Proceedings of the International Conference on Green Circuits and Systems (ICGCS'10). 615--620.

Cited By

View all
  • (2020)Predictive Compositional Method to Design and Reoptimize Complex Behavioral DataflowsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.296644739:10(2615-2627)Online publication date: Oct-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 21, Issue 2
January 2016
422 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/2888405
  • Editor:
  • Naehyuck Chang
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 28 January 2016
Accepted: 01 June 2015
Revised: 01 March 2015
Received: 01 November 2014
Published in TODAES Volume 21, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FIFO sizing
  2. System-level design and optimization
  3. parallelization
  4. partition
  5. streaming applications

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • High-Tech Research and Development (863) Program
  • Huawei Shannon Lab and the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Predictive Compositional Method to Design and Reoptimize Complex Behavioral DataflowsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.296644739:10(2615-2627)Online publication date: Oct-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media