research-article

Speedups in embedded systems with a high-performance coprocessor datapath

Authors:

Michalis D. Galanis,

Gregory Dimitroulakos,

Spyros Tragoudas,

Costas E. GoutisAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 12, Issue 3

Article No.: 35, Pages 1 - 22

https://doi.org/10.1145/1255456.1255472

Published: 22 May 2008 Publication History

Abstract

This article presents the speedups achieved in a generic single-chip microprocessor system by employing a high-performance datapath. The datapath acts as a coprocessor that accelerates computational-intensive kernel sections thereby increasing the overall performance. We have previously introduced the datapath which is composed of Flexible Computational Components (FCCs). These components can realize any two-level template of primitive operations. The automated coprocessor synthesis method from high-level software description and its integration to a design flow for executing applications on the system is presented. For evaluating the effectiveness of our coprocessor approach, analytical study in respect to the type of the custom datapath and to the microprocessor architecture is performed. The overall application speedups of several real-life applications relative to the software execution on the microprocessor are estimated using the design flow. These speedups range from 1.75 to 5.84, with an average value of 3.04, while the overhead in circuit area is small. The design flow achieved the acceleration of the applications near to theoretical speedup bounds. A comparison with another high-performance datapath showed that the proposed coprocessor achieves smaller area-time products by an average of 23% for the generated datapaths. Additionally, the FCC coprocessor achieves better performance in accelerating kernels relative to software-programmable DSP cores.

References

[1]

Arm Corp. 2006, www.arm.com.

[2]

Atasu, K., Pozzi, L., and Ienne, P. 2003. Automatic application-specific instruction-set extensions under microarchitectural constraints. In Proceedings of the Design Automation Conference. 256--261.

Digital Library

[3]

Bister, M., Taeymans, Y., and Cornelis, J. 1989. Automatic segmentation of cardiac MR images. In Proceedings of the Computers in Cardiology. IEEE Computer Society Press, 215--218.

[4]

Callahan, T. J., Hauser, J. R., and Wawrzynek, J. 2000. The Garp architecture and C compiler. IEEE Computer 33, 4 (April), 62--69.

Digital Library

[5]

Catthoor, F., Man, H., De, Geurts, W., and Vernalde, S. 1996. Accelerator datapath Synthesis for High-Throughput Signal Processing Applications. Kluwer Academic Publishers.

[6]

Cheung, N., Parameswaran, S., and Henkel, J. 2003. INSIDE: Instruction selection/identification and design Exploration for extensible processors. In Proceedings of the IEEE/ACM ICCAD. 291--297.

Digital Library

[7]

Cong, J., Fan, Y., Han, G., and Zhang, Z. 2004. Application-specific instruction generation for configurable processor architectures. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 183--189.

Digital Library

[8]

Corazao, M. R., Khalaf, M. A., Guerra, L. M., Potkonjak, M., and Rabaey, J. M. 1996. Performance optimization using template mapping for datapath-intensive high-level synthesis. IEEE Trans. Comput. --Aid. Design, 15, 2, 877--888.

Digital Library

[9]

Crenshaw, J. W. 2000. MATH Toolkit for Real-Time Programming. CMP Books.

Digital Library

[10]

De Micheli, G. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill.

Digital Library

[11]

Ebeling, C., Fisher, C., Xing, G., Shen, M., and Liu, H. 2004. Implementing an OFDM receiver on the RaPid reconfigurable architecture. IEEE Comput. 53, 11 (Nov.), 1438--1448.

Digital Library

[12]

Galanis, M. D., Theodoridis, G., Tragoudas, S., and Goutis, C. E. 2006. A high performance datapath for synthesizing DSP kernels. IEEE Trans. Comput. --Aid. Design, 25, 6 (June), 1154--1163.

Digital Library

[13]

Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE Workshop on Workload Characterization. 3--14.

Digital Library

[14]

Hounsell, B. and Taylor, R. 2004. Co-processor synthesis: A new methodology for embedded software acceleration. In Proceedings of the ACM/IEEE Design Automation and Test in Europe Conference. 682--683.

Digital Library

[15]

Ieee 802.11a wireless lan standard. http://grouper.ieee.org/groups/802/11/.

[16]

Jpeg Image Compression. www.jpeg.org.

[17]

Kastner, R., Kaplan, A., Memik, S. O., and Bozorgadeh, E. 2002. Instruction generation for hybrid reconfigurable systems. ACM Trans. Design Automat. Electro. Syst. 7, 4 (Oct.), 605--627.

Digital Library

[18]

Kumar, S., Pires, L., Ponnuswamy, S., Nanavati, C., Golusky, J., Vojta, M., Wadi, S., Pandalai, D., and Spaanenberg, H. 2000. A benchmark suite for evaluating configurable computing systems---Status, reflections, and future directions. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays Symposium. 126--134.

Digital Library

[19]

Kuon, I. and Rose, J. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. 21--30.

Digital Library

[20]

Lee, C., Potkonjak, M., and Mangione-Smith W. H. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of IEEE/ACM Symposium on Microarchitecture.

Digital Library

[21]

Maas, E., Herrmann, D., Ernst, R., Ruffer, P., Hasenzahl, S., and Seitz, M. 1997. A Processor-coprocessor Architecture for high end video applications. In Proceedings of the ICASSP. 595--598.

Digital Library

[22]

Marwedel, P., Landwehr, B., and Domer, R. 1997. Built-in chaining: Introducing complex components into architectural synthesis. In Proceedings of the ASPDAC. 599--605.

[23]

Mei, B., Vernalde, S., Verkest, D., and Lauwereins, R. 2004. Mapping methodology for a tightly coupled VLIW/reconfigurable matrix architecture: A case study. In Proceedings of the ACM/IEEE Design Automation and Test in Europe Conference. 1224--1229.

Digital Library

[24]

Schreiber, R., Aditya, S., Mahlke, S., Kathail, V., Rau, B. R., Cronquist, D., and Sivaraman, M. 2002. PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators. J. VLSI Process. 31, 2, 27--142.

Digital Library

[25]

Shee, S. L., Parameswaran, S., and Cheung, N. 2005. Novel architecture for loop acceleration: A case study. In Proceedings of the CODES+ISSS. 297--302.

Digital Library

[26]

Sima, M., Cotofana, S., Vassiliadis, S., Van Eijndhoven, J. T. J., and Vissers, K. A. 2002. A reconfigurable functional unit for TriMedia/CPU64: A case study. In Proceedings of SAMOS. 224--241.

Digital Library

[27]

Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., and Chaves Filho, E. M. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and communication-intensive applications. IEEE Comput. 49, 5 (May), 465--481.

Digital Library

[28]

Smith, M. D. and Holloway, G. 2002. An introduction to machine SUIF and its portable libraries for analysis and optimization. Tech. Rep., Harvard University.

[29]

Stitt, G., Vahid, F., Mcgregor, G., and Einloth, B. 2005. Hardware/software partitioning of software binaries: A case study of H.264 Decode. In Proceedings of the CODES+ISSS. 285-290.

Digital Library

[30]

Suif2 Compiler Infrastucture. http://suif.stanford.edu/suif/suif2/index.html.

[31]

Sun, F., Ravi, S., Raghunathan, A., and Jha, N. K. 2002. Synthesis of custom processors based on extensible platforms. In Proceedings of the ACM/IEEE ICCAD. 641--648.

Digital Library

[32]

Synplify Asic. www.synplicity.com, Synplicity Inc.

[33]

Texas Instruments. www.ti.com.

[34]

Utdsp Suite. http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html.

[35]

Villarreal, J., Suresh, D., Stitt, G., Vahid, F., and Najjar, W. 2002. Improving software performance with configurable logic. In Proceedings of the Design Automation for Embedded Systems 7, 4, 325--339.

Digital Library

Cited By

Desnos KPalumbo F(2018)Dataflow Modeling for Reconfigurable Signal Processing SystemsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_22(787-824)Online publication date: 14-Oct-2018
https://doi.org/10.1007/978-3-319-91734-4_22
Kim YMahapatra R(2010)Dynamic Context Compression for Low-Power CGRADesign of Low-Power Coarse-Grained Reconfigurable Architectures10.1201/b10471-13(119-140)Online publication date: 9-Dec-2010
https://doi.org/10.1201/b10471-13
Kim YMahapatra R(2009)Hierarchical reconfigurable computing arrays for efficient CGRA-based embedded systemsProceedings of the 46th Annual Design Automation Conference10.1145/1629911.1630123(826-831)Online publication date: 26-Jul-2009
https://dl.acm.org/doi/10.1145/1629911.1630123

Index Terms

Speedups in embedded systems with a high-performance coprocessor datapath

Recommendations

Exploring the speedups of embedded microprocessor systems utilizing a high-performance coprocessor data-path

The speedups achieved in a generic microprocessor system by employing a high-performance data-path are presented. The data-path acts as a coprocessor that accelerates time critical code segments, called kernels, thereby increasing the overall ...
Improving performance and energy consumption in embedded microprocessor platforms with a flexible custom coprocessor data-path
GLSVLSI '07: Proceedings of the 17th ACM Great Lakes symposium on VLSI

The speedups and the energy reductions achieved in a generic single-chip microprocessor system by employing a high-performance data-path are presented. The data-path acts as a coprocessor that accelerates computational intensive kernel sections thereby ...
Performance and energy consumption improvements in microprocessor systems utilizing a coprocessor data-path
Special Issue: Embedded computing systems for DSP

The speedups and the energy reductions achieved in a generic single-chip microprocessor system by employing a high-performance data-path are presented. The data-path acts as a coprocessor that accelerates computational intensive kernel sections thereby ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 12, Issue 3

August 2007

427 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1255456

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 22 May 2008

Accepted: 01 December 2006

Revised: 01 October 2006

Received: 01 April 2006

Published in TODAES Volume 12, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
478
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Desnos KPalumbo F(2018)Dataflow Modeling for Reconfigurable Signal Processing SystemsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_22(787-824)Online publication date: 14-Oct-2018
https://doi.org/10.1007/978-3-319-91734-4_22
Kim YMahapatra R(2010)Dynamic Context Compression for Low-Power CGRADesign of Low-Power Coarse-Grained Reconfigurable Architectures10.1201/b10471-13(119-140)Online publication date: 9-Dec-2010
https://doi.org/10.1201/b10471-13
Kim YMahapatra R(2009)Hierarchical reconfigurable computing arrays for efficient CGRA-based embedded systemsProceedings of the 46th Annual Design Automation Conference10.1145/1629911.1630123(826-831)Online publication date: 26-Jul-2009
https://dl.acm.org/doi/10.1145/1629911.1630123

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents