article

Energy savings and speedups from partitioning critical software loops to hardware in embedded systems

Authors:

Shawn NematbakhshAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 3, Issue 1

Pages 218 - 232

https://doi.org/10.1145/972627.972637

Published: 01 February 2004 Publication History

Abstract

We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from PowerStone, MediaBench, and NetBench. Our experiments included estimated results for partitioning using an 8051 8-bit microcontroller or a 32-bit MIPS microprocessor for the software, and using on-chip configurable logic or custom application-specific integrated circuit hardware for the hardware. Additional experiments involved actual measurements taken from several physical implementations of hardware/software partitionings on real single-chip microprocessor/configurable-logic devices. We also estimated results assuming voltage scalable processors. We provide performance, energy, and size data for all of the experiments. We found that the benchmarks spent an average of 80% of their execution time in only 3% of their code, amounting to only about 200 bytes of critical code. For various experiments, we found that moving critical code to hardware resulted in average speedups of 3 to 5 and average energy savings of 35% to 70%, with average hardware requirements of only 5000 to 10,000 gates. To our knowledge, these experiments represent the most comprehensive hardware/software partitioning study published to date.

References

[1]

Altera Corporation. 2001. ARM-Based Embedded Processor PLDs.

[2]

Amdahl, G. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings AFIPS 1967 Spring Joint Computer Conference 30, 483--485.

Digital Library

[3]

Atmel FPSLIC, http://www.atmel.com/atmel/products/prod39.htm.

[4]

Balboni, A., Fornaciari, W., and Sciuto, W. 1996. Partitioning and exploration in the TOSCA co-design flow. In Proceedings of the International Workshop on Hardware/Software Codesign, 62--69.

Digital Library

[5]

Burger, D. and Austin, T. M. 1997. The SimpleScalar tool set, Version 2.0. In Tech. Rep. &num;1342, University of Wisconsin-Madison Computer Sciences Department.

[6]

E5 Press Release, http://www.triscend.com/about/indexrelease051401.html.

[7]

Eles, P., Peng, Z., Kuchcinsky, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and tabu search. Design Automation for Embedded Systems 2, 1, 5--32.

Digital Library

[8]

Gajski, D.D., Vahid, F., Narayan, S., and Gong, J. 1998. SpecSyn: An environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Transactions on VLSI Systems 6, 1, 84--100.

Digital Library

[9]

Givargis, T., Vahid F., and Henkel, J. 2001. System-level exploration for pareto-optimal configurations in parameterized systems-on-a-chip. In Proceedings of the International Conference on Computer-Aided Design (ICCAD).

Digital Library

[10]

Gokhale, M. and Stone, J. 1998. NAPA C: Compiling for hybrid RISC/FPGA architectures. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM).

Digital Library

[11]

Gonzalez, R., Gordon, B., and Horowitz, M. 1997. Supply and threshold voltage scaling for low power CMOS. IEEE Journal of Solid-State Circuits 32, 8.

[12]

Hauser, J. and Wawrzynek, J. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, CA, 12--21.

Digital Library

[13]

Henkel, J. 1999. A low power hardware/software partitioning approach for core-based embedded systems. In Proceedings of the 36th ACM/IEEE Design Automation Conference, 122--127.

Digital Library

[14]

Henkel, J. and Ernst R. 1997. A hardware/software partitioner using a dynamically determined granularity. In Proceedings of the Design Automation Conference.

Digital Library

[15]

Henkel, J. and Li, Y. 1998. Energy-conscious HW/SW-partitioning of embedded systems: A Case Study on an MPEG-2 Encoder. In Proceedings of 6th International Workshop on Hardware/Software Codesign, 23--27.

Digital Library

[16]

Hou, J. and Wolf, W. 1996. Process partitioning for distributed embedded systems. In Proceeding International Workshop on Hardware/Software Codesign.

Digital Library

[17]

Intel XScale Processor, http://developer.intel.com/design/intelxscale.

[18]

Kalavade, A. and Lee, E. 1994. A global criticality/local phase driven algorithm for the constrained hardware/software partitioning problem. In Proceedings of the International Workshop on Hardware/Software Codesign, 42--48.

Digital Library

[19]

Lee, C., Potkonjak, M., and Magione-Smith, W. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communication systems. In Proceedings of MICRO.

Digital Library

[20]

Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design.

Digital Library

[21]

MediaBench. http://www.cs.ucla.edu/&sim;leec/mediabench/.

[22]

Mernik, G., Mangione-Smith, W. H., and Hu, W. 2001. NetBench: A benchmarking suite for network processors. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 39--42.

Digital Library

[23]

MIPS Technologies, Inc., http://www.mips.com.

[24]

Stitt, G., Grattan, B., Villarreal, J., and Vahid, F. 2002. Using on-chip configurable logic to reduce embedded system software energy. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, Napa Valley, CA.

Digital Library

[25]

Synopsys, http://www.synopsys.com.

[26]

Triscend Corporation, http://www.triscend.com. 2002.

[27]

University of California, Riverside; Dalton Project. http://www.cs.ucr.edu/&sim;dalton.

[28]

Vanmeerbeeck, G., Schaumont, P., Vernalde, S., Engels, M., and Bolsens, I. 2001. Hardware/software partitioning of embedded system in OCAPI-xl. In Proceedings of the International Symposium on Hardware/Software Codesign, 30--35.

Digital Library

[29]

Villarreal, J., Lysecky, R., Cotterell, S., and Vahid, F. 2001. Loop analysis of embedded applications. In Tech. Rep. UCR-CSE-01-03, University of California, Riverside.

[30]

Virtex Power Estimator, http://support.xilinx.com/cgi-bin/powerweb.pl.

[31]

Wan, M., Ichikawa, Y., Lidsky, D., Rabaey, J. 1998. An energy conscious methodology for early design exploration of heterogeneous DSPs. In Proceedings of the IEEE Custom Integrated Circuits Conference, 111--117.

[32]

Werner, B. and Magnusson, P. 1997. A hybrid simulation approach enabling performance characterization of large software systems. In Proceedings of MASCOTS.

Digital Library

[33]

Xilinx Corporation. 2002. Virtex-II Pro Platform FGPA Handbook.

Cited By

Chen H(2023)Optimization Methods of Multi-Core Embedded SystemHighlights in Science, Engineering and Technology10.54097/hset.v71i.1268671(153-162)Online publication date: 28-Nov-2023
https://doi.org/10.54097/hset.v71i.12686
Özeloğlu AGürbüz İSan İ(2021)Deep reinforcement learning‐based autonomous parking design with neural network compute acceleratorsConcurrency and Computation: Practice and Experience10.1002/cpe.667034:9Online publication date: 2-Nov-2021
https://doi.org/10.1002/cpe.6670
Guo ZZhang XZhao B(2019)A Memory-Reinforced Tabu Search Algorithm With Critical Path Awareness for HW/SW Partitioning on Reconfigurable MPSoCsIEEE Access10.1109/ACCESS.2019.29343907(112448-112458)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2934390
Show More Cited By

Index Terms

Energy savings and speedups from partitioning critical software loops to hardware in embedded systems
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems

Recommendations

Dynamic hardware/software partitioning: a first approach
DAC '03: Proceedings of the 40th annual Design Automation Conference

Partitioning an application among software running on a microprocessor and hardware co-processors in on-chip configurable logic has been shown to improve performance and energy consumption in embedded systems. Meanwhile, dynamic software optimization ...
Performance improvements from partitioning applications to FPGA hardware in embedded SoCs

A hardware/software partitioning methodology for improving performance in single-chip systems composed by processor and Field Programmable Gate Array reconfigurable logic is presented. Speedups are achieved by executing critical software parts on the ...
Hardware/software partitioning of software binaries
ICCAD '02: Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design

Partitioning an embedded system application among a microprocessor and custom hardware has been shown to improve the performance, power or energy of numerous examples. The advent of single-chip microprocessor/FPGA platforms makes such partitioning even ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 3, Issue 1

February 2004

232 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/972627

Issue’s Table of Contents

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 February 2004

Published in TECS Volume 3, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

82
Total Citations
View Citations
1,343
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen H(2023)Optimization Methods of Multi-Core Embedded SystemHighlights in Science, Engineering and Technology10.54097/hset.v71i.1268671(153-162)Online publication date: 28-Nov-2023
https://doi.org/10.54097/hset.v71i.12686
Özeloğlu AGürbüz İSan İ(2021)Deep reinforcement learning‐based autonomous parking design with neural network compute acceleratorsConcurrency and Computation: Practice and Experience10.1002/cpe.667034:9Online publication date: 2-Nov-2021
https://doi.org/10.1002/cpe.6670
Guo ZZhang XZhao B(2019)A Memory-Reinforced Tabu Search Algorithm With Critical Path Awareness for HW/SW Partitioning on Reconfigurable MPSoCsIEEE Access10.1109/ACCESS.2019.29343907(112448-112458)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2934390
Agostini LTavares T(2019)Designing and Developing Architectures to Tangible User Interfaces: A “Softwareless” ApproachHCI International 2019 - Posters10.1007/978-3-030-23528-4_64(469-475)Online publication date: 6-Jul-2019
https://doi.org/10.1007/978-3-030-23528-4_64
Khurshid BNaaz R(2017)Efficient Realization of Fixed-Point Binary and Ternary Adders on FPGAsJournal of Circuits, Systems and Computers10.1142/S021812661750053026:04(1750053)Online publication date: Apr-2017
https://doi.org/10.1142/S0218126617500530
Govil NShrestha RRoy Chowdhury S(2017)PGMAMicroprocessors & Microsystems10.1016/j.micpro.2017.09.00254:C(83-96)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.micpro.2017.09.002
Khurshid B(2017)LUT based realization of fixed-point multipliers targeting state-of-art FPGAsDesign Automation for Embedded Systems10.1007/s10617-017-9184-x21:2(89-115)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10617-017-9184-x
Khurshid BMir R(2017)An Efficient FIR Filter Structure Based on Technology-Optimized Multiply-Adder Unit Targeting LUT-Based FPGAsCircuits, Systems, and Signal Processing10.1007/s00034-016-0312-936:2(600-639)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s00034-016-0312-9
Mao FChen YZhang WLi HHe B(2016)Library-Based Placement and Routing in FPGAs with Support of Partial ReconfigurationACM Transactions on Design Automation of Electronic Systems10.1145/290129521:4(1-26)Online publication date: 18-May-2016
https://dl.acm.org/doi/10.1145/2901295
Khurshid BMir R(2015)Power efficient implementation of bit-parallel unrolled CORDIC structures for FPGA platforms2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)10.1109/VLSI-SATA.2015.7050466(1-6)Online publication date: Jan-2015
https://doi.org/10.1109/VLSI-SATA.2015.7050466
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents