article

Warp Processors

Authors:

Frank VahidAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 11, Issue 3

Pages 659 - 681

https://doi.org/10.1145/1142980.1142986

Published: 07 June 2004 Publication History

Abstract

We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.

References

[1]

Altera Corp. 2006. Customer showcase. http://www.altera.com/corporate/cust_successes/ customer_showcase/view_product/csh-vproduct-nios.jsp.

[2]

Altera Corp. 2005. Excalibur embedded processor solutions. http://www.altera.com/products/ devices/excalibur/exc-index.html.

[3]

Atmel Corp. 2005. FPSLIC (AVR with FPGA), http://www.atmel.com/products/FPSLIC/.

[4]

Balboni, A., Fornaciari W., and Sciuto, D. 1996. Partitioning and exploration in the TOSCA co-design flow. In Proceedings of the International Workshop on Hardware/Software Codesign (CODES), 62--69.

[5]

Banerjee, P., Mittal, G., Zaretsky, D., and Tang, X. 2004. BINACHIP-FPGA: A tool to map DSP software binaries and assembly programs onto FPGAs. In Proceedings of the Embedded Signal Processing Conference (GSPx).

[6]

Berkeley Design Technology, Inc. 2004. http://www.bdti.com/articles/info_eet0207fpga.htm&num; DSPEnhanced&percent;20FPGAs.

[7]

Betz, V., Rose, J., and Marquardt, A. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic, Hingham, Mass.

[8]

Betz, V. and Rose, J. 1997. VPR: A new packing, placement, and routing for FPGA research. In Proceedings of the International Workshop on Field Programmable Logic and Applications (FPLA), 213--222.

[9]

Böhm, W., Hammes, J., Draper, B., Chawathe, M., Ross, C., Rinker, R., and Najjar, W. 2002. Mapping a single assignment programming language to reconfigurable systems. J. Supercomput. 21, 117--130.

[10]

Brelaz, D. 1979. New methods to color the vertices of a graph. Commun. ACM 22, 251--256.

[11]

Burger, D. and Austin, T. 1997. The SimpleScalar tool set, version 2.0. SIGARCH Comput. Architecture News, 25, 3.

[12]

Chen, W., Kosmas, P., Leeser, M., and Rappaport, C. 2004. An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 97--105.

[13]

Christensen, F. 2004. A scalable software-defined radio development system. Xcell J., Winter.

[14]

Chow, P., Seo, S., Rose, J., Chung, K., Paez-Monzon, G., and Rahardja, I. 1999. The design of an SRAM-based field-programmable gate array, part I: Architecture. IEEE Trans. Very Large Scale Integration Syst. (TVLSI), 7, 2, 191--197.

[15]

Cifuentes, C. 1996. Structuring decompiled graphs. In Proceedings of the International Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 1060, 91--105.

[16]

Cifuentes, C., Simon, D., and Fraboulet, A. 1998. Assembly to high-level language translation. Department of Computer Science and Electrical Engineering, University of Queensland. Tech. Rep. 439.

[17]

Cifuentes, C., Van Emmerik, M., Ung, D., Simon, D., and Waddington, T. 1999. Preliminary experiences with the use of the UQBT binary translation framework. In Proceedings of the Workshop on Binary Translation, 12--22.

[18]

Critical Blue. 2005. http://www.criticalblue.com.

[19]

D.H. Brown Associates. 2004. Cray XD1 brings high-bandwidth supercomputing to the mid-market. White Paper prepared for Cray, Inc., http://www.cray.com/downloads/dhbrown_crayxd1_ oct2004.pdf.

[20]

EEMBC. 2005. The Embedded Microprocessor Benchmark Consortium. http://www.eembc.org.

[21]

Eles, P., Peng, Z., Kuchchinski, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and Tabu search. Kluwer's Design Automation for Embedded Systems 2, 1, 5--32.

[22]

Ernst, R., Henkel, J., and Benner, T. 1993. Hardware-software cosynthesis for microcontrollers. IEEE Des. Test Comput. 10, 4, 64--75.

[23]

Gajski, D., Vahid, F., Narayan, S., and Gong, J. 1998. SpecSyn: An environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Trans. Very Large Scale Integration Syst. (TVLSI) 6, 1, 84--100.

[24]

Gokhale, M. and Stone, J. 1998. NAPA C: Compiling for hybrid RISC/FPGA architectures. In Proceedings of the Symposium on FPGAs for Custom Computing Machines (FCCM), 126.

[25]

Gordon-Ross, A. and Vahid, F. 2003. Frequent loop detection using efficient non-intrusive on-chip hardware. In Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 117--124.

[26]

Guo, Z., Buyukkurt, B., Najjar, W., and Vissers, K. 2005. Optimized generation of data-path from C codes. In Proceedings of the Design Automation and Test in Europe Conference (DATE), 112--117.

[27]

Hauser, J. and Wawrzynek, J. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the Symposium on FPGAs for Custom Computing Machines (FCCM), 12-- 21.

[28]

Henkel, J. and Ernst, R. 1997. A hardware/software partitioner using a dynamically determined granularity. In Proceedings of the Design Automation Conference (DAC), 691--696.

[29]

Keane, J., Bradley, C., and Ebeling, C. 2004. A compiled accelerator for biological cell signaling simulations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 233--241.

[30]

Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 330--335.

[31]

Lysecky, R., Cotterell, S., and Vahid, F. 2004a. A fast on-chip profiler memory using a pipelined binary tree. IEEE Trans. Very Large Scale Integration (TVLSI) 12, 1, 120--122.

[32]

Lysecky, R. and Vahid, F. 2004. A configurable logic architecture for dynamic hardware/software partitioning. In Proceedings of the Design Automation and Test in Europe Conference (DATE), 10480.

[33]

Lysecky, R. and Vahid, F. 2003. On-Chip logic minimization. In Proceedings of the Design Automation Conference (DAC), 334--337.

[34]

Lysecky, R., Vahid, F., and Tan, S. 2005. A study of the scalability of on-chip routing for just-in-time FPGA compilation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines (FCCM), 57--62.

[35]

Lysecky, R., Vahid, F., and Tan, S. 2004b. Dynamic FPGA routing for just-in-time FPGA compilation. In Proceedings of the Design Automation Conference (DAC), 954--959.

[36]

Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 241--243.

[37]

Marquardt, A., Betz, V., and Rose, J. 2000. Speed and area trade-offs on cluster-based FPGA architectures. IEEE Trans. Very Large Scale Integration Syst. (TVLSI) 8, 1, 84--93.

[38]

Matsumoto, C. 2000. Triscend adds 32-bit configurable SoC line. EE Times, http://www. eet.com/story/OEG20000828S0015.

[39]

Memik, G., Mangione-Smith, W., and Hu, W. 2001. NetBench: A benchmarking suite for network processors. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), 39--42.

[40]

Mittal, G., Zaretsky, D., Tang, X., and Banerjee, P. 2004. Automatic translation of software binaries onto FPGAs. In Proceedings of the Design Automation Conference (DAC), 389--394.

[41]

Morris, K. 2005. Cray goes FPGA. FPGA and Programmable Logic J., April.

[42]

Press, W., Flannery, B., Teukolsky, S., and Vetterling, W. 1992. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York.

[43]

Singh, S., Rose, J., Chow, P., and Lewis, D. 1992. The effect of logic block architecture on FPGA performance. IEEE J. Solid-State Circuits. 27, 3, 3--12.

[44]

Stitt, G., Lysecky, R., and Vahid, F. 2003. Dynamic hardware/software partitioning: A first approach. In Proceedings of the Design Automation Conference (DAC), 250--255.

[45]

Stitt, G. and Vahid, F. 2005. New decompilation techniques for binary-level co-processor generation. In Proceedings of the International Conference on Computer Aided Design (ICCAD).

[46]

Stitt, G. and Vahid, F. 2002. Hardware/software partitioning of software binaries. In Proceedings of the International Conference on Computer Aided Design (ICCAD), 164--170.

[47]

Stitt, G., Vahid, F., McGregor, G., and Einloth, B. 2005. Hardware/Software partitioning of software binaries: A case study of H.264 decode. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 285--290.

[48]

Tensilica, Inc. 2006. XPRES compiler, automatically generate processors from standard C code. http://www.tensilica.com/products/xpres.htm.

[49]

Triscend Corp. 2003. http://www.triscend.com.

[50]

Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., and Bohm, W. 2001. A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In Proceedings of the Conference on Compiler, Architecture, and Synthesis for Embedded Systems (CASES), 116--125.

[51]

Vissers, K. 2004. Programming models and architectures for FPGAs. Keynote talk. In Proceedings of the Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES).

[52]

Xilinx, Inc. 2006. http://www.xilinx.com.

[53]

Xilinx, Inc. 2005a. Customer success stories, http://www.xilinx.com/company/success/csprod. htm&num;embedded.

[54]

Xilinx, Inc. 2005b. Virtex-4 FPGAs, http://www.xilinx.com/products/silicon_solutions/fpgas/ virtex/virtex4/index.htm.

[55]

Xilinx, Inc. 2004a. Partnering for success, Xilinx and photonic bridges. http://www.xilinx.com/ ipcenter/processor_central/embedded/success_PB.pdf.

[56]

Xilinx, Inc. 2004b. Virtex-II Pro/ProX FPGAs, http://www.xilinx.com/products/silicon_solutions/ fpgas/virtex/virtex_ii_pro_fpgas/.

[57]

Xilinx, Inc. 2000a. Xilinx introduces high level language compiler for Virtex FPGAs. Xilinx Press Release. http://www.xilinx.com/prs_rls/00119_forge.htm.

[58]

Xilinx, Inc. 2000b. Xilinx Version 3.3i software doubles clock frequencies. Xilinx Press Release. http://www.xilinx.com/prs_rls/00118_3_3i.htm.

[59]

Zagha, M., B. Larson, S. Turner, and M. Itzkowitz. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceeding of the Conference on Supercomputing, article no. 16.

[60]

Zhang, X., Wang, Z., Gloy, N., Chen, J. B., and Smith, M. D. 1997. System support for automatic profiling and optimization. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), 15--26.

[61]

Zilles, C. B. and Sohi, G. S. 2001. A programmable co-processor for profiling. In Proceedings of the International Symposium on High-Performance Computer Architectures, 241.

Cited By

Silva IJunior F(2023)X4-RARE: Revisiting the X4CP32 Coarse-Grained Reconfigurable Architecture Model2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI59464.2023.10238676(1-6)Online publication date: 20-Jun-2023
https://doi.org/10.1109/ISVLSI59464.2023.10238676
Jaswal MRoy S(2022)DynPath–Non-Intrusive Feature-Rich Hardware-Based Execution Path ProfilerIEEE Access10.1109/ACCESS.2022.321871010(116069-116086)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3218710
Wei SLiu LZhu JDeng CWei SLiu LZhu JDeng C(2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
https://doi.org/10.1007/978-981-19-6994-2_4
Show More Cited By

Index Terms

Warp Processors
1. Computer systems organization

Recommendations

Energy savings and speedups from partitioning critical software loops to hardware in embedded systems

We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from ...
Design and implementation of a MicroBlaze-based warp processor

While soft processor cores provided by FPGA vendors offer designers with increased flexibility, such processors typically incur penalties in performance and energy consumption compared to hard processor core alternatives. The recently developed ...
Warp Processors
DAC '04: Proceedings of the 41st annual Design Automation Conference

We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 11, Issue 3

July 2006

262 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1142980

Issue’s Table of Contents

Copyright © 2004 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 07 June 2004

Published in TODAES Volume 11, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

110
Total Citations
View Citations
1,115
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Silva IJunior F(2023)X4-RARE: Revisiting the X4CP32 Coarse-Grained Reconfigurable Architecture Model2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI59464.2023.10238676(1-6)Online publication date: 20-Jun-2023
https://doi.org/10.1109/ISVLSI59464.2023.10238676
Jaswal MRoy S(2022)DynPath–Non-Intrusive Feature-Rich Hardware-Based Execution Path ProfilerIEEE Access10.1109/ACCESS.2022.321871010(116069-116086)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3218710
Wei SLiu LZhu JDeng CWei SLiu LZhu JDeng C(2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
https://doi.org/10.1007/978-981-19-6994-2_4
Kaiser TGerfers F(2022)Pasithea-1: An Energy-Efficient Self-contained CGRA with RISC-Like ISAArchitecture of Computing Systems10.1007/978-3-031-21867-5_3(33-47)Online publication date: 13-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-21867-5_3
Wirsch RHochberger C(2021)Towards Transparent Dynamic Binary Translation from RISC-V to a CGRAArchitecture of Computing Systems10.1007/978-3-030-81682-7_8(118-132)Online publication date: 15-Jul-2021
https://doi.org/10.1007/978-3-030-81682-7_8
Brandalero MCarro LBeck Filho AShafique M(2020)Multi-Target Adaptive Reconfigurable Acceleration for Low-Power IoT ProcessingIEEE Transactions on Computers10.1109/TC.2020.2984736(1-1)Online publication date: 2020
https://doi.org/10.1109/TC.2020.2984736
Brandalero MShafique MCarro LBeck A(2019)TransRec: Improving Adaptability in Single-ISA Heterogeneous Systems with Transparent and Reconfigurable Acceleration2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715121(582-585)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715121
Tamimi SEbrahimi ZKhaleghi BAsadi H(2019)An Efficient SRAM-Based Reconfigurable Architecture for Embedded ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281211838:3(466-479)Online publication date: Mar-2019
https://doi.org/10.1109/TCAD.2018.2812118
Pantho MMandebi Mbongue JBobda CAndrews D(2018)Transparent Acceleration of Image Processing Kernels on FPGA-Attached Hybrid Memory Cube Computers2018 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2018.00069(342-345)Online publication date: Dec-2018
https://doi.org/10.1109/FPT.2018.00069
Souza JSartor ACarro LRutzig MWong SBeck A(2018)DIM-VEX: Exploiting Design Time Configurability and Runtime ReconfigurabilityApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-319-78890-6_30(367-378)Online publication date: 8-Apr-2018
https://doi.org/10.1007/978-3-319-78890-6_30
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents