Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Warp Processors

Published: 07 June 2004 Publication History

Abstract

We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.

References

[1]
Altera Corp. 2006. Customer showcase. http://www.altera.com/corporate/cust_successes/ customer_showcase/view_product/csh-vproduct-nios.jsp.
[2]
Altera Corp. 2005. Excalibur embedded processor solutions. http://www.altera.com/products/ devices/excalibur/exc-index.html.
[3]
Atmel Corp. 2005. FPSLIC (AVR with FPGA), http://www.atmel.com/products/FPSLIC/.
[4]
Balboni, A., Fornaciari W., and Sciuto, D. 1996. Partitioning and exploration in the TOSCA co-design flow. In Proceedings of the International Workshop on Hardware/Software Codesign (CODES), 62--69.
[5]
Banerjee, P., Mittal, G., Zaretsky, D., and Tang, X. 2004. BINACHIP-FPGA: A tool to map DSP software binaries and assembly programs onto FPGAs. In Proceedings of the Embedded Signal Processing Conference (GSPx).
[6]
Berkeley Design Technology, Inc. 2004. http://www.bdti.com/articles/info_eet0207fpga.htm# DSPEnhanced&percent;20FPGAs.
[7]
Betz, V., Rose, J., and Marquardt, A. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic, Hingham, Mass.
[8]
Betz, V. and Rose, J. 1997. VPR: A new packing, placement, and routing for FPGA research. In Proceedings of the International Workshop on Field Programmable Logic and Applications (FPLA), 213--222.
[9]
Böhm, W., Hammes, J., Draper, B., Chawathe, M., Ross, C., Rinker, R., and Najjar, W. 2002. Mapping a single assignment programming language to reconfigurable systems. J. Supercomput. 21, 117--130.
[10]
Brelaz, D. 1979. New methods to color the vertices of a graph. Commun. ACM 22, 251--256.
[11]
Burger, D. and Austin, T. 1997. The SimpleScalar tool set, version 2.0. SIGARCH Comput. Architecture News, 25, 3.
[12]
Chen, W., Kosmas, P., Leeser, M., and Rappaport, C. 2004. An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 97--105.
[13]
Christensen, F. 2004. A scalable software-defined radio development system. Xcell J., Winter.
[14]
Chow, P., Seo, S., Rose, J., Chung, K., Paez-Monzon, G., and Rahardja, I. 1999. The design of an SRAM-based field-programmable gate array, part I: Architecture. IEEE Trans. Very Large Scale Integration Syst. (TVLSI), 7, 2, 191--197.
[15]
Cifuentes, C. 1996. Structuring decompiled graphs. In Proceedings of the International Conference on Compiler Construction. Lecture Notes in Computer Science, vol. 1060, 91--105.
[16]
Cifuentes, C., Simon, D., and Fraboulet, A. 1998. Assembly to high-level language translation. Department of Computer Science and Electrical Engineering, University of Queensland. Tech. Rep. 439.
[17]
Cifuentes, C., Van Emmerik, M., Ung, D., Simon, D., and Waddington, T. 1999. Preliminary experiences with the use of the UQBT binary translation framework. In Proceedings of the Workshop on Binary Translation, 12--22.
[18]
Critical Blue. 2005. http://www.criticalblue.com.
[19]
D.H. Brown Associates. 2004. Cray XD1 brings high-bandwidth supercomputing to the mid-market. White Paper prepared for Cray, Inc., http://www.cray.com/downloads/dhbrown_crayxd1_ oct2004.pdf.
[20]
EEMBC. 2005. The Embedded Microprocessor Benchmark Consortium. http://www.eembc.org.
[21]
Eles, P., Peng, Z., Kuchchinski, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and Tabu search. Kluwer's Design Automation for Embedded Systems 2, 1, 5--32.
[22]
Ernst, R., Henkel, J., and Benner, T. 1993. Hardware-software cosynthesis for microcontrollers. IEEE Des. Test Comput. 10, 4, 64--75.
[23]
Gajski, D., Vahid, F., Narayan, S., and Gong, J. 1998. SpecSyn: An environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Trans. Very Large Scale Integration Syst. (TVLSI) 6, 1, 84--100.
[24]
Gokhale, M. and Stone, J. 1998. NAPA C: Compiling for hybrid RISC/FPGA architectures. In Proceedings of the Symposium on FPGAs for Custom Computing Machines (FCCM), 126.
[25]
Gordon-Ross, A. and Vahid, F. 2003. Frequent loop detection using efficient non-intrusive on-chip hardware. In Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 117--124.
[26]
Guo, Z., Buyukkurt, B., Najjar, W., and Vissers, K. 2005. Optimized generation of data-path from C codes. In Proceedings of the Design Automation and Test in Europe Conference (DATE), 112--117.
[27]
Hauser, J. and Wawrzynek, J. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the Symposium on FPGAs for Custom Computing Machines (FCCM), 12-- 21.
[28]
Henkel, J. and Ernst, R. 1997. A hardware/software partitioner using a dynamically determined granularity. In Proceedings of the Design Automation Conference (DAC), 691--696.
[29]
Keane, J., Bradley, C., and Ebeling, C. 2004. A compiled accelerator for biological cell signaling simulations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 233--241.
[30]
Lee, C., Potkonjak, M., and Mangione-Smith, W. 1997. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 330--335.
[31]
Lysecky, R., Cotterell, S., and Vahid, F. 2004a. A fast on-chip profiler memory using a pipelined binary tree. IEEE Trans. Very Large Scale Integration (TVLSI) 12, 1, 120--122.
[32]
Lysecky, R. and Vahid, F. 2004. A configurable logic architecture for dynamic hardware/software partitioning. In Proceedings of the Design Automation and Test in Europe Conference (DATE), 10480.
[33]
Lysecky, R. and Vahid, F. 2003. On-Chip logic minimization. In Proceedings of the Design Automation Conference (DAC), 334--337.
[34]
Lysecky, R., Vahid, F., and Tan, S. 2005. A study of the scalability of on-chip routing for just-in-time FPGA compilation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines (FCCM), 57--62.
[35]
Lysecky, R., Vahid, F., and Tan, S. 2004b. Dynamic FPGA routing for just-in-time FPGA compilation. In Proceedings of the Design Automation Conference (DAC), 954--959.
[36]
Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 241--243.
[37]
Marquardt, A., Betz, V., and Rose, J. 2000. Speed and area trade-offs on cluster-based FPGA architectures. IEEE Trans. Very Large Scale Integration Syst. (TVLSI) 8, 1, 84--93.
[38]
Matsumoto, C. 2000. Triscend adds 32-bit configurable SoC line. EE Times, http://www. eet.com/story/OEG20000828S0015.
[39]
Memik, G., Mangione-Smith, W., and Hu, W. 2001. NetBench: A benchmarking suite for network processors. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), 39--42.
[40]
Mittal, G., Zaretsky, D., Tang, X., and Banerjee, P. 2004. Automatic translation of software binaries onto FPGAs. In Proceedings of the Design Automation Conference (DAC), 389--394.
[41]
Morris, K. 2005. Cray goes FPGA. FPGA and Programmable Logic J., April.
[42]
Press, W., Flannery, B., Teukolsky, S., and Vetterling, W. 1992. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, New York.
[43]
Singh, S., Rose, J., Chow, P., and Lewis, D. 1992. The effect of logic block architecture on FPGA performance. IEEE J. Solid-State Circuits. 27, 3, 3--12.
[44]
Stitt, G., Lysecky, R., and Vahid, F. 2003. Dynamic hardware/software partitioning: A first approach. In Proceedings of the Design Automation Conference (DAC), 250--255.
[45]
Stitt, G. and Vahid, F. 2005. New decompilation techniques for binary-level co-processor generation. In Proceedings of the International Conference on Computer Aided Design (ICCAD).
[46]
Stitt, G. and Vahid, F. 2002. Hardware/software partitioning of software binaries. In Proceedings of the International Conference on Computer Aided Design (ICCAD), 164--170.
[47]
Stitt, G., Vahid, F., McGregor, G., and Einloth, B. 2005. Hardware/Software partitioning of software binaries: A case study of H.264 decode. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 285--290.
[48]
Tensilica, Inc. 2006. XPRES compiler, automatically generate processors from standard C code. http://www.tensilica.com/products/xpres.htm.
[49]
Triscend Corp. 2003. http://www.triscend.com.
[50]
Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., and Bohm, W. 2001. A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In Proceedings of the Conference on Compiler, Architecture, and Synthesis for Embedded Systems (CASES), 116--125.
[51]
Vissers, K. 2004. Programming models and architectures for FPGAs. Keynote talk. In Proceedings of the Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES).
[52]
Xilinx, Inc. 2006. http://www.xilinx.com.
[53]
Xilinx, Inc. 2005a. Customer success stories, http://www.xilinx.com/company/success/csprod. htm#embedded.
[54]
Xilinx, Inc. 2005b. Virtex-4 FPGAs, http://www.xilinx.com/products/silicon_solutions/fpgas/ virtex/virtex4/index.htm.
[55]
Xilinx, Inc. 2004a. Partnering for success, Xilinx and photonic bridges. http://www.xilinx.com/ ipcenter/processor_central/embedded/success_PB.pdf.
[56]
Xilinx, Inc. 2004b. Virtex-II Pro/ProX FPGAs, http://www.xilinx.com/products/silicon_solutions/ fpgas/virtex/virtex_ii_pro_fpgas/.
[57]
Xilinx, Inc. 2000a. Xilinx introduces high level language compiler for Virtex FPGAs. Xilinx Press Release. http://www.xilinx.com/prs_rls/00119_forge.htm.
[58]
Xilinx, Inc. 2000b. Xilinx Version 3.3i software doubles clock frequencies. Xilinx Press Release. http://www.xilinx.com/prs_rls/00118_3_3i.htm.
[59]
Zagha, M., B. Larson, S. Turner, and M. Itzkowitz. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceeding of the Conference on Supercomputing, article no. 16.
[60]
Zhang, X., Wang, Z., Gloy, N., Chen, J. B., and Smith, M. D. 1997. System support for automatic profiling and optimization. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), 15--26.
[61]
Zilles, C. B. and Sohi, G. S. 2001. A programmable co-processor for profiling. In Proceedings of the International Symposium on High-Performance Computer Architectures, 241.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 11, Issue 3
July 2006
262 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/1142980
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 07 June 2004
Published in TODAES Volume 11, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FPGA
  2. Warp processors
  3. configurable logic
  4. dynamic optimization
  5. hardware/software codesign
  6. hardware/software partitioning
  7. just-in-time (JIT) compilation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)X4-RARE: Revisiting the X4CP32 Coarse-Grained Reconfigurable Architecture Model2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI59464.2023.10238676(1-6)Online publication date: 20-Jun-2023
  • (2022)DynPath–Non-Intrusive Feature-Rich Hardware-Based Execution Path ProfilerIEEE Access10.1109/ACCESS.2022.321871010(116069-116086)Online publication date: 2022
  • (2022)Compilation SystemSoftware Defined Chips10.1007/978-981-19-6994-2_4(197-311)Online publication date: 21-Oct-2022
  • (2022)Pasithea-1: An Energy-Efficient Self-contained CGRA with RISC-Like ISAArchitecture of Computing Systems10.1007/978-3-031-21867-5_3(33-47)Online publication date: 13-Sep-2022
  • (2021)Towards Transparent Dynamic Binary Translation from RISC-V to a CGRAArchitecture of Computing Systems10.1007/978-3-030-81682-7_8(118-132)Online publication date: 15-Jul-2021
  • (2020)Multi-Target Adaptive Reconfigurable Acceleration for Low-Power IoT ProcessingIEEE Transactions on Computers10.1109/TC.2020.2984736(1-1)Online publication date: 2020
  • (2019)TransRec: Improving Adaptability in Single-ISA Heterogeneous Systems with Transparent and Reconfigurable Acceleration2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715121(582-585)Online publication date: Mar-2019
  • (2019)An Efficient SRAM-Based Reconfigurable Architecture for Embedded ProcessorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2018.281211838:3(466-479)Online publication date: Mar-2019
  • (2018)Transparent Acceleration of Image Processing Kernels on FPGA-Attached Hybrid Memory Cube Computers2018 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2018.00069(342-345)Online publication date: Dec-2018
  • (2018)DIM-VEX: Exploiting Design Time Configurability and Runtime ReconfigurabilityApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-319-78890-6_30(367-378)Online publication date: 8-Apr-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media