Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1046192.1046207acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
Article

An FPGA-based VLIW processor with custom hardware execution

Published: 20 February 2005 Publication History

Abstract

The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices continues to increase with each new line of devices. Efficiently programming these devices is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally targeted to embedded DSP microprocessors such as signal and image processing applications.This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions. To support this architecture, a compilation and design automation flow are described for programs written in C.Several design tradeoffs for the architecture were examined including number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply accumulate operations.We show that our combined VLIW with hardware functions exhibit as much as 230X speedup and 63X on average for computational kernels for a set of benchmarks. This allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.

References

[1]
Apple Computer, Inc., "Optimizing with SHARK, Big Payoff, Small Effort," http://developer.apple.com/tools/shark_optimize.html.
[2]
D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, G. Stitt, "Profiling Tools for Hardware/Software Partitioning of Embedded Applications", Proc. Of the 2003 ACM SiGPLAN Conf. On Languages, Compilers and Tools for Embedded Systems, San Diego, CA June 2003.
[3]
P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Chang, M. Haldar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden, "MATCH: A MATLAB Compilation Environment for Configurable Computing Systems," International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, 2000.
[4]
S. Gupta, N. Savoiu, N. D. Dutt, R. K. Gupta, A. Nicolau, "Using Global Code Motions to Improve the Quality of Results for High-Level Synthesis," IEEE Transactions on Computer Aided Design, February, 2004.
[5]
A. K. Jones, D. Bagchi, S. Pal, P. Banerjee, and A. Choudhary, Pact HDL: Compiler Targeting ASIC's and FPGA's with Power and Performance Optimizations, Chapter 9 in Power Aware Computing, ed. by Robert Graybill and Rami Melhem, pp. 169--190. Kluwer Academic Publishers, Boston, MA, 2002.
[6]
X. Tang, T. Jiang, A. K. Jones, and P. Banerjee, "Behavioral Synthesis of Data-Dominated Circuits for Minimal Energy Implementation," in Proceedings of the IEEE International Conference on VLSI Design, January 2005.
[7]
Synopsys, Inc., "Behavioral Compiler," http://www.synopsys.com.
[8]
V.A. Chouliaras and J. Nunez, "Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders," IEEE Transactions on Consumer Electronics, Vol. 69 No. 3, August 2003, pp. 703--710.
[9]
E. Atzori, S.M. Carta and L. Raffo, "44.6% Processing Cycles Reduction in GSM Voice by Low-power Reconfigurable Co-processor Architecture," Eletronics Letters, Vol. 38 No. 24, November 2002, pp. 1524--1526.
[10]
J. Hilgenstock, K. Herrmann, J. Otterstedt, D. Niggemeyer and P. Pirsch, "A Video Signal Processor for MIMD Multiprocessing," Proceedings of the 1998 Design Automation Conference, San Francisco, CA, June 1998.
[11]
R. Garg, C.Y. Chung, D. Kim and Y. Kim, "Boundary Macroblock Padding in MPEG-4 Video Decoding Using a Graphics Co-processor," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12 No. 8, August 2002, pp. 719--723.
[12]
C.N. Hinds, "An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications," Conference Record of the 33rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, October 1999.
[13]
J.C. Alves and J.S. Matos, "RVC-A Reconfigurable Coprocessor for Vector Processing Applications," Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, CA, April 1998.
[14]
T. Bridges, S.W. Kitchel and R. M. Wehrmeister, "A CPU Utilization Limit for Massively Parallel MIMD Computers," Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, October 1992.
[15]
S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, "PipeRench: A Reconfigurable Architecture and Compiler" in IEEE Computer, Vol.33, No. 4, April 2000.
[16]
B. A. Levine, H. Schmit, "Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable." FCCM 2003.
[17]
C. Ebeling, D. C. Cronquist, P. Franklin, "RaPiD - Reconfigurable Pipelined Datapath", in the 6th International Workshop on Field-Programmable Logic and Applications, 1996.
[18]
E. Mirsky and A. DeHon," MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources", in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1996.
[19]
B.Khailany et al., "Imagine: media processing with streams", Micro, March-April 2001.
[20]
T.J. Callahan, J.R. Hauser and J. Wawrzynek, "The Garp architecture and C compiler," Computer, Volume: 33, Issue: 4, April 2000.
[21]
M. Sima, S. Cotofana, J. T. J. van Eijndhoven, S. Vassilidis, and K. Vissers, "An 8 x 8 IDCT Implementation on an FPGA-Augmented TriMedia," Field Programmable Custom Computing Machines (FCCM) 2001.
[22]
S. Hauck, T. W. Fry, M. M. Hosler, J. P. Kao, "The Chimaera Reconfigurable Functional Unit," IEEE Symposium on FPGAs for Custom Computing Machines, pp. 87--96, 1997.
[23]
S. Dutta, A. Wolfe, W. Wolf and K. O'Connor, "Design Issues for Very-Long-Instruction-Word VLSI Video Signal Processors," IEEE Workshop on VLSI Signal Processing, San Francisco, October 1996.
[24]
R. Hoare, S. Tung, K. Werger, "A 64-Way SIMD Processing Architecture on an FPGA," in Proceedings of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems, 2003, pp. 345--350.
[25]
A. Jones, R. Hoare, I. Kourtev, J. Fazekas, D. Kusic, J. Foster, S. Boddie, A. Muaydh, "A 64-way VLIW/SIMD FPGA Processing Architecture and Design Flow," in Proc. of ICECS, 2004.
[26]
Advanced RISC Machines, "ARM7TDMI Processor," http://www.arm.com/products/CPUs/ARM7TDMI.html.
[27]
Altera Corporation, "NIOS II Soft-core Processor," http://www.altera.com/products/ip/processors/nios2/cores/ni2-processor_cores.html.
[28]
Xilinx Corporation, "Microblaze Soft-core Processor," http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm.
[29]
International Business Machines (IBM), "Power-PC 405 Embedded CPU," http://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_405_Embedded_Cores.
[30]
D. Rizzo and O. Colavin, "A Video Compression case Study on a reconfigurable VLIW Architecture," Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Paris, France, March 2002.
[31]
"Trimaran, An Infrastructure for Research in Instruction Level Parallelism", 1998. http://www.trimaran.org.

Cited By

View all
  • (2024)BRISKI: A RISC-V barrel processor approach for higher throughput with less resource tax2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00092(532-539)Online publication date: 16-Dec-2024
  • (2023)RETRACTED: Express Data Processing on FPGA: Network Interface Cards for Streamlined Software Inspection for Packet ProcessingApplied System Innovation10.3390/asi60100096:1(9)Online publication date: 9-Jan-2023
  • (2022)hXDPCommunications of the ACM10.1145/354366865:8(92-100)Online publication date: 21-Jul-2022
  • Show More Cited By

Recommendations

Reviews

Vassilios A. Chouliaras

This is a very exciting piece of research in the general area of configurable, extensible processors and the software/hardware interface. The authors propose a hybrid architecture, consisting of a parameterized very long instruction word (VLIW) core augmented with custom hardware execution units, as a very potent programmable execution engine. In addition, they have developed the software infrastructure to allow for automatic optimization of C-based applications. In the introductory section, the authors identify large-capacity field-programmable gate arrays (FPGAs) with substantial computer/memory resources as becoming commonplace. They correctly point out that the efficient mapping of applications on such devices is not a trivial exercise anymore, with a typical use being software kernels allocated on the FPGA fabric, and the irregular (control) part of the application running on an embedded processor. This segregation has indeed been identified by the major FPGA vendors, which utilize embedded processors on their devices to accommodate both regular and irregular codes. The authors provide a good discussion of past and present behavioral synthesis solutions, and correctly identify such solutions as appropriate for combinational code, not for control-dominated applications. In addition, they provide a very good overview of the literature, both from academia and from industry, on configurable (static) and reconfigurable (dynamic) systems for software acceleration. To address large, irregular code pieces in a semi-automatic manner, the authors propose a parametric platform to efficiently exploit all parallelism. The platform is a four-wide VLIW-based processor that is binary-compatible with the Altera NIOS II instruction set architecture (ISA). In addition, it supports extending that ISA with custom hardware resources to achieve superlinear speedups. The software infrastructure is based on the well-known Trimaran VLIW research. The authors use an interesting technique to extract computational kernels (hardware functions), which are implemented directly as hardware blocks. These blocks make use of the abundant MAC units in typical high-performance FPGA devices, such as the Altera Stratix family. The authors discuss their hardware architecture, which is based on a four-wide VLIW with an eight-register, four-word (8R/4W) 32x32-bit register file, shared among the VLIW processing elements (PEs) and the custom hardware units. They also correctly identify the register file as the performance-limiting resource in an FPGA implementation, and provide substantial microarchitecture performance data. In the remaining sections, the authors discuss zero-overhead hardware/software switching, the hardware functions, and the software tool chain. They performed design, validation, and FPGA implementation, and achieved 167 megahertz (MHz) on an Altera Stratix, which is an impressive clock speed for a programmable device. Finally, they report on application speedups for both their standalone VLIW engine and their four-wide VLIW, augmented with hardware functions. Results range from nine percent to 230 times for kernel acceleration, which is indeed impressive. Overall, this is a thorough account of the proposed field of research; the authors did their best to disclose as much information as possible in the context of a conference paper. I was very much impressed with the technical ability of all those involved. This is a solid paper on embedded central processing unit (CPU) architecture. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
February 2005
288 pages
ISBN:1595930299
DOI:10.1145/1046192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NIOS
  2. VLIW
  3. compiler
  4. kernels
  5. parallelism
  6. synthesis

Qualifiers

  • Article

Conference

FPGA05
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)BRISKI: A RISC-V barrel processor approach for higher throughput with less resource tax2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00092(532-539)Online publication date: 16-Dec-2024
  • (2023)RETRACTED: Express Data Processing on FPGA: Network Interface Cards for Streamlined Software Inspection for Packet ProcessingApplied System Innovation10.3390/asi60100096:1(9)Online publication date: 9-Jan-2023
  • (2022)hXDPCommunications of the ACM10.1145/354366865:8(92-100)Online publication date: 21-Jul-2022
  • (2020)hXDPProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488821(973-990)Online publication date: 4-Nov-2020
  • (2020)Towards Composing Optimized Bi-Directional Multi-Ported Memories for Next-Generation FPGAsIEEE Access10.1109/ACCESS.2020.29948828(91531-91545)Online publication date: 2020
  • (2019)Time-Multiplexed FPGA Overlay ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/333986124:5(1-19)Online publication date: 23-Jul-2019
  • (2017)Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILTACM Transactions on Reconfigurable Technology and Systems10.1145/307975710:3(1-23)Online publication date: 27-Jun-2017
  • (2017)Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA OverlaysACM Transactions on Reconfigurable Technology and Systems10.1145/305367910:3(1-25)Online publication date: 27-May-2017
  • (2017)An efficient FPGA-based memory architecture for compute-intensive applications on embedded devices2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)10.1109/PACRIM.2017.8121901(1-8)Online publication date: Aug-2017
  • (2017)Adaptable VLIW processor: The reconfigurable technology approach2017 International Conference on Advanced Technologies for Communications (ATC)10.1109/ATC.2017.8167600(120-125)Online publication date: Oct-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media