Article

An FPGA-based VLIW processor with custom hardware execution

Authors:

John FosterAuthors Info & Claims

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

Pages 107 - 117

https://doi.org/10.1145/1046192.1046207

Published: 20 February 2005 Publication History

Get Access

Abstract

The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices continues to increase with each new line of devices. Efficiently programming these devices is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally targeted to embedded DSP microprocessors such as signal and image processing applications.This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions. To support this architecture, a compilation and design automation flow are described for programs written in C.Several design tradeoffs for the architecture were examined including number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply accumulate operations.We show that our combined VLIW with hardware functions exhibit as much as 230X speedup and 63X on average for computational kernels for a set of benchmarks. This allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.

References

[1]

Apple Computer, Inc., "Optimizing with SHARK, Big Payoff, Small Effort," http://developer.apple.com/tools/shark_optimize.html.

Google Scholar

[2]

D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, G. Stitt, "Profiling Tools for Hardware/Software Partitioning of Embedded Applications", Proc. Of the 2003 ACM SiGPLAN Conf. On Languages, Compilers and Tools for Embedded Systems, San Diego, CA June 2003.

Digital Library

Google Scholar

[3]

P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Chang, M. Haldar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden, "MATCH: A MATLAB Compilation Environment for Configurable Computing Systems," International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, 2000.

Google Scholar

[4]

S. Gupta, N. Savoiu, N. D. Dutt, R. K. Gupta, A. Nicolau, "Using Global Code Motions to Improve the Quality of Results for High-Level Synthesis," IEEE Transactions on Computer Aided Design, February, 2004.

Digital Library

Google Scholar

[5]

A. K. Jones, D. Bagchi, S. Pal, P. Banerjee, and A. Choudhary, Pact HDL: Compiler Targeting ASIC's and FPGA's with Power and Performance Optimizations, Chapter 9 in Power Aware Computing, ed. by Robert Graybill and Rami Melhem, pp. 169--190. Kluwer Academic Publishers, Boston, MA, 2002.

Digital Library

Google Scholar

[6]

X. Tang, T. Jiang, A. K. Jones, and P. Banerjee, "Behavioral Synthesis of Data-Dominated Circuits for Minimal Energy Implementation," in Proceedings of the IEEE International Conference on VLSI Design, January 2005.

Digital Library

Google Scholar

[7]

Synopsys, Inc., "Behavioral Compiler," http://www.synopsys.com.

Google Scholar

[8]

V.A. Chouliaras and J. Nunez, "Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders," IEEE Transactions on Consumer Electronics, Vol. 69 No. 3, August 2003, pp. 703--710.

Digital Library

Google Scholar

[9]

E. Atzori, S.M. Carta and L. Raffo, "44.6% Processing Cycles Reduction in GSM Voice by Low-power Reconfigurable Co-processor Architecture," Eletronics Letters, Vol. 38 No. 24, November 2002, pp. 1524--1526.

Crossref

Google Scholar

[10]

J. Hilgenstock, K. Herrmann, J. Otterstedt, D. Niggemeyer and P. Pirsch, "A Video Signal Processor for MIMD Multiprocessing," Proceedings of the 1998 Design Automation Conference, San Francisco, CA, June 1998.

Digital Library

Google Scholar

[11]

R. Garg, C.Y. Chung, D. Kim and Y. Kim, "Boundary Macroblock Padding in MPEG-4 Video Decoding Using a Graphics Co-processor," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12 No. 8, August 2002, pp. 719--723.

Digital Library

Google Scholar

[12]

C.N. Hinds, "An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications," Conference Record of the 33rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, October 1999.

Google Scholar

[13]

J.C. Alves and J.S. Matos, "RVC-A Reconfigurable Coprocessor for Vector Processing Applications," Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, CA, April 1998.

Digital Library

Google Scholar

[14]

T. Bridges, S.W. Kitchel and R. M. Wehrmeister, "A CPU Utilization Limit for Massively Parallel MIMD Computers," Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, October 1992.

Google Scholar

[15]

S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, "PipeRench: A Reconfigurable Architecture and Compiler" in IEEE Computer, Vol.33, No. 4, April 2000.

Digital Library

Google Scholar

[16]

B. A. Levine, H. Schmit, "Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable." FCCM 2003.

Digital Library

Google Scholar

[17]

C. Ebeling, D. C. Cronquist, P. Franklin, "RaPiD - Reconfigurable Pipelined Datapath", in the 6th International Workshop on Field-Programmable Logic and Applications, 1996.

Digital Library

Google Scholar

[18]

E. Mirsky and A. DeHon," MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources", in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1996.

Crossref

Google Scholar

[19]

B.Khailany et al., "Imagine: media processing with streams", Micro, March-April 2001.

Digital Library

Google Scholar

[20]

T.J. Callahan, J.R. Hauser and J. Wawrzynek, "The Garp architecture and C compiler," Computer, Volume: 33, Issue: 4, April 2000.

Digital Library

Google Scholar

[21]

M. Sima, S. Cotofana, J. T. J. van Eijndhoven, S. Vassilidis, and K. Vissers, "An 8 x 8 IDCT Implementation on an FPGA-Augmented TriMedia," Field Programmable Custom Computing Machines (FCCM) 2001.

Digital Library

Google Scholar

[22]

S. Hauck, T. W. Fry, M. M. Hosler, J. P. Kao, "The Chimaera Reconfigurable Functional Unit," IEEE Symposium on FPGAs for Custom Computing Machines, pp. 87--96, 1997.

Digital Library

Google Scholar

[23]

S. Dutta, A. Wolfe, W. Wolf and K. O'Connor, "Design Issues for Very-Long-Instruction-Word VLSI Video Signal Processors," IEEE Workshop on VLSI Signal Processing, San Francisco, October 1996.

Google Scholar

[24]

R. Hoare, S. Tung, K. Werger, "A 64-Way SIMD Processing Architecture on an FPGA," in Proceedings of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems, 2003, pp. 345--350.

Google Scholar

[25]

A. Jones, R. Hoare, I. Kourtev, J. Fazekas, D. Kusic, J. Foster, S. Boddie, A. Muaydh, "A 64-way VLIW/SIMD FPGA Processing Architecture and Design Flow," in Proc. of ICECS, 2004.

Google Scholar

[26]

Advanced RISC Machines, "ARM7TDMI Processor," http://www.arm.com/products/CPUs/ARM7TDMI.html.

Google Scholar

[27]

Altera Corporation, "NIOS II Soft-core Processor," http://www.altera.com/products/ip/processors/nios2/cores/ni2-processor_cores.html.

Google Scholar

[28]

Xilinx Corporation, "Microblaze Soft-core Processor," http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm.

Google Scholar

[29]

International Business Machines (IBM), "Power-PC 405 Embedded CPU," http://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_405_Embedded_Cores.

Google Scholar

[30]

D. Rizzo and O. Colavin, "A Video Compression case Study on a reconfigurable VLIW Architecture," Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Paris, France, March 2002.

Digital Library

Google Scholar

[31]

"Trimaran, An Infrastructure for Research in Instruction Level Parallelism", 1998. http://www.trimaran.org.

Google Scholar

Cited By

View all

Abdelhamid RKoch D(2024)BRISKI: A RISC-V barrel processor approach for higher throughput with less resource tax2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00092(532-539)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00092
Pradeep SSharma YVerma CSreeram GHanumantha Rao P(2023)RETRACTED: Express Data Processing on FPGA: Network Interface Cards for Streamlined Software Inspection for Packet ProcessingApplied System Innovation10.3390/asi60100096:1(9)Online publication date: 9-Jan-2023
https://doi.org/10.3390/asi6010009
Brunella MBelocchi GBonola MPontarelli SSiracusano GBianchi GCammarano APalumbo APetrucci LBifulco R(2022)hXDPCommunications of the ACM10.1145/354366865:8(92-100)Online publication date: 21-Jul-2022
https://dl.acm.org/doi/10.1145/3543668
Show More Cited By

Index Terms

An FPGA-based VLIW processor with custom hardware execution

Recommendations

A time-predictable VLIW processor and its compiler support

Time predictability is an important requirement for real-time embedded application domains such as automotive, air transportation, and multimedia processing. However, the architectural design of modern microprocessors mainly concentrates on improving ...
A design of EPIC type processor based on MIPS architecture
Abstract
This paper proposes an EPIC (Explicitly Parallel Instruction Computing Architecture) type processor based on MIPS. VLIW processors can execute multiple instructions simultaneously, but due to dependency of instructions, it is often impossible to ...
The microarchitecture of FPGA-based soft processors
CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems

As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial ...

Reviews

Reviewer: Vassilios A. Chouliaras

This is a very exciting piece of research in the general area of configurable, extensible processors and the software/hardware interface. The authors propose a hybrid architecture, consisting of a parameterized very long instruction word (VLIW) core augmented with custom hardware execution units, as a very potent programmable execution engine. In addition, they have developed the software infrastructure to allow for automatic optimization of C-based applications. In the introductory section, the authors identify large-capacity field-programmable gate arrays (FPGAs) with substantial computer/memory resources as becoming commonplace. They correctly point out that the efficient mapping of applications on such devices is not a trivial exercise anymore, with a typical use being software kernels allocated on the FPGA fabric, and the irregular (control) part of the application running on an embedded processor. This segregation has indeed been identified by the major FPGA vendors, which utilize embedded processors on their devices to accommodate both regular and irregular codes. The authors provide a good discussion of past and present behavioral synthesis solutions, and correctly identify such solutions as appropriate for combinational code, not for control-dominated applications. In addition, they provide a very good overview of the literature, both from academia and from industry, on configurable (static) and reconfigurable (dynamic) systems for software acceleration. To address large, irregular code pieces in a semi-automatic manner, the authors propose a parametric platform to efficiently exploit all parallelism. The platform is a four-wide VLIW-based processor that is binary-compatible with the Altera NIOS II instruction set architecture (ISA). In addition, it supports extending that ISA with custom hardware resources to achieve superlinear speedups. The software infrastructure is based on the well-known Trimaran VLIW research. The authors use an interesting technique to extract computational kernels (hardware functions), which are implemented directly as hardware blocks. These blocks make use of the abundant MAC units in typical high-performance FPGA devices, such as the Altera Stratix family. The authors discuss their hardware architecture, which is based on a four-wide VLIW with an eight-register, four-word (8R/4W) 32x32-bit register file, shared among the VLIW processing elements (PEs) and the custom hardware units. They also correctly identify the register file as the performance-limiting resource in an FPGA implementation, and provide substantial microarchitecture performance data. In the remaining sections, the authors discuss zero-overhead hardware/software switching, the hardware functions, and the software tool chain. They performed design, validation, and FPGA implementation, and achieved 167 megahertz (MHz) on an Altera Stratix, which is an impressive clock speed for a programmable device. Finally, they report on application speedups for both their standalone VLIW engine and their four-wide VLIW, augmented with hardware functions. Results range from nine percent to 230 times for kernel acceleration, which is indeed impressive. Overall, this is a thorough account of the proposed field of research; the authors did their best to disclose as much information as possible in the context of a conference paper. I was very much impressed with the technical ability of all those involved. This is a solid paper on embedded central processing unit (CPU) architecture. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

February 2005

288 pages

ISBN:1595930299

DOI:10.1145/1046192

General Chair:
Herman Schmit
Tabula
,
Program Chair:
Steve Wilton
University of British Columbia

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

FPGA05

Sponsor:

FPGA05: ACM/SIGDA International Symposium on Field Programmable Gate Arrays 2005

February 20 - 22, 2005

California, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,687
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Abdelhamid RKoch D(2024)BRISKI: A RISC-V barrel processor approach for higher throughput with less resource tax2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00092(532-539)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00092
Pradeep SSharma YVerma CSreeram GHanumantha Rao P(2023)RETRACTED: Express Data Processing on FPGA: Network Interface Cards for Streamlined Software Inspection for Packet ProcessingApplied System Innovation10.3390/asi60100096:1(9)Online publication date: 9-Jan-2023
https://doi.org/10.3390/asi6010009
Brunella MBelocchi GBonola MPontarelli SSiracusano GBianchi GCammarano APalumbo APetrucci LBifulco R(2022)hXDPCommunications of the ACM10.1145/354366865:8(92-100)Online publication date: 21-Jul-2022
https://dl.acm.org/doi/10.1145/3543668
Brunella MBelocchi GBonola MPontarelli SSiracusano GBianchi GCammarano APalumbo APetrucci LBifulco RLu SHowell J(2020)hXDPProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488821(973-990)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488821
Shahrouzi SAlkamil APerera D(2020)Towards Composing Optimized Bi-Directional Multi-Ported Memories for Next-Generation FPGAsIEEE Access10.1109/ACCESS.2020.29948828(91531-91545)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2994882
Li XMaskell D(2019)Time-Multiplexed FPGA Overlay ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/333986124:5(1-19)Online publication date: 23-Jul-2019
https://dl.acm.org/doi/10.1145/3339861
Tili IOvtcharov KSteffan J(2017)Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILTACM Transactions on Reconfigurable Technology and Systems10.1145/307975710:3(1-23)Online publication date: 27-Jun-2017
https://dl.acm.org/doi/10.1145/3079757
Laforest CAnderson J(2017)Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA OverlaysACM Transactions on Reconfigurable Technology and Systems10.1145/305367910:3(1-25)Online publication date: 27-May-2017
https://dl.acm.org/doi/10.1145/3053679
Shahrouzi SPerera D(2017)An efficient FPGA-based memory architecture for compute-intensive applications on embedded devices2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)10.1109/PACRIM.2017.8121901(1-8)Online publication date: Aug-2017
https://doi.org/10.1109/PACRIM.2017.8121901
Pham-Quoc CKieu-Do-Nguyen BDinh-Duc A(2017)Adaptable VLIW processor: The reconfigurable technology approach2017 International Conference on Advanced Technologies for Communications (ATC)10.1109/ATC.2017.8167600(120-125)Online publication date: Oct-2017
https://doi.org/10.1109/ATC.2017.8167600
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A time-predictable VLIW processor and its compiler support

A design of EPIC type processor based on MIPS architecture

The microarchitecture of FPGA-based soft processors

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations