introduction

Free access

Special Issue: “AI Acceleration on FPGAs”

Authors:

Yun (Eric) Liang,

Wei Zhang,

Stephen Neuendorffer,

Wayne LukAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 6

Article No.: 89, Pages 1 - 3

https://doi.org/10.1145/3626323

Published: 09 November 2023 Publication History

PDF eReader

Artificial Intelligence (AI) applications have become ubiquitous across the computing world, spanning from large-scale data centers to mobile and Internet of Things (IoT) devices. The growing prevalence of AI applications has led to an increasing demand for efficient hardware platforms to accelerate compute-intensive workloads. Among the various platforms, Field-programmable Gate Arrays (FPGAs) are a natural choice for implementing AI models, since FPGAs integrate computing logic and memory resources within a single device, enabling them to handle various algorithms effectively. Additionally, the programmable nature of FPGAs aligns well with the rapidly evolving nature of AI applications, allowing designers to implement and evaluate diverse algorithms swiftly. This flexibility makes FPGAs an excellent alternative to Application-specific Integrated Circuit (ASICs) and Graphics Processing Units (GPUs).

However, as AI models continue to grow in size and complexity, we encounter computational challenges, particularly due to the inherent limitations of FPGAs in terms of computing capacity, memory resources, and bandwidth. To facilitate FPGA for AI applications, it is crucial to focus on the concerns related to accelerator architecture, programming model, software-hardware co-optimization, design space exploration, and other related design perspectives, to determine the most efficient approaches for implementation of AI applications on FPGAs. This special issue aims to explore research topics encompassing Machine Learning (ML) acceleration on FPGAs and includes seven interesting articles delving into the various optimization directions to overcome the aforementioned challenges.

Among the seven articles selected, the first four articles have focused on the development of innovative accelerator architecture targeting both widely used traditional and newly emerging Deep Neural Networks (DNN). The first work, “High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System,” by X. Hu et al., aims to overcome the memory bandwidth limitation of the embedded system for accelerating DNNs in real-world machine vision applications. The article proposes a Reconfigurable Tiny Neural Network Accelerator (ReTINNA), which features novel data-mapping methods and an adaptive layer-wise tiling strategy to augment data reuse and reduce the control complexity of data transmission.

Similarly, the second work, “FD-CNN: A Frequency-domain FPGA Acceleration Scheme for CNN-based Image Processing Applications,” by X. Wang et al., addresses the resource-limitation issue of CNN-based image processing on the popular Zynq platforms. In this study, a partial decoding scheme is proposed, and a corresponding frequency-domain FPGA accelerator is developed that adopts the partially decoded frequency data rather than fully decoded RGB data as input. This partial decoding scheme removes the expensive Inverse Discrete Cosine Transform (IDCT) operations and, hence, greatly simplifies the image decoder, enabling more input channels for parallel processing.

The next two articles have focused more on the newly developed operations and neural network models. In the article “An Intermediate-centric Dataflow for Transposed Convolution Acceleration on FPGA,” Z. Ma et al. focus on the acceleration of Transposed Convolution, which increasingly prevails in CNN recently. However, its accelerator design still faces many challenges; for example, the required backward-stencil computation of transposed convolutional layers greatly constrains the accelerator's performance. To tackle this problem, the article proposes to break the transposed convolution into several phases/stages and efficiently pipeline the stages. As a result, high computation parallelism and efficient data reuse are finally achieved with significant throughput improvement.

The fourth article, “Accelerating Attention Mechanism on FPGAs Based on Efficient Reconfigurable Systolic Array,” by W. Ye et al., attempts to speed up the transformer models, which have attracted tremendous interest in the fields of natural language processing, machine translation, among others. The performance bottleneck in transformer models lies in their attention mechanisms, which are their key building blocks; however, they are very expensive due to the intensive matrix computations and complicated data flow. To effectively utilize Digital Signal Processing (DSP) on FPGAs to accelerate the multi-head attention, the work has introduced double MAC ideas to Systolic Array (SA) implemented in DSP and supports dynamic reconfiguration of SA to adapt to input data with different bit width. A two-level scheduler has also been proposed to facilitate the SA dynamic reconfiguration and workload balancing.

High-level synthesis has been widely used in the DNN implementation on FPGAs to shorten the development time and offer better maintainability and more flexibility in design exploration. However, there is always interest in the exploration of trade-offs between High-level Synthesis (HLS)- and Register-transistor Level (RTL)-based DNN accelerator designs. The fifth article, “On the RTL Implementation of FINN Matrix Vector Unit,” by S. Alam et al., presents an alternative backend library for an existing HLS-based DNN accelerator generation framework, FINN, and investigates the pros and cons of an RTL-based implementation versus the original HLS variant across a spectrum of design dimensions.

As we have seen, with the rapid growth of the neural network scale, another challenge is that the increasing parameters and tighter constraints gradually complicate the design space of the accelerators, which requires further improvement of the capacity and efficiency of design space exploration methods. In the sixth article, “ACDSE: A Design Space Exploration Method for CNN Accelerator based on Adaptive Compression Mechanism,” K. Feng et al. develop a novel design space exploration method named ACDSE for optimizing the design process of CNN accelerators more efficiently. Based on reinforcement learning, ACDSE implements an adaptive compression mechanism that combines the contribution from multi-granularity search for parameter dimension reduction, parameter value range compression, and dynamic filter to prune low-value design points.

The last challenge this SI tackles is data movement efficiency. As the model and dataset sizes of deep learning scale up, the requirement for data movement has become a key factor for neural network training. The above papers generally adopt the compute-centric architecture, where data are moved from the drive or remote memory to the host, which may affect the overall computational efficiency. The seventh article, “TH-iSSD: Design and Implementation of a Generic and Reconfigurable Near-Data Processing Framework,” by J. Shu, proposes a promising alternative, a near-data processing framework called TH-iSSD. By adopting an efficient device-level data switch between sensing, computation, and storage hardware components, TH-iSSD completely eliminates the redundant data movement overhead in the data path. The control path support is also developed to form a reconfigurable and practical near-data processing framework.

In closing, we thank all the authors for their dedication at all stages of the review process and their significant contribution to this special issue. The guest editors are wholly grateful to the reviewers for their timely and constructive reviews of the submitted manuscripts. We have invited reviewers representing expertise in related fields to provide high-quality reviews of the articles. The guest editors sincerely acknowledge the support of our Editor-in-Chief, Tulika Mitra, and the editorial team of ACM Transactions on Embedded Computing Systems (TECS). We hope this special issue will be an interesting reading for contemporary researchers worldwide and inspire further research and development of AI acceleration on FPGA in embedded systems.

Yun (Eric) Liang

Peking University

e-mail: [email protected]

Wei Zhang

The Hong Kong University of Science and Technology

e-mail: [email protected]

Stephen Neuendorffer

Advanced Micro Devices Inc.

e-mail: [email protected]

Wayne Luk

Imperial College London

e-mail: [email protected]

Guest Editors

Cited By

View all

Zhao WYang GXia TChen FZheng NRen P(2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3327110

Recommendations

Hardware acceleration of graphics and imaging algorithms using FPGAs
SCCG '02: Proceedings of the 18th Spring Conference on Computer Graphics

Computer graphics algorithms and algorithms used in image processing are generally computationally expensive. This fact is the reason why people struggle to accelerate such algorithms using any reasonable means. The traditional sources of speedup are ...
Compiled acceleration of c programs on fpgas
Acceleration of Synthetic Aperture Radar (SAR) Algorithms using Field Programmable Gate Arrays (FPGAs) (Abstract Only)
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Algorithms for radar signal processing, such as Synthetic Aperture Radar (SAR) are computationally intensive and require considerable execution time on a general purpose processor. Reconfigurable logic can be used to off-load the primary computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 22, Issue 6

November 2023

428 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3632298

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 09 November 2023

Published in TECS Volume 22, Issue 6

Check for updates

Qualifiers

Introduction

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
816
Total Downloads

Downloads (Last 12 months)779
Downloads (Last 6 weeks)134

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhao WYang GXia TChen FZheng NRen P(2023)HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference ApplicationsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.332711031:12(1980-1993)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1109/TVLSI.2023.3327110

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Hardware acceleration of graphics and imaging algorithms using FPGAs

Compiled acceleration of c programs on fpgas

Acceleration of Synthetic Aperture Radar (SAR) Algorithms using Field Programmable Gate Arrays (FPGAs) (Abstract Only)

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations