Artificial Intelligence (AI) applications have become ubiquitous across the computing world, spanning from large-scale data centers to mobile and Internet of Things (IoT) devices. The growing prevalence of AI applications has led to an increasing demand for efficient hardware platforms to accelerate compute-intensive workloads. Among the various platforms, Field-programmable Gate Arrays (FPGAs) are a natural choice for implementing AI models, since FPGAs integrate computing logic and memory resources within a single device, enabling them to handle various algorithms effectively. Additionally, the programmable nature of FPGAs aligns well with the rapidly evolving nature of AI applications, allowing designers to implement and evaluate diverse algorithms swiftly. This flexibility makes FPGAs an excellent alternative to Application-specific Integrated Circuit (ASICs) and Graphics Processing Units (GPUs).
However, as AI models continue to grow in size and complexity, we encounter computational challenges, particularly due to the inherent limitations of FPGAs in terms of computing capacity, memory resources, and bandwidth. To facilitate FPGA for AI applications, it is crucial to focus on the concerns related to accelerator architecture, programming model, software-hardware co-optimization, design space exploration, and other related design perspectives, to determine the most efficient approaches for implementation of AI applications on FPGAs. This special issue aims to explore research topics encompassing Machine Learning (ML) acceleration on FPGAs and includes seven interesting articles delving into the various optimization directions to overcome the aforementioned challenges.
Among the seven articles selected, the first four articles have focused on the development of innovative accelerator architecture targeting both widely used traditional and newly emerging
Deep Neural Networks (DNN). The first work, “
High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System,” by X. Hu et al., aims to overcome the memory bandwidth limitation of the embedded system for accelerating DNNs in real-world machine vision applications. The article proposes a
Reconfigurable Tiny Neural Network Accelerator (ReTINNA), which features novel data-mapping methods and an adaptive layer-wise tiling strategy to augment data reuse and reduce the control complexity of data transmission.
Similarly, the second work, “
FD-CNN: A Frequency-domain FPGA Acceleration Scheme for CNN-based Image Processing Applications,” by X. Wang et al., addresses the resource-limitation issue of CNN-based image processing on the popular Zynq platforms. In this study, a partial decoding scheme is proposed, and a corresponding frequency-domain FPGA accelerator is developed that adopts the partially decoded frequency data rather than fully decoded RGB data as input. This partial decoding scheme removes the expensive
Inverse Discrete Cosine Transform (IDCT) operations and, hence, greatly simplifies the image decoder, enabling more input channels for parallel processing.
The next two articles have focused more on the newly developed operations and neural network models. In the article “
An Intermediate-centric Dataflow for Transposed Convolution Acceleration on FPGA,” Z. Ma et al. focus on the acceleration of Transposed Convolution, which increasingly prevails in CNN recently. However, its accelerator design still faces many challenges; for example, the required backward-stencil computation of transposed convolutional layers greatly constrains the accelerator's performance. To tackle this problem, the article proposes to break the transposed convolution into several phases/stages and efficiently pipeline the stages. As a result, high computation parallelism and efficient data reuse are finally achieved with significant throughput improvement.
The fourth article, “
Accelerating Attention Mechanism on FPGAs Based on Efficient Reconfigurable Systolic Array,” by W. Ye et al., attempts to speed up the transformer models, which have attracted tremendous interest in the fields of natural language processing, machine translation, among others. The performance bottleneck in transformer models lies in their attention mechanisms, which are their key building blocks; however, they are very expensive due to the intensive matrix computations and complicated data flow. To effectively utilize
Digital Signal Processing (DSP) on FPGAs to accelerate the multi-head attention, the work has introduced double MAC ideas to
Systolic Array (SA) implemented in DSP and supports dynamic reconfiguration of SA to adapt to input data with different bit width. A two-level scheduler has also been proposed to facilitate the SA dynamic reconfiguration and workload balancing.
High-level synthesis has been widely used in the DNN implementation on FPGAs to shorten the development time and offer better maintainability and more flexibility in design exploration. However, there is always interest in the exploration of trade-offs between
High-level Synthesis (HLS)- and
Register-transistor Level (RTL)-based DNN accelerator designs. The fifth article, “
On the RTL Implementation of FINN Matrix Vector Unit,” by S. Alam et al., presents an alternative backend library for an existing HLS-based DNN accelerator generation framework, FINN, and investigates the pros and cons of an RTL-based implementation versus the original HLS variant across a spectrum of design dimensions.
As we have seen, with the rapid growth of the neural network scale, another challenge is that the increasing parameters and tighter constraints gradually complicate the design space of the accelerators, which requires further improvement of the capacity and efficiency of design space exploration methods. In the sixth article, “
ACDSE: A Design Space Exploration Method for CNN Accelerator based on Adaptive Compression Mechanism,” K. Feng et al. develop a novel design space exploration method named ACDSE for optimizing the design process of CNN accelerators more efficiently. Based on reinforcement learning, ACDSE implements an adaptive compression mechanism that combines the contribution from multi-granularity search for parameter dimension reduction, parameter value range compression, and dynamic filter to prune low-value design points.
The last challenge this SI tackles is data movement efficiency. As the model and dataset sizes of deep learning scale up, the requirement for data movement has become a key factor for neural network training. The above papers generally adopt the compute-centric architecture, where data are moved from the drive or remote memory to the host, which may affect the overall computational efficiency. The seventh article, “
TH-iSSD: Design and Implementation of a Generic and Reconfigurable Near-Data Processing Framework,” by J. Shu, proposes a promising alternative, a near-data processing framework called TH-iSSD. By adopting an efficient device-level data switch between sensing, computation, and storage hardware components, TH-iSSD completely eliminates the redundant data movement overhead in the data path. The control path support is also developed to form a reconfigurable and practical near-data processing framework.
In closing, we thank all the authors for their dedication at all stages of the review process and their significant contribution to this special issue. The guest editors are wholly grateful to the reviewers for their timely and constructive reviews of the submitted manuscripts. We have invited reviewers representing expertise in related fields to provide high-quality reviews of the articles. The guest editors sincerely acknowledge the support of our Editor-in-Chief, Tulika Mitra, and the editorial team of ACM Transactions on Embedded Computing Systems (TECS). We hope this special issue will be an interesting reading for contemporary researchers worldwide and inspire further research and development of AI acceleration on FPGA in embedded systems.
Yun (Eric) Liang
Peking University
Wei Zhang
The Hong Kong University of Science and Technology
Stephen Neuendorffer
Advanced Micro Devices Inc.
Wayne Luk
Imperial College London
Guest Editors