Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: University of Sassari, Piazza Università 21, 07100 Sassari, Italy
11email: {fmanca2, fratto}@uniss.it
22institutetext: University of Cagliari, via Marengo 2, 09123 Cagliari, Italy
22email: francesca.palumbo@unica.it

ONNX-to-Hardware Design Flow for Adaptive Neural-Network Inference on FPGAs

Federico Manca 11    Francesco Ratto 11 0000-0001-5756-5879    Francesca Palumbo 22 0000-0002-6155-1979
Abstract

Neural Networks provide a solid and reliable way of executing different types of applications, ranging from speech recognition to medical diagnosis, speeding up onerous and long workloads. The challenges involved in their implementation at the edge include providing diversity, flexibility, and sustainability. That implies, for instance, supporting evolving applications and algorithms energy-efficiently. Using hardware (hw) or software accelerators can deliver fast and efficient computation of the NNs, while flexibility can be exploited to support long-term adaptivity. Nonetheless, handcrafting a NN for a specific device, despite the possibility of leading to an optimal solution, takes time and experience, and that’s why frameworks for hw accelerators are being developed. This work, starting from a preliminary semi-integrated ONNX-to-hardware toolchain [21], focuses on enabling Approximate Computing (AC) leveraging the distinctive ability of the original toolchain to favor adaptivity. The goal is to allow lightweight adaptable NN inference on FPGAs at the edge.

Keywords:
Cyber-Physical Systems, Convolutional Neural Networks, Approximate Computing, FPGAs
AC
Approximate Computing
AES
Advanced Encryption Standard
API
Application Programming Interface
ASIC
Application Specific Integrated Circuit
BNN
Binary Neural Network
CGR
Coarse-Grain Reconfigurable
CPG
Co-Processor Generator
CPS
Cyber-Physical System
CPU
Central Processing Unit
CNN
Convolutional Neural Network
DAG
Directed Acyclic Graph
DNN
Deep Neural Network
DPN
Dataflow Process Network
DSE
Design Space Exploration
DSP
Digital Signal Processing
FFT
Fast Fourier Transform
FIFO
First-In First-Out queue
FPGA
Field Programmable Gate Array
GPU
Graphics Processing Unit
HEVC
High Efficiency Video Coding
HLS
High Level Synthesis
hw
hardware
HWPU
HW Processing Unit
IP
Intellectual Property
LUT
Look-Up Table
MDC
Multi-Dataflow Composer
MDG
Multi-Dataflow Generator
MLP
Multi Layer Perceptron
MCDMA
Multi-channel DMA
MoA
Model of Architecture
MoC
Model of Computation
MPSoC
Multi-Processor System on a Chip (SoC)
NN
Neural Network
OS
Operating System
PC
Platform Composer
PE
Processing Element
PiSDF
Parameterized and Interfaced Synchronous DataFlow
PDF
Parameterized DataFlow
PL
Programmable Logic
PS
Processing System
PMC
Performance Monitoring Counter
PSDF
Parameterized Synchronous DataFlow
QSoC
Quantized SoC
QAT
Quantization-aware Training
QNN
Quantized Neural Network
RTL
Register Transfer Level
SG
Scatter-Gather
SDF
Synchronous DataFlow
SoC
System on a Chip
SMMU
System Memory Management Unit
TIL
Template Interface Layer
UAV
Unmanned Aerial Vehicle
UGV
Unmanned Ground Vehicle
\AddToShipoutPictureFG

*

This is the authors’ version of the work. It is posted here for your personal use. Not for redistribution.

\AddToShipoutPictureFG*

The definitive Version of Record will appear in

\AddToShipoutPictureFG*

Proceedings of the 24th International Conference, SAMOS 2024, Samos, Greece, June 29 - July 4, 2024.

\AddToShipoutPictureFG*

Please refer to the reviewed and published Version of Record for citing this work.

1 Introduction

Cyber-Physical System (CPS) integrate “computation with physical processes whose behavior is defined by both the computational (digital and other forms) and the physical parts of the system”111https://csrc.nist.gov/glossary/term/cyber physical systems. These systems are characterized by significant information exchange with the environment and dynamic, reactive behaviors in response to environmental changes. In modern systems, whether CPS or not, NN-assisted decision-making can be directly deployed at the edge on small embedded platforms. This approach reduces latency, energy consumption, and often ensures higher privacy levels [23]. Nonetheless, executing AI models on resource-constrained edge devices presents several challenges, including limited computing and memory capacities. Balancing model accuracy and execution efficiency exposes a crucial design trade-off.

In response to these challenges, Field Programmable Gate Arrays devices emerge as a valuable choice for NN inference at the edge  [11]. They can guarantee hw acceleration, execution flexibility, and energy efficiency thanks to the possibility of tailoring the hw architecture to the specific application. Despite existing solutions, there remains a lack of full support for advanced features, particularly the adaptivity naturally supported over these kind of platforms. Computing adaptivity empowers CPS to thrive in complex, ever-changing environments.

This paper aims to take steps towards filling that gap. The goal is featuring adaptivity targeting Convolutional Neural Network (CNN) models as applications and edge FPGAs as computing platforms. CNN models have proven to be positively affected by the application of AC methodologies [3]. State of the art approaches [1, 8] apply it in a data-oriented manner, as discussed in Sect. 2.1. In this paper, the combination of data-oriented and computation-oriented strategies targeting runtime adaptivity, by means of reconfiguration, is presented. Different execution profiles are operated at runtime by an adaptive inference engine. This latter is developed with the proposed design flow. In summary, the contributions of this work are:

  • A novel design flow that enables the inference of Quantized ONNX models on FPGAs (Sect. 3) featuring, for the first time to the best of our knowledge, both data-approximation and computation-approximation.

  • The analysis of the effect of data-approximation in a mixed-precision tiny CNN model for MNIST classification (Sect. 4.2 and 4.3).

  • The assessment of the benefits of computation-approximation through the deployment of an adaptive CNN inference engine for the data-approximated models (Sect. 4.4).

2 Related Work

To execute NN at the edge, three main types of architectures can be found in literature [27]: the Single Computational Engine architecture, based on a single computation engine, typically in the form of a systolic array of processing elements or a matrix multiplication unit, that executes the CNN layers sequentially [10]; Vector Processor architecture, with instructions specific for accelerating operations related to convolutions [7]; the Streaming architecture consists of one distinct hw block for each layer of the target CNN, where each block is optimized separately [1, 8], as depicted in Fig. 1. In this study, we adopted the latter for two main reasons:

  • a distinct hw processing element for each layer of the CNN model allows for higher customization, thus favoring adaptivity;

  • the streaming architecture is the most natural implementation of a dataflow-based application, such CNNs, thus easing the design with High Level Synthesis (HLS).

Refer to caption

Figure 1: Representation on a simple CNN and its mapping to a streaming archietecture.

2.1 Streaming Architectures

In our previous work [21], a toolchain for porting CNNs on FPGAs was proposed. The resulting hw is a streaming architecture that uses on-chip memory, guaranteeing low-latency and low-energy computing. Solutions that exploit a similar streaming architecture are FINN [8], a framework from AMD Research Labs; HLS4ML [1], an open-source software designed to facilitate the deployment of machine learning models on FPGAs, targeting low-latency and low-power edge applications. FINN enables building scalable and fast NN, with a focus on the support of Quantized Neural Network (QNN) inference on FPGAs. A given model, trained through Brevitas [19], is compiled by the FINN compiler, producing a synthesizable C++ description of a heterogeneous streaming architecture. All QNN parameters are kept stored in the on-chip memory, which greatly reduces the power consumed and simplifies the design. The computing engines communicate via the on-chip data stream. Avoiding the “one-size-fits-all”, an ad-hoc topology is built for the network. The resulting accelerator is deployed on the target board using the AMD Pynq framework. The main operation of the HLS4ML library is to translate the model of the network into an HLS Project. The focus in [16] was centered on reducing the computational complexity and resource usage on a fully connected network for MNIST dataset classification: the data is fed to a multi-layer perceptron with an input layer of 784 nodes, three hidden layers with 128 nodes each, and an output layer with 10 nodes. The work exploits the potential of Pruning and Quantization-aware Training (QAT) to reduce the model size with limited impact on its accuracy.

This work positioning:

To the best of our knowledge, neither FINN nor HLS4ML, despite targeting FPGA-based streaming architecture and supporting AC features, have ever proposed an adaptive solution. These existing frameworks primarily focus on data-oriented approximation. However, there remains an untapped opportunity for computation-oriented approaches, which can be achieved through reconfigurable systems design [15]. Such computation-oriented strategies could naturally be harnessed by runtime management infrastructures aiming at self-adaptive behaviors [17], which are typical of CPSs.

2.2 Approximate Computing

The AC paradigms is founded on “the idea that computer systems can let applications trade-off accuracy for efficiency”. Indeed, AC has been established as a design paradigm for energy-efficient circuits. It exploits the inherent ability of a large number of applications to produce results of acceptable quality despite being run on systems that “intentionally exposes incorrectness to the application layer in return for conserving some resource”222http://approximate.computer/approxbib/. This trade-off ultimately balances computation accuracy with efficiency. According to textbook definitions [2], AC provides three degrees of freedom by acting on data, hardware, and computation. Approximating data means processing either less up-to-date data (temporal decimation), less input data (spatial decimation), less accurate data (word-length optimization) or corrupted data. Hardware approximation leverages inexact operators or voltage scaling. While computation approximation corresponds to models modifications to expose different implementations, aiming to enable different execution profiles, over the same substrate.

\Ac

ac is particularly relevant in applications like NNs that have demonstrated remarkable resilience to errors [14]. Within this specific application domain, NN approximation can be broadly categorized into three main approaches: Computation Reduction, Approximate Arithmetic Units, and Precision Scaling [3]. The Computation Reduction approximation category aims at systematically avoiding certain computation at the hw level, thereby significantly reducing the overall workload. An example of this is pruning: biases, weights, or entire neurons can be evicted to lighten the workload [24]. By employing Approximate Units that replace more accurate units, such as the Multiply-and-Accumulate (MAC) unit, energy consumption and latency in NN accelerators can be improved [4]. The most used Precision Scaling practice is quantization: quantized hw implementations feature reduced bit-width dataflow and arithmetic units attaining substaintial energy, latency, and bandwidth gains compared to 32-bit floating-point implementations. Instead of executing all the required mathematical operations with ordinary 32-bit floating point, quantization allows the exploitation of lighter operations by mapping real numbers to integers within a specified range [12].

This work:

NN Precision scaling is exploited by implementing quantization to feature data approximation. We combine different data-approximate profiles to enable computation approximation and to deliver adaptivity. Our proposed flow utilizes Vitis HLS, which provides an arbitrary precision data types library, that goes beyond the standard C++. This library also supports customizable fixed-point data types333https://jiafulow.github.io/blog/2020/08/02/hls-arbitrary-precision-data-types, easing the data precision control among layers. Additionally, we introduce another tool called MDC, explained further below, to enable adaptivity and computation approximation.

3 Proposed design flow

The utilization of a CNN model involves two distinct phases: training and inference. The training phase aims at setting the model parameters to execute a given classification task. This phase tipycally occurs on powerful platforms, often in the cloud. The inference phase executes the trained model to perform the classification task. It is usually performed on a different platform, in our case an FPGA edge device. These two phases can be decoupled by adopting an intermediate representation to exchange the model between the training and the inference framework. The de facto standard for this purpose is the ONNX format.

The proposed design flow automates the design and deployment of an FPGA processor for the inference of a given Quantized CNN model. The model must be provided in the QONNX format [20], which extends the ONNX 444https://onnx.ai/ format by allowing the specification of layers with arbitrary-precision data types. The adopted tools are described in Sect. 3.1 and their integration and usage in Sect. 3.2.

3.1 Tools

Various commercial and open-source academic tools are utilized throughout the design flow:

  • the ONNXParser555https://gitlab.com/aloha.eu/onnxparser, a Python application, is designed to parse the ONNX models and create the code for a target device. The tool consists of a Reader and multiple Writers, each tailored for different target platforms supported within the ALOHA framework [13]. For this work, we developed a Writer targeting HLS.

  • The Vitis HLS tool666https://www.AMD.com/support/documentation-navigation/design-hubs/dh0012-vivado-high-level-synthesis-hub.html synthesizes a C or C++ function into RTL code for implementation on AMD FPGAs. The resulting hw can be optimized and customized through the insertion of directives in the code.

  • The Multi-Dataflow Composer (MDC) tool777https://mdc-suite.github.io/ is an open-source tool that can offer Coarse-Grained reconfigurability support for hw acceleration [22]. It takes as input the applications specified as dataflow, together with the library of the HDL files of the actors. These dataflows are then combined, and the resulting multi-dataflow topology is filled with the actors taken from the HDL library.

3.2 Design Flow

The proposed flow, as depicted in Fig. 2, starts from the QONNX representation of the NN and produces a streaming architecture that executes the input model. The QONNX file acts as a bridge between the training and the inference frameworks. Two distinct paths are present in the design flow: the actor related path and the network related path. They can be carried out once, to obtain a non-adaptive data-approximate solution, or multiple times, to derive a computation-approximate adaptive engine of data-approximate solutions.

Refer to caption
Refer to caption
Figure 2: On the left, the ONNX-to-Hardware design flow for the generation of adaptive neural-network inference engines on FPGAs. The training library could be any library able to export to QONNX. On the right, the streaming-based template architecture for a convolutional layer.

The QONNX file serves as input to the extended ONNX Parser, which is capable of processing the additional quantization layers included in the QONNX format. Initially, the Reader reads the QONNX file and produces an intermediate format with a list of objects describing the layers’ hyperparameters (e.g. kernel size, data precision, etc.) and connections within the QONNX model. Subsequently, the HLS Writer creates the target-dependent files.

The ONNX Parser extracts the network topology from the QONNX and the data precision in each layer. This information is used by the Front End of the MDC tool to derive the datapath of the accelerator. When designing an adaptive engine, multiple data-approximate profiles of the same CNN model are processed. The tool automates the merging process by sharing layers of different profiles that use the same data precision.

The HLS Writer produces the C++ files that implement the layers, and the TCL scripts to automate the synthesis by Vitis HLS. The C++ description of the layers is based on a template architecture: for the convolutional layers, the core of a CNN, the template is composed of a Line Buffer actor that stores the input stream to provide data reuse; the Convolutional actor, whose function is to execute the actual computation; and the Weight and Bias actors that store the kernel parameters needed for the convolution. The resulting template, depicted on the right side in Fig. 2, ensures streaming dataflow between layers, eliminating the need to store full tensors. Each actor is developed to be customizable with the hyperparameter, e.g. input and kernel size, extracted from the QONNX model. The HDL library produced by Vitis HLS and the reconfigurable datapath (Multi-datflow) serve as input to the MDC Backend to generate the HDL description of the inference engine.

4 Evaluation

The proposed design flow was evaluated on a tiny CNN model trained for MNIST classification. The model comprises two convolutional blocks and a final fully connected layer. Each block consists of a convolutional layer with a 3x3 kernel, 64 filters, and ReLU activation, followed by a batch normalization and a maxpooling layer. The inference engines that execute this model have been designed with the proposed flow targeting the FPGA available on an AMD KRIA board.

First, we describe how the quantized models have been trained in Sect. 4.1. Then, to evaluate the proposed flow, we carried out an initial exploration, described in Sect. 4.2, on data approximated designs. This exploration is meaningful to assess the impact of quantization on both accuracy and inference performance. Then, in Sect. 4.3, different execution profiles are selected and, then, merged to generate an adaptive inference engine, as described in 4.4.

4.1 Quantization-aware training

The model previously described has been designed and trained using QKeras [6]. QKeras is an extension of the Keras framework that offers several features, including the ability to specify a custom fixed-point precision for each layer of a NN and perform QAT, , which has demonstrated significant advantages over post-training quantization in terms of resulting model accuracy [9]

Through its APIs, QKeras allows to specify the number of bits used to represent the activations and the weights of the NN model. The activations are the outputs of an NN layer, while the weights are the trainable parameters of the kernel used either for convolution, in convolutional layers, or matrix multiplication in fully connected layers. In the exploration of Sect. 4.2 we varied both these bit-widths, thus implementing a mixed-precision quantization strategy.

For the QAT, we selected an optimizer that implements the Adam algorithm and Categorical Crossentropy as the loss function to be minimized during regression. The trained CNN model have been exported to QONNX format and implemented with the proposed design flow.

It is worth underlying that other frameworks, e.g. Brevitas [19], offer QAT and export to QONNX, so that the proposed design flow is interoperable with any QONNX-compliant framework.

4.2 Data approximation analysis

In this section, we report the results on the analysis using mixed-precision quantization. A string identifies a profile as Ax-Wy, where x represents the number of bits used to represent activations and y the number of bits used for weights. For each mixed-precision configuration, a non-adaptive inference engine has been realized with the proposed design flow . We report accuracy, latency for a classification, resource utilization, and power consumption for each engine in Table 1.

Table 1: Results of the analysis with data mixed-precision approximation. In the string Ax-Wy, x represents the number of bits for the activations, and y is the number of bits used for the weights.
Datatype Accuracy Latency LUT BRAM Power
[%] [us] [%] [%] [mW]
A16-W8 98.9 329 12 18 160
A16-W4 95.3 329 7 18 134
A8-W8 98.8 329 11 17 142
A8-W4 95.3 329 6 17 132
A4-W4 95.8 329 6 17 141

It can be noticed that the execution latency for an image classification remains constant independently of the data precision. This behavior can be explained considering how the HLS compilation flow works: the HLS compiler schedules the operations depending on data dependencies and user directives. After that, the operations are bound to the physical resources. Therefor, larger bit precision increases computing resource utilization rather than slowing down the system. Indeed, we can see that adopting a reduced bit precision for activations and weights leads to a reduction in Look-Up Tables and bram utilization.

The two metrics where we see an exploitable trade-off at runtime are model accuracy and power consumption. The model’s accuracy decreases with reduced bit precisions. From a baseline 99.8% which can be obtained with floating point operations, not feasible to be ported to an FPGA, the quantized model A16-W8 achieves a classification accuracy of 98.9%. This accuracy drops down to 95.3% with A8-W4. We can notice that with 4-bit precision in the weights, the final accuracy is around 95%. Small variations are due to the intrinsic randomness of the training process rather than to the activations’ precision.

On the other hand, this drop in accuracy is compensated by reduced dynamic power consumption. A general trend shows that power consumption decreases with reduced precision. A graphical description of the resulting execution profiles that consider only accuracy and power consumption is reported in Fig. 3. The variability in the power consumption, which is not directly proportional to the data precision, shows the advantage of having a fast design flow that goes from the high-level description in QONNX to the FPGA implementation. This allows us to consider the joint effects of the resource utilization, which is affected by the FPGA backend and the HLS compiler, and switching activity, which depends on the actual values of weights and the data being processed.

4.3 Execution profiles selection

From Fig. 3, we observe that the non-adaptive inference engines obtained with the initial exploration offer valuable trade-offs. However, these engines lack common layers necessary to achieve some degree of resource sharing. To address this limitation, we started from the A8-W8 profile and trained an additional profile that further exploits mixed precision. This new profile generally uses the same precision as A8-W8, but in the inner convolutional layer, where instead it uses the A4-W4 one. The resulting non-adaptive engine (named Mixed) performance is reported with a green dot in Fig. 3. This demonstrates the additional level of data approximation that can be achieved with the proposed methodology.

Finally, the Mixed and A8-W8 profiles are good candidates for merging using the proposed methodology. This will allow us to design an adaptive inference engine that enables computation approximation, as shown in the following section.

Refer to caption
Figure 3: Accuracy VS power chart of the obtained profiles. In green the Mixed design. The yellow arrows point to the two configurations selected for adaptivity.

4.4 Adaptive inference engine

In the previous designs we partially used the functionality of the proposed methodology, resulting in non-adaptive inference engines. To achieve adaptivity we need to design a computation approximate inference engine that allows selecting different profiles at runtime. For this purpose, we leverage the merging capabilities of MDC. As anticipated, Mixed and A8-W8 profiles are selected as entry points, since they share the same layers, but the inner convolutional one. The characteristics of the resulting adaptive inference engine are summarized on top of Fig. 4. The resulting inference engine has a limited overhead with respect to the non-adaptive ones. The switch among profiles can guarantee a 5% power saving with a 1.5% accuracy drop. Given the low accuracy penalty, we can suppose that in a real CPS application, the inference engine would run most of the time in the Profile 1 and switch to the more accurate only under critical circumstances, when higher accuracy is necessary. This further motivates the proposed methodology that is going to be adopted as part of a recently started EU project[18].

Indeed, a CPS is meant to react and dynamically adjust to mutable constraints and system conditions. This can be achieved, as shown on the left-hand side in Fig. 4, by an infrastructure composed of two main parts: the Adaptive Inference Engine and the Profile Manager. The former is responsible of implementing the adaptive solution that, in this case, can alternatively execute one of the two profiles. The latter, following the self-adaptive management approach presented in [17], monitors the energy status and the given constraints and decides which is the most suitable profile. The profile selected at runtime must capable of meeting the accuracy requirements while minimizing power dissipation. As an example, if the remaining battery budget is lower than a pre-defined threshold the Profile Manager might select a less energy consuming profile, if the user/application defined constraints are still met or if they can be negotiated. On the right-hand side in Fig. 4, the potentials of the implemented adaptive engine are presented. Even considering this preliminary implementation, it is shown how the adaptive engine (in blue) extends the battery duration, and in turn increases the number of executable classifications, with respect to the non-adaptive (in orange) counterpart, which is running at full performance.

Refer to caption

Figure 4: On top the resource utilization of the adaptive engine and its perfomance metrics under different profiles. On the left side the architecture of a complete adaptable systems that exploits the proposed adaptive inference engine. On the right side, a comparison of the resulting battery duration (supposing a 10Ah energy budget) and number of classifications executable by the adaptive engine and a non-adaptive one supporting the higher accuracy profile only.

5 Conclusion

CPS integrate computation with physical processes, characterized by information exchange with the environment and dynamic behaviors. FPGAs offer hw acceleration, flexibility, and energy efficiency, but challenges persist in achieving full adaptivity for dealing with complex environments.

The utilization of a CNN model involves two distinct phases: training and inference. These phases can be decoupled using an intermediate representation like the ONNX format. The proposed design flow automates FPGA inference for quantized CNN models, specified in the QONNX format, which allows data approximation through arbitrary-precision data types. At the same time, the flow also features adaptivity, implementing computation approximation.

A data approximation analysis on a tiny CNN model for MNIST classification has been carried out to select valuable execution profiles. These latter have been automatically combined in a runtime adaptive inference engine, which is capable of adapting its accuracy and power consumption at runtime by switching among the selected profiles.

Future work will aim at validating the proposed approach on more complex CNN models and datasets, allowing for quantitative state of the art comparison, besides the already provided qualitative discussion.

This work is supported by MYRTUS that is funded by the European Union (GA No. 101135183). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

References

  • [1] Aarrestad, Thea, et al. "Fast convolutional neural networks on FPGAs with hls4ml." Machine Learning: Science and Technology 2.4 (2021).
  • [2] Agrawal, Ankur et al. "Approximate computing: Challenges and opportunities." ICRC Conference 2016.
  • [3] Armeniakos, Giorgos, et al. "Hardware approximate techniques for deep neural network accelerators: A survey." ACM CSUR 55.4: 1-36 (2022).
  • [4] Bhardwaj, Kartikeya, et al. "Power-and area-efficient approximate wallace tree multiplier for error-resilient systems." ISQED symposium 2014.
  • [5] Canis, Andrew, et al. "LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems." ACM TECS 13.2 (2013): 1-27.
  • [6] Coelho, Claudionor N., et al. "Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors." Nature Machine Intelligence 3.8:675-686 (2021).
  • [7] Farabet C., et al. "An FPGA-based processor for convolutional networks." FPL Conference 2009.
  • [8] Fraser, Nicholas J., et al. "Scaling binarized neural networks on reconfigurable logic." PARMA-DITAM Workshop 2017.
  • [9] Gholami, Amir, et al. "A survey of quantization methods for efficient neural network inference." Low-Power Computer Vision Book (2021).
  • [10] Guan, Yijin, et al. "FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates." FCCM Symposium 2017.
  • [11] Guo, Kaiyuan, et al. "[DL] A survey of FPGA-based neural network inference accelerators." ACM TRETS 12.1: 1-26 (2019).
  • [12] Jungwook, Choi, et al. "Accurate and Efficient 2-bit Quantized Neural Networks." MLSys Proceedings 2019.
  • [13] Meloni, Paolo et al. "Optimization and deployment of CNNs at the edge: the ALOHA experience." CF Conference 2019.
  • [14] Mittal, Sparsh. "A survey of techniques for approximate computing." ACM CSUR 48.4: 1-33 (2016).
  • [15] Nezan, Jean-Francois et al. Multi-purpose systems: A novel dataflow-based generation and mapping strategy." ISCAS Symposium 2012.
  • [16] Ngadiuba, Jennifer, et al. "Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml." Machine Learning: Science and Technology 2.1 (2020).
  • [17] Palumbo, Francesca et al. "Hardware/Software Self-adaptation in CPS: The CERBERO Project Approach." SAMOS Conference 2019.
  • [18] Palumbo, Francesca et al. "MYRTUS: Multi-layer 360° dYnamic orchestration and interopeRable design environmenT for compute-continUum Systems." CF Conference 2024.
  • [19] Pappalardo, Alessandro, "Xilinx/brevitas". Zenodo (2023), https://doi.org/10.5281/zenodo.3333552
  • [20] Pappalardo, Alessandro, et al. "Qonnx: Representing arbitrary-precision quantized neural networks." AccML Workshop 2022.
  • [21] Ratto, Francesco, et al. "An Automated Design Flow for Adaptive Neural Network Hardware Accelerators." Journal of Signal Processing Systems (2023): 1-23.
  • [22] Sau, Carlo, et al. "The Multi-Dataflow Composer tool: An open-source tool suite for optimized coarse-grain reconfigurable hardware accelerators and platform design." MICPRO Journal 80 (2021).
  • [23] Shafique, Muhammad, et al. "TinyML: Current progress, research challenges, and future roadmap." DAC Conference 2021.
  • [24] Song Han, et al. "EIE: Efficient Inference Engine on Compressed Deep Neural Network." ACM SIGARCH Computer Architecture News 44.3 (2016).
  • [25] Summers, Sioni, et al. "Using MaxCompiler for the high level synthesis of trigger algorithms." Journal of Instrumentation 12.02 (2017).
  • [26] Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." FPGA Symposium 2017.
  • [27] Venieris, Stylianos, et al. "Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions." ACM CSUR 51.3: 1-39 (2018).