Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Chang Sun∗1,2, Thea K. Årrestad1, Vladimir Loncar3,4, Jennifer Ngadiuba5, Maria Spiropulu2 1ETH Zürich, Zürich, Switzerland,
2California Institute of Technology, Pasadena, CA, USA
3Massachusetts Institute of Technology, Cambridge, MA, USA
4Institute of Physics Belgrade, Serbia
5Fermi National Accelerator Laboratory, Batavia, IL, USA,
Email: {chang.sun, thea.aarrestad, vladimir.loncar, jennifer.ngadiuba, maria.spiropulu}@cern.ch
*: Corresponding author

Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method designed to fine-tune the per-weight and per-activation precision in an automatic way for ultra-low latency and low power neural networks which are to be deployed on FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

I Introduction

Edge computing has significantly increased the importance of real-time deep neural network (DNN) inference on specialized hardware [1]. The typical latency threshold for real-time inference is 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) ms [2, 3, 4]. Nevertheless, certain domains require sub-microsecond inference times. At the CERN Large Hadron Collider (LHC), detectors generate tens of terabytes of data every second from collisions occurring every 25 nanoseconds. This data throughput is managed by a real-time selection system, the trigger. This system determines the fate of each collision event - whether it should be preserved for analysis or discarded - with a decision-making latency ceiling of a few microseconds [5, 6]. The trigger’s precision is vital to retain only interesting events, thereby managing the bandwidth effectively and reducing the event rate significantly. The system consists of 𝒪(1000)𝒪1000\mathcal{O}(1000)caligraphic_O ( 1000 ) field programmable gate arrays (FPGAs) mounted on custom boards. Several algorithms are running in parallel on each FPGA. As a result, resources are scarce and the memory footprint of each algorithm should be minimal. In anticipation of the LHC’s upgrade to the High Luminosity-LHC (HL-LHC) [7], which will multiply the collision rate considerably by a factor of 23similar-to232\sim 32 ∼ 3 comparing the current one [5, 6], and machine learning techniques are being explored to enhance the speed and accuracy of the computational tasks in the hardware trigger.

However, integrating demanding models - when resource consumption and latency are strictly limited - without compromising performance is a hurdle. Efforts in recent years have focused on algorithmic efficiency, with strategies ranging from the design of compact networks to weight pruning and quantization [8, 9]. Quantization converts model parameters into lower-precision formats, causing some loss in performance. Although post-training is computationally cheaper to perform, the loss in performance is significant in general compared to the full-precision baseline. To mitigate this, quantization-aware training has been proposed, which adheres to a fixed numerical precision throughout training to mitigate this performance degradation.

The satisfy the latency requirements, neural networks on FPGAs for LHC physics experiments are usually fully unrolled and pipelined - all arithmetic operations are done by a different component in the circuit without overlapping, maximizing throughput and minimizing latency. To explore this property, recent researches [10, 11] have suggested that applying varying levels of quantization to different layers could further optimize accuracy against computational costs.

In this paper, we introduce the high-granularity quantization (HGQ) method, which allows models to be trained quantization aware at arbitrary granularity: In contrast to what is done in the QAT library QKeras, where weights and activations are processed in layerwise blocks for quantization, HGQ enables weights and activations within one layer to have different bitwidths. For a fully unrolled implementation, we can allow every weight and activation to have its own unique bitwith. We illustrate the key difference between the HGQ method and the conventional block-wise quantization methods in Fig. I. Optimizing quantization parameter at higher granularity allows HGQ to find a better trade-off relation between model accuracy and resource consumption. Furthermore, by optimizing these individual bitwidths alongside the network using gradient descent, the need for training the network multiple times to search for a favorable quantization bitwidth for each block of the network could also be eliminated.

Refer to caption
Figure I: Overview of the HGQ method, showing activations (circles) and weights (lines) with thickness indicating bitwidth. Connections are dropped when weight or activation values are constantly zero. Top left: baseline network with high precision throughout. Top right: network quantized layer-wise, e.g., using QKeras. Bottom right: network both quantized layer-wise and pruned. Bottom left: network quantized using HGQ, applying more detailed quantization and assigning high bitwidths only where needed, on a per-weight and activation basis. This approach reduces resource use by maximally utilizing FPGA’s heterogeneous computation.

When multiplication operations in neural networks primarily involve low-bitwidth operands implemented with look-up tables (LUTs), HGQ could demonstrate a substantial reduction in on-chip resource consumption by eliminating unnecessary computations without compromising performance. Depending on the specific task, we demonstrate that HGQ has the potential to outperform AutoQKeras and achieve resource reduction by up to a factor of 20, and latency improvement by a factor of 5 while preserving accuracy.

A functional HGQ framework has been developed using Tensorflow and Keras, and we have open-sourced it as free software. The Vivado/Vitis FPGA back-end is supported through integration with hls4ml. The library guarantees an exact correspondence between the software and firmware models, provided that no numeric overflow occurs and intermediate values are representable by float32.The work presented in this paper makes the following contributions:

  • We present a new algorithm for obtaining surrogate gradients of parameter bitwidths, from both the loss function and the estimated model resource consumption, enabling full gradient-based optimization of bitwidths;

  • We enable heterogeneous quantization of a specific model at arbitrary granularity up to per-parameter level, aiming to minimize hardware resource usage while preserving high accuracy. This approach naturally includes sparse pruning of network parameters by setting their bitwidth to zero, further reducing resource-cost;

  • We have made this library easily available online in an easy-to-use library, called HGQ111https://github.com/calad0i/HGQ, where simple drop-in replacement of Tensorflow Keras layers makes it straightforward for users to transform Keras models to their equivalent deep heterogeneously quantized versions, which are trained quantization aware;

  • We have added support for quantized HGQ models in the library, hls4ml, which converts these pre-trained quantized models into highly-parallel FPGA firmware for ultra low-latency inference.

  • Using HGQ in combination with hls4ml ensures exact bit-level accuracy between the HGQ software model and the corresponding firmware model, making the library safe and easy to use for non-experts;

  • We propose a new metric called Effective Bit Operations (EBOPs) for a more accurate estimation of on-chip resource consumption;

  • We demonstrate a resource reduction of up to 95% and a 5-fold improvement in latency, all while maintaining accuracy compared to other state-of-the-art methods.

II Related work

Network compression has been shown to be an effective way to reduce the computational cost of neural networks on FPGAs. Quantization is a widely adopted method for compressing deep neural networks (DNNs) for implementing them on hardware devices such as FPGAs or ASICs. Previous studies have utilized low precision quantization, such as binary or ternary, across networks to enhance throughput and reduce latency. Binary quantization restricts weights to α×1,1\alpha\times{-1,1}italic_α × - 1 , 1, and ternary to α×1,0,1\alpha\times{-1,0,1}italic_α × - 1 , 0 , 1, with α𝛼\alphaitalic_α as a scaling factor. Key examples include DoReFa Net [12], ABC-net [13], Binaryconnect [14], XNOR-net [15], TWN [16], TTQ [17], and [18]. These methods achieve high compression but at the cost of reduced performance compared to standard floating-point networks. Using binary network principles, several studies have moved to multi-bit network designs that represent numbers through binary bases and values, highlighted in works like [19, 20, 13, 21, 22]. Mix&Match [23], in particular, uses power-of-two bases for better hardware compatibility.

Many studies have investigated heterogeneous quantization with layer-specific precision to lessen the performance loss due to quantization. In particular, in HAQ [24] utilizes reinforcement learning to find the best bitwidth configuration. HAWQ, HAWQ-V2, PyHessian, and Q-BERT [25, 26, 27, 28] focus on optimizing bitwidths through hessian-aware techniques. DNAS [29] and AutoQKeras [10] optimize bitwidths and network architecture simultaneously,with DNAS using stochastic sampling from a super network and AutoQKeras employing gradient-free methods like Gaussian Process, Hyperband, and stochastic search for hyperparameter optimization. Similarly, Meta-ML [30] applies iterative optimization to various hyperparameters, including bitwidths, weight pruning, and model architectures.

Some works, like RVQuant [31], BitsandBytes [32], and SpQR [33], have investigated heterogeneous quantization down to the sub-layer level, offloading outlier weights to higher precision formats primarily for model compression for large models rather than significant performance gains on FPGAs. AutoQ [34] utilizes reinforcement learning to optimize bitwidths for kernel weights and activations. A study more aligned with ours is the recent FILM-QNN [35], which optimizes weight and activation precision in a manner conducive to hardware efficiency. It categorizes convolution layer filters into groups of low and high precision, assigning them based on anticipated quantization loss for each filter.

Pruning is another technique used to compress neural networks, enhancing their speed during hardware inference. This method involves removing weights that have minimal impact on the overall accuracy of the network. This concept was first introduced in [36], and was applied to neural networks in [37]. Pruning can be categorized as structured, involving the removal of weights in specific blocks (as in [38, 39, 40]), or unstructured, targeting individual weights (as in  [41, 42, 43, 44, 45, 40]). In this work, we consider pruning as a form of quantization where pruned weights are effectively quantized to zero bits. The QKeras [10] framework, like ours, aims to train and optimize neural networks for deployment on FPGAs. Qkeras is developed on top of Tensorflow Keras [46] and leverages hls4ml [47] for FPGA deployment. It specializes in training and optimizing neural networks, allowing for the use of arbitrary precision fixed-point numbers for both weights and activations. AutoQKeras, a feature within Qkeras, enables automatic adjustment of quantization settings for each layer using a gradient-free approach. This can lead to significant compression, including the use of binary or ternary networks. Typically, hls4ml is employed as the backend for deploying on FPGAs. It specializes in training and optimizing neural networks, allowing for the use of arbitrary precision fixed-point values for both weights and activations. AutoQKeras, a feature within Qkeras, enables automatic tuning of quantization settings for each layer using a gradient-free approach. This can lead to significant compression, including the use of binary or ternary networks [11]. Brevitas [48] serves as the PyTorch [49] equivalent of Qkeras, commonly paired with the FINN and FINN-R frameworks from AMD Research [50, 51] for deploying on AMD FPGAs.

III High Granularity Quantization

In this paper, we introduce High Granularity Quantization (HGQ). This is a novel quantization approach that allows for up to individual precision levels within a single layer, offering the unique capability for each parameter in a network to have its own bitwidth. We begin this section by outlining the fundamentals of quantization and Quantization-Aware Training (QAT). Subsequently, we introduce an innovative gradient-based technique for auto-tuning the quantization bitwidth during training. A comprehensive explanation of the HGQ method and its algorithm follows. This approach is designed to improve the accuracy-resource/latency balance compared to previously studied block-wise heterogeneous quantization methods in neural networks.

III-A Quantization

Quantization is a map, henceforth referred to as fqsuperscriptf𝑞\mathrm{f}^{q}roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, that transforms a real number into a finite set of discrete values, mapping from the set of real numbers \mathbb{R}blackboard_R to a discrete subset {qi|qi+1>qi}conditional-setsubscript𝑞𝑖subscript𝑞𝑖1subscript𝑞𝑖\mathbb{Q}\equiv\left\{q_{i}|q_{i+1}>q_{i}\right\}\subset\mathbb{R}blackboard_Q ≡ { italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_q start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ⊂ blackboard_R. For hardware efficiency, we ensure that quantized weights and activations are represented as fixed-point numbers, a common practice in hardware for numerical representation. A fixed-point number is essentially an integer scaled by a predefined factor, typically powers of two. It is characterized by its bitwidth (total number of bits) and the number of bits allocated for the integer portion. The inclusion of the sign bit in the integer part, for signed numbers, varies by convention. In this context, we adhere to the convention used in Xilinx® Vivado®/Vitis® HLS, which includes the sign bit in the integer part if present. Adhering to the standard for a fixed-point number with b+𝑏subscriptb\in\mathbb{N}_{+}italic_b ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT bits, where i𝑖i\in\mathbb{Z}italic_i ∈ blackboard_Z bits are dedicated to the integer part. We define f𝑓fitalic_f as the number of fractional bits, calculated by fbi𝑓𝑏𝑖f\equiv b-iitalic_f ≡ italic_b - italic_i. For signed numbers, the representable range is [2i1,2i12f]superscript2𝑖1superscript2𝑖1superscript2𝑓[-2^{i-1},2^{i-1}-2^{-f}][ - 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT ], with a step size of 2fsuperscript2𝑓2^{-f}2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT. For unsigned numbers, the range is [0, 2i2fsuperscript2𝑖superscript2𝑓2^{i}-2^{-f}2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT], sharing the same step size.

One way of quantizing a real number into a fixed-point format, fixed<b,i>, can be expressed by a rounding function as follows:

fq(x)=(([x2f]+2b1mod 2i)2b1)2f={[x2f]2f,if x[2i1,2i12f]overflowotherwise,superscriptf𝑞𝑥delimited-[]𝑥superscript2𝑓superscript2𝑏1modsuperscript2𝑖superscript2𝑏1superscript2𝑓casesdelimited-[]𝑥superscript2𝑓superscript2𝑓if 𝑥superscript2𝑖1superscript2𝑖1superscript2𝑓overflowotherwise\displaystyle\begin{split}\mathrm{f}^{q}(x)=&\left(\left(\left[x\cdot 2^{f}% \right]+2^{b-1}\ \mathrm{mod}\ 2^{i}\right)-2^{b-1}\right)\cdot 2^{-f}\\ =&\begin{cases}\left[x\cdot 2^{f}\right]\cdot 2^{-f},&\text{if }x\in[-2^{i-1},% 2^{i-1}-2^{-f}]\\ \mathrm{overflow}&\text{otherwise}\end{cases},\end{split}start_ROW start_CELL roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_x ) = end_CELL start_CELL ( ( [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] + 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT roman_mod 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT ) ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL { start_ROW start_CELL [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_x ∈ [ - 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL roman_overflow end_CELL start_CELL otherwise end_CELL end_ROW , end_CELL end_ROW (1)

where [x]x+ϵdelimited-[]𝑥𝑥italic-ϵ[x]\equiv\lfloor x+\epsilon\rfloor[ italic_x ] ≡ ⌊ italic_x + italic_ϵ ⌋ with ϵ[0,1)italic-ϵ01\epsilon\in[0,1)italic_ϵ ∈ [ 0 , 1 ) and fbi𝑓𝑏𝑖f\equiv b-iitalic_f ≡ italic_b - italic_i. Note that setting ϵ=1/2italic-ϵ12\epsilon=1/2italic_ϵ = 1 / 2 applies conventional rounding to the nearest integer. In this context, “overflow” implies that a value exceeds the representable limits of the fixed-point format, causing a cyclical wrap to the opposite end of the range. Although a quantization function could be designed to adjust values outside the permissible range to the closest valid value (for instance, by clipping them to the range limits), this approach is intentionally avoided in our work to avoid resource overhead. By judiciously selecting the quantization range, we ensure that overflow does not occur.

For an unsigned fixed-point number, denoted as ufixed<b,i>, the quantization function is described below, using the same terminology:

fq(x)=superscriptf𝑞𝑥absent\displaystyle\mathrm{f}^{q}(x)=roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_x ) = ([x2f]mod 2i)2fdelimited-[]𝑥superscript2𝑓modsuperscript2𝑖superscript2𝑓\displaystyle\left(\left[x\cdot 2^{f}\right]\ \mathrm{mod}\ 2^{i}\right)\cdot 2% ^{-f}( [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] roman_mod 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT (2)
=\displaystyle== {[x2f]2f,if x[0,2i2f]overflowotherwise,casesdelimited-[]𝑥superscript2𝑓superscript2𝑓if 𝑥0superscript2𝑖superscript2𝑓overflowotherwise\displaystyle\begin{cases}\left[x\cdot 2^{f}\right]\cdot 2^{-f},&\text{if }x% \in[0,2^{i}-2^{-f}]\\ \mathrm{overflow}&\text{otherwise}\end{cases},{ start_ROW start_CELL [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_x ∈ [ 0 , 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL roman_overflow end_CELL start_CELL otherwise end_CELL end_ROW , (3)


In our approach, we only track the number of fractional bits of the fixed-point number during training. Before deploying the network for hardware synthesis (e.g., into HLS projects), we calculate the required number of integer bits to avoid overflow. This task is trivial for weights, as they are only constants after training. For intermediate accumulator and activation values, we employ a calibration dataset to gauge the extreme values (both maximum and minimum) the values might assume. This process involves running the dataset through the network and logging the extreme quantized values, vmaxqsubscriptsuperscript𝑣𝑞maxv^{q}_{\mathrm{max}}italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, and vminqsubscriptsuperscript𝑣𝑞minv^{q}_{\mathrm{min}}italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Given the fixed-point number’s range of [2i1superscript2𝑖1-2^{i-1}- 2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT, 2i12fsuperscript2𝑖1superscript2𝑓2^{i-1}-2^{-f}2 start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT], we can determine the necessary integer bit width, i𝑖iitalic_i, using:

i=max(log2|vmaxq|+1,log2|vminq|).𝑖subscript2subscriptsuperscript𝑣𝑞max1subscript2subscriptsuperscript𝑣𝑞mini=\max(\lfloor\log_{2}|v^{q}_{\mathrm{max}}|\rfloor+1,\lceil\log_{2}|v^{q}_{% \mathrm{min}}|\rceil).italic_i = roman_max ( ⌊ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | ⌋ + 1 , ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | ⌉ ) . (4)

By ensuring the calibration dataset accurately reflects the input data distribution the network will encounter in deployment, we can guarantee that overflow will not occur. For extra safety, one may also add margin to the computed range to account for potential outliers in the input data. This method eliminates the need to consider the representational range during the training phase. Therefore, the quantization function during training can be expressed as:

fq(x)=[x2f]2f=(x+ϵ)2f2f.superscriptf𝑞𝑥delimited-[]𝑥superscript2𝑓superscript2𝑓𝑥italic-ϵsuperscript2𝑓superscript2𝑓\mathrm{f}^{q}(x)=\left[x\cdot 2^{f}\right]\cdot 2^{-f}=\lfloor(x+\epsilon)% \cdot 2^{f}\rfloor\cdot 2^{-f}.roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_x ) = [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT = ⌊ ( italic_x + italic_ϵ ) ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⌋ ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT . (5)

Without loss of generality, we assume ϵ=1/2italic-ϵ12\epsilon=1/2italic_ϵ = 1 / 2 for the rest of this chapter. This choice does not affect any of the results or conclusions drawn in this work.

III-B Quantization-Aware Training

Quantization-aware training (QAT) trains neural networks by applying quantization directly during the training pass. Previous work, e.g [10], demonstrates that QAT significantly reduces the accuracy loss typically caused by quantization. In this work, we adopt the QAT method utilized in [10] as the foundational technique for our HGQ method. Specifically, we employ the straight-through estimator (STE) [52] for weights and activations quantization, which quantizes the values during the forward pass while acting as an identity for computing the gradients in the backward pass. This strategy maintains a good balance between effective quantization and overhead during training.

III-C FPGA resource consumption

A common metric for estimating on-chip resource usage in FPGAs, Bit Operations (BOPs) [53]. BOPs quantify resource on the FPGA by counting the number of bit operations performed during the network’s inference. For two numbers defined in bitwidths bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the number of BOPs is bibjsubscript𝑏𝑖subscript𝑏𝑗b_{i}\cdot b_{j}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for a multiplication operation and max(bi,bj)+1subscript𝑏𝑖subscript𝑏𝑗1(b_{i},b_{j})+1( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + 1 for an addition operation. However, BOPs falls short in accurately reflecting resource consumption in many cases for a fully unrolled neural network on an FPGA.

This discrepancy arises from the fact that the multiplication operations is usually between a fixed constant and a variable. In particular, for an unrolled implementation on hardware:

  1. 1.

    Declaring a constant in fixed-point format of b𝑏bitalic_b bits does not necessary mean that all b𝑏bitalic_b bits are used. For instance, a weight of 0.5 in an 8-bit fixed-point format only uses 1 bit instead of 8 bits, and counting it as 8 in BOPs leads to an overestimation of resource usage.

  2. 2.

    BOPs tends to double count an accumulation operation that follows directly after another multiplication operation between constant and variable: multiplication between a constant and a variable can be implemented as a series of additions of shifted values of the variable. The operation count for a single multiplication involving bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bitwidths thus becomes either bi(bj1)subscript𝑏𝑖subscript𝑏𝑗1b_{i}\cdot(b_{j}-1)italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 1 ) or (bi1)bjsubscript𝑏𝑖1subscript𝑏𝑗(b_{i}-1)\cdot b_{j}( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ⋅ italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In scenarios involving multiplication-accumulation, the bit operations are approximated as bibisubscript𝑏𝑖subscript𝑏𝑖b_{i}\cdot b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To address this discrepancy and offer a more precise estimation of on-chip resource usage, we propose a novel metric, Effective Bit Operations (EBOPs).

The bitwidth for constants used for EBOPs is not the declared bitwidth, but the number of bits that are enclosed by non-zero values. For instance, a weight represented as 001xx1000 will be counted as 4 bits instead of 8 bits. This approach ensures that the resource estimation is not overestimated by the declared bitwidth.

To address the second issue, EBOPs quantifies only the cumulative BOPs conducted during multiplicative processes in a network. Let ={{i,j}n}subscript𝑖𝑗𝑛\mathcal{M}=\left\{\{i,j\}_{n}\right\}caligraphic_M = { { italic_i , italic_j } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be the set of all multiplication operations between operands with bitwidths bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The total number of EBOPs can then be expressed as:

EBOPs=i,jbibj.EBOPssubscript𝑖𝑗subscript𝑏isubscript𝑏j\mathrm{EBOPs}=\sum_{{i,j}\in\mathcal{M}}b_{\mathrm{i}}\cdot b_{\mathrm{j}}.roman_EBOPs = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ caligraphic_M end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ⋅ italic_b start_POSTSUBSCRIPT roman_j end_POSTSUBSCRIPT . (6)

Here bit operations in accumulation processes in avoided intentionally, under the assumption these are implicitly included within the EBOPs framework to avoid the second issue mentioned above.

Experimental findings validate EBOPs as a reliable estimator for on-chip resource consumption, closely mirroring a linear combination of LUT and DSP usages. Detailed results are discussed in Sec. V. To get a accurate resource estimation from EBOPs, one should only including operations that will be executed in parallel. For instance, different inputs fed to the same multiplier through a FIFO buffer should be counted only once (e.g. the implementation of convolutions in hls4ml in general). Additionally, this estimation does not include overhead from non-multiplicative processes (e.g., buffers used in hls4ml’s io_stream implementation). Though, note that it is feasible to estimate them separately in other means and add these additional overheads to the final result.

III-D Gradient-based optimization of bitwidths

To obtain a fully-unrolled quantized neural network with minimum resource- or area-usage on-chip, we want each weight and activation bitwidth to be individually optimized. However, in this way, the number of bitwidth parameters could exceed the number of trainable parameters in the original network. The only feasible approach to managing such a vast parameter space is through gradient-based optimization. Nonetheless, direct optimization of these discrete bitwidth values via gradients is not possible due to the absence of a direct gradient path from the loss function to the bitwidths. Therefore, we address two main issues: a) make the discrete bitwidths optimizable with a gradient; and b) estimate a surrogate gradient for these bitwidths.

III-D1 Optimize discrete bitwidths with gradient

The first issue can be straightforwardly addressed by treating the discrete bitwidths similar to the discrete weights in a quantized network. In particular, we apply the straight-through estimator (STE) to real-numbered bitwidths as it is done for the weights, and we follow the STE implementation used in QKeras:

ste(x)=x+sg([x]x),ste𝑥𝑥sgdelimited-[]𝑥𝑥\mathrm{ste}(x)=x+\mathrm{sg}([x]-x),roman_ste ( italic_x ) = italic_x + roman_sg ( [ italic_x ] - italic_x ) , (7)

where sg::sg\mathrm{sg}:\mathbb{R}\rightarrow\mathbb{R}roman_sg : blackboard_R → blackboard_R is an identity function that detaches the gradient from the enclosed expression. In this way, the bitwidths can be optimized if they have gradients attached. Continuous values for the bitwidths are stored, and they are only rounded to integers as needed during forward passes. During backward passes, the rounding operations default to the identity.

III-D2 Surrogate gradient for bitwidths

To address the second issue, we first consider some parameter x𝑥xitalic_x (e.g., weight or activation) in the network and its corresponding quantizer fq()superscriptf𝑞\mathrm{f}^{q}(\cdot)roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( ⋅ ). If we require that the quantized number has at most f𝑓fitalic_f fractional bits, its associated quantization error δfsubscript𝛿𝑓\delta_{f}italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT can be expressed as follows with ϵ=1/2italic-ϵ12\epsilon=1/2italic_ϵ = 1 / 2:

δfxfq(x)=x[x2f]2f.subscript𝛿𝑓𝑥superscriptf𝑞𝑥𝑥delimited-[]𝑥superscript2𝑓superscript2𝑓\delta_{f}\equiv x-\mathrm{f}^{q}(x)=x-\left[x\cdot 2^{f}\right]\cdot 2^{-f}.italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ≡ italic_x - roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_x ) = italic_x - [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT . (8)

During training, we assume x𝑥xitalic_x to be a random variable following a certain smooth distribution 𝔻xsubscript𝔻𝑥\mathbb{D}_{x}blackboard_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. We further assume that the variance of 𝔻xsubscript𝔻𝑥\mathbb{D}_{x}blackboard_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is significantly larger than the quantization error δfsubscript𝛿𝑓\delta_{f}italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in such a way that one can approximate the quantization error as a uniform distribution:

δfUniform(2f1,2f1).\delta_{f}\sim\mathrm{Uniform}(-2^{-f-1}\cdot,2^{-f-1}).italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∼ roman_Uniform ( - 2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT ⋅ , 2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT ) . (9)

Let the loss of the network be \mathcal{L}caligraphic_L, and express the gradient of f𝑓fitalic_f with respect to \mathcal{L}caligraphic_L as

f=δδf.𝑓𝛿𝛿𝑓\frac{\partial\mathcal{L}}{\partial f}=\frac{\partial\mathcal{L}}{\partial% \delta}\cdot\frac{\partial\delta}{\partial f}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_f end_ARG = divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_δ end_ARG ⋅ divide start_ARG ∂ italic_δ end_ARG start_ARG ∂ italic_f end_ARG . (10)

In this expression, the first term δ𝛿\frac{\partial\mathcal{L}}{\partial\delta}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_δ end_ARG can be obtained trivially with backpropagation. The second term δf𝛿𝑓\frac{\partial\delta}{\partial f}divide start_ARG ∂ italic_δ end_ARG start_ARG ∂ italic_f end_ARG is not well-defined, as f𝑓fitalic_f can only take integer values for a properly defined quantizer and thus has no gradient. To address this issue, we propose a surrogate gradient method that assigns a gradient to f𝑓fitalic_f only on integer values.

We now express the loss as a function of the weights 𝜽𝜽\bm{\theta}bold_italic_θ and all the quantization errors 𝜹𝜹\bm{\delta}bold_italic_δ, (𝜽,𝜹)𝜽𝜹\mathcal{L}(\bm{\theta},\bm{\delta})caligraphic_L ( bold_italic_θ , bold_italic_δ ). To obtain the surrogate gradient of f𝑓fitalic_f, we assume that the loss function is sensitive to the magnitude of the quantization error, but not the sign: (𝜽,|𝜹|)𝜽𝜹\mathcal{L}(\bm{\theta},|\bm{\delta}|)caligraphic_L ( bold_italic_θ , | bold_italic_δ | ).

For a parameter x𝒟xsimilar-to𝑥subscript𝒟𝑥x\sim\mathcal{D}_{x}italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with f𝑓f\in\mathbb{Z}italic_f ∈ blackboard_Z floating bits to be quantized, the corresponding absolute quantization error is |δf||xffq(x)|Uniform(0,2f1)subscript𝛿𝑓𝑥subscriptsuperscriptf𝑞𝑓𝑥similar-toUniform0superscript2𝑓1|\delta_{f}|\equiv|x-\mathrm{f}^{q}_{f}(x)|\sim\mathrm{Uniform}(0,2^{-f-1})| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | ≡ | italic_x - roman_f start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_x ) | ∼ roman_Uniform ( 0 , 2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT ). By increasing f𝑓fitalic_f by one, we obtain the absolute quantization error |δf+1|subscript𝛿𝑓1|\delta_{f+1}|| italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | as a function of f𝑓fitalic_f and |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT |:

|δf+1|={|δf||δf|2f22f1|δf||δf|>2f2.subscript𝛿𝑓1casessubscript𝛿𝑓subscript𝛿𝑓superscript2𝑓2superscript2𝑓1subscript𝛿𝑓subscript𝛿𝑓superscript2𝑓2|\delta_{f+1}|=\begin{cases}|\delta_{f}|&|\delta_{f}|\leq 2^{-f-2}\\ 2^{-f-1}-|\delta_{f}|&|\delta_{f}|>2^{-f-2}\\ \end{cases}.| italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | = { start_ROW start_CELL | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_CELL start_CELL | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | ≤ 2 start_POSTSUPERSCRIPT - italic_f - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT - | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_CELL start_CELL | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | > 2 start_POSTSUPERSCRIPT - italic_f - 2 end_POSTSUPERSCRIPT end_CELL end_ROW . (11)

We can then obtain the gradient of |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | with respect to f𝑓fitalic_f using the finite difference approximation.

|δf|f|δf+1||δf|.subscript𝛿𝑓𝑓subscript𝛿𝑓1subscript𝛿𝑓\frac{\partial|\delta_{f}|}{\partial f}\leftarrow|\delta_{f+1}|-|\delta_{f}|.divide start_ARG ∂ | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG start_ARG ∂ italic_f end_ARG ← | italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | - | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | . (12)

However, as the absolute quantization error is bounded by a geometric sequence of 2f1superscript2𝑓12^{-f-1}2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT, using a linear difference for approximation is suboptimal. Instead, we use the following heuristic expression to approximate the gradient, which recovers Eq. (12) at the limit of |δf+1||δf|subscript𝛿𝑓1subscript𝛿𝑓|\delta_{f+1}|\rightarrow|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | → | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT |:

|δf|f|δf|log|δf+1||δf|.subscript𝛿𝑓𝑓subscript𝛿𝑓subscript𝛿𝑓1subscript𝛿𝑓\frac{\partial|\delta_{f}|}{\partial f}\leftarrow|\delta_{f}|\cdot\log\frac{|% \delta_{f+1}|}{|\delta_{f}|}.divide start_ARG ∂ | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG start_ARG ∂ italic_f end_ARG ← | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | ⋅ roman_log divide start_ARG | italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG . (13)

Expressing the ratio of |δf+1|subscript𝛿𝑓1|\delta_{f+1}|| italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | and |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | as a function of |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT |, we have

|δf+1||δf|={1|δf|2f22f1|δf|1|δf|>2f2.subscript𝛿𝑓1subscript𝛿𝑓cases1subscript𝛿𝑓superscript2𝑓2superscript2𝑓1subscript𝛿𝑓1subscript𝛿𝑓superscript2𝑓2\frac{|\delta_{f+1}|}{|\delta_{f}|}=\begin{cases}1&|\delta_{f}|\leq 2^{-f-2}\\ \frac{2^{-f-1}}{|\delta_{f}|}-1&|\delta_{f}|>2^{-f-2}\\ \end{cases}.divide start_ARG | italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG = { start_ROW start_CELL 1 end_CELL start_CELL | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | ≤ 2 start_POSTSUPERSCRIPT - italic_f - 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG 2 start_POSTSUPERSCRIPT - italic_f - 1 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG - 1 end_CELL start_CELL | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | > 2 start_POSTSUPERSCRIPT - italic_f - 2 end_POSTSUPERSCRIPT end_CELL end_ROW . (14)

One may get a gradient surrogate by combining Eq. (13) and Eq. (14). However, using the local relation as expressed in Eq. (14) between |δf+1|subscript𝛿𝑓1|\delta_{f+1}|| italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | and |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | could lead to a loss landscape for f𝑓fitalic_f with extensive high-frequency components that is hard to optimize. To mitigate this issue and smooth out the loss landscape, we take the expectation of the first term of Eq. (13) over |δf|subscript𝛿𝑓|\delta_{f}|| italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT |:

𝔼|δf|[log|δf+1||δf|]=log2.subscript𝔼subscript𝛿𝑓delimited-[]subscript𝛿𝑓1subscript𝛿𝑓2\mathbb{E}_{|\delta_{f}|}\left[\log\frac{|\delta_{f+1}|}{|\delta_{f}|}\right]=% -\log 2.blackboard_E start_POSTSUBSCRIPT | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_POSTSUBSCRIPT [ roman_log divide start_ARG | italic_δ start_POSTSUBSCRIPT italic_f + 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | end_ARG ] = - roman_log 2 . (15)

By substituting Eq. (15) into Eq. (12), and add a sign(δf)signsubscript𝛿𝑓\mathrm{sign}(\delta_{f})roman_sign ( italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) term on both hand sides, we obtain the surrogate gradient for f𝑓fitalic_f:

δfflog2δf.subscript𝛿𝑓𝑓2subscript𝛿𝑓\frac{\partial\delta_{f}}{\partial f}\leftarrow-\log 2\cdot\delta_{f}.divide start_ARG ∂ italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_f end_ARG ← - roman_log 2 ⋅ italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT . (16)

Hence, the forward pass of the quantizer, with respect to one input value x𝑥xitalic_x and its float bitwidth f𝑓fitalic_f, can be expressed as in Algorithm 1, and the backward pass is the auto-differentiation of the forward pass with the stop-gradient operations.

Data: x𝑥xitalic_x: the input value; f𝑓fitalic_f: the float bitwidth
Result: xqsuperscript𝑥𝑞x^{q}italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT: the quantized value of x𝑥xitalic_x with float bitwidth f𝑓fitalic_f
fste(f)𝑓ste𝑓f\leftarrow\mathrm{ste}(f)italic_f ← roman_ste ( italic_f );
xqsg([x2f]2f)superscript𝑥𝑞sgdelimited-[]𝑥superscript2𝑓superscript2𝑓x^{q}\leftarrow\mathrm{sg}([x\cdot 2^{f}]\cdot 2^{-f})italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ← roman_sg ( [ italic_x ⋅ 2 start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT );
δsg(xxq)𝛿sg𝑥superscript𝑥𝑞\delta\leftarrow\mathrm{sg}(x-x^{q})italic_δ ← roman_sg ( italic_x - italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) ;
  // Standard STE-based quantization
δsg(δ+ln2fδ)ln2fδ𝛿sg𝛿2𝑓𝛿2𝑓𝛿\delta\leftarrow\mathrm{sg}(\delta+\ln 2\cdot f\cdot\delta)-\ln 2\cdot f\cdot\deltaitalic_δ ← roman_sg ( italic_δ + roman_ln 2 ⋅ italic_f ⋅ italic_δ ) - roman_ln 2 ⋅ italic_f ⋅ italic_δ ;
  // Attach gradients of f𝑓fitalic_f to δ𝛿\deltaitalic_δ
xqxδsuperscript𝑥𝑞𝑥𝛿x^{q}\leftarrow x-\deltaitalic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ← italic_x - italic_δ ;
  // Attach gradients of f𝑓fitalic_f and x𝑥xitalic_x to xqsuperscript𝑥𝑞x^{q}italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT
return xqsuperscript𝑥𝑞x^{q}italic_x start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT
Algorithm 1 Quantizer forward pass

Quantization often results in higher loss values, causing the gradients from the loss function to propagate to the bitwidths and increase them. To address this, we introduce a regularization term to prevent the bitwidths from growing too large. We use EBOPs as this regularization term, incorporating it into the loss function with a regularization coefficient β𝛽\betaitalic_β to balance accuracy against on-chip resource usage. Moreover, for network values not involved in multiplicative operations (such as last-layer outputs or inputs to non-linear activations), we apply an L-1 regularization with a coefficient γ𝛾\gammaitalic_γ to the bitwidths, preventing them from expanding unnecessarily. The final loss function is given by

=base+βEBOPs+γL1norm,subscriptbase𝛽EBOPs𝛾subscriptL1norm\mathcal{L}=\mathcal{L}_{\mathrm{base}}+\beta\cdot\mathrm{EBOPs}+\gamma\cdot% \mathrm{L1}_{\mathrm{norm}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT + italic_β ⋅ roman_EBOPs + italic_γ ⋅ L1 start_POSTSUBSCRIPT roman_norm end_POSTSUBSCRIPT , (17)

with gradients attached to the bitwidths.

As all additional gradients introduced in this section are directly added to the bitwidths, the loss landscape of the network’s weights remains unperturbed compared to that of networks with static quantization parameters. Consequently, it eliminates the alterations to the loss landscape which are introduced by regularization-based quantization methods, like [54, 55].

III-D3 Gradient for bitwidths with multiple parameters

In experiments, we noticed that when the parameter group size increases while keeping the same β𝛽\betaitalic_β, the corresponding bitwidth is more likely to collapse to zero and causing a breaking down during training. This effect is hypothesized due to the non-uniformity of gradients contributed by parameters in the group. To mitigate such effects, we normalize the gradient on fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by 1/gi1normsubscript𝑔𝑖1/{\sqrt{||g_{i}||}}1 / square-root start_ARG | | italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | end_ARG, based on empirical observations. Here, ginormsubscript𝑔𝑖||g_{i}||| | italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | denotes the number of elements in gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The quantizer’s forward pass with respect to a parameter group gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be described as in Algorithm 2 during training and Eq. (5) during inference. The backward pass is derived from the forward pass automatically.

Data: gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: the i𝑖iitalic_i-th parameter group; fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: the number of floating bits for i𝑖iitalic_i-th parameter group
Result: (gq)isubscriptsuperscript𝑔𝑞𝑖(g^{q})_{i}( italic_g start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: the quantized parameters of i𝑖iitalic_i-th parameter group
if gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT isagroupofweightsisagroupofweights\mathrm{is\ a\ group\ of\ weights}roman_is roman_a roman_group roman_of roman_weights then
       fiste(fi)subscript𝑓𝑖stesubscript𝑓𝑖{f}_{i}\leftarrow\mathrm{ste}(f_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← roman_ste ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );
       fifisubscript𝑓𝑖subscript𝑓𝑖{f}_{i}\leftarrow f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
end if
(gq)iemptylistsubscriptsuperscript𝑔𝑞𝑖emptylist(g^{q})_{i}\leftarrow\mathrm{empty\ list}( italic_g start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← roman_empty roman_list;
Ngi𝑁normsubscript𝑔𝑖N\leftarrow||g_{i}||italic_N ← | | italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | |;
forall vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT inin\mathrm{in}roman_in gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do
       vqsg([vj2fi]2fi)superscript𝑣𝑞sgdelimited-[]subscript𝑣𝑗superscript2subscript𝑓𝑖superscript2subscript𝑓𝑖v^{q}\leftarrow\mathrm{sg}([v_{j}\cdot 2^{f_{i}}]\cdot 2^{-{f_{i}}})italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ← roman_sg ( [ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] ⋅ 2 start_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ;
       δsg(vjvq)𝛿sgsubscript𝑣𝑗superscript𝑣𝑞\delta\leftarrow\mathrm{sg}(v_{j}-v^{q})italic_δ ← roman_sg ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT );
       δsg(δ+ln2fiδN)ln2fδN𝛿sg𝛿2subscript𝑓𝑖𝛿𝑁2𝑓𝛿𝑁\delta\leftarrow\mathrm{sg}(\delta+\frac{\ln 2\cdot f_{i}\cdot\delta}{\sqrt{N}% })-\frac{\ln 2\cdot f\cdot\delta}{\sqrt{N}}italic_δ ← roman_sg ( italic_δ + divide start_ARG roman_ln 2 ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_δ end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) - divide start_ARG roman_ln 2 ⋅ italic_f ⋅ italic_δ end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ;
       vqjxδsubscriptsuperscript𝑣𝑞𝑗𝑥𝛿{v^{q}}_{j}\leftarrow x-\deltaitalic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_x - italic_δ ;
       appendappend\mathrm{append}roman_append vqjsubscriptsuperscript𝑣𝑞𝑗{v^{q}}_{j}italic_v start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT toto\mathrm{to}roman_to (gq)isubscriptsuperscript𝑔𝑞𝑖(g^{q})_{i}( italic_g start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
end forall
return (gq)isubscriptsuperscript𝑔𝑞𝑖(g^{q})_{i}( italic_g start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Algorithm 2 Quantizer forward pass for a parameter group

III-D4 Connection to Pruning

From Eq. (5), it is observable that the quantized value defaults to zero whenever ϵ2fx<(1ϵ)2fitalic-ϵsuperscript2𝑓𝑥1italic-ϵsuperscript2𝑓-\epsilon\cdot 2^{-f}\leq x<(1-\epsilon)\cdot 2^{-f}- italic_ϵ ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT ≤ italic_x < ( 1 - italic_ϵ ) ⋅ 2 start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT. Given that f𝑓fitalic_f can take both positive and negative values, a sufficiently small f𝑓fitalic_f with ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 will cause certain parameters in the network to always be zero. This is equivalent to pruning the parameter. Assigning a distinct bitwidth to each parameter group in the network through HGQ automatically prunes the network during training in an unstructured, taking both loss and resource consumption into account.

Listing 1: HGQ model example.
from tensorflow.keras.layers import Input
from HGQ import HQuantize, HDense
inp = Input((16,))
out = HQuantize(name=‘inp_q’)(out)
out = HDense(64, activation=‘relu’, bops_reg_factor=r)(out)
out = HDense(32, activation=‘relu’, bops_reg_factor=r)(out)
out = HDense(32, activation=‘relu’, bops_reg_factor=r)(out)
out = HDense(5, activation=‘linear’, bops_reg_factor=r)(out)
hgq_model = Model(inp, out)’
Listing 2: Keras model example.
from tensorflow.keras.layers import Input, Dense
inp = Input((16,))
out = Dense(64, activation=‘relu’)(out)
out = Dense(32, activation=‘relu’)(out)
out = Dense(32, activation=‘relu’)(out)
out = Dense(5, activation=‘linear’)(out)
keras_model = Model(inp, out)

IV The High Granularity Quantization Framework

The HGQ algorithm is available as a user-friendly Python library, similar to QKeras [10]. It functions as a sophisticated quantization API built on top of Keras [56], utilizing hls4ml [47] for deployment. Additionally, this framework facilitates automatic conversion of a tensorflow.keras model into a hls4ml model, ensuring bit-accuracy as per the specifications of a dataset defined by the user, without requiring manual intervention.

Following the methodology of QKeras, HGQ encompasses most of the layers available in hls4ml, and adheres to the design principles of Keras. HGQ is engineered to carry out automatic quantization on all compatible layers according to the EBOPs regularization factor, β𝛽\betaitalic_β. This approach eliminates the necessity for users to fine-tune quantization parameters for individual modules or undergo multiple training cycles to identify the best quantization scheme.

HGQ provides drop-in replacement for the most commonly used Keras layers, making it straightforward to transition from a standard Keras model to an HGQ model with minimal adjustments. For instance, as demonstrated in listing 2, converting to an HGQ model primarily involves substituting existing layers with their HGQ alternatives, as shown in listing 1, along with the inclusion of an additional layer to quantize inputs. Within the HGQ framework, there are two categories of layers: Heterogeneous (H-) layers, which accept an additional parameter, beta, to manage the layer’s resource usage regularization based on MBOPs, and Passive (P-) layers, which serve to relay metadata without performing quantization. The H- layers also allow for layer-specific kernel and pre-activation quantizer configurations (kq_config and paq_config, respectively) for more precise control over quantization behaviors.

HGQ simplifies the process by auto-learning optimal quantization parameters during training, thus mostly freeing the user from manually specifying bit widths and scaling factors for each layer. Instead, users are primarily concerned with setting the EBOPs’ regularization factor, although manual parameter adjustment is still an option.

Beyond quantization-aware training, HGQ introduces a convenient intermediate layer or proxy models for transitioning a trained Keras model to hls4ml. This feature accommodates both HGQ and QKeras models, automating the creation of hls4ml configurations for precise conversions. Furthermore, the proxy model facilitates bit-accurate emulation of the compiled hls4ml model, aiding in debugging and validating the hls4ml model’s performance through development, even in cases of overflow within the constraints of available bit width, up to the accuracy permitted by float32 precision used in the emulation.

IV-A Resource Consumption Surrogate

We consider five types of major resources on an Xilinx FPGA chip: LUTs, DSPs, FFs, BRAMs, and URAMs. When fitting an unrolled neural network for ultra low latency applications, like L1 Triggers in LHC, the limiting resources are usually either LUTs or DSPs. Empirically, operations consisting of larger bitwidths are more likely to consume DSPs, while operations with smaller bitwidths are more likely to consume LUTs. During our experiments, we observed that the EBOPs roughly predict a linear combination of the LUTs and DSPs consumption, namely, EBOPs \approx LUTs + 55×55\times55 × DSPs for models synthesized with io_type=io_parallel in hls4ml, where intermediate values are directly wired between layers/modules.

In Fig. II, we demonstrate the relationship between EBOPs and the actual on-chip resource consumption, represented by post place and route LUTs + 55×55\times55 × DSPs, synthesized with Xilinx Vivado 2020.1/Vitis 2023.2 for models shown in this work. Although the relationship isn’t exact, we can still make a reasonable estimation of resource usage based on EBOPs, even during training. This suggests that treating one DSP as equivalent to approximately 55 LUTs could be a practical approximation for comparing resource usage across different models, although this may not hold universally. It’s important to note that EBOPs primarily account for operations involving a multiplication-accumulation of one constant and one variable. Therefore, if operations other than these significantly contribute to the consumption of on-chip resources, the EBOPs-based estimation might not be reliable. For instance, in the SVHN classifier model synthesized with io_type=io_stream, the resource usage by FIFO buffers isn’t factored in, leading to an underestimation of total resource consumption as predicted by EBOPs.

Refer to caption
Figure II: The relationship between EBOPs and resource consumption estimated by \approx LUTs + 55×55\times55 × DSPs. The EBOPs roughly predicts a linear combination of the LUTs and DSPs consumption for models synthesized with io_type=io_parallel. Models shown in this figure are from the three tasks described in Section V. The relationship is not exact, but indicates that one DSP is roughly equivalent to 55 LUTs when comparing the resource consumption of different models.

V Results

To evaluate the performance of the HGQ framework, we train and evaluate models on a classification, a computer vision and a regression task: jet tagging at the LHC [10], SVHN digit classification[57], and muon tracking[58], respectively.

To demonstrate the trade-off between accuracy (or resolution) and resource usage, we methodically adjusted the β𝛽\betaitalic_β factor for each task during training to map out various optimal points on the accuracy (resolution) vs. resource consumption Pareto Front. This process involved initiating with a notably low β𝛽\betaitalic_β value and incrementally raising it through the training, capturing all models that align with the Pareto Front, defined by validation accuracy (or resolution) and estimated resource consumption via EBOPs. Meanwhile, we maintained the γ𝛾\gammaitalic_γ value fixed at 2.e-6 for all experiments to avert the risk of layers diverging in bitwidths. Post-training, we conduct a reassessment of the models using the test set, providing details on accuracy or resolution based on c-synthesis, and detailing resource consumption after the place-and-route phase.

V-A Jet Classification at the LHC

We conducted a comparison of the accuracy, latency, and on-chip resource utilization of models trained with HGQ against various quantized models from earlier research.

We use the dataset from [59]. This dataset is for classifying jets, a kind of particle shower produced by high-energy particles at the LHC experiments, into five classes based on their originating particles: quark (q), gluon (g), W boson, Z boson, and top (t) jets. The inputs for each jet are 16 scaler values representing physics-motivated high-level features. The model architecture employed is based on the full precision baseline model described in the original work [10], which is a 4-layer fully connected neural network. The exact model architecture is shown in Fig. VI in extended data.

The results are summarized in Table I. In the table, the following models are cited from [10]: BF, BP, BH, Q6, QE, and QB. In this work, various techniques such as Quantization-Aware Training (QAT), pruning, and automated parameter optimization using a Gaussian Process (for the QE and QB models) were used to achieve low resource consumption. LogicNets JSC-M and JSC-L are cited from [60], where the networks are designed to use on-chip LUTs efficiently. BP-DSP-RF=2 is cited from [38], where the network is implemented in QKeras with a reuse factor of two, and pruned in a DSP-aware fashion to reduce the resource consumption. MetaML-αq=1%subscript𝛼qpercent1\alpha_{\mathrm{q}}\mathrm{=}1\%italic_α start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT = 1 % and MetaML-αq=4%subscript𝛼qpercent4\alpha_{\mathrm{q}}\mathrm{=}4\%italic_α start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT = 4 % are cited from [30]. These two networks went through iterative architecture search, quantization, and pruning for better accuracy-resource trade-off. SymbolNet is cited from [61], which leverages a gradient-based method for optimal symbolic expression searching. It also uses an adaptive dynamic pruning scheme to reduce on-chip resource consumption while maintaining accuracy.

As shown in Fig. III and Tab. I, the HGQ models outperform the baseline models by a significant margin both in terms of model accuracy and resource usage. Depending on the working point, HGQ models reduce the resource consumption from 50% to up to 95%, while maintaining the same accuracy. When working with a lower accuracy requirement, it could also achieve similar resource consumption as an optimized symbolic regressor.

The HGQ trained models, HGQ 1 through 8,are taken from the same training run with ramping up β𝛽\betaitalic_β during training. These models were initially set to use 2 bits for representing the activations’ floating-point part and 2 bits in total for the weights. Throughout the training process, which spanned roughly 300,000 epochs, β𝛽\betaitalic_β was gradually increased from 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Due to the models’ compact size, this entire training process could be completed in just a few hours using a standard consumer-grade GPU.

We also studied the performance of the models trained with fixed, non-zero β𝛽\betaitalic_β values. In Fig. III and Tab. I, these correspond to HGQ-c1, trained with a fixed β𝛽\betaitalic_β of 2.1e-6, and HGQ-c2, using a β𝛽\betaitalic_β of 1.2e-5. Each is trained for 5,000 epochs (lasting a few minutes on a consumer grade GPU). Models trained with either a constant or incrementally increasing β𝛽\betaitalic_β value are capable of achieving a comparable balance between accuracy and resource consumption. From this, we conclude that the method of progressively ramping up β𝛽\betaitalic_β throughout the training process is effective in generating a collection of models that represent an optimal compromise between accuracy and resource efficiency, situating them favorably on the accuracy-resource Pareto Frontier.

TABLE I: Resource consumption and latency of the jet tagging models. Resource reported for HGQ models are after place & route. In this task, HGQ models outperforms the baseline models by a large margin in both accuracy and resource consumption. The LGQ model is not high-granularity quantized, but only using gradient-based bitwidth optimization.
Model Accuracy (%) Latency (cc) DSP (%)(\%)( % ) LUT (%)(\%)( % ) FF (%) II (cc)
BF [10] 74.4 9 (45 ns) 56.0 (1,826) 4.09 (48,321) 0.8 (20,132)
BP [10] 74.8 14 (70 ns) 7.7 (526) 1.49 (17,577) 0.4 (10,548)
BH [10] 73.2 14 (70 ns) 1.3 (88) 1.34 (15,802) 0.3 (8,108)
Q6 [10] 74.8 11 (55 ns) 1.8 (124) 3.36 (39,782) 0.3 (8,128)
QE [10] 72.3 11 (55 ns) 1.0 (66) 0.77 (9,149) 0.1 (1,781)
QB [10] 71.9 14 (70 ns) 1.0 (69) 0.95 (11,193) 0.1 (1,771)
LogicNets JSC-M [60] 70.6 N/A 0 (0) 1.22 (14,428) 0.02 (440)
LogicNets JSC-L [60] 71.8 5 (13 ns) 0 (0) 3.21 (37,931) 0.03 (810)
BP-DSP-RF=2 [38] 76.3 21 (105 ns) 2.6 (175) 0.47 (5,504) 0.13 (3,036) 2
MetaML-αq=1%subscript𝛼qpercent1\alpha_{\mathrm{q}}\mathrm{=}1\%italic_α start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT = 1 % [30] 75.6 9 (45 ns) 0.7 (50) 0.57 (6,698) N/A 1
MetaML-αq=4%subscript𝛼qpercent4\alpha_{\mathrm{q}}\mathrm{=}4\%italic_α start_POSTSUBSCRIPT roman_q end_POSTSUBSCRIPT = 4 % [30] 72.8 8 (40 ns) 0.2 (23) 0.57 (7,224) N/A 1
SymbolNet [61] 71. 2 (10 ns) <<<0.1 (3) 0.01 (177) <<<0.01 (109) 1
HGQ-1 76.4 6 (30 ns) 0.50 (34) 0.53 (6,236) 0.05 (1253) 1
HGQ-2 75.9 4 (20 ns) 0.09 (6) 0.27 (3,162) 0.02 (550) 1
HGQ-3 75.0 4 (20 ns) 0.07 (5) 0.13 (1,540) 0.02 (370) 1
HGQ-4 73.9 3 (15 ns) 0.00 (0) 0.05 (565) 0.01 (140) 1
HGQ-5 72.5 2 (10 ns) 0.00 (0) 0.04 (468) 0.01 (131) 1
HGQ-6 71.0 2 (10 ns) 0.00 (0) 0.02 (256) 0.00 (66) 1
HGQ-c1 76.3 8 (40 ns) 0.26 (18) 0.50 (5,899) 0.09 (2,072) 1
HGQ-c2 74.2 3 (15 ns) 0.00 (0) 0.06 (678) 0.01 (172) 1
Refer to caption
Figure III: Accuracy versus resource consumption of the jet tagging models. Note that models with different DSP and LUT usage could land on the same point on this plot due to the linear combination of DSPs and LUTs.

V-B SVHN Classifier

We also benchmark HGQ on a computer vision task and compare it to previous state-of-the-art work [57, 38] on the SVHN dataset [62]. The SVHN dataset consists of 32×32323232\times 3232 × 32 RGB images of house numbers taken from Google Street View. The task is to classify the digit in the center of the image into one of ten classes. The architecture of the model is a LeNet-like [63], convolution-dense network directly taken from [57]. The exact model architecture is shown in Fig. VII in extended data.

These results are summarized in Tab. II and Fig. IV. In the table, the following models are taken from [57]: AQP, AQ, QP 7-bit, Q 7-bit, and BP 14-bit. All models are quantized, and pruning to a sparsity of 50% is applied to AQP, QP and BP. AQP and AQ are heterogeneously quantized models, where the quantization configuration is obtained using a Gaussian Process hyperparameter optimization, and are trained quantization aware with QKeras. QP 7-bit, Q 7-bit, and BP 14-bit are homeogeneously quantized models trained quantization aware with QKeras. BP-DSP-RF=3 is cited from [38], where the network is implemented in QKeras with a reuse factor of three, and pruned in a DSP-aware fashion to reduce the resource consumption.

The HGQ trained models, HGQ 1 though 7, are taken from a single training run during which the β𝛽\betaitalic_β value was gradually increased. We initialize the models using 6 bits for the floating points for activations, and 6 bits in total for weights. The β𝛽\betaitalic_β value was systematically increased from 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT over approximately 12,000 epochs. Completing this training process required about 10 hours on a standard consumer-grade GPU.

This model is too large to be fit on-chip fully unrolled, rather we utilize io_stream in hls4ml to partition the convolutional layers into smaller blocks. The resource consumption is estimated by the sum of the resource consumption of each block. For this reason, intra-layer heterogeneous activation cannot be utilized, and only inter-layer heterogeneous weight quantization is performed. Nevertheless, HGQ still outperforms both baselines by a considerable margin of up to 30% in resource savings while maintaining similar accuracy and latency.

TABLE II: Resource usage and latency of the convolutional SVHN classifier models. Reported resource usage for HGQ models are after place & route. In this task, the HGQ-0.4 and HGQ-1.5 models outperform the baseline AQ and AQP models by a large margin in accuracy, and also using less resources.
Model Accuracy (%) Latency (cc) DSP (%) LUT (%) FF (%) BRAM (%) II (cc)
BF 14-bit [57] 87 1,035 (5.18 μ𝜇\muitalic_μs) 93.23 (6,377) 19.36 (228,823) 3.40 (80,278) 3.08 (66.5) 1,030
BP 14-bit [57] 93 1,035 (5.18 μ𝜇\muitalic_μs) 48.85 (3,341) 12.27 (145,089) 2.77 (65,482) 3.08 (66.5) 1,030
Q 7-bit [57] 94 1,034 (5.17 μ𝜇\muitalic_μs) 2.56 (175) 12.77 (150,981) 1.51 (35,628) 3.10 (67.0) 1,029
QP 7-bit [57] 94 1,035 (5.18 μ𝜇\muitalic_μs) 2.54 (174) 9.40 (111,152) 1.38 (32,554) 3.10 (67.0) 1,030
AQ [57] 88 1,059 (5.30 μ𝜇\muitalic_μs) 1.05 (72) 4.06 (48,027) 0.64 (15,242) 1.48 (32.5) 1,029
AQP [57] 88 1,059 (5.30 μ𝜇\muitalic_μs) 1.02 (70) 3.28 (38,795) 0.63 (14,802) 1.39 (30.5) 1,029
BP-DSP-RF=3 [38] 92 ? (43.58 μ𝜇\muitalic_μs) 17.76 (1,215) 5.01 (59,279) 1.97 (46,584) 35.88 (1,550) 35.88
HGQ-1 93.9 1050 (5.25 μ𝜇\muitalic_μs) 0.85 (58) 5.87 (69,407) 1.18 (27853) 1.48 (32.0) 1029
HGQ-2 93.1 1061 (5.31 μ𝜇\muitalic_μs) 0.44 (30) 4.00 (47,314) 0.87 (20582) 1.30 (28.0) 1029
HGQ-3 91.9 1058 (5.29 μ𝜇\muitalic_μs) 0.22 (15) 3.39 (40,032) 0.76 (18087) 1.09 (23.5) 1029
HGQ-4 90.9 1059 (5.30 μ𝜇\muitalic_μs) 0.19 (13) 2.91 (34,435) 0.73 (17261) 1.04 (22.5) 1029
HGQ-5 89.9 1056 (5.28 μ𝜇\muitalic_μs) 0.15 (10) 2.60 (30,766) 0.64 (15205) 0.97 (21.0) 1029
HGQ-6 88.8 1056 (5.28 μ𝜇\muitalic_μs) 0.09 (6) 2.37 (27,982) 0.62 (14736) 0.97 (21.0) 1029
Refer to caption
Figure IV: Accuracy versus resource usage of the SVHN Classifier models. Note that models with different DSP and LUT consumption could land on the same point on this plot due to taking a linear combination of DSPs and LUTs.

V-C Muon Tracker

We also compare the resolution, latency, and on-chip resource consumption from HGQ trained models to a regression task proposed in [58].

This task involves predicting the polar-angle of a simulated muon track in a simplified detector. The inputs are one 3×503503\times 503 × 50 and two 3×503503\times 503 × 50 binary-valued arrays, representing the hit patterns in three detector layers. The output is a single scalar value representing the polar angle of the track in milliradians. Architecture of the model is a multistage neural network taken from the original work [58]. The exact model architecture is available in Fig. VIII in extended data.

The results are presented in Tab. III and Fig. V. The Qf models presented in [58] are all trained quantization aware with QKeras using manually tuned parameters.

The HGQ trained models, HGQ 1 though 7, are taken from a single training run during which the β𝛽\betaitalic_β value was gradually increased. We initialize the models with 6 bits for the floating points for activations, and 6 bits in total for weights. The β𝛽\betaitalic_β value was systematically increased from 3.e-6 to 6.e-4 over approximately 600,000 epochs. The complete run takes around 16 hours on a single consumer grade GPU.

The HGQ models consistently outperform the baseline models with a 40similar-toabsent40\sim 40∼ 40% reduction in resource consumption, while maintaining the same resolution with comparable latency.

TABLE III: Resource consumption and latency of the Muon Tracker models. The resource usage reported for HGQ models are after place & route. In this task, the HGQ-1.25 outperforms the baseline Qf6 model in both accuracy and resource consumption, while the HGQ-3.00 outperforms the baseline Qf5 model in both accuracy and resource consumption.
Model Resolution (mrad) Latency (cc) DSP (%) LUT (%) FF (%) BRAM (%) II (cc)
Qf8 [58] 1.95 17 (106.3 ns) 57.4 (1,762) 8.8 (37,867) 1.0 (8,443) 5.6 (37.5) 1
Qf7 [58] 1.97 11 (68.8 ns) 45.2 (1,389) 8.0 (34,848) 0.6 (5,433) 5.6 (37.5) 1
Qf6 [58] 2.04 13 (81.3 ns) 10.5 (324) 12.6 (54,638) 0.8 (6,525) 5.6 (37.5) 1
Qf5 [58] 2.15 11 (68.8 ns) 2.9 (88) 9.3 (40,039) 0.4 (3,419) 5.6 (37.5) 1
Qf4 [58] 2.45 10 (62.5 ns) 0.8 (24) 6.6 (28,526) 0.3 (2,954) 5.6 (37.5) 1
Qf3 [58] 2.78 9 (56.3 ns) 0.0 (2) 5.0 (21,682) 0.3 (2,242) 5.6 (37.5) 1
HGQ-1 1.95 11 (68.8 ns) 17.0 (522) 9.12 (39,413) 0.70 (6,043) 1.16 (25.0) 1
HGQ-2 2.00 11 (68.8 ns) 5.01 (154) 7.98 (34,460) 0.61 (5,263) 1.16 (25.0) 1
HGQ-3 2.09 12 (75.0 ns) 2.21 (68) 5.77 (24,941) 0.54 (4,677) 1.74 (37.5) 1
HGQ-4 2.20 13 (81.3 ns) 1.33 (41) 4.99 (21,557) 0.54 (4,699) 1.74 (37.5) 1
HGQ-5 2.39 10 (62.5 ns) 0.88 (27) 3.92 (16,918) 0.29 (2,484) 1.74 (37.5) 1
HGQ-6 2.63 12 (75.0 ns) 0.33 (10) 3.08 (13,306) 0.40 (3,429) 1.16 (25.0) 1
Refer to caption
Figure V: Accuracy versus resource consumption of the muon tracking models. Note that models with different DSP and LUT consumption could land on the same point on this plot as a result of taking the linear combination of DSPs and LUTs.

VI Conclusion and Future Work

In this work, we present HGQ, a novel method to optimize quantized neural networks for real-time applications on Field-Programmable Gate Arrays (FPGAs). Maximally leveraging the ability of FPGAs to perform fully heterogeneous computation, we introduce a new algorithm for precisely determining the optimal quantization precision for each weight and activation to minimize resource consumption without sacrificing the accuracy of the original model. To facilitate adoption, we have developed a user-friendly library that simplifies the application of this method. The HGQ approach enables the optimization of quantization bitwidths at arbitrary granularity up to individual parameter level, through a gradient descent approach that is conscious of both resource use and loss minimization. Additionally, the library offers an easy-to-use interface for defining quantized neural networks and training them with our method, as well as for deploying these networks on FPGAs by integrating with hls4ml.

Our findings show that HGQ achieves up to a 95% reduction in resource consumption compared to leading compression techniques, without compromising performance. We further demonstrate that a singular training session with HGQ is sufficient to explore a broad spectrum of trade-offs between performance and resource utilization, efficiently recovering the Pareto frontier, thereby rendering the model optimization process both more efficient and effective. Through its interface with hls4ml, HGQ provides a bit-accurate conversion from software to FPGA firmware models without the need for user interaction, significantly simplifying and streamlining the workflow from training to deployment. Moreover, we introduce EBOPs, a metric providing an accurate estimation of the final on-chip resource consumption as a linear combination of LUTs and DSPs. This estimation is available at training time, allowing for efficient software-hardware co-design.

In the future, we plan to extend support for more operations and layers. We also aim to support other training back-ends, such as PyTorch [49] and JAX [64]. Furthermore, we plan to include energy estimates as well as and more fine-grained resource estimations in the library.

VII Code Availability

We have made our library publicly available under the Apache 2.0 license at https://www.github.com/calad0i/HGQ. The scripts to reproduce the results in this paper are also available at https://www.github.com/calad0i/HGQ-demos under the Apache 2.0 license.

To use this library, one needs a forked version of hls4ml available at https://www.github.com/calad0i/hls4ml#HGQ-integration. The fork will be merged into the main hls4ml repository in the future, and one may check https://github.com/fastmachinelearning/hls4ml/pull/914 for the pull request status.

VIII Data Availability

The data used for training and evaluation in this work are all publicly available datasets. The jet tagging dataset is available at https://dx.doi.org/10.5281/zenodo.2603255. The SVHN dataset is available at http://ufldl.stanford.edu/housenumbers/. The muon tracking dataset is available at https://dx.doi.org/10.57967/hf/2084. Results shown in this work can be reproduced using the code available at https://www.github.com/calad0i/HGQ-demos.

IX Author contributions

C.S. conceived, designed, and implemented the HGQ method and library and performed the experiments. C.S. and V.C. implemented HGQ support in hls4ml. C.S. and T.A. wrote the manuscript. All authors reviewed and edited the manuscript.

X Acknowledgements

C.S. is supported by the Caltech Danny Koh grad fellowship. C.S. and M.S. acknowledge support from the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics grant DE-SC0011925. T.Å. is supported by the Swiss National Science Foundation Grant No. PZ00P2_201594. J.N. is supported by the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics “Designing efficient edge AI with physics phenomena” Project (DE-FOA-0002705). V.L. is supported by the NSF Institute for Accelerated AI Algorithms for Data-Driven Discovery (A3D3), under the NSF grant #PHY-2117997.

XI Competing Interests

The authors declare no competing interests.


XII Extended Data

XII-A Architectures and networks

Refer to caption
Figure VI: Schematic of the model used for jet classification. The kernel size for the Dense layers refers to <output dimension, input dimension>. The <*> notation on arrows indicates that the array shape passed through the arrow.
Refer to caption
Figure VII: Schematic of the model used for the SVHN classification. The kernel size for the Dense layers refers to <output dimension, input dimension>. The kernel size for the Conv2D layers refers to <filter width, filter height, input channel>. The <*> notation on arrows indicates that the array shape passed through the arrow.
Refer to caption
Figure VIII: Figure modified from [58]. Schematic of the model used for muon tracking. The kernel size for the Dense layers refers to <output dimension, input dimension>. The kernel size for the Conv1D layers refers to <filter width, input channels>. The <*> notation on arrows indicates that the array shape passed through the arrow. Arrows without an array shape specified are passing arrays of shape 50.