Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Chang Sun^∗1,2, Thea K. Årrestad¹, Vladimir Loncar^3,4, Jennifer Ngadiuba⁵, Maria Spiropulu² ¹ETH Zürich, Zürich, Switzerland,
²California Institute of Technology, Pasadena, CA, USA
³Massachusetts Institute of Technology, Cambridge, MA, USA
⁴Institute of Physics Belgrade, Serbia
⁵Fermi National Accelerator Laboratory, Batavia, IL, USA,
Email: {chang.sun, thea.aarrestad, vladimir.loncar, jennifer.ngadiuba, maria.spiropulu}@cern.ch *: Corresponding author

Abstract

Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method designed to fine-tune the per-weight and per-activation precision in an automatic way for ultra-low latency and low power neural networks which are to be deployed on FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

I Introduction

Edge computing has significantly increased the importance of real-time deep neural network (DNN) inference on specialized hardware [1]. The typical latency threshold for real-time inference is $\mathcal{O}(1)$ ms [2, 3, 4]. Nevertheless, certain domains require sub-microsecond inference times. At the CERN Large Hadron Collider (LHC), detectors generate tens of terabytes of data every second from collisions occurring every 25 nanoseconds. This data throughput is managed by a real-time selection system, the trigger. This system determines the fate of each collision event - whether it should be preserved for analysis or discarded - with a decision-making latency ceiling of a few microseconds [5, 6]. The trigger’s precision is vital to retain only interesting events, thereby managing the bandwidth effectively and reducing the event rate significantly. The system consists of $\mathcal{O}(1000)$ field programmable gate arrays (FPGAs) mounted on custom boards. Several algorithms are running in parallel on each FPGA. As a result, resources are scarce and the memory footprint of each algorithm should be minimal. In anticipation of the LHC’s upgrade to the High Luminosity-LHC (HL-LHC) [7], which will multiply the collision rate considerably by a factor of $2\sim 3$ comparing the current one [5, 6], and machine learning techniques are being explored to enhance the speed and accuracy of the computational tasks in the hardware trigger.

However, integrating demanding models - when resource consumption and latency are strictly limited - without compromising performance is a hurdle. Efforts in recent years have focused on algorithmic efficiency, with strategies ranging from the design of compact networks to weight pruning and quantization [8, 9]. Quantization converts model parameters into lower-precision formats, causing some loss in performance. Although post-training is computationally cheaper to perform, the loss in performance is significant in general compared to the full-precision baseline. To mitigate this, quantization-aware training has been proposed, which adheres to a fixed numerical precision throughout training to mitigate this performance degradation.

The satisfy the latency requirements, neural networks on FPGAs for LHC physics experiments are usually fully unrolled and pipelined - all arithmetic operations are done by a different component in the circuit without overlapping, maximizing throughput and minimizing latency. To explore this property, recent researches [10, 11] have suggested that applying varying levels of quantization to different layers could further optimize accuracy against computational costs.

In this paper, we introduce the high-granularity quantization (HGQ) method, which allows models to be trained quantization aware at arbitrary granularity: In contrast to what is done in the QAT library QKeras, where weights and activations are processed in layerwise blocks for quantization, HGQ enables weights and activations within one layer to have different bitwidths. For a fully unrolled implementation, we can allow every weight and activation to have its own unique bitwith. We illustrate the key difference between the HGQ method and the conventional block-wise quantization methods in Fig. I. Optimizing quantization parameter at higher granularity allows HGQ to find a better trade-off relation between model accuracy and resource consumption. Furthermore, by optimizing these individual bitwidths alongside the network using gradient descent, the need for training the network multiple times to search for a favorable quantization bitwidth for each block of the network could also be eliminated.

Refer to caption — Figure I: Overview of the HGQ method, showing activations (circles) and weights (lines) with thickness indicating bitwidth. Connections are dropped when weight or activation values are constantly zero. Top left: baseline network with high precision throughout. Top right: network quantized layer-wise, e.g., using QKeras. Bottom right: network both quantized layer-wise and pruned. Bottom left: network quantized using HGQ, applying more detailed quantization and assigning high bitwidths only where needed, on a per-weight and activation basis. This approach reduces resource use by maximally utilizing FPGA’s heterogeneous computation.

When multiplication operations in neural networks primarily involve low-bitwidth operands implemented with look-up tables (LUTs), HGQ could demonstrate a substantial reduction in on-chip resource consumption by eliminating unnecessary computations without compromising performance. Depending on the specific task, we demonstrate that HGQ has the potential to outperform AutoQKeras and achieve resource reduction by up to a factor of 20, and latency improvement by a factor of 5 while preserving accuracy.

A functional HGQ framework has been developed using Tensorflow and Keras, and we have open-sourced it as free software. The Vivado/Vitis FPGA back-end is supported through integration with hls4ml. The library guarantees an exact correspondence between the software and firmware models, provided that no numeric overflow occurs and intermediate values are representable by float32.The work presented in this paper makes the following contributions:

•

We present a new algorithm for obtaining surrogate gradients of parameter bitwidths, from both the loss function and the estimated model resource consumption, enabling full gradient-based optimization of bitwidths;
•

We enable heterogeneous quantization of a specific model at arbitrary granularity up to per-parameter level, aiming to minimize hardware resource usage while preserving high accuracy. This approach naturally includes sparse pruning of network parameters by setting their bitwidth to zero, further reducing resource-cost;
•

We have made this library easily available online in an easy-to-use library, called HGQ¹¹1https://github.com/calad0i/HGQ, where simple drop-in replacement of Tensorflow Keras layers makes it straightforward for users to transform Keras models to their equivalent deep heterogeneously quantized versions, which are trained quantization aware;
•

We have added support for quantized HGQ models in the library, hls4ml, which converts these pre-trained quantized models into highly-parallel FPGA firmware for ultra low-latency inference.
•

Using HGQ in combination with hls4ml ensures exact bit-level accuracy between the HGQ software model and the corresponding firmware model, making the library safe and easy to use for non-experts;
•

We propose a new metric called Effective Bit Operations (EBOPs) for a more accurate estimation of on-chip resource consumption;
•

We demonstrate a resource reduction of up to 95% and a 5-fold improvement in latency, all while maintaining accuracy compared to other state-of-the-art methods.

II Related work

Network compression has been shown to be an effective way to reduce the computational cost of neural networks on FPGAs. Quantization is a widely adopted method for compressing deep neural networks (DNNs) for implementing them on hardware devices such as FPGAs or ASICs. Previous studies have utilized low precision quantization, such as binary or ternary, across networks to enhance throughput and reduce latency. Binary quantization restricts weights to $\alpha\times{-1,1}$ , and ternary to $\alpha\times{-1,0,1}$ , with $\alpha$ as a scaling factor. Key examples include DoReFa Net [12], ABC-net [13], Binaryconnect [14], XNOR-net [15], TWN [16], TTQ [17], and [18]. These methods achieve high compression but at the cost of reduced performance compared to standard floating-point networks. Using binary network principles, several studies have moved to multi-bit network designs that represent numbers through binary bases and values, highlighted in works like [19, 20, 13, 21, 22]. Mix&Match [23], in particular, uses power-of-two bases for better hardware compatibility.

Many studies have investigated heterogeneous quantization with layer-specific precision to lessen the performance loss due to quantization. In particular, in HAQ [24] utilizes reinforcement learning to find the best bitwidth configuration. HAWQ, HAWQ-V2, PyHessian, and Q-BERT [25, 26, 27, 28] focus on optimizing bitwidths through hessian-aware techniques. DNAS [29] and AutoQKeras [10] optimize bitwidths and network architecture simultaneously,with DNAS using stochastic sampling from a super network and AutoQKeras employing gradient-free methods like Gaussian Process, Hyperband, and stochastic search for hyperparameter optimization. Similarly, Meta-ML [30] applies iterative optimization to various hyperparameters, including bitwidths, weight pruning, and model architectures.

Some works, like RVQuant [31], BitsandBytes [32], and SpQR [33], have investigated heterogeneous quantization down to the sub-layer level, offloading outlier weights to higher precision formats primarily for model compression for large models rather than significant performance gains on FPGAs. AutoQ [34] utilizes reinforcement learning to optimize bitwidths for kernel weights and activations. A study more aligned with ours is the recent FILM-QNN [35], which optimizes weight and activation precision in a manner conducive to hardware efficiency. It categorizes convolution layer filters into groups of low and high precision, assigning them based on anticipated quantization loss for each filter.

Pruning is another technique used to compress neural networks, enhancing their speed during hardware inference. This method involves removing weights that have minimal impact on the overall accuracy of the network. This concept was first introduced in [36], and was applied to neural networks in [37]. Pruning can be categorized as structured, involving the removal of weights in specific blocks (as in [38, 39, 40]), or unstructured, targeting individual weights (as in [41, 42, 43, 44, 45, 40]). In this work, we consider pruning as a form of quantization where pruned weights are effectively quantized to zero bits. The QKeras [10] framework, like ours, aims to train and optimize neural networks for deployment on FPGAs. Qkeras is developed on top of Tensorflow Keras [46] and leverages hls4ml [47] for FPGA deployment. It specializes in training and optimizing neural networks, allowing for the use of arbitrary precision fixed-point numbers for both weights and activations. AutoQKeras, a feature within Qkeras, enables automatic adjustment of quantization settings for each layer using a gradient-free approach. This can lead to significant compression, including the use of binary or ternary networks. Typically, hls4ml is employed as the backend for deploying on FPGAs. It specializes in training and optimizing neural networks, allowing for the use of arbitrary precision fixed-point values for both weights and activations. AutoQKeras, a feature within Qkeras, enables automatic tuning of quantization settings for each layer using a gradient-free approach. This can lead to significant compression, including the use of binary or ternary networks [11]. Brevitas [48] serves as the PyTorch [49] equivalent of Qkeras, commonly paired with the FINN and FINN-R frameworks from AMD Research [50, 51] for deploying on AMD FPGAs.

III High Granularity Quantization

In this paper, we introduce High Granularity Quantization (HGQ). This is a novel quantization approach that allows for up to individual precision levels within a single layer, offering the unique capability for each parameter in a network to have its own bitwidth. We begin this section by outlining the fundamentals of quantization and Quantization-Aware Training (QAT). Subsequently, we introduce an innovative gradient-based technique for auto-tuning the quantization bitwidth during training. A comprehensive explanation of the HGQ method and its algorithm follows. This approach is designed to improve the accuracy-resource/latency balance compared to previously studied block-wise heterogeneous quantization methods in neural networks.

III-A Quantization

Quantization is a map, henceforth referred to as $\mathrm{f}^{q}$ , that transforms a real number into a finite set of discrete values, mapping from the set of real numbers $\mathbb{R}$ to a discrete subset $\mathbb{Q}\equiv\left\{q_{i}|q_{i+1}>q_{i}\right\}\subset\mathbb{R}$ . For hardware efficiency, we ensure that quantized weights and activations are represented as fixed-point numbers, a common practice in hardware for numerical representation. A fixed-point number is essentially an integer scaled by a predefined factor, typically powers of two. It is characterized by its bitwidth (total number of bits) and the number of bits allocated for the integer portion. The inclusion of the sign bit in the integer part, for signed numbers, varies by convention. In this context, we adhere to the convention used in Xilinx^® Vivado^®/Vitis^® HLS, which includes the sign bit in the integer part if present. Adhering to the standard for a fixed-point number with $b\in\mathbb{N}_{+}$ bits, where $i\in\mathbb{Z}$ bits are dedicated to the integer part. We define $f$ as the number of fractional bits, calculated by $f\equiv b-i$ . For signed numbers, the representable range is $[-2^{i-1},2^{i-1}-2^{-f}]$ , with a step size of $2^{-f}$ . For unsigned numbers, the range is [0, $2^{i}-2^{-f}$ ], sharing the same step size.

One way of quantizing a real number into a fixed-point format, fixed<b,i>, can be expressed by a rounding function as follows:

\displaystyle\begin{split}\mathrm{f}^{q}(x)=&\left(\left(\left[x\cdot 2^{f}% \right]+2^{b-1}\ \mathrm{mod}\ 2^{i}\right)-2^{b-1}\right)\cdot 2^{-f}\\ =&\begin{cases}\left[x\cdot 2^{f}\right]\cdot 2^{-f},&\text{if }x\in[-2^{i-1},% 2^{i-1}-2^{-f}]\\ \mathrm{overflow}&\text{otherwise}\end{cases},\end{split}

(1)

where $[x]\equiv\lfloor x+\epsilon\rfloor$ with $\epsilon\in[0,1)$ and $f\equiv b-i$ . Note that setting $\epsilon=1/2$ applies conventional rounding to the nearest integer. In this context, “overflow” implies that a value exceeds the representable limits of the fixed-point format, causing a cyclical wrap to the opposite end of the range. Although a quantization function could be designed to adjust values outside the permissible range to the closest valid value (for instance, by clipping them to the range limits), this approach is intentionally avoided in our work to avoid resource overhead. By judiciously selecting the quantization range, we ensure that overflow does not occur.

For an unsigned fixed-point number, denoted as ufixed<b,i>, the quantization function is described below, using the same terminology:

	$\displaystyle\mathrm{f}^{q}(x)=$	$\displaystyle\left(\left[x\cdot 2^{f}\right]\ \mathrm{mod}\ 2^{i}\right)\cdot 2% ^{-f}$		(2)
	$\displaystyle=$	$\displaystyle\begin{cases}\left[x\cdot 2^{f}\right]\cdot 2^{-f},&\text{if }x% \in[0,2^{i}-2^{-f}]\\ \mathrm{overflow}&\text{otherwise}\end{cases},$		(3)

In our approach, we only track the number of fractional bits of the fixed-point number during training. Before deploying the network for hardware synthesis (e.g., into HLS projects), we calculate the required number of integer bits to avoid overflow. This task is trivial for weights, as they are only constants after training. For intermediate accumulator and activation values, we employ a calibration dataset to gauge the extreme values (both maximum and minimum) the values might assume. This process involves running the dataset through the network and logging the extreme quantized values, $v^{q}_{\mathrm{max}}$ , and $v^{q}_{\mathrm{min}}$ . Given the fixed-point number’s range of [ $-2^{i-1}$ , $2^{i-1}-2^{-f}$ ], we can determine the necessary integer bit width, $i$ , using:

i=\max(\lfloor\log_{2}|v^{q}_{\mathrm{max}}|\rfloor+1,\lceil\log_{2}|v^{q}_{% \mathrm{min}}|\rceil).

(4)

By ensuring the calibration dataset accurately reflects the input data distribution the network will encounter in deployment, we can guarantee that overflow will not occur. For extra safety, one may also add margin to the computed range to account for potential outliers in the input data. This method eliminates the need to consider the representational range during the training phase. Therefore, the quantization function during training can be expressed as:

\mathrm{f}^{q}(x)=\left[x\cdot 2^{f}\right]\cdot 2^{-f}=\lfloor(x+\epsilon)% \cdot 2^{f}\rfloor\cdot 2^{-f}.

(5)

Without loss of generality, we assume $\epsilon=1/2$ for the rest of this chapter. This choice does not affect any of the results or conclusions drawn in this work.

III-B Quantization-Aware Training

Quantization-aware training (QAT) trains neural networks by applying quantization directly during the training pass. Previous work, e.g [10], demonstrates that QAT significantly reduces the accuracy loss typically caused by quantization. In this work, we adopt the QAT method utilized in [10] as the foundational technique for our HGQ method. Specifically, we employ the straight-through estimator (STE) [52] for weights and activations quantization, which quantizes the values during the forward pass while acting as an identity for computing the gradients in the backward pass. This strategy maintains a good balance between effective quantization and overhead during training.

III-C FPGA resource consumption

A common metric for estimating on-chip resource usage in FPGAs, Bit Operations (BOPs) [53]. BOPs quantify resource on the FPGA by counting the number of bit operations performed during the network’s inference. For two numbers defined in bitwidths $b_{i}$ and $b_{j}$ , the number of BOPs is $b_{i}\cdot b_{j}$ for a multiplication operation and max $(b_{i},b_{j})+1$ for an addition operation. However, BOPs falls short in accurately reflecting resource consumption in many cases for a fully unrolled neural network on an FPGA.

This discrepancy arises from the fact that the multiplication operations is usually between a fixed constant and a variable. In particular, for an unrolled implementation on hardware:

1.

Declaring a constant in fixed-point format of $b$ bits does not necessary mean that all $b$ bits are used. For instance, a weight of 0.5 in an 8-bit fixed-point format only uses 1 bit instead of 8 bits, and counting it as 8 in BOPs leads to an overestimation of resource usage.
2.

BOPs tends to double count an accumulation operation that follows directly after another multiplication operation between constant and variable: multiplication between a constant and a variable can be implemented as a series of additions of shifted values of the variable. The operation count for a single multiplication involving $b_{i}$ and $b_{j}$ bitwidths thus becomes either $b_{i}\cdot(b_{j}-1)$ or $(b_{i}-1)\cdot b_{j}$ . In scenarios involving multiplication-accumulation, the bit operations are approximated as $b_{i}\cdot b_{i}$ .

To address this discrepancy and offer a more precise estimation of on-chip resource usage, we propose a novel metric, Effective Bit Operations (EBOPs).

The bitwidth for constants used for EBOPs is not the declared bitwidth, but the number of bits that are enclosed by non-zero values. For instance, a weight represented as 001xx1000 will be counted as 4 bits instead of 8 bits. This approach ensures that the resource estimation is not overestimated by the declared bitwidth.

To address the second issue, EBOPs quantifies only the cumulative BOPs conducted during multiplicative processes in a network. Let $\mathcal{M}=\left\{\{i,j\}_{n}\right\}$ be the set of all multiplication operations between operands with bitwidths $b_{i}$ and $b_{j}$ . The total number of EBOPs can then be expressed as:

\mathrm{EBOPs}=\sum_{{i,j}\in\mathcal{M}}b_{\mathrm{i}}\cdot b_{\mathrm{j}}.

(6)

Here bit operations in accumulation processes in avoided intentionally, under the assumption these are implicitly included within the EBOPs framework to avoid the second issue mentioned above.

Experimental findings validate EBOPs as a reliable estimator for on-chip resource consumption, closely mirroring a linear combination of LUT and DSP usages. Detailed results are discussed in Sec. V. To get a accurate resource estimation from EBOPs, one should only including operations that will be executed in parallel. For instance, different inputs fed to the same multiplier through a FIFO buffer should be counted only once (e.g. the implementation of convolutions in hls4ml in general). Additionally, this estimation does not include overhead from non-multiplicative processes (e.g., buffers used in hls4ml’s io_stream implementation). Though, note that it is feasible to estimate them separately in other means and add these additional overheads to the final result.

III-D Gradient-based optimization of bitwidths

To obtain a fully-unrolled quantized neural network with minimum resource- or area-usage on-chip, we want each weight and activation bitwidth to be individually optimized. However, in this way, the number of bitwidth parameters could exceed the number of trainable parameters in the original network. The only feasible approach to managing such a vast parameter space is through gradient-based optimization. Nonetheless, direct optimization of these discrete bitwidth values via gradients is not possible due to the absence of a direct gradient path from the loss function to the bitwidths. Therefore, we address two main issues: a) make the discrete bitwidths optimizable with a gradient; and b) estimate a surrogate gradient for these bitwidths.

III-D1 Optimize discrete bitwidths with gradient

The first issue can be straightforwardly addressed by treating the discrete bitwidths similar to the discrete weights in a quantized network. In particular, we apply the straight-through estimator (STE) to real-numbered bitwidths as it is done for the weights, and we follow the STE implementation used in QKeras:

\mathrm{ste}(x)=x+\mathrm{sg}([x]-x),

(7)

where $\mathrm{sg}:\mathbb{R}\rightarrow\mathbb{R}$ is an identity function that detaches the gradient from the enclosed expression. In this way, the bitwidths can be optimized if they have gradients attached. Continuous values for the bitwidths are stored, and they are only rounded to integers as needed during forward passes. During backward passes, the rounding operations default to the identity.

III-D2 Surrogate gradient for bitwidths

To address the second issue, we first consider some parameter $x$ (e.g., weight or activation) in the network and its corresponding quantizer $\mathrm{f}^{q}(\cdot)$ . If we require that the quantized number has at most $f$ fractional bits, its associated quantization error $\delta_{f}$ can be expressed as follows with $\epsilon=1/2$ :

\delta_{f}\equiv x-\mathrm{f}^{q}(x)=x-\left[x\cdot 2^{f}\right]\cdot 2^{-f}.

(8)

During training, we assume $x$ to be a random variable following a certain smooth distribution $\mathbb{D}_{x}$ . We further assume that the variance of $\mathbb{D}_{x}$ is significantly larger than the quantization error $\delta_{f}$ in such a way that one can approximate the quantization error as a uniform distribution:

\delta_{f}\sim\mathrm{Uniform}(-2^{-f-1}\cdot,2^{-f-1}).

(9)

Let the loss of the network be $\mathcal{L}$ , and express the gradient of $f$ with respect to $\mathcal{L}$ as

\frac{\partial\mathcal{L}}{\partial f}=\frac{\partial\mathcal{L}}{\partial% \delta}\cdot\frac{\partial\delta}{\partial f}.

(10)

In this expression, the first term $\frac{\partial\mathcal{L}}{\partial\delta}$ can be obtained trivially with backpropagation. The second term $\frac{\partial\delta}{\partial f}$ is not well-defined, as $f$ can only take integer values for a properly defined quantizer and thus has no gradient. To address this issue, we propose a surrogate gradient method that assigns a gradient to $f$ only on integer values.

We now express the loss as a function of the weights $\bm{\theta}$ and all the quantization errors $\bm{\delta}$ , $\mathcal{L}(\bm{\theta},\bm{\delta})$ . To obtain the surrogate gradient of $f$ , we assume that the loss function is sensitive to the magnitude of the quantization error, but not the sign: $\mathcal{L}(\bm{\theta},|\bm{\delta}|)$ .

For a parameter $x\sim\mathcal{D}_{x}$ with $f\in\mathbb{Z}$ floating bits to be quantized, the corresponding absolute quantization error is $|\delta_{f}|\equiv|x-\mathrm{f}^{q}_{f}(x)|\sim\mathrm{Uniform}(0,2^{-f-1})$ . By increasing $f$ by one, we obtain the absolute quantization error $|\delta_{f+1}|$ as a function of $f$ and $|\delta_{f}|$ :

|\delta_{f+1}|=\begin{cases}|\delta_{f}|&|\delta_{f}|\leq 2^{-f-2}\\ 2^{-f-1}-|\delta_{f}|&|\delta_{f}|>2^{-f-2}\\ \end{cases}.

(11)

We can then obtain the gradient of $|\delta_{f}|$ with respect to $f$ using the finite difference approximation.

\frac{\partial|\delta_{f}|}{\partial f}\leftarrow|\delta_{f+1}|-|\delta_{f}|.

(12)

However, as the absolute quantization error is bounded by a geometric sequence of $2^{-f-1}$ , using a linear difference for approximation is suboptimal. Instead, we use the following heuristic expression to approximate the gradient, which recovers Eq. (12) at the limit of $|\delta_{f+1}|\rightarrow|\delta_{f}|$ :

\frac{\partial|\delta_{f}|}{\partial f}\leftarrow|\delta_{f}|\cdot\log\frac{|% \delta_{f+1}|}{|\delta_{f}|}.

(13)

Expressing the ratio of $|\delta_{f+1}|$ and $|\delta_{f}|$ as a function of $|\delta_{f}|$ , we have

\frac{|\delta_{f+1}|}{|\delta_{f}|}=\begin{cases}1&|\delta_{f}|\leq 2^{-f-2}\\ \frac{2^{-f-1}}{|\delta_{f}|}-1&|\delta_{f}|>2^{-f-2}\\ \end{cases}.

(14)

One may get a gradient surrogate by combining Eq. (13) and Eq. (14). However, using the local relation as expressed in Eq. (14) between $|\delta_{f+1}|$ and $|\delta_{f}|$ could lead to a loss landscape for $f$ with extensive high-frequency components that is hard to optimize. To mitigate this issue and smooth out the loss landscape, we take the expectation of the first term of Eq. (13) over $|\delta_{f}|$ :

\mathbb{E}_{|\delta_{f}|}\left[\log\frac{|\delta_{f+1}|}{|\delta_{f}|}\right]=% -\log 2.

(15)

By substituting Eq. (15) into Eq. (12), and add a $\mathrm{sign}(\delta_{f})$ term on both hand sides, we obtain the surrogate gradient for $f$ :

\frac{\partial\delta_{f}}{\partial f}\leftarrow-\log 2\cdot\delta_{f}.

(16)

Hence, the forward pass of the quantizer, with respect to one input value $x$ and its float bitwidth $f$ , can be expressed as in Algorithm 1, and the backward pass is the auto-differentiation of the forward pass with the stop-gradient operations.

Data:

x

: the input value;

f

: the float bitwidth

Result:

x^{q}

: the quantized value of

x

with float bitwidth

f

f\leftarrow\mathrm{ste}(f)

;

x^{q}\leftarrow\mathrm{sg}([x\cdot 2^{f}]\cdot 2^{-f})

;

\delta\leftarrow\mathrm{sg}(x-x^{q})

;

// Standard STE-based quantization

\delta\leftarrow\mathrm{sg}(\delta+\ln 2\cdot f\cdot\delta)-\ln 2\cdot f\cdot\delta

;

// Attach gradients of

f

\delta

x^{q}\leftarrow x-\delta

;

// Attach gradients of

f

and

x

x^{q}

return $x^{q}$

Algorithm 1 Quantizer forward pass

Quantization often results in higher loss values, causing the gradients from the loss function to propagate to the bitwidths and increase them. To address this, we introduce a regularization term to prevent the bitwidths from growing too large. We use EBOPs as this regularization term, incorporating it into the loss function with a regularization coefficient $\beta$ to balance accuracy against on-chip resource usage. Moreover, for network values not involved in multiplicative operations (such as last-layer outputs or inputs to non-linear activations), we apply an L-1 regularization with a coefficient $\gamma$ to the bitwidths, preventing them from expanding unnecessarily. The final loss function is given by

\mathcal{L}=\mathcal{L}_{\mathrm{base}}+\beta\cdot\mathrm{EBOPs}+\gamma\cdot% \mathrm{L1}_{\mathrm{norm}},

(17)

with gradients attached to the bitwidths.

As all additional gradients introduced in this section are directly added to the bitwidths, the loss landscape of the network’s weights remains unperturbed compared to that of networks with static quantization parameters. Consequently, it eliminates the alterations to the loss landscape which are introduced by regularization-based quantization methods, like [54, 55].

III-D3 Gradient for bitwidths with multiple parameters

In experiments, we noticed that when the parameter group size increases while keeping the same $\beta$ , the corresponding bitwidth is more likely to collapse to zero and causing a breaking down during training. This effect is hypothesized due to the non-uniformity of gradients contributed by parameters in the group. To mitigate such effects, we normalize the gradient on $f_{i}$ by $1/{\sqrt{||g_{i}||}}$ , based on empirical observations. Here, $||g_{i}||$ denotes the number of elements in $g_{i}$ . The quantizer’s forward pass with respect to a parameter group $g_{i}$ can be described as in Algorithm 2 during training and Eq. (5) during inference. The backward pass is derived from the forward pass automatically.

Data:

g_{i}

: the

i

-th parameter group;

f_{i}

: the number of floating bits for

i

-th parameter group

Result:

(g^{q})_{i}

: the quantized parameters of

i

-th parameter group

if $g_{i}$ $\mathrm{is\ a\ group\ of\ weights}$ then

{f}_{i}\leftarrow\mathrm{ste}(f_{i})

;

else

{f}_{i}\leftarrow f_{i}

;

end if

(g^{q})_{i}\leftarrow\mathrm{empty\ list}

;

N\leftarrow||g_{i}||

;

forall $v_{j}$ $\mathrm{in}$ $g_{i}$ do

v^{q}\leftarrow\mathrm{sg}([v_{j}\cdot 2^{f_{i}}]\cdot 2^{-{f_{i}}})

;

\delta\leftarrow\mathrm{sg}(v_{j}-v^{q})

;

\delta\leftarrow\mathrm{sg}(\delta+\frac{\ln 2\cdot f_{i}\cdot\delta}{\sqrt{N}% })-\frac{\ln 2\cdot f\cdot\delta}{\sqrt{N}}

;

{v^{q}}_{j}\leftarrow x-\delta

;

\mathrm{append}

{v^{q}}_{j}

\mathrm{to}

(g^{q})_{i}

;

end forall

return $(g^{q})_{i}$

Algorithm 2 Quantizer forward pass for a parameter group

III-D4 Connection to Pruning

From Eq. (5), it is observable that the quantized value defaults to zero whenever $-\epsilon\cdot 2^{-f}\leq x<(1-\epsilon)\cdot 2^{-f}$ . Given that $f$ can take both positive and negative values, a sufficiently small $f$ with $\epsilon>0$ will cause certain parameters in the network to always be zero. This is equivalent to pruning the parameter. Assigning a distinct bitwidth to each parameter group in the network through HGQ automatically prunes the network during training in an unstructured, taking both loss and resource consumption into account.

Listing 1: HGQ model example.

⬇

from tensorflow.keras.layers import Input

from HGQ import HQuantize, HDense

inp = Input((16,))

out = HQuantize(name=‘inp_q’)(out)

out = HDense(64, activation=‘relu’, bops_reg_factor=r)(out)

out = HDense(32, activation=‘relu’, bops_reg_factor=r)(out)

out = HDense(5, activation=‘linear’, bops_reg_factor=r)(out)

hgq_model = Model(inp, out)’

Listing 2: Keras model example.

⬇

from tensorflow.keras.layers import Input, Dense

inp = Input((16,))

out = Dense(64, activation=‘relu’)(out)

out = Dense(32, activation=‘relu’)(out)

out = Dense(5, activation=‘linear’)(out)

keras_model = Model(inp, out)

IV The High Granularity Quantization Framework

The HGQ algorithm is available as a user-friendly Python library, similar to QKeras [10]. It functions as a sophisticated quantization API built on top of Keras [56], utilizing hls4ml [47] for deployment. Additionally, this framework facilitates automatic conversion of a tensorflow.keras model into a hls4ml model, ensuring bit-accuracy as per the specifications of a dataset defined by the user, without requiring manual intervention.

Following the methodology of QKeras, HGQ encompasses most of the layers available in hls4ml, and adheres to the design principles of Keras. HGQ is engineered to carry out automatic quantization on all compatible layers according to the EBOPs regularization factor, $\beta$ . This approach eliminates the necessity for users to fine-tune quantization parameters for individual modules or undergo multiple training cycles to identify the best quantization scheme.

HGQ provides drop-in replacement for the most commonly used Keras layers, making it straightforward to transition from a standard Keras model to an HGQ model with minimal adjustments. For instance, as demonstrated in listing 2, converting to an HGQ model primarily involves substituting existing layers with their HGQ alternatives, as shown in listing 1, along with the inclusion of an additional layer to quantize inputs. Within the HGQ framework, there are two categories of layers: Heterogeneous (H-) layers, which accept an additional parameter, beta, to manage the layer’s resource usage regularization based on MBOPs, and Passive (P-) layers, which serve to relay metadata without performing quantization. The H- layers also allow for layer-specific kernel and pre-activation quantizer configurations (kq_config and paq_config, respectively) for more precise control over quantization behaviors.

HGQ simplifies the process by auto-learning optimal quantization parameters during training, thus mostly freeing the user from manually specifying bit widths and scaling factors for each layer. Instead, users are primarily concerned with setting the EBOPs’ regularization factor, although manual parameter adjustment is still an option.

Beyond quantization-aware training, HGQ introduces a convenient intermediate layer or proxy models for transitioning a trained Keras model to hls4ml. This feature accommodates both HGQ and QKeras models, automating the creation of hls4ml configurations for precise conversions. Furthermore, the proxy model facilitates bit-accurate emulation of the compiled hls4ml model, aiding in debugging and validating the hls4ml model’s performance through development, even in cases of overflow within the constraints of available bit width, up to the accuracy permitted by float32 precision used in the emulation.

IV-A Resource Consumption Surrogate

We consider five types of major resources on an Xilinx FPGA chip: LUTs, DSPs, FFs, BRAMs, and URAMs. When fitting an unrolled neural network for ultra low latency applications, like L1 Triggers in LHC, the limiting resources are usually either LUTs or DSPs. Empirically, operations consisting of larger bitwidths are more likely to consume DSPs, while operations with smaller bitwidths are more likely to consume LUTs. During our experiments, we observed that the EBOPs roughly predict a linear combination of the LUTs and DSPs consumption, namely, EBOPs $\approx$ LUTs + $55\times$ DSPs for models synthesized with io_type=io_parallel in hls4ml, where intermediate values are directly wired between layers/modules.

In Fig. II, we demonstrate the relationship between EBOPs and the actual on-chip resource consumption, represented by post place and route LUTs + $55\times$ DSPs, synthesized with Xilinx Vivado 2020.1/Vitis 2023.2 for models shown in this work. Although the relationship isn’t exact, we can still make a reasonable estimation of resource usage based on EBOPs, even during training. This suggests that treating one DSP as equivalent to approximately 55 LUTs could be a practical approximation for comparing resource usage across different models, although this may not hold universally. It’s important to note that EBOPs primarily account for operations involving a multiplication-accumulation of one constant and one variable. Therefore, if operations other than these significantly contribute to the consumption of on-chip resources, the EBOPs-based estimation might not be reliable. For instance, in the SVHN classifier model synthesized with io_type=io_stream, the resource usage by FIFO buffers isn’t factored in, leading to an underestimation of total resource consumption as predicted by EBOPs.

V Results

To evaluate the performance of the HGQ framework, we train and evaluate models on a classification, a computer vision and a regression task: jet tagging at the LHC [10], SVHN digit classification[57], and muon tracking[58], respectively.

To demonstrate the trade-off between accuracy (or resolution) and resource usage, we methodically adjusted the $\beta$ factor for each task during training to map out various optimal points on the accuracy (resolution) vs. resource consumption Pareto Front. This process involved initiating with a notably low $\beta$ value and incrementally raising it through the training, capturing all models that align with the Pareto Front, defined by validation accuracy (or resolution) and estimated resource consumption via EBOPs. Meanwhile, we maintained the $\gamma$ value fixed at 2.e-6 for all experiments to avert the risk of layers diverging in bitwidths. Post-training, we conduct a reassessment of the models using the test set, providing details on accuracy or resolution based on c-synthesis, and detailing resource consumption after the place-and-route phase.

V-A Jet Classification at the LHC

We conducted a comparison of the accuracy, latency, and on-chip resource utilization of models trained with HGQ against various quantized models from earlier research.

We use the dataset from [59]. This dataset is for classifying jets, a kind of particle shower produced by high-energy particles at the LHC experiments, into five classes based on their originating particles: quark (q), gluon (g), W boson, Z boson, and top (t) jets. The inputs for each jet are 16 scaler values representing physics-motivated high-level features. The model architecture employed is based on the full precision baseline model described in the original work [10], which is a 4-layer fully connected neural network. The exact model architecture is shown in Fig. VI in extended data.

The results are summarized in Table I. In the table, the following models are cited from [10]: BF, BP, BH, Q6, QE, and QB. In this work, various techniques such as Quantization-Aware Training (QAT), pruning, and automated parameter optimization using a Gaussian Process (for the QE and QB models) were used to achieve low resource consumption. LogicNets JSC-M and JSC-L are cited from [60], where the networks are designed to use on-chip LUTs efficiently. BP-DSP-RF=2 is cited from [38], where the network is implemented in QKeras with a reuse factor of two, and pruned in a DSP-aware fashion to reduce the resource consumption. MetaML- $\alpha_{\mathrm{q}}\mathrm{=}1\%$ and MetaML- $\alpha_{\mathrm{q}}\mathrm{=}4\%$ are cited from [30]. These two networks went through iterative architecture search, quantization, and pruning for better accuracy-resource trade-off. SymbolNet is cited from [61], which leverages a gradient-based method for optimal symbolic expression searching. It also uses an adaptive dynamic pruning scheme to reduce on-chip resource consumption while maintaining accuracy.

As shown in Fig. III and Tab. I, the HGQ models outperform the baseline models by a significant margin both in terms of model accuracy and resource usage. Depending on the working point, HGQ models reduce the resource consumption from 50% to up to 95%, while maintaining the same accuracy. When working with a lower accuracy requirement, it could also achieve similar resource consumption as an optimized symbolic regressor.

The HGQ trained models, HGQ 1 through 8,are taken from the same training run with ramping up $\beta$ during training. These models were initially set to use 2 bits for representing the activations’ floating-point part and 2 bits in total for the weights. Throughout the training process, which spanned roughly 300,000 epochs, $\beta$ was gradually increased from $10^{-6}$ to $10^{-4}$ . Due to the models’ compact size, this entire training process could be completed in just a few hours using a standard consumer-grade GPU.

We also studied the performance of the models trained with fixed, non-zero $\beta$ values. In Fig. III and Tab. I, these correspond to HGQ-c1, trained with a fixed $\beta$ of 2.1e-6, and HGQ-c2, using a $\beta$ of 1.2e-5. Each is trained for 5,000 epochs (lasting a few minutes on a consumer grade GPU). Models trained with either a constant or incrementally increasing $\beta$ value are capable of achieving a comparable balance between accuracy and resource consumption. From this, we conclude that the method of progressively ramping up $\beta$ throughout the training process is effective in generating a collection of models that represent an optimal compromise between accuracy and resource efficiency, situating them favorably on the accuracy-resource Pareto Frontier.

TABLE I: Resource consumption and latency of the jet tagging models. Resource reported for HGQ models are after place & route. In this task, HGQ models outperforms the baseline models by a large margin in both accuracy and resource consumption. The LGQ model is not high-granularity quantized, but only using gradient-based bitwidth optimization.

Model	Accuracy (%)	Latency (cc)	DSP $(\%)$	LUT $(\%)$	FF (%)	II (cc)
BF [10]	74.4	9 (45 ns)	56.0 (1,826)	4.09 (48,321)	0.8 (20,132)
BP [10]	74.8	14 (70 ns)	7.7 (526)	1.49 (17,577)	0.4 (10,548)
BH [10]	73.2	14 (70 ns)	1.3 (88)	1.34 (15,802)	0.3 (8,108)
Q6 [10]	74.8	11 (55 ns)	1.8 (124)	3.36 (39,782)	0.3 (8,128)
QE [10]	72.3	11 (55 ns)	1.0 (66)	0.77 (9,149)	0.1 (1,781)
QB [10]	71.9	14 (70 ns)	1.0 (69)	0.95 (11,193)	0.1 (1,771)
LogicNets JSC-M [60]	70.6	N/A	0 (0)	1.22 (14,428)	0.02 (440)
LogicNets JSC-L [60]	71.8	5 (13 ns)	0 (0)	3.21 (37,931)	0.03 (810)
BP-DSP-RF=2 [38]	76.3	21 (105 ns)	2.6 (175)	0.47 (5,504)	0.13 (3,036)	2
MetaML- $\alpha_{\mathrm{q}}\mathrm{=}1\%$ [30]	75.6	9 (45 ns)	0.7 (50)	0.57 (6,698)	N/A	1
MetaML- $\alpha_{\mathrm{q}}\mathrm{=}4\%$ [30]	72.8	8 (40 ns)	0.2 (23)	0.57 (7,224)	N/A	1
SymbolNet [61]	71.	2 (10 ns)	$<$ 0.1 (3)	0.01 (177)	$<$ 0.01 (109)	1
HGQ-1	76.4	6 (30 ns)	0.50 (34)	0.53 (6,236)	0.05 (1253)	1
HGQ-2	75.9	4 (20 ns)	0.09 (6)	0.27 (3,162)	0.02 (550)	1
HGQ-3	75.0	4 (20 ns)	0.07 (5)	0.13 (1,540)	0.02 (370)	1
HGQ-4	73.9	3 (15 ns)	0.00 (0)	0.05 (565)	0.01 (140)	1
HGQ-5	72.5	2 (10 ns)	0.00 (0)	0.04 (468)	0.01 (131)	1
HGQ-6	71.0	2 (10 ns)	0.00 (0)	0.02 (256)	0.00 (66)	1
HGQ-c1	76.3	8 (40 ns)	0.26 (18)	0.50 (5,899)	0.09 (2,072)	1
HGQ-c2	74.2	3 (15 ns)	0.00 (0)	0.06 (678)	0.01 (172)	1

V-B SVHN Classifier

We also benchmark HGQ on a computer vision task and compare it to previous state-of-the-art work [57, 38] on the SVHN dataset [62]. The SVHN dataset consists of $32\times 32$ RGB images of house numbers taken from Google Street View. The task is to classify the digit in the center of the image into one of ten classes. The architecture of the model is a LeNet-like [63], convolution-dense network directly taken from [57]. The exact model architecture is shown in Fig. VII in extended data.

These results are summarized in Tab. II and Fig. IV. In the table, the following models are taken from [57]: AQP, AQ, QP 7-bit, Q 7-bit, and BP 14-bit. All models are quantized, and pruning to a sparsity of 50% is applied to AQP, QP and BP. AQP and AQ are heterogeneously quantized models, where the quantization configuration is obtained using a Gaussian Process hyperparameter optimization, and are trained quantization aware with QKeras. QP 7-bit, Q 7-bit, and BP 14-bit are homeogeneously quantized models trained quantization aware with QKeras. BP-DSP-RF=3 is cited from [38], where the network is implemented in QKeras with a reuse factor of three, and pruned in a DSP-aware fashion to reduce the resource consumption.

The HGQ trained models, HGQ 1 though 7, are taken from a single training run during which the $\beta$ value was gradually increased. We initialize the models using 6 bits for the floating points for activations, and 6 bits in total for weights. The $\beta$ value was systematically increased from $10^{-7}$ to $10^{-4}$ over approximately 12,000 epochs. Completing this training process required about 10 hours on a standard consumer-grade GPU.

This model is too large to be fit on-chip fully unrolled, rather we utilize io_stream in hls4ml to partition the convolutional layers into smaller blocks. The resource consumption is estimated by the sum of the resource consumption of each block. For this reason, intra-layer heterogeneous activation cannot be utilized, and only inter-layer heterogeneous weight quantization is performed. Nevertheless, HGQ still outperforms both baselines by a considerable margin of up to 30% in resource savings while maintaining similar accuracy and latency.

TABLE II: Resource usage and latency of the convolutional SVHN classifier models. Reported resource usage for HGQ models are after place & route. In this task, the HGQ-0.4 and HGQ-1.5 models outperform the baseline AQ and AQP models by a large margin in accuracy, and also using less resources.

Model	Accuracy (%)	Latency (cc)	DSP (%)	LUT (%)	FF (%)	BRAM (%)	II (cc)
BF 14-bit [57]	87	1,035 (5.18 $\mu$ s)	93.23 (6,377)	19.36 (228,823)	3.40 (80,278)	3.08 (66.5)	1,030
BP 14-bit [57]	93	1,035 (5.18 $\mu$ s)	48.85 (3,341)	12.27 (145,089)	2.77 (65,482)	3.08 (66.5)	1,030
Q 7-bit [57]	94	1,034 (5.17 $\mu$ s)	2.56 (175)	12.77 (150,981)	1.51 (35,628)	3.10 (67.0)	1,029
QP 7-bit [57]	94	1,035 (5.18 $\mu$ s)	2.54 (174)	9.40 (111,152)	1.38 (32,554)	3.10 (67.0)	1,030
AQ [57]	88	1,059 (5.30 $\mu$ s)	1.05 (72)	4.06 (48,027)	0.64 (15,242)	1.48 (32.5)	1,029
AQP [57]	88	1,059 (5.30 $\mu$ s)	1.02 (70)	3.28 (38,795)	0.63 (14,802)	1.39 (30.5)	1,029
BP-DSP-RF=3 [38]	92	? (43.58 $\mu$ s)	17.76 (1,215)	5.01 (59,279)	1.97 (46,584)	35.88 (1,550)	35.88
HGQ-1	93.9	1050 (5.25 $\mu$ s)	0.85 (58)	5.87 (69,407)	1.18 (27853)	1.48 (32.0)	1029
HGQ-2	93.1	1061 (5.31 $\mu$ s)	0.44 (30)	4.00 (47,314)	0.87 (20582)	1.30 (28.0)	1029
HGQ-3	91.9	1058 (5.29 $\mu$ s)	0.22 (15)	3.39 (40,032)	0.76 (18087)	1.09 (23.5)	1029
HGQ-4	90.9	1059 (5.30 $\mu$ s)	0.19 (13)	2.91 (34,435)	0.73 (17261)	1.04 (22.5)	1029
HGQ-5	89.9	1056 (5.28 $\mu$ s)	0.15 (10)	2.60 (30,766)	0.64 (15205)	0.97 (21.0)	1029
HGQ-6	88.8	1056 (5.28 $\mu$ s)	0.09 (6)	2.37 (27,982)	0.62 (14736)	0.97 (21.0)	1029

V-C Muon Tracker

We also compare the resolution, latency, and on-chip resource consumption from HGQ trained models to a regression task proposed in [58].

This task involves predicting the polar-angle of a simulated muon track in a simplified detector. The inputs are one $3\times 50$ and two $3\times 50$ binary-valued arrays, representing the hit patterns in three detector layers. The output is a single scalar value representing the polar angle of the track in milliradians. Architecture of the model is a multistage neural network taken from the original work [58]. The exact model architecture is available in Fig. VIII in extended data.

The results are presented in Tab. III and Fig. V. The Qf models presented in [58] are all trained quantization aware with QKeras using manually tuned parameters.

The HGQ trained models, HGQ 1 though 7, are taken from a single training run during which the $\beta$ value was gradually increased. We initialize the models with 6 bits for the floating points for activations, and 6 bits in total for weights. The $\beta$ value was systematically increased from 3.e-6 to 6.e-4 over approximately 600,000 epochs. The complete run takes around 16 hours on a single consumer grade GPU.

The HGQ models consistently outperform the baseline models with a $\sim 40$ % reduction in resource consumption, while maintaining the same resolution with comparable latency.

TABLE III: Resource consumption and latency of the Muon Tracker models. The resource usage reported for HGQ models are after place & route. In this task, the HGQ-1.25 outperforms the baseline Qf6 model in both accuracy and resource consumption, while the HGQ-3.00 outperforms the baseline Qf5 model in both accuracy and resource consumption.

Model	Resolution (mrad)	Latency (cc)	DSP (%)	LUT (%)	FF (%)	BRAM (%)	II (cc)
Qf8 [58]	1.95	17 (106.3 ns)	57.4 (1,762)	8.8 (37,867)	1.0 (8,443)	5.6 (37.5)	1
Qf7 [58]	1.97	11 (68.8 ns)	45.2 (1,389)	8.0 (34,848)	0.6 (5,433)	5.6 (37.5)	1
Qf6 [58]	2.04	13 (81.3 ns)	10.5 (324)	12.6 (54,638)	0.8 (6,525)	5.6 (37.5)	1
Qf5 [58]	2.15	11 (68.8 ns)	2.9 (88)	9.3 (40,039)	0.4 (3,419)	5.6 (37.5)	1
Qf4 [58]	2.45	10 (62.5 ns)	0.8 (24)	6.6 (28,526)	0.3 (2,954)	5.6 (37.5)	1
Qf3 [58]	2.78	9 (56.3 ns)	0.0 (2)	5.0 (21,682)	0.3 (2,242)	5.6 (37.5)	1
HGQ-1	1.95	11 (68.8 ns)	17.0 (522)	9.12 (39,413)	0.70 (6,043)	1.16 (25.0)	1
HGQ-2	2.00	11 (68.8 ns)	5.01 (154)	7.98 (34,460)	0.61 (5,263)	1.16 (25.0)	1
HGQ-3	2.09	12 (75.0 ns)	2.21 (68)	5.77 (24,941)	0.54 (4,677)	1.74 (37.5)	1
HGQ-4	2.20	13 (81.3 ns)	1.33 (41)	4.99 (21,557)	0.54 (4,699)	1.74 (37.5)	1
HGQ-5	2.39	10 (62.5 ns)	0.88 (27)	3.92 (16,918)	0.29 (2,484)	1.74 (37.5)	1
HGQ-6	2.63	12 (75.0 ns)	0.33 (10)	3.08 (13,306)	0.40 (3,429)	1.16 (25.0)	1

VI Conclusion and Future Work

In this work, we present HGQ, a novel method to optimize quantized neural networks for real-time applications on Field-Programmable Gate Arrays (FPGAs). Maximally leveraging the ability of FPGAs to perform fully heterogeneous computation, we introduce a new algorithm for precisely determining the optimal quantization precision for each weight and activation to minimize resource consumption without sacrificing the accuracy of the original model. To facilitate adoption, we have developed a user-friendly library that simplifies the application of this method. The HGQ approach enables the optimization of quantization bitwidths at arbitrary granularity up to individual parameter level, through a gradient descent approach that is conscious of both resource use and loss minimization. Additionally, the library offers an easy-to-use interface for defining quantized neural networks and training them with our method, as well as for deploying these networks on FPGAs by integrating with hls4ml.

Our findings show that HGQ achieves up to a 95% reduction in resource consumption compared to leading compression techniques, without compromising performance. We further demonstrate that a singular training session with HGQ is sufficient to explore a broad spectrum of trade-offs between performance and resource utilization, efficiently recovering the Pareto frontier, thereby rendering the model optimization process both more efficient and effective. Through its interface with hls4ml, HGQ provides a bit-accurate conversion from software to FPGA firmware models without the need for user interaction, significantly simplifying and streamlining the workflow from training to deployment. Moreover, we introduce EBOPs, a metric providing an accurate estimation of the final on-chip resource consumption as a linear combination of LUTs and DSPs. This estimation is available at training time, allowing for efficient software-hardware co-design.

In the future, we plan to extend support for more operations and layers. We also aim to support other training back-ends, such as PyTorch [49] and JAX [64]. Furthermore, we plan to include energy estimates as well as and more fine-grained resource estimations in the library.

VII Code Availability

We have made our library publicly available under the Apache 2.0 license at https://www.github.com/calad0i/HGQ. The scripts to reproduce the results in this paper are also available at https://www.github.com/calad0i/HGQ-demos under the Apache 2.0 license.

To use this library, one needs a forked version of hls4ml available at https://www.github.com/calad0i/hls4ml#HGQ-integration. The fork will be merged into the main hls4ml repository in the future, and one may check https://github.com/fastmachinelearning/hls4ml/pull/914 for the pull request status.

VIII Data Availability

The data used for training and evaluation in this work are all publicly available datasets. The jet tagging dataset is available at https://dx.doi.org/10.5281/zenodo.2603255. The SVHN dataset is available at http://ufldl.stanford.edu/housenumbers/. The muon tracking dataset is available at https://dx.doi.org/10.57967/hf/2084. Results shown in this work can be reproduced using the code available at https://www.github.com/calad0i/HGQ-demos.

IX Author contributions

C.S. conceived, designed, and implemented the HGQ method and library and performed the experiments. C.S. and V.C. implemented HGQ support in hls4ml. C.S. and T.A. wrote the manuscript. All authors reviewed and edited the manuscript.

X Acknowledgements

C.S. is supported by the Caltech Danny Koh grad fellowship. C.S. and M.S. acknowledge support from the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics grant DE-SC0011925. T.Å. is supported by the Swiss National Science Foundation Grant No. PZ00P2_201594. J.N. is supported by the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics “Designing efficient edge AI with physics phenomena” Project (DE-FOA-0002705). V.L. is supported by the NSF Institute for Accelerated AI Algorithms for Data-Driven Discovery (A3D3), under the NSF grant #PHY-2117997.

XI Competing Interests

The authors declare no competing interests.

References

[1] Singh, R. & Gill, S. S. Edge ai: A survey. Internet of Things and Cyber-Physical Systems 3, 71–92 (2023). URL https://www.sciencedirect.com/science/article/pii/S2667345223000196.
[2] Niu, W. et al. Grim: A general, real-time deep learning inference framework for mobile devices based on fine-grained structured weight sparsity. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6224–6239 (2022). URL https://doi.org/10.1109/TPAMI.2021.3089687.
[3] Huang, K. & Gao, W. Real-time neural network inference on extremely weak devices: agile offloading with explainable ai. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, MobiCom ’22, 200–213 (Association for Computing Machinery, New York, NY, USA, 2022). URL https://doi.org/10.1145/3495243.3560551.
[4] Yang, Y. et al. Streamvc: Real-time low-latency voice conversion (2024). URL https://google-research.github.io/seanet/stream_vc/.
[5] The CMS Collaboration. The Phase-2 Upgrade of the CMS Level-1 Trigger. Tech. Rep., CERN, Geneva (2020). URL https://cds.cern.ch/record/2714892. Final version.
[6] The ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the ATLAS TDAQ System. Tech. Rep., CERN, Geneva (2017). URL https://cds.cern.ch/record/2285584.
[7] Zurbano Fernandez, I. et al. High-Luminosity Large Hadron Collider (HL-LHC): Technical design report. CERN Yellow Reports: Monographs 10/2020 (2020).
[8] Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Computing Surveys 55, 1 – 37 (2021). URL https://api.semanticscholar.org/CorpusID:235446458.
[9] Li, Z., Li, H. & Meng, L. Model compression for deep neural networks: A survey. Computers 12 (2023). URL https://www.mdpi.com/2073-431X/12/3/60.
[10] Coelho, C. N. et al. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence 3, 675–686 (2021). URL https://doi.org/10.1038%2Fs42256-021-00356-5.
[11] Ngadiuba, J. et al. Compressing deep neural networks on fpgas to binary and ternary precision with hls4ml. Machine Learning: Science and Technology 2, 015001 (2020). URL https://dx.doi.org/10.1088/2632-2153/aba042.
[12] Zhou, S. et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160 (2016). URL http://arxiv.org/abs/1606.06160. 1606.06160.
[13] Lin, X., Zhao, C. & Pan, W. Towards accurate binary convolutional neural network. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017). URL https://proceedings.neurips.cc/paper_files/paper/2017/file/b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf.
[14] Courbariaux, M., Bengio, Y. & David, J. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR abs/1511.00363 (2015). URL http://arxiv.org/abs/1511.00363. 1511.00363.
[15] Rastegari, M., Ordonez, V., Redmon, J. & Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Leibe, B., Matas, J., Sebe, N. & Welling, M. (eds.) Computer Vision – ECCV 2016, 525–542 (Springer International Publishing, Cham, 2016).
[16] Li, F., Liu, B., Wang, X., Zhang, B. & Yan, J. Ternary weight networks (2022). 1605.04711.
[17] Zhu, C., Han, S., Mao, H. & Dally, W. J. Trained ternary quantization (2017). 1612.01064.
[18] He, Z. & Fan, D. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11430–11438 (2018).
[19] Xu, C. et al. Alternating multi-bit quantization for recurrent neural networks. CoRR abs/1802.00150 (2018). URL http://arxiv.org/abs/1802.00150. 1802.00150.
[20] Guo, Y., Yao, A., Zhao, H. & Chen, Y. Network sketching: Exploiting binary structure in deep cnns. CoRR abs/1706.02021 (2017). URL http://arxiv.org/abs/1706.02021. 1706.02021.
[21] Zhang, D., Yang, J., Ye, D. & Hua, G. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. CoRR abs/1807.10029 (2018). URL http://arxiv.org/abs/1807.10029. 1807.10029.
[22] Qu, Z., Zhou, Z., Cheng, Y. & Thiele, L. Adaptive loss-aware quantization for multi-bit networks. CoRR abs/1912.08883 (2019). URL http://arxiv.org/abs/1912.08883. 1912.08883.
[23] Chang, S.-E. et al. Mix and match: A novel fpga-centric deep neural network quantization framework. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 208–220 (2021).
[24] Wang, K., Liu, Z., Lin, Y., Lin, J. & Han, S. Hardware-centric automl for mixed-precision quantization. International Journal of Computer Vision 128, 2035–2048 (2020). URL https://doi.org/10.1007/s11263-020-01339-6.
[25] Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W. & Keutzer, K. HAWQ: hessian aware quantization of neural networks with mixed-precision. CoRR abs/1905.03696 (2019). URL http://arxiv.org/abs/1905.03696. 1905.03696.
[26] Dong, Z. et al. HAWQ-V2: hessian aware trace-weighted quantization of neural networks. CoRR abs/1911.03852 (2019). URL http://arxiv.org/abs/1911.03852. 1911.03852.
[27] Yao, Z., Gholami, A., Keutzer, K. & Mahoney, M. W. Pyhessian: Neural networks through the lens of the hessian. 2020 IEEE International Conference on Big Data (Big Data) 581–590 (2019). URL https://api.semanticscholar.org/CorpusID:209376531.
[28] Choi, J. et al. Bridging the accuracy gap for 2-bit quantized neural networks (QNN). CoRR abs/1807.06964 (2018). URL http://arxiv.org/abs/1807.06964. 1807.06964.
[29] Wu, B. et al. Mixed precision quantization of convnets via differentiable neural architecture search. CoRR abs/1812.00090 (2018). URL http://arxiv.org/abs/1812.00090. 1812.00090.
[30] Que, Z. et al. Metaml: Automating customizable cross-stage design-flow for deep learning acceleration. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 248–252 (2023).
[31] Park, E., Yoo, S. & Vajda, P. Value-aware quantization for training and inference of neural networks. In Ferrari, V., Hebert, M., Sminchisescu, C. & Weiss, Y. (eds.) Computer Vision – ECCV 2018, 608–624 (Springer International Publishing, Cham, 2018).
[32] Dettmers, T., Lewis, M., Shleifer, S. & Zettlemoyer, L. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR (2022).
[33] Dettmers, T. et al. Spqr: A sparse-quantized representation for near-lossless llm weight compression (2023). 2306.03078.
[34] Lou, Q., Guo, F., Kim, M., Liu, L. & Jiang., L. Autoq: Automated kernel-wise neural network quantization. In International Conference on Learning Representations (2020). URL https://openreview.net/forum?id=rygfnn4twS.
[35] Sun, M. et al. Film-qnn: Efficient fpga acceleration of deep neural networks with intra-layer, mixed-precision quantization. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2022). URL https://doi.org/10.1145/3490422.3502364.
[36] Le Cun, Y., Denker, J. S. & Solla, S. A. Optimal brain damage. In Proceedings of the 2nd International Conference on Neural Information Processing Systems, NIPS’89, 598–605 (MIT Press, Cambridge, MA, USA, 1989).
[37] Hassibi, B., Stork, D. & Wolff, G. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, 293–299 vol.1 (1993).
[38] Ramhorst, B., Constantinides, G. A. & Loncar, V. Fpga resource-aware structured pruning for real-time neural networks (2023). 2308.05170.
[39] Meng, F. et al. Pruning filter in filter. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, 17629–17640 (Curran Associates, Inc., 2020). URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ccb1d45fb76f7c5a0bf619f979c6cf36-Paper.pdf.
[40] Li, Y. et al. Differentiable transportation pruning (2023). 2307.08483.
[41] Zhang, S., Wang, M., Liu, S., Chen, P.-Y. & Xiong, J. Why lottery ticket wins? a theoretical perspective of sample complexity on sparse neural networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 2707–2720 (Curran Associates, Inc., 2021). URL https://proceedings.neurips.cc/paper_files/paper/2021/file/15f99f2165aa8c86c9dface16fefd281-Paper.pdf.
[42] Vischer, M. A., Lange, R. T. & Sprekeler, H. On lottery tickets and minimal task representations in deep reinforcement learning. In International Conference on Learning Representations (2022). URL https://openreview.net/forum?id=Fl3Mg_MZR-.
[43] Frankle, J. & Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (2019). URL https://openreview.net/forum?id=rJl-b3RcF7.
[44] Miao, L. et al. Learning pruning-friendly networks via frank-wolfe: One-shot, any-sparsity, and no retraining. In International Conference on Learning Representations (2022). URL https://openreview.net/forum?id=O1DEtITim__.
[45] Chijiwa, D., Yamaguchi, S. y., Ida, Y., Umakoshi, K. & INOUE, T. Pruning randomly initialized neural networks with iterative randomization. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 4503–4513 (Curran Associates, Inc., 2021). URL https://proceedings.neurips.cc/paper_files/paper/2021/file/23e582ad8087f2c03a5a31c125123f9a-Paper.pdf.
[46] Abadi, M. et al. TensorFlow: Large-scale machine learning on heterogeneous systems (2015). URL https://www.tensorflow.org/. Software available from tensorflow.org.
[47] Fahim, F. et al. hls4ml: An open-source codesign workflow to empower scientific low-power machine learning devices. CoRR abs/2103.05579 (2021). URL https://arxiv.org/abs/2103.05579. 2103.05579.
[48] Alessandro, Franco, G., nickfraser, Umuroglu, Y. & vfdev. Xilinx/brevitas: Release version 0.2.1 (2021). URL https://doi.org/10.5281/zenodo.4507794.
[49] Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H. M., Larochelle, A. P., Beygelzimer, A. P., d’Alché Buc, A. P. & Fox, A. P. B. (eds.) Proceedings of the 33rd International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2019).
[50] Umuroglu, Y. et al. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM Press, 2017). 1612.07119.
[51] Blott, M. et al. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. 11 (2018). 1809.04570.
[52] Bengio, Y., Léonard, N. & Courville, A. C. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432 (2013). URL http://arxiv.org/abs/1308.3432. 1308.3432.
[53] Baskin, C. et al. UNIQ. ACM Transactions on Computer Systems 37, 1–15 (2019). URL https://doi.org/10.1145%2F3444943.
[54] Elthakeb, A. T. et al. Waveq: Gradient-based deep quantization of neural networks through sinusoidal adaptive regularization (2020). 2003.00146.
[55] Nguyen, H. D., Alexandridis, A. & Mouchtaris, A. Quantization aware training with absolute-cosine regularization for automatic speech recognition. In Interspeech (2020). URL https://api.semanticscholar.org/CorpusID:226203265.
[56] Chollet, F. et al. Keras. https://keras.io (2015).
[57] Aarrestad, T. et al. Fast convolutional neural networks on fpgas with hls4ml. Machine Learning: Science and Technology 2, 045015 (2021). URL https://dx.doi.org/10.1088/2632-2153/ac0ea1.
[58] Sun, C., Nakajima, T., Mitsumori, Y., Horii, Y. & Tomoto, M. Fast muon tracking with machine learning implemented in fpga. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 1045, 167546 (2023). URL http://dx.doi.org/10.1016/j.nima.2022.167546.
[59] Pierini, M., Duarte, J. M., Tran, N. & Freytsis, M. Hls4ml lhc jet dataset (150 particles) (2020). URL https://doi.org/10.5281/zenodo.3602260.
[60] Umuroglu, Y., Akhauri, Y., Fraser, N. J. & Blott, M. Logicnets: Co-designed neural networks and circuits for extreme-throughput applications. 2020 30th International Conference on Field-Programmable Logic and Applications (FPL) 291–297 (2020). URL https://doi.org/10.1109/FPL50879.2020.00055.
[61] Tsoi, H. F., Loncar, V., Dasu, S. & Harris, P. Symbolnet: Neural symbolic regression with adaptive dynamic pruning (2024). 2401.09949.
[62] Netzer, Y. et al. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011).
[63] LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Computation 1, 541–551 (1989). URL https://api.semanticscholar.org/CorpusID:41312633.
[64] Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs (2018). URL http://github.com/google/jax.
[65] Tange, O. Gnu parallel 20240122 (’frederik x’) (2023). URL https://doi.org/10.5281/zenodo.10558745. GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them.