1 Introduction
Machine learning (ML) has become ubiquitous in various domains, such as healthcare [
24], automotive [
26], and cybersecurity [
54], among others. ML has also made significant advances in terms of performance albeit with increased development costs—e.g., training a recent ML model is estimated to cost over $4.6M [
48]. The increasing demand for ML models along with their costly development has created a favorable market for selling ML models directly as a service to the customers [
74]. The model provider can either host the trained model on the cloud or deploy it directly on an edge device like a surveillance camera [
80]. The recent trend is indeed to directly deploy the ML model on an edge platform for better performance and privacy [
103]. These valuable ML model IPs should be kept confidential. High fidelity model extraction attacks have already been proposed in prior literature that use mathematical and/or cryptanalytic approaches to reverse engineer the internals of the model [
13,
41]. The shift from cloud-based servers to edge devices has additionally exposed the ML models to the easily applicable and extremely potent physical side-channel attacks [
4,
22,
70,
83,
93,
95,
99]. Furthermore, the number of queries for success in side-channels attacks is orders of magnitude lower than that of the mathematical model extraction attacks. ML models may also be more vulnerable to such attacks when compared with cryptographic implementations, because the latter usually has a built-in key-refresh mechanism unlike ML models, which places an upper-bound on the number of traces that can be captured by an adversary [
94].
Physical side-channel attacks exploit the information leakage of the secret through physical properties of the device like power-consumption and
electromagnetic emanations (EM). The
Differential Power Analysis (DPA), for instance, exploits the inherent correlations between the secret-key dependent data being processed and the power consumption of the device [
44]. Many cryptographic implementations have been shown to be vulnerable to such attacks since then [
15,
51,
67]. Researchers have accordingly proposed effective ways to mitigate such attacks [
29,
40,
62,
68,
87]. However, the side-channel analysis research focused primarily on protecting cryptographic implementations, because it was the only domain that required confidentiality. But lately, ML models have also become lucrative targets for side-channel attacks. Stealing the model internals also assists in adversarial attacks that aim to create misclassifications in the model for malicious purposes [
100].
Physical side-channel attacks are easily applicable if the adversary has physical access to the device, which is why edge-based ML accelerators are unsurprisingly susceptible to such attacks [
4,
22,
93,
99]. With the recent developments in remote physical side-channel attacks, it is also possible to extend reverse engineering of ML models to a multi-tenant cloud-based FPGA setting [
57,
77,
101]. However, the research on developing adequate countermeasures is still quite immature. With the latest market research predicting a tremendous growth in the sales of edge-based ML hardware in the coming years [
37], there is an urgent need to develop efficient and robust side-channel defenses for ML applications.
The existing works on building countermeasures against power-based side-channel attacks on ML accelerators extend the ones developed for the cryptographic applications. The only two available techniques from our earlier works either utilize a combination of masking and hiding for side-channel resilience or a full Boolean masking approach [
21,
22]. Hiding aims to equalize power consumption throughout the execution typically using a precharge and differential logic-based circuit. Masking splits (or encodes) the original secret into multiple statistically independent shares to break the correlation of the secret data with the power consumption. The former approach is cost-conscious but challenging to implement effectively due to the precise control required in back-end flow for the hiding countermeasure. The latter approach of using Boolean masking is relatively easy to implement effectively, because it is an algorithm-level defense and does not require a precise back-end control. However, it incurs higher performance and area overheads when implemented trivially.
In this article, we extend our previous work on Boolean masking that was published at the IEEE/ACM International Conference on Computer Aided Design 2020 [
21]. The earlier work presents the first fully masked ML accelerator design that provides protection against power side-channel attacks. The extensions over the previous work include the following.
•
Our previous work used a pipelined ripple-carry architecture for the masked adder design, which had a latency of 100 cycles. In the extension, we design and implement a masked Kogge-Stone architecture for the adder that reduces its latency from 100 cycles to 31 cycles. We quantify that such a change improves the area-delay overheads of the whole design from 5.4
\( \times \) to 4.7
\( \times \) without changing the side-channel security. We quantify the area reduction mainly in terms of the number of
look-up tables (LUTs), because LUTs are the limiting factor in FPGAs; flip flops are available in abundance.
1 Section
4.3 discusses the details of the design, Section
6.2.3 illustrates the side-channel resilience, and Section
6.3 presents the implementation results.
•
We propose a shuffle countermeasure to improve the side-channel resilience and make it more difficult to conduct a second-order side-channel attack on our first-order masked design. Specifically, the scheme randomize the sequence in which neurons are computed for each layer by adopting the
random start index (RSI) method originally developed for protecting AES implementations [
55,
92]. Shuffling introduces noise in the temporal domain and increases the number of measurements for a successful attack by
\( N^2 \) , where
N is the number of hidden layer nodes. We discuss the details of the implementation in Section
5. We also demonstrate that shuffling helps to increase even the second-order side-channel security of the design up to at least 3M traces in a low noise setup.
•
We also explore an alternative implementation, where we switch our earlier proposed adder using the pipelined Trichina’s AND gate [
90] with a masked LUT having five inputs. We show that such a design further reduces the area-delay product by 2
\( \times \) but does leak information against a first-order side-channel attack. The leakage, however, is relatively smaller compared to an unprotected design with leakages becoming statistically significant only after capturing 270k measurements. We discuss the details of the implementation and quantify the side-channel leakage in Section
7.1.1.
The rest of this manuscript is organized as follows: Section
2 discusses our assumptions on the adversary and the victim; Section
3 briefly discusses related work on ML model extraction, existing side-channel defenses in cryptography, neural networks and describes our baseline hardware design; Section
4 describes in detail the hardware design of our proposed masked neural network components; Section
5 discusses the implementation of the shuffle countermeasure; Section
6 presents our results on hardware implementation results and the side-channel evaluation; Section
7 discusses potential ways to reduce the costs further or provide provable security; finally, we conclude our article in Section
8.
2 Threat Model
We adopt the standard DPA threat model in which an adversary has direct physical access to the target device running inference [
4,
22,
45], or can obtain power measurements remotely when the device executes neural network computations [
101]. The adversary can control the inputs and observe the corresponding outputs from the device as in chosen-plaintext attacks.
Figure
1 shows our threat model where the training phase is trusted but the trained model is deployed to an inference engine operating in an untrusted environment. The adversary is after the trained model parameters (e.g., weights and biases)—input data privacy is out of scope [
93].
We assume that the trained ML model is stored in a protected memory and the standard techniques are used to securely transfer it (i.e., bus snooping attacks are out of scope) [
9]. The adversary, therefore, has gray-box access to the device, i.e., it knows all the design details up to the level of each logic gate but does not know the trained ML model. We restrict the secret variables to just the parameters and not the hyperparameters such as the number of neurons, following earlier work [
22,
42,
89]. In fact, an adversary will still not be able to clone the model with just the hyperparameters if it does not possess the required compute power or training dataset. This is analogous to the scenario in cryptography where an adversary, even after knowing the implementation of a cipher, cannot break it without the correct key.
We target a hardware implementation of the neural network, not software. The design fully fits on the FPGA. Therefore, it does not involve any off-chip memory access and executes with constant-flow in constant time. These attributes make the design resilient to any type of digital (memory, timing, access-pattern, etc.) side-channel attack. However, the physical side-channels like power and EM emanations still exist; we address the power-based side-channel leakages in our work. Other implementation attacks on neural networks such as the fault attacks are out of scope [
7,
8].
3 Background AND Related Work
This section presents related work on the privacy of ML applications, the current state of side-channel defenses, preliminaries on BNNs, and our BNN hardware design.
3.1 ML Model Extraction
Recent developments in the field of ML point to several motivating scenarios that demand asset confidentiality. First, training is a computationally intensive process and hence requires the model provider to invest money in high-performance compute resources (e.g., a GPU cluster). The model provider might also need to invest money to buy a labeled dataset for training or label an unstructured dataset. Therefore, knowledge about either the parameters or hyperparameters can provide an unfair business advantage to the user of the model, which is why the ML model should be private.
Theoretical model extraction analyzes the query-response pair obtained by repeatedly querying an unknown ML model to steal the parameters [
13,
41,
64,
71]. This type of attack is similar to the class of theoretical cryptanalysis in the cryptography literature. Digital side-channels, by contrast, exploit the leakage of secret-data dependent
intermediate computations like access-patterns or timing in the neural network computations to steal the parameters [
20,
23,
33,
34,
96], which can usually be mitigated by making the secret computations constant-flow and constant-time. Physical side-channels target the leak in the physical properties like CMOS power-draw or electromagnetic emanations that will still exist in a constant-flow/constant-time algorithm’s implementation [
4,
22,
83,
93,
95,
99]. Mitigating physical side-channels are thus harder than digital side-channels in hardware accelerator design and has been extensively studied in the cryptography community.
3.2 Side-channel Defenses
The researchers have proposed numerous countermeasures against DPA. These countermeasures can be broadly classified as either
hiding-based or
masking-based. The former aims to make the power-consumption constant throughout the computation by using power-balancing techniques [
61,
87,
98]. The latter splits the sensitive variable into multiple statistically independent shares to ensure that the power consumption is independent of the sensitive variable throughout the computation [
2,
6,
28,
40,
65,
90].
The security provided by hiding-based schemes hinges upon the precision of the back-end design tools to create a near-perfect power-equalized circuit by balancing the load capacitances and synchronizing their activity across the leakage prone paths. This is not a trivial task and prior literature shows how a well-balanced dual-rail-based defense is still vulnerable to localized EM attacks [
36]. By contrast, masking transforms the algorithm itself to work in a secure way by never evaluating the secret variables directly. This keeps the security largely independent of the back-end design and makes masking a favorable choice over hiding.
3.3 Neural Network Classifiers
Neural network algorithms learn how to perform a certain task. In the learning phase, the user sends a set of inputs and expected outputs to the machine (a.k.a., training), which helps it to approximate (or learn) the function mapping the input-output pairs. The learned function can then be used by the machine to generate outputs for unknown inputs (a.k.a., inference).
A neural network consists of units called neurons (or nodes) and these neurons are usually grouped into layers. The neurons in each layer can be connected to the neurons in the previous and next layers. Each connection has a weight associated with it, which is computed during the training. The neurons work in a feed-forward fashion passing information from one layer to the next.
The weights and biases can be initialized to be random values or a carefully chosen set before training [
66]. These weights and biases are the
critical parameters that our countermeasure aims to protect. During training, a set of inputs along with the corresponding labels are fed to the network. The network computes the error between the actual outputs and the labels and tunes the weights and biases to reduce it, converging to a state where the accuracy is acceptable.
3.4 Binarized Neural Networks
Figure
2 depicts the neuron computation in a fully connected BNN. The weights and biases of a neural network are typically floating-point numbers. However, high area, storage costs, and power demands of floating-point hardware do not fare well with the requirements of the resource-constrained edge devices.
Binarized Neural Networks (BNNs) [
35], with their low hardware cost and power needs fit very well in this use-case while providing a reasonable accuracy. BNNs restrict the weights and activations to binary values (+1 and
\( -1 \) ), which can easily be represented in hardware by a single bit. This reduces the storage costs for the weights from floating-point values to binary values. The XNOR-POPCOUNT operation implemented using XNOR gates replaces the large floating-point multipliers resulting in a huge area and performance gain [
69]. In Figure
2, the neuron in the first hidden layer multiplies the input values with their respective binarized weights. The generated products are added to the bias, and the result is fed to the activation function, which is a sign function that binarizes the non-negative and negative inputs to +1 to
\( -1 \) , respectively. Hence, the activations in the subsequent layer are also binarized.
Prior works have demonstrated the computational efficiency of XNOR-POPCOUNT-based arithmetic operations in DNNs. A prior proposed GPU kernel exploiting the XNOR-POPCOUNT operations achieved
\( 23\times \) faster matrix multiplication compared to a naive baseline implementation [
17]. The designed kernel is
\( 3.4\times \) faster than cuBLAS and the MLP runs
\( 7\times \) faster using the XNOR-POPCOUNT kernel compared to the baseline. Well known semiconductor companies like Xilinx, Intel, and Apple are showing interest in BNNs due to these advantages [
43,
84,
91]. Owing to their low memory footprint and lightweight nature of operations, BNNs are considered attractive for edge applications such as FPGA-accelerators [
25], cryptographic neural network inference systems [
50], and for designing low-bitwidth ConvNets [
102], among many other applications. Despite their low-bitwidth operations, the accuracy obtained by BNNs is comparable to that obtained by full-precision neural networks. For instance, the accuracy loss of a BNN-ConvNet with 1-bit weights and activations was found to be less than
\( 0.5\% \) compared to a full-precision ConvNet with seven convolutional layers and one dense layer [
102] when evaluated on the Google
Street View House Number (SVHN) dataset. Similarly, another work [
49] achieved less than
\( 5\% \) accuracy loss on the ImageNet dataset [
19], a challenging dataset notorious for its complexity in the computer vision community, using a ResNet-like network built entirely using binary convolution blocks, compared to a full precision network.
3.5 Our Baseline BNN Hardware Design
We consider a BNN having an input layer of 784 nodes, three hidden layers of 1,010
2 nodes each, and an output layer of 10 nodes. The 784 input nodes denote the 784 pixel values in the 28
\( \times \) 28 grayscale images of the
Modified National Institute of Standards and Technology (MNIST) database and 10 output nodes represent the 10 output classes of the handwritten numerical digit.
3.5.1 Weighted Summations.
We choose to use a single adder in the design and sequentialize all the additions in the algorithm to reduce the area costs. Figure
3 shows our baseline BNN design. The computation starts from the input layer pixel values stored in the Pixel Memory. For each node of the first hidden layer, the hardware multiplies 784 input pixel values one by one and accumulates the sum of these products. The final summation is added with the bias reusing the adder with a multiplexed input and fed to the activation function. The hardware uses XNOR and POPCOUNT
3 operations to perform weighted summations in the hidden layers. The final layer summations are sent to the output logic.
In the input layer computations, the hardware multiplies an 8-bit unsigned input pixel value with its corresponding weight. The weight values are binarized to either 0 or 1 (representing a
\( -1 \) or +1, respectively). Figure
4 shows the realization of this multiplication with a multiplexer that takes in the pixel value (
a) and its 2’s complement (
\( -a \) ) as the data inputs and weight (
\( \pm \) 1) as the select line. The 8-bit unsigned pixel value, when multiplied by
\( \pm \) 1, needs to be sign-extended to 9-bits, resulting in a 9-bit 2
\( \times \) 1 multiplexer.
3.5.2 Activation Function.
The activation function binarizes the non-negative and negative inputs to +1 and \( -1 \) , respectively, for each node of the hidden layer. In hardware, this is implemented using a simple NOT gate that takes the MSB of the summations as its input.
3.5.3 Output Layer.
The summations in the output layer represent the confidence score of each output class for the provided image. Therefore, the final classification result is the class having the maximum confidence score. Figure
3 shows the hardware for computing the classification result. As the adder generates output layer summations, they are sent to the output logic block that performs a rolling update of the max register (
max) if the newly received sum is greater than the previously computed max. In parallel, the hardware also stores the index of the current max node. The index stored after the final update is sent out as the final output of the neural network. The hardware takes 2.8M cycles to finish one inference.
Exemplary DPA Attack on Baseline Implementation. A DPA adversary can target the 20-bit accumulator register at the output of the adder in Figure
3. The register accumulates the sum of the products of image pixels and weights in every cycle. Thus, the power consumption of any of those cycles can be modeled by using a hamming distance (HD) model. For example, if the adversary targets the fourth cycle, the power model
M is given as follows:
where
\( p_i \) and
\( w_i \) denote the
i \( \text{th} \) image pixel and weight, respectively. The number of hypotheses in this case is 16 (2
\( ^\text{4} \) ), since the adversary attacks four weights
\( w_0,w_1,w_2 \) , and
\( w_3 \) . Since the hardware adds the bias in the 785
\( \text{th} \) cycle, the adversary needs to extract the 784 weights first and then construct the power model for the sum with bias to attack it.
4 Fully Masking the Neural Network
This section discusses the hardware design and implementation of all components in the masked neural network. Prior work on masking of neural networks shows that arithmetic masking alone cannot mask integer addition due to a leakage in the sign-bit [
22]. Hence, we apply gate-level
Boolean masking to perform integer addition in a secure fashion. We express the entire computation of the neural network as a sequence of AND and XOR operations and apply gate-level masking on the resulting expression. XORs, being linear, do not require any additional masking, and AND gates are replaced with secure, Trichina style AND gates [
90]. Furthermore, we design specialized circuits for BNN’s unique components like Masked Multiplexer and Masked Output Layer.
We first explain the notations in equations and figures. Any variable without a subscript or superscript represents an N-bit number. We use the subscript to refer to a single bit of the N-bit number. For example, \( a_7 \) refers to the \( 8{\text{th}} \) bit of a. The superscript in masking refers to the different secret shares of a variable. To refer to a particular share of a particular bit of an N-bit number, we use both the subscript and the superscript. For example, \( a_{4}^{1} \) refers to the second Boolean share of the \( 5{\text{th}} \) bit of a. If a variable only has the superscript (say i), then we are referring to its full N-bit \( i{\text{th}} \) Boolean share; N can also be equal to 1, in which case a is simply a bit. r (or r \( {_i} \) ) denotes a fresh, random bit. The operation \( \oplus \) represents a bitwise XOR of operands.
4.1 A Glitch-Resilient Trichina’s AND Gate
Among the closely related masking styles [
72], we chose to implement Trichina’s method due to its simplicity and implementation efficiency. Figure
5 (left) shows the basic structure and functionality of the Trichina’s gate, which implements a 2-bit, masked, AND operation of
\( c=a \cdot b \) . Each input (
a and
b) is split into two shares (
\( a^0 \) and
\( a^1 \) s.t.
\( a=a^0\oplus a^1 \) ,
\( b^0 \) and
\( b^1 \) s.t.
\( b=b^0\oplus b^1 \) ). These shares are sequentially processed with a chain of AND gates initiated with a fresh random bit (
r). A single AND operation thus uses three random bits. The technique ensures that output is the Boolean masked output of the original AND function, i.e.,
\( c=c^0 \oplus c^1 \) , while all the intermediate computations are randomized.
Unfortunately, the straightforward adoption of Trichina’s AND gate can lead to information leakage due to glitches [
62]. For instance, in Figure
5 (left) if the products
\( a_0\cdot b_0 \) and
\( a_0\cdot b_1 \) reach the input of second XOR gate before random mask
r reaches the input of first XOR gate, the output at the XOR gate will evaluate (glitch) to
\( (a_0\cdot b_0)\oplus (a_0\cdot b_1)=a_0\cdot (b_0\oplus b_1) \) temporarily, which leads to secret value
b being unmasked. Therefore, we opted for an extension of the Trichina’s AND gate by adding flip-flops to synchronise the arrival of inputs at the XOR gates (see Figure
5 right). The only XOR gate not having a flip-flop at its input is the leftmost XOR gate in the path of
\( c_1 \) , which is not a problem, because a glitching output at this gate does not combine two shares of the same variable. Similar techniques have been used in past [
3]. Masking styles like the Threshold gates [
52,
53,
86] may be considered for even stronger security guarantees, but they will add further area-performance-randomness overhead.
4.2 Masked Ripple Carry Adder
We adopt the ripple-carry style of implementation for the adder first. It is composed of N 1-bit full adders where the carry-out from each adder is the carry-in for the next adder in the chain, starting from LSB. Therefore, ripple-carry configuration eases parameterization and modular design of the Boolean masked adders.
4.2.1 Design of a Masked Full Adder.
A 1-bit full adder takes as input two operands and a carry-in and generates the sum and the carry, which are a function of the two operands and the carry-in. If the input operand bits are denoted by
a and
b and carry-in bit by
c, then the Boolean equation of the sum
S and the carry
C can be described as follows:
However, the non-linear OR operation in the carry function is usually replaced with the linear XOR operator to simplify the masking of carry:
Figure
6 shows the regular, 1-bit full adder (on the left), and the resulting masked adder with Trichina’s AND gates (on the right). In the rest of the subsection, we will discuss the derivation of the masked full adder equations.
First step is to split the secret variables (
a,
\( b, \) and
c) into Boolean shares. The hardware samples a fresh, random mask from a uniform distribution and performs XOR with the original variable. If we represent the random masks as
\( a^{0} \) ,
\( b^{0} \) , and
\( c^{0} \) , then the masked values
\( a^{1} \) ,
\( b^{1} \) , and
\( c^{1} \) can be generated as follows:
A masking scheme always works on the two shares independently without combining them at any point in the operation, because that will reconstruct the secret and create a side-channel leak.
The function of sum-generation is linear, making it easy to directly and independently compute the Boolean shares of
S:
where
Unlike the sum-generation, carry-generation is a non-linear operation due to the presence of an AND operator. Hence, the hardware cannot directly and independently compute the Boolean shares
\( C^0 \) and
\( C^1 \) of
C. We use the Trichina’s construction explained in Section
4.1 to mask carry-generation.
The hardware uses three Trichina’s AND gates to mask the three AND operations in Equation (
2) using three random masks. This generates two Boolean shares from each Trichina AND operation. At this point, the expression is linear again, and therefore, the hardware can redistribute the terms, similar to the masking of sum operation.
In the following equations, we use
\( TG(x,y,r) \) to represent the product
\( x\cdot y \) implemented via Trichina’s AND Gate as illustrated in the following equation:
where
\( m^0 \) and
\( m^1 \) are the two Boolean shares of the product. Replacing each AND operation in Equation (
2) with TG, we can write
where
\( d^0 \) ,
\( d^1 \) ,
\( e^0 \) ,
\( e^1 \) ,
\( f^0 \) , and
\( f^1 \) are the output shares from each Trichina Gate. From Equations (
2), (
5), (
6), and (
7), we get
Replacing the TGs from Equations (
5), (
6), and (
7) and rearranging the terms, we get
which can also be written as a combination of two Boolean shares
\( C^0 \) and
\( C^1 \) , where
Therefore, we create a masked full adder that takes in the Boolean shares of the two bits to be added along with a carry-in and gives out the Boolean shares of the sum and carry-out.
4.2.2 The Modular Design of Pipelined N-bit Full Adder.
The masked full adders can be chained together to create an N-bit masked adder that can add two masked N-bit numbers. Figure
7 (top) shows how to construct a 4-bit masked adder as an example. We pipeline the N-bit adder to yield a throughput of one by adding registers between the full-adders corresponding to each bit (see Figure
7 (bottom)).
4.3 Masked Kogge Stone Adder
We also propose and implement the masked design for a baseline Kogge Stone adder (KSA) architecture. KSA is a type of parallel prefix adder that has a (logarithmically) lower latency compared to that of the ripple carry adder. KSA builds on the concept of carry look ahead. It starts by computing the
generate (
\( g_i \) ) and
propagate (
\( p_i \) ) bits for each position given by the following equations:
where
\( a_i \) and
\( b_i \) are the
\( i{\text{th}} \) bits of the operands. The generate bit being 1 implies that the carry will definitely be asserted at that position. The propagate bit being 1 implies that the carry will only be asserted if there is an incoming carry from the previous step. Therefore, the carry can only be asserted if the generate bit is 1 or the propagate bit is 1 with an incoming carry from the previous position.
This concept of generates and propagates for a single position can be extended to compute the so-called
group generate and
group propagate bits that denote if a group of bits do generate or propagate a carry. For example, the following two equations illustrate how to compute the group generate (
\( G_{1:0} \) ) and propagate (
\( P_{1:0} \) ) for the group of least significant two bits from the individual generates and propagates
\( g_0,p_0,g_1 \) , and
\( p_1 \) :
Larger groups can be created from smaller groups by combining them in a similar fashion. Let
\( G_{i:k} \) ,
\( P_{i:k} \) ,
\( G_{k:j} \) , and
\( P_{k:j} \) represent the group generate and propagate bits for the group of bits from
\( i^{th} \) to
\( k^{th} \) position, and
\( k{\text{th}} \) to
\( j{\text{th}} \) position, respectively. The generic equation to combine these quantities and get a group generate and propagate for the combined group from
\( i{\text{th}} \) to
\( j{\text{th}} \) bit is given below:
Figure
8 shows the baseline KSA schematic for 8-bit operands.
4 First, the KSA computes the sum bits without carry by element-wise XORing of the operands. Next, it keeps combining the generates and propagates in parallel until all the group generates and propagates for each position represents the group from that position until the least significant bit, effectively creating a direct dependence of each carry to the carry-in of the adder. Finally, the adder computes the carry at each position directly from the final group generate, group propagate, and the carry-in and XORs it with the sum without carry to calculate the result.
We have designed a masked version of the KSA. Figure
9 shows the schematic of this design. The adder in the masked design receives two Boolean shares of each operand. The function to compute the sum without carry is an XOR operation, which is linear. Therefore, the adder directly and independently computes the Boolean shares of the sum without carry. Similarly it also computes the Boolean shares of the individual propagate bits, because they only involve an XOR operation (Equation (
9)). The computation of the individual generate bits, however, involve the non-linear AND operation. The adder thus replaces the regular AND gates with the synchronised Trichina’s AND gate discussed in Section
4.1 to produce the Boolean shares of the generate bits. The produced generate and propagate bits are propagated down the adder tree and combined in each stage using the combination function from Equations (
10) and (
11). The combination function uses 1 XOR and 2 AND gates. The XOR, again, does not require any additional masking, whereas two AND gates are replaced by two synchronised Trichina’s AND gates. In the final stage, the adder XORs the respective shares of the sum without carry and the generate bits to calculate the Boolean shares of the actual sum.
In the masked KSA architecture the adder’s latency is reduced from 100 cycles of masked RCA to 30 cycles. The reduced latency directly helps to reduce the area of the throughput-optimization circuitry that is discussed in Section
4.6. The adder latency decides the number of buffers needed to store the concurrent partial summations and the data width of the corresponding demultiplexer and multiplexer blocks. For instance, in the RCA-based design the hardware uses 101 buffers to compute the summations for 101 nodes concurrently, because the adder only produces 1 output in 101 cycles.
4.4 Masking of Activation Function
The baseline hardware implements the activation function as an inverter as discussed in
3.5.2. In the masked version, the MSB output from the adder is a pair of Boolean shares. To perform NOT operation in a masked way, the hardware simply inverts one of the Boolean shares as Figure
10 shows.
4.5 Masking the Output Layer
The hardware stores the 10 output layer summations in the form of Boolean shares. To determine the classification result, it needs to find the maximum value among the 10 masked output nodes. Specifically, it needs to compare two signed values expressed as Boolean shares. We transform the problem of masked comparison to masked subtraction.
Figure
11 shows the hardware design of the masked output layer. The hardware subtracts each output node value from the current maximum and swaps the current maximum (old max shares) with the node value (new max shares) if the MSB is 1 using a masked multiplexer. An MSB of 1 signifies that the difference is negative and hence the new sum is greater than the latest max. Instead of building a new masked subtractor, we reuse the existing masked adder to also function as a subtractor through a
sub flag, which is set while computing max. In parallel, the hardware uses one more masked multiplexer-based update-circuit to update the Boolean shares of the index corresponding to the current max node (not shown in the figure). This is to prevent known-ciphertext attacks, ciphertext being the classification result in our case. Finally, the Masked Output Logic computes the classification result in the form of (Boolean) shares of the node’s index having the maximum confidence score.
Subtraction is essentially adding a number with the 2’s complement of another number. 2’s complement is computed by taking bitwise 1’s complement and adding 1 to it. A bitwise inverse is implemented as an XOR operation with 1 and the addition of 1 is implemented by setting the initial carry-in to 1. The additional XOR, being a linear operator, changes nothing with respect to the masking of the new adder-subtractor circuit.
4.6 Scheduling of Operations
We optimize the scheduling in such a way that the hardware maintains a throughput of 1 addition per cycle. The latency of the masked 20-bit RCA adder is 100 cycles. Therefore, the result from the adder will only be available after 101 cycles
5 from the time it samples the inputs. The hardware cannot feed the next input in the sequence until the previous sum is available because of the data dependency between the accumulated sum and the next accumulated sum. This incurs a stall for 101 cycles leading to a total of
\( 784*101=\hbox{79,184} \) cycles for each node computation. That is a
\( 784\times \) performance drop over the unmasked implementation with a regular adder.
We solve the problem by finding useful work for the adder that is independent of the summation in-flight, during the stalls. We observe that computing the weighted summation of one node is completely independent of the next node’s computation. The hardware utilizes this independence to improve the throughput by starting the next node computation while the result for the first node arrives. Similarly, all the nodes up till 101 can be computed upon concurrently using the same adder and achieve the exact same throughput as the baseline design. This comes at the expense of additional registers (see Figure
126) for storing 101 summations
7 plus some control logic but a throughput gain of 784
\( \times \) (or 1,010
\( \times \) in hidden layers) is worthwhile. The optimization only works if the number of next-layer nodes is greater than, and a multiple of 101. This restricts optimizing the output layer (of 10 nodes) and contributes to the 3.5% increase in the latency of the masked design.
The masked KSA design extension that we propose has a latency of only 31 cycles. This directly impacts the size of the throughput optimization circuit. Since the adder produces a result in every 32
8 cycles, the hardware requires a register file of only 32 entries to store the parallel partial summations and accordingly a multiplexer and demultiplexer of width 32. Hence, we reduce the area cost of the throughput optimization circuitry by 3
\( \times , \) which is coherent with the synthesis results shown in Table
2.
5 Shuffling the Neural Network Computations
Shuffling is a commonly used technique to defend against side-channel attacks [
92]. The premise of shuffling is to introduce noise in the temporal domain by either randomizing the order of independent operations and/or by adding random delays within the algorithm. In DPA attacks, the adversary tries to correlate its hypotheses with the side-channel measurements at a fixed point of interest over multiple measurements assuming that the targeted operation happens
at the same time in each trace. Since in a shuffled implementation the targeted operation will happen at the same time instance in a small set of traces, the number of traces needed for a successful attack increases. Prior work by Mangard et al. demonstrates that the number of traces required for a successful attack is
\( p^2 \) if the probability of occurrence of the operation of attack is
\( 1/p \) [
51]. Therefore, shuffling would increase the number of measurements needed to run a successful side-channel attack by
\( p^2 \) . Shuffling improves the side-channel resilience in general for all orders due to the temporal perturbation introduced. However, we note that our proposed masking scheme already provides first-order side-channel security and thus, the adversary will naturally try a second-order attack as the next step. Therefore, shuffling makes it more difficult to conduct a second-order side-channel attack on our first-order masked design.
Earlier studies have adopted shuffling as a countermeasure against physical side-channel attacks on both software and hardware implementations of cryptographic primitives [
92]. Depending on the performance and area budget, the shuffle is implemented either as
random permutation (RP) or as a
random starting index (RSI) [
55]. RP-based shuffling typically uses the Fisher-Yates shuffle algorithm
9 that generates unbiased permutations of a given sequence with
n elements leading to
\( n! \) possible permutations. In RSI, the original sequence is cyclically rotated by a random chosen amount, which is the starting index, leading to
n possible permutations. We identify that the computation of the activation value of each node in a layer is independent of other nodes. This allows the hardware to shuffle the order of computation of activation values, which is by default done sequentially in the unshuffled implementation. The RSI-based shuffle is almost free in hardware, because only the starting read address of the memory storing the parameters needs to be initialized with a random index. This makes RSI a very good candidate for a low area-budget design. Given the area overheads and constant-time enforcement challenges of Fisher-Yates shuffle, we have opted for applying the RSI in this work.
RSI-based shuffling enables the number of possible permutations to be equal to the number of nodes in a hidden layer, i.e., 1,024. Based on the analysis by Mangard et al. [
51], the number of traces for a successful attack increases by
\( 1,024^2\times \) , i.e., about six orders-of-magnitude. The large number of independent operations in neural networks helps to provide a significantly higher side-channel security using RSI, compared to that in block ciphers, which is typically in two orders-of-magnitude due to fewer independent operations, for instance, the number of SBox computations. Implementing RP-based shuffling offers a better security but has significant area costs due to the following reasons. Every step of the Fisher-Yates shuffle requires generating a uniform random number within a dynamically changing range. However, PRNGs typically generate random numbers only within ranges that are powers of 2. One way to confine the generated random number within the required range is to perform a modulo but that leads to biased permutations [
47]. This is a well-studied problem and one way to solve it is by discarding certain random numbers that create the bias (a.k.a. rejection sampling) that makes the algorithm variable-time, and possibly introduce timing-based side-channel vulnerabilities. The rejection sampler and modulo operation with non-powers of 2 incur significant area overheads when implemented on hardware.
In the shuffled neural network implementation, the hardware generates a random index to start the node computations in a layer. The hardware then evaluates all the nodes sequentially from the randomly chosen node until the last node and then the remaining nodes from the first node until the starting node. The hardware can further shuffle the reads of the input and hidden layer nodes during weighted summation, because addition is commutative. However, we do not chose to implement that shuffling in this work, because the number of input layer nodes is 784, which is not a power of 2 and hence requires a variable time random number generator for an unbiased generation of numbers between 0 and 783 [
47].
8 Conclusions AND Future Outlook
Physical side-channel analysis of neural networks is a new, promising direction in hardware security where the attacks are rapidly evolving compared to defenses. We proposed the first fully masked neural network, demonstrated the security with up to 2M traces, and quantified the overheads of a potential countermeasure. We addressed the key challenge of masking integer addition [
22] through Boolean masking. We furthermore presented ideas on how to mask the unique linear and non-linear computations of a fully connected neural network that do not exist in cryptographic applications. We also demonstrated how to couple a first-order masked design with a lightweight shuffle countermeasure to provide additional second-order side-channel security. The combination of the two defenses can raise the side-channel resilience to a level that succeeding a theoretical/cryptanalytic attack may need fewer queries.
The large variety in neural network architectures in terms of the quantization-level and the types of layer operations (e.g., Convolution, Maxpool, Softmax), and activation functions (e.g., ReLU, Sigmoid, Tanh) presents a large design space for neural network side-channel defenses. This work focused on BNNs, because we feel that BNNs are both excellent candidates for resource-constrained devices deployed at the edge and the closest flavors of ML algorithms to block ciphers (on which a large chunk of the side-channel literature focuses). Our ideas serve as a benchmark to analyze the vulnerabilities that exist in neural network computations and to construct more robust and efficient countermeasures.