CSINN Batina Paper
CSINN Batina Paper
CSINN Batina Paper
Abstract architecture for the same task was ResNet consisting of 152
layers [15]. This trend is not expected to stagnate any time
Machine learning has become mainstream across industries. soon, so it is prime time to consider machine/deep learning
Numerous examples prove the validity of it for security ap- from a novel perspective and in new use cases. Also, deep
plications. In this work, we investigate how to reverse en- learning algorithms are gaining popularity in IoT edge de-
gineer a neural network by using side-channel information vices such as sensors or actuators, as they are indispensable
such as timing and electromagnetic (EM) emanations. To in many tasks, like image classification or speech recogni-
this end, we consider multilayer perceptron and convolu- tion. As a consequence, there is an increasing interest in de-
tional neural networks as the machine learning architectures ploying neural networks on low-power processors found in
of choice and assume a non-invasive and passive attacker ca- always-on systems, e.g., ARM Cortex-M microcontrollers.
pable of measuring those kinds of leakages.
In this work, we focus on two neural network algorithms:
We conduct all experiments on real data and commonly
multilayer perceptron (MLP) and convolutional neural net-
used neural network architectures in order to properly assess
works (CNNs). We consider feed-forward neural networks
the applicability and extendability of those attacks. Practical
and consequently, our analysis is conducted on such net-
results are shown on an ARM Cortex-M3 microcontroller,
works only.
which is a platform often used in pervasive applications us-
With the increasing number of design strategies and el-
ing neural networks such as wearables, surveillance cameras,
ements to use, fine-tuning of hyper-parameters of those al-
etc. Our experiments show that a side-channel attacker is
gorithms is emerging as one of the main challenges. When
capable of obtaining the following information: the activa-
considering distinct industries, we are witnessing an increase
tion functions used in the architecture, the number of lay-
in intellectual property (IP) models strategies. Basically, in
ers and neurons in the layers, the number of output classes,
cases when optimized networks are of commercial interest,
and weights in the neural network. Thus, the attacker can
their details are kept undisclosed. For example, EMVCo
effectively reverse engineer the network using merely side-
(formed by MasterCard and Visa to manage specifications
channel information such as timing or EM.
for payment systems and to facilitate worldwide interoper-
ability) nowadays requires deep learning techniques for se-
1 Introduction curity evaluations [43]. This has an obvious consequence in:
1) security labs generating (and using) neural networks for
Machine learning, and more recently deep learning, have be- evaluation of security products and 2) they treat them as IP,
come hard to ignore for research in distinct areas, such as im- exclusively for their customers.
age recognition [25], robotics [21], natural language process- There are also other reasons for keeping the neural net-
ing [47], and also security [53, 26] mainly due to its unques- work architectures secret. Often, these pre-trained models
tionable practicality and effectiveness. Ever increasing com- might provide additional information regarding the training
putational capabilities of the computers of today and huge data, which can be very sensitive. For example, if the model
amounts of data available are resulting in much more com- is trained based on a medical record of a patient [9], confi-
plex machine learning architectures than it was envisioned dential information could be encoded into the network dur-
before. As an example, AlexNet architecture consisting of 8 ing the training phase. Also, machine learning models that
layers was the best performing algorithm in image classifi- are used for guiding medical treatments are often based on a
cation task ILSVRC2012 (http://www.image-net.org/ patient’s genotype making this extremely sensitive from the
challenges/LSVRC/2012/). In 2015, the best performing privacy perspective [10]. Even if we disregard privacy issues,
2.1.1 Multilayer Perceptron neurons. CNNs use three main types of layers: convolutional
layers, pooling layers, and fully-connected layers. Convolu-
A very simple type of a neural network is called perceptron. tional layers are linear layers that share weights across space.
A perceptron is a linear binary classifier applied to the fea- Pooling layers are non-linear layers that reduce the spatial
ture vector as a function that decides whether or not an input size in order to limit the number of neurons. Fully-connected
belongs to some specific class. Each vector component has layers are layers where every neuron is connected with all the
an associated weight wi and each perceptron has a threshold neurons in the neighborhood layer. For additional informa-
value θ . The output of a perceptron equals “1” if the di- tion about CNNs, we refer interested readers to [12].
rect sum between the feature vector and the weight vector is
larger than zero and “-1” otherwise. A perceptron classifier 2.1.3 Activation Functions
works only for data that are linearly separable, i.e., if there
is some hyperplane that separates all the positive points from An activation function of a node is a function f defining the
all the negative points [34]. output of a node given an input or set of inputs, see Eq. (1).
By adding more layers to a perceptron, we obtain a multi- To enable calculations of nontrivial functions for ANN us-
layer perceptron algorithm. Multilayer perceptron (MLP) is ing a small number of nodes, one needs nonlinear activation
a feed-forward neural network that maps sets of inputs onto functions as follows.
sets of appropriate outputs. It consists of multiple layers of
nodes in a directed graph, where each layer is fully connected y = Activation(∑(weight · input) + bias). (1)
to the next one. Consequently, each node in one layer con-
nects with a certain weight w to every node in the following In this paper, we consider the logistic (sigmoid) func-
layer. Multilayer perceptron algorithm consists of at least tion, tanh function, softmax function, and Rectified Linear
three layers: one input layer, one output layer, and one hid- Unit (ReLU) function. The logistic function is a nonlinear
den layer. Those layers must consist of nonlinearly activating function giving smooth and continuously differentiable re-
nodes [7]. We depict a model of a multilayer perceptron in sults [14]. The range of a logistic function is [0, 1], which
Figure 1. Note, if there is more than one hidden layer, then means that all the values going to the next neuron will have
it can be considered a deep learning architecture. Differing the same sign.
1
from linear perceptron, MLP can distinguish data that are not f (x) = . (2)
linearly separable. To train the network, the backpropagation 1 + e−x
algorithm is used, which is a generalization of the least mean The tanh function is a scaled version of the logistic func-
squares algorithm in the linear perceptron. Backpropagation tion where the main difference is that it is symmetric over
is used by the gradient descent optimization algorithm to ad- the origin. The tanh function ranges in [−1, 1].
just the weight of neurons by calculating the gradient of the
loss function [34]. 2
f (x) = tanh(x) = − 1. (3)
1 + e−2x
2.1.2 Convolutional Neural Network The softmax function is a type of sigmoid function able to
map values into multiple outputs (e.g., classes). The softmax
CNNs represent a type of neural networks which were first function is ideally used in the output layer of the classifier
designed for 2-dimensional convolutions as it was inspired in order to obtain the probabilities defining a class for each
by the biological processes of animals’ visual cortex [28]. input [5]. To denote a vector, we represent it in bold style.
From the operational perspective, CNNs are similar to ordi-
nary neural networks (e.g., multilayer perceptron): they con- ex j
f (x) j = , f or j = 1, . . . , K. (4)
sist of a number of layers where each layer is made up of ∑Kk=1 exk
3.2 Experimental Setup As already stated above, the exploited leakage model of
the target device is the Hamming weight (HW) model. A
Here we describe the attack methodology, which is first vali- microcontroller loads sensitive data to a data bus to perform
dated on Atmel ATmega328P. Later, we also demonstrate the indicated instructions. This data bus is pre-charged to all
proposed methodology on ARM Cortex-M3. ’0’s’ before every instruction. Note that data bus being pre-
The side-channel activity is captured using the Lecroy Wa- charged is a natural behavior of microcontrollers and not a
veRunner 610zi oscilloscope. For each known input, the vulnerability introduced by the attacker. Thus, the power
attacker gets one measurement (or trace) from the oscillo- consumption (or EM radiation) assigned to the value of the
scope. In the following, nr. of inputs or nr. of traces are data being loaded is modeled as the number of bits equal to
used interchangeably. Each measurement is composed of ’1’. In other words, the power consumption of loading data
0.15
0.1
0.05
Amplitude
-0.05
-0.1
-0.15
-0.2
-0.25
8.5 9 9.5 10 10.5 11
Time samples 105
(a) ReLU
x is:
n
HW (x) = ∑ xi , (6)
i=1
(a) First byte mantissa for weight = 2.43 (b) Second byte mantissa for weight = 2.43 (c) Third byte mantissa for weight = 2.43
Table 2: Code snippet of the returned assembly for multipli- sign and exponent, we could obtain the unique weight value.
cation: x = x · w(= 2.36 or 0x3D0A1740 in IEEE 754 rep- The traces are measured when the microcontroller performs
resentation). The multiplication itself is not shown here, but secret weight multiplication with uniformly random values
from the registers assignment, our leakage model assumption why -1,
between -1 and 1 (x ∈ {−1, 1}) to emulate normalized in- 1?
holds. put values. We set N = 5 and to reduce the number of pos-
sible candidates, we assume that each floating-point value
# Instruction Comment will have a precision of 2 decimal points, p = 0.01. Since
11a ldd r22, Y+1 0x01
we are dealing with mantissa only, we can then only check
11c ldd r23, Y+2 0x02
the weight candidates in the range [0, N], thus reducing the
11e ldd r24, Y+3 0x03
120 ldd r25, Y+4 0x04 number of possible candidates. We note here that this range
122 ldi r18, 0x3D 61 [−5, 5] is based on the previous experiments with MLP. Al-
124 ldi r19, 0x0A 10 though, in the later phase of the experiment, we targeted the
126 ldi r20, 0x17 23 floating point and fixed-point representation (232 in the worst
128 ldi r21, 0x40 64 case scenario on a 32-bit microcontroller, but could be less
12a call 0xa0a multiplication if the value is for example normalized), instead of the real
12e std Y+1, r22 0x01 value, which could in principle cover all possible floating
130 std Y+2, r23 0x02 values.
132 std Y+3, r24 0x03
In Figure 5, we show the result of the correlation for each
134 std Y+4, r25 0x04
byte with the measured traces. The horizontal axis shows
the time of execution and vertical axis correlation. The ex-
periments were conducted on 1 000 traces for each case. In
the weight directly. For this experiment, we target the result the figure, the black plot denotes the correlation of the “cor-
of the multiplication m of known input values x and unknown rect” mantissa weight (|m(ŵ) − m(w)| < 0.01), whereas the
weight w. For every input, we assume different possibilities red plots are from all other weight candidates in the range
for weight values. We then perform the multiplication and described earlier. Since we are only attacking mantissa in
estimate the IEEE 754 binary representation of the output. this phase, several weight candidates might have similar cor-
To deal with the growing number of possible candidates for relation peaks. After the recovery of the mantissa, the sign
the unknown weight w, we assume that the weight will be bit and exponent can be recovered similarly, which narrows
bounded in a range [−N, N], where N is a parameter chosen down the list candidate to a unique weight. Another ob- Explain?
by the adversary, and the size of possible candidates is de- servation is that the correlation value is not very high and
noted as s = 2N/p, where p is the precision when dealing scattered across different clock cycles. This is due to the
with floating-point numbers. reason that the measurements are noisy and since the oper-
3 regs? Then, we perform the recovery of the 23-bit mantissa of ation is not constant-time, the interesting time samples are
is it the 3 the weight. The sign and exponent could be recovered sepa- distributed across multiple clock cycles. Nevertheless, it is
bytes for
rately. Thus, we are observing the leakage of 3 registers, and shown that the side-channel leakage can be exploited to re-
mantissa or
sign, exp, based on the best CPA results for each register, we can recon- cover the weight up to certain precision. Multivariate side
mantissa? struct the mantissa. Note that the recovered mantissa does channel analysis [42] can be considered if distributed sam-
not directly relate to the weight, but with the recovery of the ples hinder recovery.
Correlation
0.6 0.9
correlation
0.4 0.85
0.2 0.8
0 0.75
200 400 600 800 1000 200 400 600 800 1000
number of traces Number of traces
(a) weight = 1.635 (a) First byte recovery (sign and 7-bit exponent)
1 1
Targeted value Targeted value
Incorrect values Incorrect values
0.8 0.8
Correlation
0.6 0.6
correlation
0.4 0.4
0.2 0.2
0 0
200 400 600 800 1000 200 400 600 800 1000
number of traces Number of traces
(b) weight = 0.890 (b) Second byte recovery (lsb exponent and mantissa)
Figure 6: Correlation comparison between the correct and Figure 7: Recovery of the weight
incorrect mantissa of the weights. (a) Correct mantissa can
be recovered (correct values/black line has a higher value
compared to max incorrect values/red line). (b) A special get only those bits, assigning the rest to 0. Thus, our search
case where the incorrect value of mantissa has a higher cor- space is now [0, 27 − 1]. The mantissa multiplication can be
relation, recovering 0.896025 (1100100000..00) instead of performed as 1.mantissax × 1.mantissaw , then taking the 23
0.89 (1100011110...10), still within precision error limits re- most significant bits after the leading 1, and normalization
sulting in attack success (updating the exponent if the result overflows) if necessary.
In Figure 6, we show the result of the correlation between
the HW of the first 7-bit mantissa of the weight with the
We emphasize that attacking real numbers as in the case of traces. Except for Figure 6b, the other results show that the
weights of ANN can be easier than attacking cryptographic correct mantissa can be recovered. Although the correlation
implementations. This is because cryptography typically is not increasing, it is important that the difference becomes
works on fixed-length integers and exact values must be re- stable after a sufficient amount of traces is used and even-
covered. When attacking real numbers, small precision er- tually distinguishing correct weight from wrong weight hy-
rors due to rounding off the intermediate values still result in potheses. The most interesting result is shown in Figure 6b,
useful information. which at first glance looks like a failure of the attack. Here,
To deal with more precise values, we can target the man- the target value of the mantissa is 1100011110...10, while
tissa multiplication operation directly. In this case, the search the attack recovers 1100100000..00. Considering the sign
space can either be [0, 223 − 1] to cover all possible values and exponents, the attack recovers 0.890625 instead of 0.89,
for the mantissa (hence, more computational resources will i.e., a precision error at 4th place after decimal point. Thus,
be required) or we can focus only on the most significant in both cases, we have shown that we can recover the weights
bits of the mantissa (lesser candidates but also with lesser from the SCA leakage.
precision). Since the 7 most significant bits of the man- In Figure 7, we show the composite recovery of 2 bytes of
tissa are processed in the same register, we can aim to tar- the weight representation i.e., a low precision setting where
Amplitude
Amplitude
0 0 0
(a) One hidden layer with 6 neurons (b) 2 hidden layers (6 and 5 neurons each) (c) 3 hidden layers (6,5,5 neurons each)
4 Experiments with ARM Cortex-M3 computation time but the average and maximum values are
higher for tanh function. To distinguish, one can obtain mul-
In the previous section, we propose a methodology to re- tiple inputs to the function, build patterns and do pattern
verse engineer sensitive parameters of a neural network, matching to determine which type of function is used. The
which we practically validated on an 8-bit AVR (Atmel AT- activity of a single neuron is shown in Figure 11a, which uses
mega328P). In this section, we extend the presented attack sigmoid as an activation function (the multiplication opera-
on a 32-bit ARM microcontroller. ARM microcontrollers tion is shown separated by a vertical red line).
form a fair share of the current market with huge domi- A known input attack is mounted on the multiplication to
nance in mobile applications, but also seeing rapid adoption recover the secret weight. One practical consideration in at-
in markets like IoT, automotive, virtual and augmented real- tacking multiplication is that different compilers will com-
ity, etc. Our target platform is the widely available Arduino pile it differently for different targets. Modern microcon-
due development board which contains an Atmel SAM3X8E trollers also have dedicated floating point units for handling
ARM Cortex-M3 CPU with a 3-stage pipeline, operating at operations like multiplication of real numbers. To avoid the
84 MHz. The measurement setup is similar to previous ex- discrepancy of the difference of multiplication operation, we
periments (Lecroy WaveRunner 610zi oscilloscope and RF- target the output of multiplication. In other words, we target
U 5-2 near-field EM probe from Langer). The point of mea- the point when multiplication operation with secret weight
surements was determined by a benchmarking code running is completed and the resultant product is updated in general
AES encryption. After capturing the measurements for the purpose registers or memory. Figure 11b shows the success
target neural network, one can perform reverse engineering. of attack recovering secret weight of 2.453, with a known
Note that ARM Cortex-M3 (as well as M4 and M7) have input. As stated before, side-channel measurements on mod-
support for deep learning in the form of CMSIS-NN imple- ern 32-bit ARM Cortex-M3 may have lower SNR thus mak-
mentation [27]. ing attack slightly harder. Still, the attack is shown to be
The timing behavior of various activation functions is practical even on ARM with 2× more measurements. In
shown in Figure 10. The results, though different from pre- our setup, getting 200 extra measurements takes less than
vious experiments on AVR, have unique timing signatures, a minute. Similarly, the setup and number of measurements
allowing identification of each activation function. Here, can be updated for other targets like FPGA, GPU, etc.
sigmoid and tanh activation functions have similar minimal Finally, the full network layout is recovered. The activity
(a) Observing pattern and timing of multiplication(b) Correlation comparison between correct and in-(c) SEMA on hidden layers with 3 hidden layers
and activation function correct mantissa for weight=2.453 (6,5,5 neurons each)
of a full network with 3 hidden layers composed of 6, 5, and tigate has 4 hidden layers with dimensions (50, 30, 20, 50), it
5 neurons each is shown in Figure 11c. All the neurons are uses ReLU activation function and has Softmax at the output.
observable by visual inspection. The determination of layer The whole measurement trace is shown in Figure 12(a) with
boundaries (shown by a solid red line) can be determined a zoom on 1 neurons in the third layer in Figure 12(b). When
by attacking the multiplication operation and following the measuring at 500 MSamples/s, each trace had ∼ 4.6 million
approach discussed in Section 3.6. samples. The dataset is DPAcontest v4 with 50 samples and
75 000 measurements [46]. The first 50 000 measurements
are used for training and the rest for testing. We experiment
4.1 Reverse Engineering MLP with the Hamming weight model (meaning there are 9 output
classes). The original accuracy equals 60.9% and the accu-
The migration of our testbed to ARM Cortex-M3 allowed
racy of the reverse engineered network is 60.87%. While the
us to test bigger networks, which are used in some relevant
previously developed techniques are directly available, there
case-studies. First, we consider an MLP that is used in profil-
are a few practical issues.
ing side-channel analysis [41]. Our network of choice comes
from the domain of side-channel analysis which has seen the • As the average run time is 9.8 ms, each measurement
adoption of deep learning methods in the past. With this net- would take long considering the measurement and data
work, a state-of-the-art profiled SCA was conducted when saving time. To boost up the SNR, averaging is recom-
considering several datasets where some even contain im- mended. We could use the oscilloscope in-built feature
plemented countermeasures. Since the certification labs use for averaging. Overall, the measurement time per trace
machine learning to evaluate the resilience of cryptographic was slightly over one second after averaging 10 times.
implementations to profiled attacks, an attacker being able to • The measurement period was too big to measure the
reverse engineer that machine learning would be able to use whole period easily at a reasonable resolution. This was
it to attack implementations on his own. The MLP we inves- resolved by measuring two consecutive layers at a time
[2] A LBERICIO , J., J UDD , P., H ETHERINGTON , T., [12] G OODFELLOW, I., B ENGIO , Y., AND C OURVILLE ,
A AMODT , T., J ERGER , N. E., AND M OSHOVOS , A. Deep Learning. MIT Press, 2016. http://www.
A. Cnvlutin: Ineffectual-Neuron-Free Deep Neural deeplearningbook.org.
Network Computing. In 2016 ACM/IEEE 43rd An-
[13] H ACHEZ , G., AND Q UISQUATER , J.-J. Montgomery
nual International Symposium on Computer Architec-
exponentiation with no final subtractions: Improved
ture (ISCA) (June 2016), pp. 1–13.
results. In International Workshop on Cryptographic
[3] ATENIESE , G., M ANCINI , L. V., S POGNARDI , A., Hardware and Embedded Systems (2000), Springer,
V ILLANI , A., V ITALI , D., AND F ELICI , G. Hacking pp. 293–301.
Smart Machines with Smarter Ones: How to Extract
Meaningful Data from Machine Learning Classifiers. [14] H AYKIN , S. Neural Networks: A Comprehensive
Int. J. Secur. Netw. 10, 3 (Sept. 2015), 137–150. Foundation, 2nd ed. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 1998.
[4] B HASIN , S., G UILLEY, S., H EUSER , A., AND DAN -
GER , J.-L. From cryptography to hardware: analyz- [15] H E , K., Z HANG , X., R EN , S., AND S UN , J. Deep
ing and protecting embedded Xilinx BRAM for cryp- Residual Learning for Image Recognition. CoRR
tographic applications. Journal of Cryptographic En- abs/1512.03385 (2015).
gineering 3, 4 (2013), 213–225.