Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CSINN Batina Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CSI NN: Reverse Engineering of

Neural Network Architectures Through


Electromagnetic Side Channel
Lejla Batina, Radboud University, The Netherlands; Shivam Bhasin and
Dirmanto Jap, Nanyang Technological University, Singapore; Stjepan Picek,
Delft University of Technology, The Netherlands
https://www.usenix.org/conference/usenixsecurity19/presentation/batina

This paper is included in the Proceedings of the


28th USENIX Security Symposium.
August 14–16, 2019 • Santa Clara, CA, USA
978-1-939133-06-9

Open access to the Proceedings of the


28th USENIX Security Symposium
is sponsored by USENIX.
CSI NN: Reverse Engineering of Neural Network Architectures Through
Electromagnetic Side Channel

Lejla Batina Shivam Bhasin


Radboud University, The Netherlands Nanyang Technological University, Singapore
Dirmanto Jap Stjepan Picek
Nanyang Technological University, Singapore Delft University of Technology, The Netherlands

Abstract architecture for the same task was ResNet consisting of 152
layers [15]. This trend is not expected to stagnate any time
Machine learning has become mainstream across industries. soon, so it is prime time to consider machine/deep learning
Numerous examples prove the validity of it for security ap- from a novel perspective and in new use cases. Also, deep
plications. In this work, we investigate how to reverse en- learning algorithms are gaining popularity in IoT edge de-
gineer a neural network by using side-channel information vices such as sensors or actuators, as they are indispensable
such as timing and electromagnetic (EM) emanations. To in many tasks, like image classification or speech recogni-
this end, we consider multilayer perceptron and convolu- tion. As a consequence, there is an increasing interest in de-
tional neural networks as the machine learning architectures ploying neural networks on low-power processors found in
of choice and assume a non-invasive and passive attacker ca- always-on systems, e.g., ARM Cortex-M microcontrollers.
pable of measuring those kinds of leakages.
In this work, we focus on two neural network algorithms:
We conduct all experiments on real data and commonly
multilayer perceptron (MLP) and convolutional neural net-
used neural network architectures in order to properly assess
works (CNNs). We consider feed-forward neural networks
the applicability and extendability of those attacks. Practical
and consequently, our analysis is conducted on such net-
results are shown on an ARM Cortex-M3 microcontroller,
works only.
which is a platform often used in pervasive applications us-
With the increasing number of design strategies and el-
ing neural networks such as wearables, surveillance cameras,
ements to use, fine-tuning of hyper-parameters of those al-
etc. Our experiments show that a side-channel attacker is
gorithms is emerging as one of the main challenges. When
capable of obtaining the following information: the activa-
considering distinct industries, we are witnessing an increase
tion functions used in the architecture, the number of lay-
in intellectual property (IP) models strategies. Basically, in
ers and neurons in the layers, the number of output classes,
cases when optimized networks are of commercial interest,
and weights in the neural network. Thus, the attacker can
their details are kept undisclosed. For example, EMVCo
effectively reverse engineer the network using merely side-
(formed by MasterCard and Visa to manage specifications
channel information such as timing or EM.
for payment systems and to facilitate worldwide interoper-
ability) nowadays requires deep learning techniques for se-
1 Introduction curity evaluations [43]. This has an obvious consequence in:
1) security labs generating (and using) neural networks for
Machine learning, and more recently deep learning, have be- evaluation of security products and 2) they treat them as IP,
come hard to ignore for research in distinct areas, such as im- exclusively for their customers.
age recognition [25], robotics [21], natural language process- There are also other reasons for keeping the neural net-
ing [47], and also security [53, 26] mainly due to its unques- work architectures secret. Often, these pre-trained models
tionable practicality and effectiveness. Ever increasing com- might provide additional information regarding the training
putational capabilities of the computers of today and huge data, which can be very sensitive. For example, if the model
amounts of data available are resulting in much more com- is trained based on a medical record of a patient [9], confi-
plex machine learning architectures than it was envisioned dential information could be encoded into the network dur-
before. As an example, AlexNet architecture consisting of 8 ing the training phase. Also, machine learning models that
layers was the best performing algorithm in image classifi- are used for guiding medical treatments are often based on a
cation task ILSVRC2012 (http://www.image-net.org/ patient’s genotype making this extremely sensitive from the
challenges/LSVRC/2012/). In 2015, the best performing privacy perspective [10]. Even if we disregard privacy issues,

USENIX Association 28th USENIX Security Symposium 515


obtaining useful information from neural network architec- notice that by observing side-channel leakage, it is possible
tures can help acquiring trade secrets from the competition, to deduce the number of nodes and the number of layers in
which could lead to competitive products without violating the networks.
intellectual property rights [3]. Hence, determining the lay- In this work, we show it is possible to recover the layout of
out of the network with trained weights is a desirable target unknown networks by exploiting the side-channel informa-
for the attacker. One could ask the following question: Why tion. Our approach does not need access to training data and
would an attacker want to reverse engineer the neural net- allows for network recovery by feeding known random in-
work architecture instead of just training the same network puts to the network. By using the known divide-and-conquer
on its own? There are several reasons that are complicating approach for side-channel analysis, (i.e., the attacker’s abil-
this approach. First, the attacker might not have access to the ity to work with a feasible number of hypotheses due to,
same training set in order to train his own neural network. e.g., the architectural specifics), the information at each layer
Although this is admittedly a valid point, recent work shows could be recovered. Consequently, the recovered informa-
how to solve those limitations [49]. Second, as the architec- tion can be used as input for recovering the subsequent lay-
tures have become more complex, there are more and more ers.
parameters to tune and it could be extremely difficult for the We note that there exists somewhat parallel research to
attacker to pinpoint the same values for the parameters as in ours also on reverse engineering by “simply” observing the
the architecture of interest. outputs of the network and training a substitute model. Yet,
After motivating our use case, the main question that re- this task is not so simple since one needs to know what kind
mains is on the feasibility of reverse engineering such archi- of architecture is used (e.g., convolutional neural network or
tectures. Physical access to a device could allow readily re- multilayer perceptron, the number of layers, the activation
verse engineering based on the binary analysis. However, in functions, access to training data, etc.) while limiting the
a confidential IP setting, standard protections like blocking number of queries to ensure the approach is realistic [39].
binary readback, blocking JTAG access [20], code obfusca- Some more recent works have tried to overcome a few of the
tion, etc. are expected to be in place and preventing such highlighted limitations [49, 18].
attacks. Nevertheless, even when this is the case, a viable To our best knowledge, this kind of observation has never
alternative is to exploit side-channel leakages. been used before in this context, at least not for leveraging on
Side-channel analysis attacks have been widely studied (power/EM) side-channel leakages with reverse engineering
in the community of information security and cryptography, the neural networks architecture as the main goal. We posi-
due to its potentially devastating impact on otherwise (the- tion our results in the following sections in more detail. To
oretically) secure algorithms. Practically, the observation summarize, our main motivation comes from the ever more
that various physical leakages such as timing delay, power pervasive use of neural networks in security-critical applica-
consumption, and electromagnetic emanation (EM) become tions and the fact that the architectures are becoming propri-
available during the computation with the (secret) data has etary knowledge for the security evaluation industry. Hence,
led to a whole new research area. By statistically combin- reverse engineering a neural network has become a new tar-
ing this physical observation of a specific internal state and get for the adversaries and we need a better understanding of
hypothesis on the data being manipulated, it is possible to the vulnerabilities to side-channel leakages in those cases to
recover the intermediate state processed by the device. be able to protect the users’ rights and data.
In this study, our aim is to highlight the potential vulnera-
bilities of standard (perhaps still naive from the security per-
1.1 Related Work
spective) implementations of neural networks. At the same
time, we are unaware of any neural network implementation There are many papers considering machine learning and
in the public domain that includes side-channel protection. more recently, deep learning for improving the effectiveness
For this reason, we do not just pinpoint to the problem but of side-channel attacks. For instance, a number of works
also suggest some protection measures for neural networks have compared the effectiveness of classical profiled side-
against side-channel attacks. Here, we start by considering channel attacks, so-called template attacks, against various
some of the basic building blocks of neural networks: the machine learning techniques [30, 19]. Lately, several works
number of hidden layers, the basic multiplication operation, explored the power of deep learning in the context of side-
and the activation functions. channel analysis [32]. However, this line of work is using
For instance, the complex structure of the activation func- machine learning to derive a new side-channel distinguisher,
tion often leads to conditional branching due to the necessary i.e., the selection function leading to the key recovery.
exponentiation and division operations. Conditional branch- On the other hand, using side-channel analysis to attack
ing typically introduces input-dependent timing differences machine learning architectures has been much less investi-
resulting in different timing behavior for different activation gated. Shokri et al. investigate the leakage of sensitive in-
function, thus allowing the function identification. Also, we formation from machine learning models about individual

516 28th USENIX Security Symposium USENIX Association


data records on which they were trained [44]. They show gineering of neural networks from a more generic perspec-
that such models are vulnerable to membership inference at- tive. The closest previous works to ours have reverse engi-
tacks and they also evaluate some mitigation strategies. Song neered neural networks by using cache attacks that work on
et al. show how a machine learning model from a mali- distinct CPUs and are basically micro-architectural attacks
cious machine learning provider can be used to obtain in- (albeit using timing side-channel). Our approach utilizes EM
formation about the training set of a model [45]. Hua et side-channel on small embedded devices and it is supported
al. were first to reverse engineer two convolutional neural by practical results obtained on a real-world architecture. Fi-
networks, namely AlexNet and SqueezeNet through mem- nally, our attack is able to recover both the hyper-parameters
ory and timing side-channel leaks [17]. The authors measure (parameter external to the model, e.g., the number of lay-
side-channel through an artificially introduced hardware tro- ers) and parameters (parameter internal to the model, like
jan. They also need access to the original training data set weights) of neural networks.
for the attack, which might not always be available. Lastly,
in order to obtain the weights of neural networks, they attack
a very specific operation, i.e., zero pruning [40]. Wei et al. 1.2 Contribution and Organization
have also performed an attack on an FPGA-based convolu- The main contributions of this paper are:
tional neural network accelerator [52]. They recovered the 1. We describe full reverse engineering of neural network
input image from the collected power consumption traces. parameters based on side-channel analysis. We are able
The proposed attack exploits a specific design choice, i.e., to recover the key parameters such as activation func-
the line buffer in a convolution layer of a CNN. tion, pre-trained weights, number of hidden layers and
In a nutshell, both previous reverse engineering efforts us- neurons in each layer. The proposed technique does not
ing side-channel information were performed on very special need any information on the (sensitive) training data as
designs of neural networks and the attacks had very specific that information is often not even available to the at-
and different goals. Our work is more generic than those two tacker. We emphasize that, for our attack to work, we
as it assumes just a passive adversary able to measure phys- require the knowledge of some inputs/outputs and side-
ical leakages and our strategy remains valid for a range of channel measurements, which is a standard assumption
architectures and devices. Although we show the results on for side-channel attacks.
the chips that were depackaged prior to experiments in or- 2. All the proposed attacks are practically implemented
der to demonstrate the leakage available to powerful adver- and demonstrated on two distinct microcontrollers (i.e.,
saries, our findings remain valid even without depackaging. 8-bit AVR and 32-bit ARM).
Basically, having EM as an available source of side-channel 3. We highlight some interesting aspects of side-channel
leakage, it comes down to using properly designed antennas attacks when dealing with real numbers, unlike in ev-
and more advanced setups, which is beyond the scope of this eryday cryptography. For example, we show that even
work. a side-channel attack that failed can provide sensitive
Several other works doing somewhat related research are information about the target due to the precision error.
given as follows. Ohrimenko et al. used a secure implemen- 4. Finally, we propose a number of mitigation techniques
tation of MapReduce jobs and analyzed intermediate traffic rendering the attacks more difficult.
between reducers and mappers [37]. They showed how an We emphasize that the simplicity of our attack is its strongest
adversary observing the runs of typical jobs can infer pre- point, as it minimizes the assumption on the adversary (no
cise information about the inputs. In a follow-up work they pre-processing, chosen-plaintext messages, etc.)
discuss how machine learning algorithms can be exploited
by various side-channels [38]. Consequently, they propose
data-oblivious machine learning algorithms that prevent ex- 2 Background
ploitation of side channels induced by memory, disk, and
network accesses. They note that side-channel attacks based In this section, we give details about artificial neural net-
on power and timing leakages are out of the scope of their works we consider in this paper and their building blocks.
work. Xu et al. introduced controlled-channel attacks, which Next, we discuss the concepts of side-channel analysis and
is a type of side-channel attack allowing an untrusted oper- several types of attacks we use in this paper.
ating system to extract large amounts of sensitive informa-
tion from protected applications [54]. Wang and Gong in-
2.1 Artificial Neural Networks
vestigated both theoretically and experimentally how to steal
hyper-parameters of machine learning algorithms [51]. In Artificial neural networks (ANNs) is an umbrella notion for
order to mount the attack in practice, they estimate the error all computer systems loosely inspired by biological neural
between the true hyper-parameter and the estimated one. networks. Such systems are able to “learn” from examples,
In this work, we further explore the problem of reverse en- which makes them a strong (and very popular) paradigm in

USENIX Association 28th USENIX Security Symposium 517


the machine learning domain. Any ANN is built from a num-
ber of nodes called artificial neurons. The nodes are con-
nected in order to transmit a signal. Usually, in an ANN,
the signal at the connection between artificial neurons is a
real number and the output of each neuron is calculated as
a nonlinear function of the sum of its inputs. Neurons and
connections have weights that are adjusted as the learning
progresses. Those weights are used to increase or decrease
the strength of a signal at a connection. In the rest of this pa-
per, we use the notions of an artificial neural network, neural
Figure 1: Multilayer perceptron.
network, and network interchangeably.

2.1.1 Multilayer Perceptron neurons. CNNs use three main types of layers: convolutional
layers, pooling layers, and fully-connected layers. Convolu-
A very simple type of a neural network is called perceptron. tional layers are linear layers that share weights across space.
A perceptron is a linear binary classifier applied to the fea- Pooling layers are non-linear layers that reduce the spatial
ture vector as a function that decides whether or not an input size in order to limit the number of neurons. Fully-connected
belongs to some specific class. Each vector component has layers are layers where every neuron is connected with all the
an associated weight wi and each perceptron has a threshold neurons in the neighborhood layer. For additional informa-
value θ . The output of a perceptron equals “1” if the di- tion about CNNs, we refer interested readers to [12].
rect sum between the feature vector and the weight vector is
larger than zero and “-1” otherwise. A perceptron classifier 2.1.3 Activation Functions
works only for data that are linearly separable, i.e., if there
is some hyperplane that separates all the positive points from An activation function of a node is a function f defining the
all the negative points [34]. output of a node given an input or set of inputs, see Eq. (1).
By adding more layers to a perceptron, we obtain a multi- To enable calculations of nontrivial functions for ANN us-
layer perceptron algorithm. Multilayer perceptron (MLP) is ing a small number of nodes, one needs nonlinear activation
a feed-forward neural network that maps sets of inputs onto functions as follows.
sets of appropriate outputs. It consists of multiple layers of
nodes in a directed graph, where each layer is fully connected y = Activation(∑(weight · input) + bias). (1)
to the next one. Consequently, each node in one layer con-
nects with a certain weight w to every node in the following In this paper, we consider the logistic (sigmoid) func-
layer. Multilayer perceptron algorithm consists of at least tion, tanh function, softmax function, and Rectified Linear
three layers: one input layer, one output layer, and one hid- Unit (ReLU) function. The logistic function is a nonlinear
den layer. Those layers must consist of nonlinearly activating function giving smooth and continuously differentiable re-
nodes [7]. We depict a model of a multilayer perceptron in sults [14]. The range of a logistic function is [0, 1], which
Figure 1. Note, if there is more than one hidden layer, then means that all the values going to the next neuron will have
it can be considered a deep learning architecture. Differing the same sign.
1
from linear perceptron, MLP can distinguish data that are not f (x) = . (2)
linearly separable. To train the network, the backpropagation 1 + e−x
algorithm is used, which is a generalization of the least mean The tanh function is a scaled version of the logistic func-
squares algorithm in the linear perceptron. Backpropagation tion where the main difference is that it is symmetric over
is used by the gradient descent optimization algorithm to ad- the origin. The tanh function ranges in [−1, 1].
just the weight of neurons by calculating the gradient of the
loss function [34]. 2
f (x) = tanh(x) = − 1. (3)
1 + e−2x

2.1.2 Convolutional Neural Network The softmax function is a type of sigmoid function able to
map values into multiple outputs (e.g., classes). The softmax
CNNs represent a type of neural networks which were first function is ideally used in the output layer of the classifier
designed for 2-dimensional convolutions as it was inspired in order to obtain the probabilities defining a class for each
by the biological processes of animals’ visual cortex [28]. input [5]. To denote a vector, we represent it in bold style.
From the operational perspective, CNNs are similar to ordi-
nary neural networks (e.g., multilayer perceptron): they con- ex j
f (x) j = , f or j = 1, . . . , K. (4)
sist of a number of layers where each layer is made up of ∑Kk=1 exk

518 28th USENIX Security Symposium USENIX Association


The Rectified Linear Unit (ReLU) is a nonlinear function models for representative devices are the Hamming weight
that is differing from the previous two activation functions for microcontrollers and the Hamming distance in FPGA,
as it does not activate all the neurons at the same time [35]. ASIC, and GPU [4, 31] platforms. As the measurements
By activating only a subset of neurons at any time, we make can be noisy, the adversary often needs many measurements,
the network sparse and easier to compute [2]. Consequently, sometimes millions. Next, statistical tests like correlation [6]
such properties make ReLU probably the most widely used are applied to distinguish the correct key hypothesis from
activation function in ANNs today. other wrong guesses. In the following, DPA (DEMA) is used
to recover secret weights from a pre-trained network.
f (x) = max(0, x). (5)

3 Side-channel Based Reverse Engineering of


2.2 Side-channel Analysis Neural Networks
Side-channel Analysis (SCA) exploits weaknesses on the im-
plementation level [33]. More specifically, all computations In this section, we discuss the threat model we use, the ex-
running on a certain platform result in unintentional physical perimental setup and reverse engineering of various elements
leakages as a sort of physical signatures from the reaction of neural networks.
time, power consumption, and Electromagnetic (EM) ema-
nations released while the device is manipulating data. SCA 3.1 Threat Model
exploits those physical signatures aiming at the key (secret
data) recovery. In its basic form, SCA was proposed to per- The main goal of this work is to recover the neural network
form key recovery attacks on the implementation of cryp- architecture using only side-channel information.
tography [23, 22]. One advantage of SCA over traditional Scenario. We select to work with MLP and CNNs since:
cryptanalysis is that SCA can apply a divide-and-conquer ap- 1) they are commonly used machine learning algorithms in
proach. This means that SCA is typically recovering small modern applications, see e.g., [16, 11, 36, 48, 25, 21]; 2) they
parts of the key (sub-keys) one by one, which is reducing the consist of different types of layers that are also occurring in
attack complexity. other architectures like recurrent neural networks; and 3) in
Based on the analysis technique used, different variants of the case of MLP, the layers are all identical, which makes it
SCA are known. In the following, we recall a few techniques more difficult for SCA and could be consequently considered
used later in the paper. Although the original terms suggest as the worst-case scenario.
power consumption as the source of leakage, the techniques We choose our attack to be as generic as possible. For in-
apply to other side channels as well. In particular, in this stance, we have no assumption on the type of inputs or its
work, we are using the EM side channel and the correspond- source, as we work with real numbers. If the inputs are in
ing terms are adapted to reflect this. the form of integers (like the MNIST database), the attack
Simple Power (or Electromagnetic) Analysis (SPA or becomes easier, since we would not need to recover man-
SEMA). Simple power (or EM) analysis, as the name sug- tissa bytes and deal with precision. We also assume that the
gests, is the most basic form of SCA [22]. It targets infor- implementation of the machine learning algorithm does not
mation from the sensitive computation that can be recovered include any side-channel countermeasures.
from a single or a few traces. As a common example, SPA Attacker’s capability. The attacker in consideration is a
can be used against a straightforward implementation of the passive one. This implies him/her acquiring measurements
RSA algorithm to distinguish square from multiply opera- of the device while operating “normally” and not interfering
tion, leading to the key recovery. In this work, we apply with its internal operations by evoking faulty computations
SPA, or actually SEMA to reverse engineer the architecture and behavior by e.g., glitching the device, etc. More in de-
of the neural network. tails, we consider the following setting:
Differential Power (or Electromagnetic) Analysis (DPA 1. Attacker does not know the architecture of the used
or DEMA). DPA or DEMA is an advanced form of SCA, network but can feed random (and hence known) in-
which applies statistical techniques to recover secret infor- puts to the architecture. We note that the attacks
mation from physical signatures. The attack normally tests and analysis presented in our work do not rely on
for dependencies between actual physical signature (or mea- any assumptions on the distributions of the inputs,
surements) and hypothetical physical signature, i.e., predic- although a common assumption in SCA is that they
tions on intermediate data. The hypothetical signature is are chosen uniformly at random. Basically, we as-
based on a leakage model and key hypothesis. Small parts sume that the attacker has physical access to the device
of the secret key (e.g., one byte) can be tested independently. (can be remote, via EM signals) and he/she knows that
The knowledge of the leakage model comes from the adver- the device runs some neural net. The attacker only con-
sary’s intuition and expertise. Some commonly used leakage trols the execution of it through selecting the inputs, but

USENIX Association 28th USENIX Security Symposium 519


many samples (or points). The number of samples (or length
of the trace) depends on sampling frequency and execution
time. As shown later, depending on the target, nr. of sam-
ples can vary from thousands (for multiplication) to millions
(for a whole CNN network). The measurements are synchro-
nized with the operations by common handshaking signals
like start and stop of computation. To further improve the
quality of measurements, we opened the chip package me-
(a) Target 8-bit microcontroller me-(b) Langer RF-U 5-2 Near Field
chanically decapsulated Electromagnetic passive Probe
chanically (see Figure 2a). An RF-U 5-2 near-field electro-
magnetic (EM) probe from Langer is used to collect the EM
measurements (see Figure 2b). The setup is depicted in Fig-
ure 2c. We use the probe as an antenna for spying on the EM
side-channel leakage from the underlying processor running
ML. Note that EM measurements also allow to observe the
timing of all the operations and thus the setup allows for tim-
ing side-channels analysis as well. Our choice of the target
platform is motivated by the following considerations:

• Atmel ATmega328P: This processor typically allows


for high quality measurements. We are able to achieve a
high signal-to-noise ratio (SNR) measurements, making
this a perfect tuning phase to develop the methodology
of our attacks.
• ARM Cortex-M3: This is a modern 32-bit micro-
(c) The complete measurement setup controller architecture featuring multiple stages of the
pipeline, on-chip co-processors, low SNR measure-
Figure 2: Experimental Setup ments, and wide application. We show that the de-
veloped methodology is indeed versatile across targets
with a relevant update of measurement capability.
he/she can observe the outputs and side-channel infor-
mation (but not individual intermediate values). The In addition, real-world use cases also justify our platforms of
attack scenario is often referred to as known-plaintext choice. Similar micro-controllers are often used in wearables
attack. An adequate use case would be when the at- like Fitbit (ARM Cortex-M4), several hardware crypto wal-
tacker legally acquires a copy of the network with API lets, smart home devices, etc. Additionally, SCA on a GPU
access to it and aims at recovering its internal details or an FPGA platform is practically demonstrated in sev-
e.g. for IP theft. eral instances, thus our methodology can be directly adapted
2. Attacker is capable of measuring side-channel infor- for those cases as well. For different platforms, the leak-
mation leaked from the implementation of the tar- age model could change, but this would not limit our ap-
geted architecture. The attacker can collect multiple proach and methodology. In fact, adequate leakage models
side-channel measurements while processing the data are known for platforms like FPGA [4] and GPU [31]. More-
and use different side-channel techniques for her anal- over, as for ARM Cortex-M3, low SNR of the measurement
ysis. In this work, we focus on timing and EM side might force the adversary to increase the number of mea-
channels. surements and apply signal pre-processing techniques, but
the main principles behind the analysis remain valid.

3.2 Experimental Setup As already stated above, the exploited leakage model of
the target device is the Hamming weight (HW) model. A
Here we describe the attack methodology, which is first vali- microcontroller loads sensitive data to a data bus to perform
dated on Atmel ATmega328P. Later, we also demonstrate the indicated instructions. This data bus is pre-charged to all
proposed methodology on ARM Cortex-M3. ’0’s’ before every instruction. Note that data bus being pre-
The side-channel activity is captured using the Lecroy Wa- charged is a natural behavior of microcontrollers and not a
veRunner 610zi oscilloscope. For each known input, the vulnerability introduced by the attacker. Thus, the power
attacker gets one measurement (or trace) from the oscillo- consumption (or EM radiation) assigned to the value of the
scope. In the following, nr. of inputs or nr. of traces are data being loaded is modeled as the number of bits equal to
used interchangeably. Each measurement is composed of ’1’. In other words, the power consumption of loading data

520 28th USENIX Security Symposium USENIX Association


0.2 Multiplications Activation function

0.15

0.1

0.05
Amplitude

-0.05

-0.1

-0.15

-0.2

-0.25
8.5 9 9.5 10 10.5 11
Time samples 105
(a) ReLU

Figure 3: Observing pattern and timing of multiplication and


activation function

x is:
n
HW (x) = ∑ xi , (6)
i=1

where xi represents the ith bit of x. In our case, it is the secret


pre-trained weight which is regularly loaded from memory
for processing and results in the HW leakage. To conduct the
side-channel analysis, we perform the divide-and-conquer (b) Sigmoid
approach, where we target each operation separately. The
full recovery process is described in Section 3.6.
Several pre-trained networks are implemented on the
board. The training phase is conducted offline, and the
trained network is then implemented in C language and com-
piled on the microcontroller. In these experiments, we con-
sider multilayer perceptron architectures consisting of a dif-
ferent number of layers and nodes in those layers. Note that,
with our approach, there is no limit in the number of layers or
nodes we can attack, as the attack scales linearly with the size
of the network. The methodology is developed to demon-
strate that the key parameters of the network, namely the
(c) Tanh
weights and activation functions can be reverse engineered.
Further experiments are conducted on deep neural networks
with three hidden layers but the method remains valid for
larger networks as well.

3.3 Reverse Engineering the Activation Func-


tion
We remind the reader that nonlinear activation functions are
necessary in order to represent nonlinear functions with a
small number of nodes in a network. As such, they are el-
ements used in virtually any neural network architecture to-
day [25, 15]. If the attacker is able to deduce the information (d) Softmax
on the type of used activation functions, he/she can use that
knowledge together with information about input values to Figure 4: Timing behavior for different activation functions
deduce the behavior of the whole network.

USENIX Association 28th USENIX Security Symposium 521


Table 1: Minimum, Maximum, and Mean computation time profile) the timing behavior of the target activation function
(in ns) for different activation functions independently for better precision, especially when common
libraries are used for standard functions like multiplication,
Activation Function Minimum Maximum Mean activation function, etc.
ReLU 5 879 6 069 5 975
Sigmoid 152 155 222 102 189 144
Tanh 51 909 210 663 184 864 3.4 Reverse Engineering the Multiplication
Softmax 724 366 877 194 813 712 Operation
A well-trained network can be of significant value. Main
distinguishing factors for a well trained network against a
We analyze the side-channel leakage from different acti- poorly trained one, for a given architecture, are the weights.
vation functions. We consider the most commonly used ac- With fine-tuned weights, we can improve the accuracy of the
tivation functions, namely ReLU, sigmoid, tanh, and soft- network. In the following, we demonstrate a way to recover
max [14, 35]. The timing behavior can be observed directly those weights by using SCA.
on the EM trace. For instance, as shown later in Figure 8a, a For the recovery of the weights, we use the Correlation
multiplication is followed by activation with individual sig- Power Analysis (CPA) i.e., a variant of DPA using the Pear-
natures. For a similar architecture, we test different vari- son’s correlation as a statistical test.1 CPA targets the multi-
how activs r ants with each activation function. We collect EM traces and plication m = x · w of a known input x with a secret weight
identified
measure the timing of the activation function computation w. Using the HW model, the adversary correlates the activ-
from the measurements. The measurements are taken when ity of the predicted output m for all hypothesis of the weight.
the network is processing random inputs in the range, i.e., Thus, the attack computes ρ(t, w), for all hypothesis of the
x = -2,2 means?
x ∈ {−2, 2}. A total of 2 000 EM measurements are cap- weight w, where ρ is the Pearson correlation coefficient and
tured for each activation function. As shown in Figure 3, the t is the side-channel measurement. The correct value of the
timing behavior of the four tested activation functions have weight w will result in a higher correlation standing out from
distinct signatures allowing easy characterization. all other wrong hypotheses w∗ , given enough measurements.
Different inputs result in different processing times. Although the attack concept is the same as when attacking
Moreover, the timing behavior for the same inputs largely cryptographic algorithms, the actual attack used here is quite
varies depending on the activation function. For example, different. Namely, while cryptographic operations are al-
we can observe that ReLU will require the shortest amount of ways performed on fixed length integers, in ANN we are
time, due to its simplicity (see Figure 4a). On the other hand, dealing with real numbers.
tanh and sigmoid might have similar timing delays, but with We start by analyzing the way the compiler is handling
different pattern considering the input (see Figure 4b and floating-point operations for our target. The generated as-
Figure 4b), where tanh is more symmetric in pattern com- sembly is shown in Table 2, which confirms the usage of
pared to sigmoid, for both positive and negative inputs. We IEEE 754 compatible representation as stated above. The
can observe that softmax function will require most of the knowledge of the representation allows one to better esti-
processing time, since it requires the exponentiation opera- mate the leakage behavior. Since the target device is an 8-bit
tion which also depends on the number of neurons in the out- microcontroller, the representation follows a 32-bit pattern
put layer. As neural network algorithms are often optimized (b31 ...b0 ), being stored in 4 registers. The 32-bit consist of:
for performance, the presence of such timing side-channels 1 sign bit (b31 ), 8 biased exponent bits (b30 ...b23 ) and 23
is often ignored. A function such as tanh or sigmoid requires mantissa (fractional) bits (b22 ...b0 ). It can be formulated as:
computation of ex and division and it is known that such
functions are difficult to implement in constant time. In addi- (−1)b31 × 2(b30 ...b23 )2 −127 × (1.b22 ...b0 )2 .
tion, constant time implementations might lead to substantial
For example, the value 2.43 can be expressed as (−1)0 ×
performance degradation. Other activation functions can be
2(1000000)2 −127 × (1.00110111000010100011111)2 . The
characterized similarly. Table 1 presents the minimum, max-
measurement t is considered when the computed result m
imum, and mean computation time for each activation func-
is stored back to the memory, leaking in the HW model i.e.,
tion over 2 000 captured measurements. While ReLU is the
HW (m). Since 32-bit m is split into individual 8-bits, each
fastest one, the timing difference for other functions stands
byte of m is recovered individually. Hence, by recovering
out sufficiently, to allow for a straightforward recovery. To
this representation, it is enough to recover the estimation of
distinguish them, one can also do some pattern matching
the real number value.
to determine which type of function is used, if necessary.
To implement the attack two different approaches can be
Note, although Sigmoid and Tanh have similar Maximum
considered. The first approach is to build the hypothesis on
and mean values, the Minimum value differs significantly.
Moreover, the attacker can sometimes pre-characterize (or 1 It is called CEMA in case of EM side channel.

522 28th USENIX Security Symposium USENIX Association


is measurement t used to get the vals of mantissa and traces measured while comp m
resemble w, so corr(w,t) = corr( traces measured while comp m, t) ?

(a) First byte mantissa for weight = 2.43 (b) Second byte mantissa for weight = 2.43 (c) Third byte mantissa for weight = 2.43

Figure 5: Correlation of different weights candidate on multiplication operation

Table 2: Code snippet of the returned assembly for multipli- sign and exponent, we could obtain the unique weight value.
cation: x = x · w(= 2.36 or 0x3D0A1740 in IEEE 754 rep- The traces are measured when the microcontroller performs
resentation). The multiplication itself is not shown here, but secret weight multiplication with uniformly random values
from the registers assignment, our leakage model assumption why -1,
between -1 and 1 (x ∈ {−1, 1}) to emulate normalized in- 1?
holds. put values. We set N = 5 and to reduce the number of pos-
sible candidates, we assume that each floating-point value
# Instruction Comment will have a precision of 2 decimal points, p = 0.01. Since
11a ldd r22, Y+1 0x01
we are dealing with mantissa only, we can then only check
11c ldd r23, Y+2 0x02
the weight candidates in the range [0, N], thus reducing the
11e ldd r24, Y+3 0x03
120 ldd r25, Y+4 0x04 number of possible candidates. We note here that this range
122 ldi r18, 0x3D 61 [−5, 5] is based on the previous experiments with MLP. Al-
124 ldi r19, 0x0A 10 though, in the later phase of the experiment, we targeted the
126 ldi r20, 0x17 23 floating point and fixed-point representation (232 in the worst
128 ldi r21, 0x40 64 case scenario on a 32-bit microcontroller, but could be less
12a call 0xa0a multiplication if the value is for example normalized), instead of the real
12e std Y+1, r22 0x01 value, which could in principle cover all possible floating
130 std Y+2, r23 0x02 values.
132 std Y+3, r24 0x03
In Figure 5, we show the result of the correlation for each
134 std Y+4, r25 0x04
byte with the measured traces. The horizontal axis shows
the time of execution and vertical axis correlation. The ex-
periments were conducted on 1 000 traces for each case. In
the weight directly. For this experiment, we target the result the figure, the black plot denotes the correlation of the “cor-
of the multiplication m of known input values x and unknown rect” mantissa weight (|m(ŵ) − m(w)| < 0.01), whereas the
weight w. For every input, we assume different possibilities red plots are from all other weight candidates in the range
for weight values. We then perform the multiplication and described earlier. Since we are only attacking mantissa in
estimate the IEEE 754 binary representation of the output. this phase, several weight candidates might have similar cor-
To deal with the growing number of possible candidates for relation peaks. After the recovery of the mantissa, the sign
the unknown weight w, we assume that the weight will be bit and exponent can be recovered similarly, which narrows
bounded in a range [−N, N], where N is a parameter chosen down the list candidate to a unique weight. Another ob- Explain?
by the adversary, and the size of possible candidates is de- servation is that the correlation value is not very high and
noted as s = 2N/p, where p is the precision when dealing scattered across different clock cycles. This is due to the
with floating-point numbers. reason that the measurements are noisy and since the oper-
3 regs? Then, we perform the recovery of the 23-bit mantissa of ation is not constant-time, the interesting time samples are
is it the 3 the weight. The sign and exponent could be recovered sepa- distributed across multiple clock cycles. Nevertheless, it is
bytes for
rately. Thus, we are observing the leakage of 3 registers, and shown that the side-channel leakage can be exploited to re-
mantissa or
sign, exp, based on the best CPA results for each register, we can recon- cover the weight up to certain precision. Multivariate side
mantissa? struct the mantissa. Note that the recovered mantissa does channel analysis [42] can be considered if distributed sam-
not directly relate to the weight, but with the recovery of the ples hinder recovery.

USENIX Association 28th USENIX Security Symposium 523


1 1 Targeted value
Targeted value
Incorrect values Incorrect values
0.8 0.95

Correlation
0.6 0.9
correlation

0.4 0.85

0.2 0.8

0 0.75
200 400 600 800 1000 200 400 600 800 1000
number of traces Number of traces
(a) weight = 1.635 (a) First byte recovery (sign and 7-bit exponent)

1 1
Targeted value Targeted value
Incorrect values Incorrect values
0.8 0.8

Correlation
0.6 0.6
correlation

0.4 0.4

0.2 0.2

0 0
200 400 600 800 1000 200 400 600 800 1000
number of traces Number of traces
(b) weight = 0.890 (b) Second byte recovery (lsb exponent and mantissa)

Figure 6: Correlation comparison between the correct and Figure 7: Recovery of the weight
incorrect mantissa of the weights. (a) Correct mantissa can
be recovered (correct values/black line has a higher value
compared to max incorrect values/red line). (b) A special get only those bits, assigning the rest to 0. Thus, our search
case where the incorrect value of mantissa has a higher cor- space is now [0, 27 − 1]. The mantissa multiplication can be
relation, recovering 0.896025 (1100100000..00) instead of performed as 1.mantissax × 1.mantissaw , then taking the 23
0.89 (1100011110...10), still within precision error limits re- most significant bits after the leading 1, and normalization
sulting in attack success (updating the exponent if the result overflows) if necessary.
In Figure 6, we show the result of the correlation between
the HW of the first 7-bit mantissa of the weight with the
We emphasize that attacking real numbers as in the case of traces. Except for Figure 6b, the other results show that the
weights of ANN can be easier than attacking cryptographic correct mantissa can be recovered. Although the correlation
implementations. This is because cryptography typically is not increasing, it is important that the difference becomes
works on fixed-length integers and exact values must be re- stable after a sufficient amount of traces is used and even-
covered. When attacking real numbers, small precision er- tually distinguishing correct weight from wrong weight hy-
rors due to rounding off the intermediate values still result in potheses. The most interesting result is shown in Figure 6b,
useful information. which at first glance looks like a failure of the attack. Here,
To deal with more precise values, we can target the man- the target value of the mantissa is 1100011110...10, while
tissa multiplication operation directly. In this case, the search the attack recovers 1100100000..00. Considering the sign
space can either be [0, 223 − 1] to cover all possible values and exponents, the attack recovers 0.890625 instead of 0.89,
for the mantissa (hence, more computational resources will i.e., a precision error at 4th place after decimal point. Thus,
be required) or we can focus only on the most significant in both cases, we have shown that we can recover the weights
bits of the mantissa (lesser candidates but also with lesser from the SCA leakage.
precision). Since the 7 most significant bits of the man- In Figure 7, we show the composite recovery of 2 bytes of
tissa are processed in the same register, we can aim to tar- the weight representation i.e., a low precision setting where

524 28th USENIX Security Symposium USENIX Association


we recover sign, exponent, and most significant part of man- plication operation, the incorrect hypothesis will result in a
tissa. Again, the targeted (correct) weight can be easily dis- lower correlation value. Thus, this can be used to identify
tinguished from the other candidates. Hence, once all the layer boundaries.
necessary information has been recovered, the weight can be
reconstructed accordingly.

3.6 Recovery of the Full Network Layout


3.5 Reverse Engineering the Number of Neu-
rons and Layers The combination of previously developed individual tech-
niques can thereafter result in full reverse engineering of the
After the recovery of the weights and the activation func-
network. The full network recovery is performed layer by
tions, now we use SCA to determine the structure of the net-
layer, and for each layer, the weights for each neuron have to
work. Mainly, we are interested to see if we can recover the
be recovered one at a time. Let us consider a network con-
number of hidden layers and the number of neurons for each
sisting of N layers, L0 , L1 , ..., LN−1 , with L0 being the input
layer. To perform the reverse engineering of the network
layer and LN−1 being the output layer. Reverse engineering
structure, we first use SPA (SEMA). SPA is the simplest form
is performed with the following steps:
of SCA which allows information recovery in a single (or a
few) traces with methods as simple as a visual inspection. 1. The first step is to recover the weight wL0 of each con-
The analysis is performed on three networks with different nection from the input layer (L0 ) and the first hidden
layouts. layer (L1 ). Since the dimension of the input layer is
The first analyzed network is an MLP with one hidden known, the CPA/CEMA can be performed nL0 times
layer with 6 neurons. The EM trace corresponding to the (the size of L0 ). The correlation is computed for 2d
processing of a randomly chosen input is shown in Figure 8a. hypotheses (d is the number of bits in IEEE 754 rep-
By looking at the EM trace, the number of neurons can be resentation, normally it is 32 bits, but to simplify, 16
easily counted. The observability arises from the fact that bits can be used with lesser precision for the mantissa).
multiplication operation and the activation function (in this After the weights have been recovered, the output of the
case, it is the Sigmoid function) have completely different sum of multiplication can be calculated. This informa-
leakage signatures. Similarly, the structures of deeper net- tion provides us with input to the activation function.
works are also shown in Figure 8b and Figure 8c. The recov- 2. In order to determine the output of the sum of the mul-
ery of output layer then provides information on the number tiplications, the number of neurons in the layer must
of output classes. However, distinguishing different layers be known. This can be recovered by the combination
might be difficult, since the leakage pattern is only dependent of SPA/SEMA and DPA/DEMA technique described in
on multiplication and activation function, which are usually the previous subsection (2 times CPA for each weight one to get
no of
present in most of the layers. We observe minor features al- candidate w, so in total 2nL0 2d CPA required), in par- neurons in
lowing identification of layer boundaries but only with low allel with the weight recovery. When all the weights of layer
confidence. Hence, we develop a different approach based the first hidden layer are recovered, the following steps 2 to find
are executed. break pt of
on CPA to identify layer boundaries.
The experiments follow a similar methodology as in the 3. Using the same set of traces, timing patterns for differ- layer
previous experiments. To determine if the targeted neuron is ent inputs to the activation function can be built, similar
in the same layer as previously attacked neurons, or in the to Figure 4. Timing patterns or average timing can then
next layer, we perform a weight recovery using two sets of be compared with the profile of each function to deter-
data. mine the activation function (a comparison can be based
Let us assume that we are targeting the first hidden layer on simple statistical tools like correlation, distance met-
det if the (the same approach can be done on different layers as well).
ric, etc). Afterward, the output of the activation func-
targeted Assume that the input is a vector of length N0 , so the in- tion can be computed, which provides the input to the
neuron is input x can be represented x = {x1 , ..., xN }. For the targeted next layer.
0
the same neuron y in the hidden layer, perform the weight recovery 4. The same steps are repeated in the subsequent layers
n
layer or next (L1 , ..., LN−1 , so in total at most 2NnL 2d , where nL is
on 2 different hypotheses. For the first hypothesis, assume
that the yn is in the first hidden layer. Perform weight re- max(nL0 , ..., nLN−1 )) until the structure of the full net-
covery individually using xi , for 1 ≤ i ≤ N0 . For the second work is recovered.
hypothesis, assume that yn is in the next hidden layer (the The whole procedure is depicted in Figure 9. In general,
second hidden layer). Perform weight recovery individually it can be seen that the attack scales linearly with the size of
? using y , for 1 ≤ i ≤ (n − i). For each hypothesis, record the the network. Moreover, the same set of traces can be reused
i
maximum (absolute) correlation value, and compare both. for various steps of the attack and attacking different layers,
Since the correlation depends on both inputs to the multi- thus reducing measurement effort.

USENIX Association 28th USENIX Security Symposium 525


0.2 0.2 0.2

0.15 0.15 0.15

0.1 0.1 0.1

0.05 0.05 0.05


Amplitude

Amplitude

Amplitude
0 0 0

-0.05 -0.05 -0.05

-0.1 -0.1 -0.1

-0.15 -0.15 -0.15

-0.2 -0.2 -0.2

-0.25 -0.25 -0.25


0.5 1 1.5 2 0.5 1 1.5 2 2.5 3 3.5 1 2 3 4
Time samples 106 Time samples 106 Time samples 106

(a) One hidden layer with 6 neurons (b) 2 hidden layers (6 and 5 neurons each) (c) 3 hidden layers (6,5,5 neurons each)

Figure 8: SEMA on hidden layers

Figure 9: Methodology to reverse engineer the target neural network

4 Experiments with ARM Cortex-M3 computation time but the average and maximum values are
higher for tanh function. To distinguish, one can obtain mul-
In the previous section, we propose a methodology to re- tiple inputs to the function, build patterns and do pattern
verse engineer sensitive parameters of a neural network, matching to determine which type of function is used. The
which we practically validated on an 8-bit AVR (Atmel AT- activity of a single neuron is shown in Figure 11a, which uses
mega328P). In this section, we extend the presented attack sigmoid as an activation function (the multiplication opera-
on a 32-bit ARM microcontroller. ARM microcontrollers tion is shown separated by a vertical red line).
form a fair share of the current market with huge domi- A known input attack is mounted on the multiplication to
nance in mobile applications, but also seeing rapid adoption recover the secret weight. One practical consideration in at-
in markets like IoT, automotive, virtual and augmented real- tacking multiplication is that different compilers will com-
ity, etc. Our target platform is the widely available Arduino pile it differently for different targets. Modern microcon-
due development board which contains an Atmel SAM3X8E trollers also have dedicated floating point units for handling
ARM Cortex-M3 CPU with a 3-stage pipeline, operating at operations like multiplication of real numbers. To avoid the
84 MHz. The measurement setup is similar to previous ex- discrepancy of the difference of multiplication operation, we
periments (Lecroy WaveRunner 610zi oscilloscope and RF- target the output of multiplication. In other words, we target
U 5-2 near-field EM probe from Langer). The point of mea- the point when multiplication operation with secret weight
surements was determined by a benchmarking code running is completed and the resultant product is updated in general
AES encryption. After capturing the measurements for the purpose registers or memory. Figure 11b shows the success
target neural network, one can perform reverse engineering. of attack recovering secret weight of 2.453, with a known
Note that ARM Cortex-M3 (as well as M4 and M7) have input. As stated before, side-channel measurements on mod-
support for deep learning in the form of CMSIS-NN imple- ern 32-bit ARM Cortex-M3 may have lower SNR thus mak-
mentation [27]. ing attack slightly harder. Still, the attack is shown to be
The timing behavior of various activation functions is practical even on ARM with 2× more measurements. In
shown in Figure 10. The results, though different from pre- our setup, getting 200 extra measurements takes less than
vious experiments on AVR, have unique timing signatures, a minute. Similarly, the setup and number of measurements
allowing identification of each activation function. Here, can be updated for other targets like FPGA, GPU, etc.
sigmoid and tanh activation functions have similar minimal Finally, the full network layout is recovered. The activity

526 28th USENIX Security Symposium USENIX Association


(a) ReLU (b) Sigmoid (c) Tanh

Figure 10: Timing behavior for different activation functions

(a) Observing pattern and timing of multiplication(b) Correlation comparison between correct and in-(c) SEMA on hidden layers with 3 hidden layers
and activation function correct mantissa for weight=2.453 (6,5,5 neurons each)

Figure 11: Analysis of an (6,5,5,) neural network

of a full network with 3 hidden layers composed of 6, 5, and tigate has 4 hidden layers with dimensions (50, 30, 20, 50), it
5 neurons each is shown in Figure 11c. All the neurons are uses ReLU activation function and has Softmax at the output.
observable by visual inspection. The determination of layer The whole measurement trace is shown in Figure 12(a) with
boundaries (shown by a solid red line) can be determined a zoom on 1 neurons in the third layer in Figure 12(b). When
by attacking the multiplication operation and following the measuring at 500 MSamples/s, each trace had ∼ 4.6 million
approach discussed in Section 3.6. samples. The dataset is DPAcontest v4 with 50 samples and
75 000 measurements [46]. The first 50 000 measurements
are used for training and the rest for testing. We experiment
4.1 Reverse Engineering MLP with the Hamming weight model (meaning there are 9 output
classes). The original accuracy equals 60.9% and the accu-
The migration of our testbed to ARM Cortex-M3 allowed
racy of the reverse engineered network is 60.87%. While the
us to test bigger networks, which are used in some relevant
previously developed techniques are directly available, there
case-studies. First, we consider an MLP that is used in profil-
are a few practical issues.
ing side-channel analysis [41]. Our network of choice comes
from the domain of side-channel analysis which has seen the • As the average run time is 9.8 ms, each measurement
adoption of deep learning methods in the past. With this net- would take long considering the measurement and data
work, a state-of-the-art profiled SCA was conducted when saving time. To boost up the SNR, averaging is recom-
considering several datasets where some even contain im- mended. We could use the oscilloscope in-built feature
plemented countermeasures. Since the certification labs use for averaging. Overall, the measurement time per trace
machine learning to evaluate the resilience of cryptographic was slightly over one second after averaging 10 times.
implementations to profiled attacks, an attacker being able to • The measurement period was too big to measure the
reverse engineer that machine learning would be able to use whole period easily at a reasonable resolution. This was
it to attack implementations on his own. The MLP we inves- resolved by measuring two consecutive layers at a time

USENIX Association 28th USENIX Security Symposium 527


conclusions that our attack scales linearly with the size of the
network, we did not experience additional difficulties when
compared to attacking smaller networks.

4.2 Reverse Engineering CNN


When considering CNN, the target is the CMSIS-NN imple-
mentation [27] on ARM Cortex-M3 with measurement setup
same as in previous experiments. Here, as input, we target
the CIFAR-10 dataset [24]. This dataset consists of 60 000
32 × 32 color images in 10 classes. Each class has 6 000 im-
ages and there are in total 50 000 training images and 10 000
(a) test images. The CNN we investigate is the same as in [27]
and it consists of 3 convolutional layers, 3 max pooling lay-
ers, and one fully-connected layer (in total 7 layers).
We choose as target the multiplication operation from the
input with the weight, similar as in previous experiments.
Differing from previous experiments, the operations on real
values are here performed using fixed-point arithmetic. Nev-
ertheless, the idea of the attack remains the same. In this
example, numbers are stored using 8-bit data type – int8
(q7). The resulting multiplication is stored in temporary int
variable. This can also be easily extended to int16 or int32
for more precision. Since we are working with integer val-
ues, we use the Hamming weight model of the hypothetical
(b) outputs (since the Hamming weight model is more straight-
forward in this case).
Figure 12: (a) Full EM trace of the MLP network from [41], If the storing of temporary variable is targeted, as can be
(b) zoom on one neuron in the third hidden layer showing seen from Figure 13(a), around 50 000 traces will be re-
20 multiplications, followed by a ReLU activation function. quired before the correct weight can be distinguished from
50 such patterns can be seen in (a) identifying third layer in the wrong weights. This is based on 0.01 precision (the ab-
(50,30,20,50) MLP solute difference from the actual weight in floating number).
However, in this case, it can be observed that the correlation
value is quite low (∼ 0.1). In the case that the conversion to
in independent measurements. It is important to always int8 is performed after the multiplication, this can be also
measure two consecutive layers and not individual layer targeted. In Figure 13(b), it can be seen that after 10 000
to determine layer boundaries. This issue otherwise can traces, the correct weight candidate can be distinguished, and
be solved with a high-end oscilloscope. the correlation is slightly higher (∼ 0.34).
• We had to resynchronize traces each time according to Next, for pooling layer, once the weights in the convolu-
the target neuron which is a standard pre-processing in tion part are recovered, the output can be calculated. Most
side-channel attacks. CNNs use max pooling layers, which makes it also possible
Next, we experiment with an MLP consisting of 4 hidden to simply guess the pooling layer type. Still, because the max
layers, where each layer has 200 nodes. We use the MNIST pooling layer is based on the following conditional instruc-
database as input to the MLP [29]. The MNIST database tion, conditional(i f (a > max)max = a), it is straightforward
contains 60 000 training images and 10 000 testing images to differentiate it from the average pooling that has summa-
where each image has 28 × 28 pixel size. The number of tion and division operations. This technique is then repeated
classes equals 10. The accuracy of the original network is to reverse engineer any number of convolutional and pooling
equal to 98.16% while the accuracy of the reverse engineered layers. Finally, the CNN considered here uses ReLU activa-
network equals 98.15%, with an average weight error con- tion function and has one fully-connected layer, which are
verging to 0.0025. reverse engineered as discussed in previous sections. In our
We emphasize that both attacks (on DPAcontest v4 and experiment, the original accuracy of the CNN equals 78.47%
MNIST) were performed following exactly the same proce- and the accuracy of the recovered CNN is 78.11%. As it can
dure as in previous sections leading to a successful recovery be seen, by using sufficient measurements (e.g., ∼ 50 000),
of the network parameters. Finally, in accordance with the we are able to reverse engineer CNN architecture as well.

528 28th USENIX Security Symposium USENIX Association


2. Weight recovery can benefit from the application of
masking countermeasures [8, 42]. Masking is an-
other widely studied side-channel countermeasure that
is even accompanied by a formal proof of security. It
involves assuring that sensitive computations are with
random values to remove the dependencies between ac-
tual data and side-channel signatures, thus preventing
the attack. Every computation of f (x, w) is transformed
into fm (x⊕m1 , w⊕m2 ) = f (x, w)⊕m, where m1 , m2 are
uniformly drawn random masks, and fm is the masked
function which applies mask m at the output of f , given
masked inputs x ⊕ m1 and w ⊕ m2 . If each neuron is
(a) int scenario individually masked with an independently drawn uni-
formly random mask for every iteration and every neu-
ron, the proposed attacks can be prevented. However,
this might result in a substantial performance penalty.
3. The proposed attack on activation functions is possible
due to the non-constant timing behavior. Mostly con-
sidered activation functions perform exponentiation op-
eration. Implementation of constant time exponentia-
tion has been widely studied in the domain of public
key cryptography [13]. Such ideas can be adjusted to
implement constant time activation function processing.
Note, the techniques we discuss here represent well-explored
methods of protecting against side-channel attacks. As such,
they are generic and can be applied to any implementation.
(b) int8 scenario Unfortunately, all those countermeasures also come with an
area and performance cost. Shuffling and masking require a
Figure 13: The correlation of correct and wrong weight hy- true random number generator that is typically very expen-
potheses on different number of traces targeting the result of sive in terms of area and performance. Constant time imple-
multiplication operation stored as different variable type: (a) mentations of exponentiation [1] also come at performance
int, (b) int8 efficiency degradation. Thus, the optimal choice of protec-
tion mechanism should be done after a systematic resource
and performance evaluation study.
5 Mitigation
6 Further Discussions and Conclusions
As demonstrated, various side-channel attacks can be ap-
plied to reverse engineer certain components of a pre-trained Neural networks are widely used machine learning family
network. To mitigate such a recovery, several countermea- of algorithms due to its versatility across domains. Their
sures can be deployed: effectiveness depends on the chosen architecture and fine-
1. Hidden layers of an MLP must be executed in sequence tuned parameters along with the trained weights, which can
but the multiplication operation in individual neurons be proprietary information. In this work, we practically
within a layer can be executed independently. An ex- demonstrate reverse engineering of neural networks using
ample is shuffling [50] as a well-studied side-channel side-channel analysis techniques. Concrete attacks are per-
countermeasure. It involves shuffling/permuting the or- formed on measured data corresponding to implementations
der of execution of independent sub-operations. For of chosen networks. To make our setting even more general,
example, given N sub-operations (1, . . . , N) and a ran- we do not assume any specific form of the input data (except
dom permutation σ , the order of execution becomes that inputs are real values).
(σ (1), . . . , σ (N)) instead. We propose to shuffle the We conclude that using an appropriate combination of
order of multiplications of individual neurons within a SEMA and DEMA techniques, all sensitive parameters of
hidden layer during every classification step. Shuffling the network can be recovered. The proposed methodology is
modifies the time window of operations from one ex- demonstrated on two different modern controllers, a classic
ecution to another, mitigating a classical DPA/DEMA 8-bit AVR and a 32-bit ARM Cortex-M3 microcontroller. As
attack. also shown in this work, the attacks on modern devices are

USENIX Association 28th USENIX Security Symposium 529


typically somewhat harder to mount, due to lower SNR for [5] B ISHOP, C. M. Pattern Recognition and Ma-
side-channel attacks, but remain practical. In the presented chine Learning (Information Science and Statistics).
experiments, the attack took twice as many measurements, Springer-Verlag, Berlin, Heidelberg, 2006.
requiring roughly 20 seconds extra time. Overall, the attack
methodology scales linearly with the size of the network. [6] B RIER , E., C LAVIER , C., AND O LIVIER , F. Correla-
The attack might be easier in some setting where a new tion power analysis with a leakage model. In Interna-
network is derived from well known network like VGG-16, tional Workshop on Cryptographic Hardware and Em-
Alexnet, etc. by tuning hyper-parameters or transfer learn- bedded Systems (2004), Springer, pp. 16–29.
ing. In such cases, the side-channel based approach can re- [7] C OLLOBERT, R., AND B ENGIO , S. Links Between
veal the remaining secrets. However, analysis of such partial Perceptrons, MLPs and SVMs. In Proceedings of
cases is currently out of scope. the Twenty-first International Conference on Machine
The proposed attacks are both generic in nature and more Learning (New York, NY, USA, 2004), ICML ’04,
powerful than the previous works in this direction. Finally, ACM, pp. 23–.
suggestions on countermeasures are provided to help the de-
signer mitigate such threats. The proposed countermeasures [8] C ORON , J.-S., AND G OUBIN , L. On boolean and
are borrowed mainly from side-channel literature and can in- arithmetic masking against differential power analysis.
cur huge overheads. Still, we believe that they could moti- In International Workshop on Cryptographic Hardware
vate further research on optimized and effective countermea- and Embedded Systems (2000), Springer, pp. 231–237.
sures for neural networks. Besides continuing working on
countermeasures, as the main future research goal, we plan [9] D OWLIN , N., G ILAD -BACHRACH , R., L AINE , K.,
to look into more complex CNNs. Naturally, this will require L AUTER , K., NAEHRIG , M., AND W ERNSING , J.
stepping aside from low power ARM devices and using for CryptoNets: Applying Neural Networks to Encrypted
instance, FPGAs. Additionally, in this work, we considered Data with High Throughput and Accuracy. In Proceed-
only feed-forward networks. It would be interesting to ex- ings of the 33rd International Conference on Interna-
tend our work to other types of networks like recurrent neu- tional Conference on Machine Learning - Volume 48
ral networks. Since such architectures have many same ele- (2016), ICML’16, JMLR.org, pp. 201–210.
ments like MLP and CNNs, we believe our attack should be [10] F REDRIKSON , M., L ANTZ , E., J HA , S., L IN , S.,
(relatively) easily extendable to such neural networks. PAGE , D., AND R ISTENPART., T. Privacy in Pharma-
cogenetics: An End-to-End Case Study of Personalized
References Warfarin Dosing. In USENIX Security (2014), pp. 17–
32.
[1] A L H ASIB , A., AND H AQUE , A. A. M. M. A com-
parative study of the performance and security issues [11] G ILMORE , R., H ANLEY, N., AND O’N EILL , M. Neu-
of AES and RSA cryptography. In Convergence and ral network based attack on a masked implementation
Hybrid Information Technology, 2008. ICCIT’08. Third of AES. In 2015 IEEE International Symposium on
International Conference on (2008), vol. 2, IEEE, Hardware Oriented Security and Trust (HOST) (May
pp. 505–510. 2015), pp. 106–111.

[2] A LBERICIO , J., J UDD , P., H ETHERINGTON , T., [12] G OODFELLOW, I., B ENGIO , Y., AND C OURVILLE ,
A AMODT , T., J ERGER , N. E., AND M OSHOVOS , A. Deep Learning. MIT Press, 2016. http://www.
A. Cnvlutin: Ineffectual-Neuron-Free Deep Neural deeplearningbook.org.
Network Computing. In 2016 ACM/IEEE 43rd An-
[13] H ACHEZ , G., AND Q UISQUATER , J.-J. Montgomery
nual International Symposium on Computer Architec-
exponentiation with no final subtractions: Improved
ture (ISCA) (June 2016), pp. 1–13.
results. In International Workshop on Cryptographic
[3] ATENIESE , G., M ANCINI , L. V., S POGNARDI , A., Hardware and Embedded Systems (2000), Springer,
V ILLANI , A., V ITALI , D., AND F ELICI , G. Hacking pp. 293–301.
Smart Machines with Smarter Ones: How to Extract
Meaningful Data from Machine Learning Classifiers. [14] H AYKIN , S. Neural Networks: A Comprehensive
Int. J. Secur. Netw. 10, 3 (Sept. 2015), 137–150. Foundation, 2nd ed. Prentice Hall PTR, Upper Saddle
River, NJ, USA, 1998.
[4] B HASIN , S., G UILLEY, S., H EUSER , A., AND DAN -
GER , J.-L. From cryptography to hardware: analyz- [15] H E , K., Z HANG , X., R EN , S., AND S UN , J. Deep
ing and protecting embedded Xilinx BRAM for cryp- Residual Learning for Image Recognition. CoRR
tographic applications. Journal of Cryptographic En- abs/1512.03385 (2015).
gineering 3, 4 (2013), 213–225.

530 28th USENIX Security Symposium USENIX Association


[16] H EUSER , A., P ICEK , S., G UILLEY, S., AND [28] L E C UN , Y., B ENGIO , Y., ET AL . Convolutional net-
M ENTENS , N. Lightweight Ciphers and their Side- works for images, speech, and time series. The hand-
channel Resilience. IEEE Transactions on Computers book of brain theory and neural networks 3361, 10
(2017), 1–1. (1995).
[17] H UA , W., Z HANG , Z., AND S UH , G. E. Reverse [29] L E C UN , Y., AND C ORTES , C. MNIST handwritten
Engineering Convolutional Neural Networks Through digit database.
Side-channel Information Leaks. In Proceedings of
the 55th Annual Design Automation Conference (New [30] L ERMAN , L., P OUSSIER , R., B ONTEMPI , G.,
York, NY, USA, 2018), DAC ’18, ACM, pp. 4:1–4:6. M ARKOWITCH , O., AND S TANDAERT, F.-X. Tem-
plate attacks vs. machine learning revisited (and the
[18] I LYAS , A., E NGSTROM , L., ATHALYE , A., AND L IN , curse of dimensionality in side-channel analysis). In
J. Black-box Adversarial Attacks with Limited Queries International Workshop on Constructive Side-Channel
and Information. CoRR abs/1804.08598 (2018). Analysis and Secure Design (2015), Springer, pp. 20–
[19] JAP, D., S T ÖTTINGER , M., AND B HASIN , S. Support 33.
vector regression: exploiting machine learning tech- [31] L UO , C., F EI , Y., L UO , P., M UKHERJEE , S., AND
niques for leakage modeling. In Proceedings of the K AELI , D. Side-channel power analysis of a GPU AES
Fourth Workshop on Hardware and Architectural Sup- implementation. In Computer Design (ICCD), 2015
port for Security and Privacy (2015), ACM, p. 2. 33rd IEEE International Conference on (2015), IEEE,
[20] K HAN , A., G OODHUE , G., S HRIVASTAVA , P., VAN pp. 281–288.
D ER V EER , B., VARNEY, R., AND NAGARAJ , P. Em-
[32] M AGHREBI , H., P ORTIGLIATTI , T., AND P ROUFF ,
bedded memory protection, Nov. 22 2011. US Patent
E. Breaking cryptographic implementations using deep
8,065,512.
learning techniques. In International Conference on
[21] KOBER , J., AND P ETERS , J. Reinforcement Learning Security, Privacy, and Applied Cryptography Engineer-
in Robotics: A Survey, vol. 12. Springer, Berlin, Ger- ing (2016), Springer, pp. 3–26.
many, 2012, pp. 579–610.
[33] M ANGARD , S., O SWALD , E., AND P OPP, T. Power
[22] KOCHER , P., JAFFE , J., AND J UN , B. Differential Analysis Attacks: Revealing the Secrets of Smart
power analysis. In Annual International Cryptology Cards. Springer, December 2006. ISBN 0-387-30857-
Conference (1999), Springer, pp. 388–397. 1, http://www.dpabook.org/.
[23] KOCHER , P. C. Timing attacks on implementations [34] M ITCHELL , T. M. Machine Learning, 1 ed. McGraw-
of Diffie-Hellman, RSA, DSS, and other systems. In Hill, Inc., New York, NY, USA, 1997.
Annual International Cryptology Conference (1996),
Springer, pp. 104–113. [35] NAIR , V., AND H INTON , G. E. Rectified Linear Units
Improve Restricted Boltzmann Machines. In Proceed-
[24] K RIZHEVSKY, A., NAIR , V., AND H INTON , G. ings of the 27th International Conference on Interna-
CIFAR-10 (Canadian Institute for Advanced Re- tional Conference on Machine Learning (USA, 2010),
search). ICML’10, Omnipress, pp. 807–814.
[25] K RIZHEVSKY, A., S UTSKEVER , I., AND H INTON , [36] NARAEI , P., A BHARI , A., AND S ADEGHIAN , A. Ap-
G. E. ImageNet Classification with Deep Convolu- plication of multilayer perceptron neural networks and
tional Neural Networks. In Proceedings of the 25th support vector machines in classification of healthcare
International Conference on Neural Information Pro- data. In 2016 Future Technologies Conference (FTC)
cessing Systems - Volume 1 (USA, 2012), NIPS’12, (Dec 2016), pp. 848–852.
Curran Associates Inc., pp. 1097–1105.
[37] O HRIMENKO , O., C OSTA , M., F OURNET, C.,
[26] K U ČERA , M., T SANKOV, P., G EHR , T., G UARNIERI ,
G KANTSIDIS , C., KOHLWEISS , M., AND S HARMA ,
M., AND V ECHEV, M. Synthesis of Probabilistic Pri-
D. Observing and Preventing Leakage in MapReduce.
vacy Enforcement. In Proceedings of the 2017 ACM
In Proceedings of the 22Nd ACM SIGSAC Confer-
SIGSAC Conference on Computer and Communica-
ence on Computer and Communications Security (New
tions Security (New York, NY, USA, 2017), CCS ’17,
York, NY, USA, 2015), CCS ’15, ACM, pp. 1570–
ACM, pp. 391–408.
1581.
[27] L AI , L., S UDA , N., AND C HANDRA , V. CMSIS-NN:
Efficient Neural Network Kernels for Arm Cortex-M
CPUs. CoRR abs/1801.06601 (2018).

USENIX Association 28th USENIX Security Symposium 531


[38] O HRIMENKO , O., S CHUSTER , F., F OURNET, C., [46] TELECOM PARIS T ECH SEN RESEARCH GROUP.
M EHTA , A., N OWOZIN , S., VASWANI , K., AND DPA Contest (4th edition), 2013–2014. http://www.
C OSTA , M. Oblivious Multi-party Machine Learn- DPAcontest.org/v4/.
ing on Trusted Processors. In Proceedings of the 25th
USENIX Conference on Security Symposium (Berke- [47] T EUFL , P., PAYER , U., AND L ACKNER , G. From
ley, CA, USA, 2016), SEC’16, USENIX Association, NLP (Natural Language Processing) to MLP (Machine
Language Processing). In Computer Network Security
pp. 619–636.
(Berlin, Heidelberg, 2010), I. Kotenko and V. Skormin,
[39] PAPERNOT, N., M C DANIEL , P., G OODFELLOW, I., Eds., Springer Berlin Heidelberg, pp. 256–269.
J HA , S., C ELIK , Z. B., AND S WAMI , A. Practical
[48] T HOMAS , P., AND S UHNER , M.-C. A New Multi-
Black-Box Attacks Against Machine Learning. In Pro-
layer Perceptron Pruning Algorithm for Classification
ceedings of the 2017 ACM on Asia Conference on Com-
and Regression Applications. Neural Processing Let-
puter and Communications Security (New York, NY,
ters 42, 2 (Oct 2015), 437–458.
USA, 2017), ASIA CCS ’17, ACM, pp. 506–519.
[49] T RAM ÈR , F., Z HANG , F., J UELS , A., R EITER , M. K.,
[40] PARASHAR , A., R HU , M., M UKKARA , A.,
AND R ISTENPART, T. Stealing Machine Learning
P UGLIELLI , A., V ENKATESAN , R., K HAILANY,
Models via Prediction APIs. CoRR abs/1609.02943
B., E MER , J., K ECKLER , S. W., AND DALLY,
(2016).
W. J. SCNN: An accelerator for compressed-sparse
convolutional neural networks. In 2017 ACM/IEEE [50] V EYRAT-C HARVILLON , N., M EDWED , M., K ERCK -
44th Annual International Symposium on Computer HOF, S., AND S TANDAERT, F.-X. Shuffling against
Architecture (ISCA) (June 2017), pp. 27–40. side-channel attacks: A comprehensive study with cau-
tionary note. In International Conference on the Theory
[41] P ICEK , S., H EUSER , A., J OVIC , A., B HASIN , S.,
and Application of Cryptology and Information Secu-
AND R EGAZZONI , F. The Curse of Class Imbalance
rity (2012), Springer, pp. 740–757.
and Conflicting Metrics with Machine Learning for
Side-channel Evaluations. IACR Transactions on Cryp- [51] WANG , B., AND G ONG , N. Z. Stealing Hyperpa-
tographic Hardware and Embedded Systems 2019, 1 rameters in Machine Learning. CoRR abs/1802.05351
(Nov. 2018), 209–237. (2018).
[42] P ROUFF , E., AND R IVAIN , M. Masking against side- [52] W EI , L., L IU , Y., L UO , B., L I , Y., AND X U , Q.
channel attacks: A formal security proof. In Annual I Know What You See: Power Side-Channel Attack
International Conference on the Theory and Applica- on Convolutional Neural Network Accelerators. CoRR
tions of Cryptographic Techniques (2013), Springer, abs/1803.05847 (2018).
pp. 142–159.
[53] X U , X., L IU , C., F ENG , Q., Y IN , H., S ONG , L.,
[43] R ISCURE. https://www.riscure.com/blog/automated- AND S ONG , D. Neural Network-based Graph Embed-
neural-network-construction-genetic-algorithm/, 2018. ding for Cross-Platform Binary Code Similarity Detec-
tion. In Proceedings of the 2017 ACM SIGSAC Confer-
[44] S HOKRI , R., S TRONATI , M., S ONG , C., AND
ence on Computer and Communications Security (New
S HMATIKOV, V. Membership Inference Attacks
York, NY, USA, 2017), CCS ’17, ACM, pp. 363–376.
Against Machine Learning Models. In 2017 IEEE
Symposium on Security and Privacy (SP) (May 2017), [54] X U , Y., C UI , W., AND P EINADO , M. Controlled-
pp. 3–18. Channel Attacks: Deterministic Side Channels for Un-
[45] S ONG , C., R ISTENPART, T., AND S HMATIKOV, V. trusted Operating Systems. In Proceedings of the 2015
Machine Learning Models That Remember Too Much. IEEE Symposium on Security and Privacy (Washing-
In Proceedings of the 2017 ACM SIGSAC Confer- ton, DC, USA, 2015), SP ’15, IEEE Computer Society,
ence on Computer and Communications Security (New pp. 640–656.
York, NY, USA, 2017), CCS ’17, ACM, pp. 587–601.

532 28th USENIX Security Symposium USENIX Association

You might also like