Deep-Spying: Spying Using Smartwatch and Deep Learning: Master Thesis
Deep-Spying: Spying Using Smartwatch and Deep Learning: Master Thesis
Deep-Spying: Spying Using Smartwatch and Deep Learning: Master Thesis
Master Thesis
IT University of Copenhagen
Copenhagen, Denmark
December 2015
Abstract
Wearable technologies are today on the rise, becoming more common and
broadly available to mainstream users. In fact, wristband and armband de-
vices such as smartwatches and fitness trackers already took an important
place in the consumer electronics market and are becoming ubiquitous. By
their very nature of being wearable, these devices, however, provide a new
pervasive attack surface threatening users privacy, among others.
The goal of this work is to raise awareness about the potential risks re-
lated to motion sensors built-in wearable devices and to demonstrate abuse
opportunities leveraged by advanced neural network architectures.
i
Acknowledgments
I would first like to deeply thank Sebastian Risi for his insightful advice
during the entire duration of this thesis and his immediate interest in the
project idea. Special thanks also go to the seven voluntary participants who
took some of their time to help me collect valuable data.
Finally, I would like to thank the REAL lab, the PITlab, and the IT
department at the IT University of Copenhagen for providing me with hard-
ware and computational resources allowing me to conduct the experiments
detailed in this work.
ii
Contents
Abstract i
Acknowledgments ii
Contents iii
List of tables x
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Computer Security . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Wearable Computing . . . . . . . . . . . . . . . . . . . 6
2.1.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . 8
2.2 Background: Artificial Neural Network . . . . . . . . . . . . . 8
2.2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . 11
2.2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . 12
2.2.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . 15
2.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Motion-Based Keystroke Inference Attack . . . . . . . 17
2.3.1.1 Keylogging . . . . . . . . . . . . . . . . . . . 18
2.3.1.2 Touchlogging . . . . . . . . . . . . . . . . . . 19
2.3.2 Classification of Motion Sensors Signal . . . . . . . . . 20
2.3.2.1 Classifier Model . . . . . . . . . . . . . . . . . 20
iii
Contents
3 Attack Description 25
3.1 Attacker Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Attack Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 System 29
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Wearable Application . . . . . . . . . . . . . . . . . . . 30
4.2.2 Mobile Application . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Training Application . . . . . . . . . . . . . . . . . . . 33
4.3 Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Data Analytics 36
5.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Median Filter . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.3 Butterworth Filter . . . . . . . . . . . . . . . . . . . . 41
5.2.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Sensor Fusion . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.2.1 Segmentation from Labels . . . . . . . . . . . 49
5.3.2.2 Heuristic Segmentation . . . . . . . . . . . . . 51
5.3.3 Statistical Features . . . . . . . . . . . . . . . . . . . . 53
5.4 Classifier Model . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4.1 Performance Evaluation . . . . . . . . . . . . . . . . . 54
iv
Contents
6 Evaluation 61
6.1 Empirical Data Collection . . . . . . . . . . . . . . . . . . . . 61
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2.1 Experiment 1: Touchlogging Attack . . . . . . . . . . . 64
6.2.2 Experiment 2: Keylogging Attack . . . . . . . . . . . . 66
6.2.3 Experiment 3: from Touchlogging to Keylogging . . . . 69
6.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7 Conclusion 72
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography 75
Appendices 85
A Backpropagation 86
B Signal Pre-processing 88
B.1 Gyroscope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
B.2 Accelerometer . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
D Experiment Results 98
D.1 Results for Experiment 1: Touchlogging Attack . . . . . . . . 99
D.1.1 FNN-Sigmoid . . . . . . . . . . . . . . . . . . . . . . . 99
D.1.2 FNN-Tanh . . . . . . . . . . . . . . . . . . . . . . . . . 101
v
Contents
vi
List of Figures
vii
List of Figures
viii
List of Figures
ix
List of Tables
5.1 Fusion strategy benchmark results (average values for 100 train-
ing Epochs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Hidden layer benchmark (see Figure 5.13 for graphical repre-
sentation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
x
Introduction
1
This chapter will first introduce the reader to the problem being addressed
in this research and the related implications. Finally, the methodology em-
ployed to provide a practical proof-of-concept system will be shortly de-
scribed.
The keyboard is one of the oldest human-computer interface and still one
of the most common devices to input information into various types of ma-
chines. Some of this information can be sensitive and highly valuable, such
as passwords, PINs, social security numbers, and credit card numbers. Re-
lated works (detailed in Chapter 2) have shown that the data from the mo-
tion sensors of a smartphone can be used to infer keystrokes entered on its
touchscreen [16, 84, 66]. Other research has proved that the motion sen-
sors from a smartphone standing on a flat surface can be used to infer the
keystrokes typed on a nearby physical computer keyboard [61]. Moreover, re-
cently published works have demonstrated that smartwatches motion sensors
could be exploited to infer keystrokes on both virtual and physical keyboards
[81, 59, 56].
1
Chapter 1. Introduction
2
Chapter 1. Introduction
1.2 Methodology
3
Chapter 1. Introduction
tion sensors data and perform experimental indirect passive attacks such as
keylogging and touchlogging. Second, experiments are conducted to collect
data in a deployment environment. Finally, the results are interpreted and
discussed.
4
Related Work
2
This chapter’s goal is twofold. First to define the key concepts of the theo-
retical and technical background on which this work is based. This research
project is highly multidisciplinary and established at the intersection of var-
ious research fields (as illustrated in Figure 2.1). The second goal is thus
to review and reflect on the previous relevant studies and the current state-
of-the-art in the related fields. The core focus is the security of wearable
technologies and relies primarily on machine learning methods for data anal-
ysis.
5
Chapter 2. Related Work
6
Chapter 2. Related Work
Motion Sensors: Modern mobile and wearable devices usually come with
built-in motion sensors measuring the movements of the device. Analyzing
the output data of such sensors allow the estimation of specific types of
motion that the device undergoes such as translation, tilt, shake, rotation,
or swing. The typical motion sensors available in standard devices are listed
in Table 2.1. Software-based sensors usually derive their data from hardware-
based sensors, namely the accelerometers (one for each axis x, y, and z), and
the gyroscope [42, 40, 43, 1].
7
Chapter 2. Related Work
8
Chapter 2. Related Work
The network is activated by feeding its input layer with a feature vector
that will be mapped to an output vector thanks to the network internal struc-
ture. The neurons map inputs to outputs by using a predefined activation
function (examples listed in Table 2.2). The output value of a given neuron
i can be computed as follows:
n
!
X
yi = φ(xi ) = φ Wij yj (2.1)
j=1
9
Chapter 2. Related Work
Normalized ∂φ(xi )
e xi = φ(xi )(δij − φ(xj )),
φ(xi ) = P
n
∂xj
Exponential
(
e xj 0 if i 6= j,
j=1 δij =
(Softmax) 1 if i = j;
Figure 2.3: A standard feedforward cell i with three other neurons a, b, and
c connected to its input.
An ANN is trained by adjusting its weights until the correct output vector
is generated from a given input so as to minimize the global error. The terms
Feedforward Neural Network (FNN) and Vanilla Neural Network are used to
refer to the most basic ANN architecture where the neurons are connected
forward in an acyclic way. That is, the activation flow is unidirectional from
the input layer to the output layer.
10
Chapter 2. Related Work
with xt the new input vector at time t, yt−1 the previously produced output
vector, and the activation function φ0 the Hyperbolic Tangent (Tanh). As
shown in Figure 2.5, RNNs can be seen as unfolded deep FNNs where each
layer is connected to its past instance. It is thus possible to use RNN to map
one input to one output, one input to many outputs, many inputs to one
output, or many inputs to many outputs.
11
Chapter 2. Related Work
(a) RNN with one hidden recurrent unit. (b) Unfolded RNN.
Figure 2.5: A RNN can be seen as an unfolded deep FNN. The depth corre-
sponds to the length n of the input sequence.
Despite the interesting properties of RNN, Bengio et al. [10] have shown
that standard RNNs are in practice unable to learn long-term dependencies in
contexts where information need to be connected over long time intervals. In
fact, training RNN with gradient descent methods such as Backpropagation
(details in Section 2.2.3) lead to gradually vanishing gradient because of
nested activation functions. In the case of RNN where the depth can be
both layer-related and time-related, this leads the network to be unable to
associate information separated over long periods because the error cannot
be preserved over such intervals.
12
Chapter 2. Related Work
13
Chapter 2. Related Work
With W the weight, xt the new input vector, yt−1 the previously produced
output vector, ct−1 the previously produced cell state’s output, φ the Logistic
Function (Sigmoid), and φ0 the Hyperbolic Tangent (Tanh).
14
Chapter 2. Related Work
2.2.3 Backpropagation
∂E
= ei yj (2.12)
∂Wij
∂φ(xi )
∂xi (Ti − yi )
if i ∈ output layer,
ei = ∂φ(x ) X (2.13)
n
!
i
∂xi
Wij ej otherwise;
j=1
With ∂φ(x
∂xi
i)
the derivative of the activation function (see Table 2.2), x the
input to the neuron (computed from Equation 2.1), T the target expected
15
Chapter 2. Related Work
output1 , and y the predicted output. Since the error is being backpropagated,
j refer to neurons connected to i in the next higher layer. The gradient thus
represent the change to all the weights with regard to the change in the global
output error. The weights can finally be updated such that:
With η the learning rate, a constant value usually chosen in the range
(0.0, 1.0) used to tune the training algorithm by determining how much the
weights are updated at each training iteration. A high learning rate can
quicken the training process by doing big training steps but can prevent
the global minima from being reached if too big. A low learning rate allows
precise steps towards the solution but can lead to convergence in local minima
if too small.
Many methods have been developed to further improve and optimize ANN
training (e.g. Adaptive Learning Rate, Bias, Weight Decay). Moreover, dif-
ferent variants to the Backpropagation algorithm have been implemented to
increase the performance of the algorithm or adapt the technique to differ-
ent ANN architectures. A significant alternative is Backpropagation Through
Time [82] used to train RNNs.
The goal of this section is to investigate and understand the methods and
techniques used in previous studies from relevant similar research fields. The
1
For regression tasks, the expected output vector usually consist of one or more con-
tinuous values. For classification tasks, the targeted output vector usually consists of
binary values. For example for three classes, class a: h1, 0, 0i, class b: h0, 1, 0i, and class
c: h0, 0, 1i.
16
Chapter 2. Related Work
Studies have shown the great potential of recovering sound, music, voice
conversations, and even typing by simply observing slight vibrations in the
environment produced by physical events [24, 6]. Such investigations have
shown that motion invisible to the human eye can convey a surprisingly
significant amount of information, establishing motion as a pertinent source
of valuable data and thus a reliable side-channel. Although the camera can
be used to detect motion, we are here interested in motion sensors because
they are currently available on the majority of WAD, which is not the case
for cameras.
17
Chapter 2. Related Work
specifications, the keyboard layout, the user habits, and the relative user
position and motion. This assumption leads many security researchers to
question the practicality of such an attack. However, studies [17, 3] have
shown that motion-based keystroke inference attack remains effective and
practical despite the obvious assumptions that the previously enunciated
factors might alter the robustness and the accuracy of the inference. Mo-
tion is thus established as a significant side-channel allowing the leakage of
sensitive information.
2.3.1.1 Keylogging
Marquardt et al. [61] have shown that the motion sensors output from a
smartphone standing on a flat surface can be used to infer keystrokes typed
on a nearby physical computer keyboard standing on the same surface. Their
attack scenario is based on two observations. First, that access to the ac-
celerometer data was not protected by any mobile operating system, thus
allowing any installed application to monitor the accelerometer events. Sec-
ondly, that many users place their smartphones on the desk next to their
computer keyboard when they are working. In their experiment setup, an
iPhone device is collecting the accelerometer data to send them to a remote
server where data processing and classification is performed. They demon-
strated the ability to recover up to 80% of typed content by matching ab-
stracted words with candidate dictionaries after classification. This research,
therefore, shows the great potential of successfully inferring keystrokes from
subtle motions such as small vibrations.
18
Chapter 2. Related Work
2.3.1.2 Touchlogging
Related works [16, 84, 66, 63] have shown that the data from the motion
sensors of a smartphone can be used to infer keystrokes entered on its touch-
screen. Cai et al. [16] demonstrated that a malicious Android application
can infer as much as 70% of the keystrokes entered on a number-only virtual
keyboard on an Android device. For their attack to work, the user, however,
need to grant the application access to the motion sensors at install time.
Cai et al. believed this assumption is not unrealistic considering that most
users will not treat motion data as sensitive as camera or microphone for
example.
In their paper, Owusu et al. [66] proposed a system that reads accelerom-
eter data to extract 6-character passwords on an Android device. Their
experiment consists of a QWERTY virtual keyboard used to perform the
keystroke reconstruction attack. Additionally, a data collection screen is
used to collect ground truth from acceleration measurements matching key
presses at specific screen regions. They managed to break 59 of 99 passwords
using only the accelerometer data.
Miluzzo et al. [63] have demonstrated that the motion sensors built-in
smartphones and tablets could be used to infer keystrokes entered on a com-
plete 26-letters keyboard with an accuracy reaching as much as 90%. In their
19
Chapter 2. Related Work
work, they have also shown that combining both the accelerometer and the
gyroscope can leverage the accuracy of the classification. Their approach
combines the results of multiple shallow classifiers to improve the prediction
quality.
20
Chapter 2. Related Work
As mentioned by Cai et al. the smaller the required training set, the easier
the attack. In their implementation [17], they showed that the inference
accuracy level stabilizes when the training set reaches a certain size (i.e.
12 for alphabet-only keyboard and 8 for number-only keyboard). It is also
important to note that the choice of the motion sensor can affect the quality
of the classification. In fact, studies have shown that the gyroscope is a better
side-channel than the accelerometer for keystroke inference [1, 17, 63].
21
Chapter 2. Related Work
In their work, they build feature vectors from time domain data such as the
duration of the motion data segment, the peaks time difference, the num-
ber of spikes, the peaks interval, the attenuation rate, and the vertex angle
between peaks. In another work, Cai et al. [16] experimented with the pat-
terns produced by the motion during keystrokes. They first identified the
starting and ending time of keystrokes by calculating the Peak-to-Average
Power Ratio of the pitch angle and the roll angle. Then they observed that
when these angles values are plotted, distinctive lobes appear on the pattern
with some interesting properties. In fact, they noticed that lobes directions
are similar for same keys while the angles of the lobes vary for different keys.
Based on this 2D representation, they built three pairs of features consisting
of geometric metrics such as the angle between the direction axis of the upper
lobe and the lower lobe with the x-axis, the angles of the two dominating
edges, and finally the average width of both the upper and lower lobe.
Owusu et al. [66] solved the sampling rate problem by using an approach
involving linear interpolation to create consistent sampling intervals through-
out the recorded accelerometer data. They extracted the individual motion
signals from each keypress using Root Mean Square anomalies for spike de-
tection. In their work, they used a set of 46 features consisting of 44 acceler-
ation stream information (i.e. min, max, Root Mean Square, number of local
peaks, number of local crests, etc.) and two meta-features (i.e. total time
and window size). A wrapper [49] was then used for feature subset selection
to maximize the accuracy of the prediction.
In their work, Marquardt et al. [61] used a 100ms long time window
as Asonov et al. [2] to extract features from the signal. They overcame the
sampling rate problem by using a combination of domains to build the feature
vectors. That is, time domain features (i.e. Root Mean Square, skewness,
variance, kurtosis), spectral domain features (i.e. Fast Fourier Transform),
and cepstrum features (i.e. Mel-frequency Cepstral Coefficients). For their
22
Chapter 2. Related Work
Xu et al. [84] selected features from the signal in a time window bounded
by the touch-down and the touch-up events triggered when the user interact
with the touchscreen. They used three features to determine if the touch
event occurred on the left side or on the right side of the screen (i.e. roll
angle variations) and three features to determine if the touch event occurred
on the top or the bottom of the screen (i.e. pitch angle variations).
23
Chapter 2. Related Work
24
Attack Description
3
This chapter will provide an overview of three envisioned attack scenarios
where an attacker (Eve) uses a smartphone wirelessly paired with a WAD
worn by the victim (Alice) to perform a motion-based keystroke inference
attack. All scenarios are similar regarding attacker goal, threat model, and
methods employed. They only differ on the type of keypad being eaves-
dropped.
25
Chapter 3. Attack Description
Once the WAD is paired with the attacker’s device, applications can be
installed wirelessly on the WAD. Because the security risks of motion sen-
sors are not well understood and often underestimated, current smartwatch
operating systems (i.e. Android Wear 5.1 Lollipop, Apple Watch watchOS
2) does not require any user permissions for an application to use the motion
sensors. Additionally, applications can run as Services in the background
without displaying any GUI. Alice would, therefore, be unaware that an un-
known application installed on her WAD by Eve is monitoring her motions.
2
Wireless Networks offer by definition more opportunities to eavesdroppers than tra-
ditional wired networks because of their very nature of wireless transmission. The data
is transmitted using radio waves through air, allowing anyone with a suitable receiver to
collect and decode signals exchanged between two parties. The range of Bluetooth tech-
nology is application specific: Core Specification [13] mandates a minimum range of 10
meters while the signal can still be transmitted up to 100 meters. External antennas can
also potentially be used by an attacker to receive the signal from a further away location.
26
Chapter 3. Attack Description
27
Chapter 3. Attack Description
28
4
System
This chapter’s purpose is twofold. First, to introduce the reader to the system
architecture, its different components, and their relationships. Second, to
describe each component respective purpose and the methods employed for
their implementation.
The system should take WAD sensor data as input and infer keystrokes as
output. The main architectural model adopted is Client-Server because of
its flexibility. In fact, this distributed system paradigm allows client ma-
chines with limited computational resources (e.g. mobile devices, wearable
computers) to delegate heavy computations to more powerful machines such
as a networked server. A server host provides services to the different clients
connected to it while the clients initiate communication sessions with the
server that await incoming requests.
29
Chapter 4. System
the data to the processing server. In the training phase, a second client is
in charge of sending the labels to the server. One can see here an important
advantage of the Client-Server model. This architecture indeed allows flexi-
bility to experiment with different types of training devices without having
to change the rest of the system. That is, the very same services provided
by the networked server are directly available to any client connected to the
same local network.
4.2 Client
30
Chapter 4. System
This mobile application is needed for the relay device and was tested on an
LG Nexus 4 smartphone and implemented in Java to target devices running
Android API level 19. As shown in Figure 4.2, a recording session begins
when a user requests the mobile client to start recording. The smartphone
then sends one message to the server to initiate a new session and one message
to the WAD to start listening to motion sensor events. When the user is
typing on the training client, labels with timestamps1 are sent to the server.
In the meantime, the user’s hand motions are recorded by the WAD and
reported to the relay device as an Android Wearable Data Layer message
containing a timestamp value, a three-dimensional array (i.e. one dimension
per axis), and the type of sensor that spread the event (i.e. gyroscope or
accelerometer). The relay device stores the data locally in a buffer for each
sensor type. Once a buffer is full (i.e. reach a defined size limit), the data are
serialized to the JSON data-interchange format [65] and sent to the server
through a TCP socket connection. JSON makes it easy for humans to read
the data and for machines to parse, which enable fast to implement and less
1
The timestamps are measured in ms and require the devices (i.e. the WAD and the
training device) to have synchronized clocks.
31
Chapter 4. System
A recording session stops when the user requests the mobile client to stop
recording. The relay device then sends a message to the WAD to stop lis-
tening to motion sensors and wait until it receives the last message that was
waiting to be transmitted.2 Once the WAD receives the last message, the
smartphone flushes the buffer by serializing and transmitting all of its re-
maining data. Finally, the relay smartphone sends a message to the server
to close the session.
2
As a result of Bluetooth limited throughput and the important amount of motion
events to be sent, the relay device sometimes need to wait few minutes for all the packets
to be received successfully from the WAD.
32
Chapter 4. System
33
Chapter 4. System
(a) Web application for touchlogging (b) Physical device prototype for keylogging
attack scenario. attack scenario.
4.3 Server
The server is needed to fulfil two main tasks. First to receive data from the
different clients, organize them and save them persistently. Secondly and
most importantly to perform data analytics by using the previously saved
data; namely keystroke inference from motion sensor measurements.
The TCP socket and HTTP modules used to manage the data acquisition
process are implemented in Java, and the data analytics process is imple-
mented in Python to benefit from flexible data structures and scientific com-
putation tools.4 When the server receives the end-of-session message from
the relay device, the data are first sorted by timestamps because they are
not guaranteed to be ordered when received. The server then saved them
4
The ANNs are implemented using modules available in the PyBrain Open-Source
library [7] with a C++ wrapper additionally used to speed up the computations. Some
experiments were also performed using Torch [21] and the programming language Lua.
34
Chapter 4. System
in one CSV file per sensor. Figure 4.4 presents the main components of the
data analytics pipeline and their connections. The raw data can initially be
pre-processed to mitigate the effect of noise and measurement inaccuracies.
Features are then extracted from the data in time windows corresponding to
the keystrokes duration. During the training phase, the classification model
is trained with the extracted features by iteratively evaluating the prediction
outputs until a satisfying accuracy is reached or a maximum number of itera-
tions have occurred. At the end of this phase, the trained model is serialized
in XML and saved persistently. During the logging phase, the classification
model is deserialized from the file system, and its inputs are activated with
newly recorded data to attempt to predict the labels.
35
Data Analytics
5
The first objective of this chapter is to detail how the data are recorded and
what methodologies have been used to clean the signal. As shown in Figure
5.2, the raw signal is subject to noise and, therefore, can be pre-processed
before to be used for data analysis purpose. On one hand, the important
amount of noise can potentially obfuscate patterns and largely alter the clas-
sification accuracy. On the other hand, a deep neural network architecture
would, in theory, be able to handle such noisy data. Pre-processing can thus
be applied optionally depending on the experiments.
36
Chapter 5. Data Analytics
(a) Gyroscope.
(b) Accelerometer.
37
Chapter 5. Data Analytics
Sensor Recording: The data are acquired from the gyroscope and ac-
celerometer sensors built-in the smartwatch. Sensor events data are stored
in tuples (ti , xi , yi , zi ), i = 1...n, where ti is the time in ms at which the
event occurred, xi , yi , zi are the values along the three axes x, y, and z,
respectively, and n is the total number of motion sensor events in an entire
recording session. We observed that while the sampling rate was not con-
stant, the delay between sensor events varies slightly enough for us to initially
ignore the sampling rate problem during pre-processing.
Label Recording: The training device reports labels in the form of tuples
(tj , lj ), j = 1...m, where tj is the time in ms at which the keystroke appended,
lj is the label (i.e. the value of the entered key), and m is the total number
of keystrokes in the entire recording session.
5.2 Pre-processing
5.2.1 Calibration
Both motion sensors need to be calibrated to align all three axes. In fact,
the accelerometer axes contain values in different absolute ranges because of
the effect of gravity (as illustrated in Figure 5.2 (b)). Although the gyro-
scope axes should average to zero, a small non-zero difference was observed.
Calibration is performed by subtracting each sensor value with the mean of
its axes, such that:
38
Chapter 5. Data Analytics
f (vi ) = vi − v (5.1)
v∈{X,Y,Z}
Where v is the amplitude value on the given axis. The result of this
operation can be seen in Figure 5.3.
(a) Gyroscope.
(b) Accelerometer.
39
Chapter 5. Data Analytics
The moving median filter is a first pre-processing step used to mitigate the
effect of noise in the data. The moving mean filter is a possible alternative
but has the disadvantage to attenuate the trends in the data. The moving
median removes the noise while preserving the signal pattern and is applied
with a sliding window to compute the median value in a fixed range [46],
that is:
Where v is the amplitude value on the given axis and w is an odd number
representing the size of the sliding window. Since the sensors sampling fre-
quencies are different, the number of data-points at the end of a recording
session is different. Hence, the sliding window size has to be different for
each sensor to remove noise optimally while preserving the signal as much as
possible. Experiments show that w = 9 and w = 5 provide satisfying filtering
results for the gyroscope and the accelerometer, respectively. As shown in
Figure 5.4, the operation helps to remove noise but the signals need to be
further processed to be smoother.
40
Chapter 5. Data Analytics
(a) Gyroscope.
(b) Accelerometer.
The Android API makes it possible to define a delay at which the events are
received. The documentation however clearly stipulates that the specified
delay is just a hint to the system and that events can be received faster or
slower than the specified rate. In our context, knowing the maximum delay
between sensor events in the worth case scenario allow us to optimize the
pre-processing algorithm to clean the signal appropriately. The maximum
41
Chapter 5. Data Analytics
delay for the gyroscope and the accelerometer was found to be 10 000µs and
62 500µs, respectively. Elementary physics tells us that the sampling rate
frequency can be computed from the sampling delay such that:
1
f= (5.3)
d · 10−6
For the gyroscope, we only want to keep signals with a frequency lower
than a certain cutoff frequency and attenuates signals with frequencies higher,
hence the use of a low-pass filter. At contrary for the accelerometer, we want
to attenuate signals with frequencies lower than the cutoff frequency, thus the
need to use a high-pass filter [14]. Using the frequencies previously calculated,
the filters can be applied with Nyquist frequencies set to 8Hz and 50Hz, for
the gyroscope and the accelerometer, respectively. The resulting signals can
be seen in Figure 5.5.
42
Chapter 5. Data Analytics
The Kalman Filter algorithm produces estimates that minimize Mean Squared
Error by observing a series of measurements. Even with data containing sta-
tistical noise, the filter can produce estimates allowing patterns to emerge
more significantly from the signal [32]. This advanced filtering technique
proved to be useful to our application context by smoothing the signal evenly
and attenuating irregular peaks and pits (as shown in Figure 5.6).
43
Chapter 5. Data Analytics
Figure 5.6: Signals ready for feature extraction after the last step of the
pre-processing pipeline.
44
Chapter 5. Data Analytics
45
Chapter 5. Data Analytics
As mentioned in Section 2.3.2, studies have shown that sensors fusion can
increase the robustness of motion sensor outputs classification. In fact, sen-
sors can be subject to inaccurate measurements while recording, merging
their outputs with other sensors can minimize uncertainty and provide more
accurate measurements. However, the accelerometer is recorded with a sam-
pling frequency significantly lower than the gyroscope1 , thus making sensor
fusion hard using the recorded data as such. Moreover, data-points need to
be evenly distributed to allow trends to emerge from the data. The sampling
rate can be made constant by distributing the data-points evenly over the
complete recording session duration. The implemented algorithm is described
as follows:
1
k= (t n − t 0 ) + 1 (5.4)
α
Where t is the timestamp at which a motion sensor event have been
recorded, n is the total number of events in an entire recording session,
and α is a constant integer referring to the target sampling rate that
was defined to be 2ms. Then populate T 0 such that:
t0i = t0 + α i (5.5)
t0 ∈ T 0
1
By a factor of 6.25 according to the sensors maximum delay measured in the data
acquisition step. This factor was confirmed experimentally by dividing the total number
of data-point recorded for the gyroscope with the total number of data-point recorded for
the accelerometer during a recording session.
46
Chapter 5. Data Analytics
1
f (vi ) = vi−1 + (vj − vi−1 ) (5.6)
v∈{X 0 ,Y 0 ,Z 0 } j − (i − 1)
5. Finally, since the known values have now been used to compute the
missing data-points, the last step consists of keeping only the tuples
where t0i ∈ T 0 , the previously generated target set with constant time
intervals.
47
Chapter 5. Data Analytics
Figure 5.8: Gyroscope and accelerometer mean signal aligned with time fit-
ting.
The sensor fusion algorithm returns a vector for each time frame ti and
allow multiple combinations of axes to perform different classification exper-
iments. For example, hgxi , gyi , gzi , ai i is a possible four-dimensions vector
returned by the fusion algorithm. With x, y, and z values along the different
axes of the accelerometer a and the gyroscope g. Mean values gi and ai are
simply the average of the three axes values (i.e. gi = 13 (gxi + gyi + gzi )).
48
Chapter 5. Data Analytics
5.3.2 Segmentation
Where α is half the size of the sampling window. Considering that Asonov
et al. [2] determined the duration of a key-press to be approximately 100ms
and knowing that our target sampling rate was defined to be 2ms (see Section
5.3.1), we defined a sampling window of 50 data-points. Thus α = 25 in our
implementation.
49
Chapter 5. Data Analytics
50
Chapter 5. Data Analytics
1. First of all, we observed that the gyroscope’s signal peaks were better
aligned with the actual keystroke timestamps than the accelerometer’s.
Considering that the sensors data are three-dimensional, the signals on
the three axes are merged by simply calculating the gyroscope’s mean
signal g such that:
1
gi = (gxi + gyi + gzi ) (5.8)
3
2. Secondly, the mean signal can now be used to compute the Peak-to-
Average Power Ratio defined as the square Crest Factor as follows:
2
vi
f (vi ) = (5.9)
r(g)
Where vi is the amplitude of the mean signal g at the index position i
and r(g) return the Root Mean Square of the signal such that:
v
u n
u1 X
r(g) = t gi 2 (5.10)
n i=1
51
Chapter 5. Data Analytics
peak(ri ) → (ri > ri−1 ) ∧ (ri > ri+1 ) ∧ (ri > α) (5.11)
Figure 5.10: Example of peaks detected (shown as green circles) from the
gyroscope Peak-to-Average Power Ratios with actual keystroke positions
(shown as vertical dashed lines).
52
Chapter 5. Data Analytics
With RM S the Root Mean Square, ρ the number of peaks in the signal
detected using the approach described in Section 5.3.2.2, λ the Crest Factor
computed from Equation 5.9, σ the skewness, κ the kurtosis, and γ the vari-
ance. This statistical vector is computed for all three axes for both sensors.
Thus returning a statistical feature vector of length 48.
The classifier takes accelerometer and gyroscope data as input and output
classes corresponding to keystrokes. A typical input vector consists of values
normalized in the range [−1, 1] and the output is a binary vector of the same
length as the number of labels. That is, each label is assigned a binary
representation stored in a look-up table. Different multi-class classification
models were implemented to compare their respective efficiency to process
the data. Each of the models is trained with supervised learning thanks
to a dataset containing labeled data-points. An online approach is used to
update the weights each time a training example is shown to the network.
The weights are updated using an improved variant of the Backpropagation
iterative gradient descent termed Rprop- [69, 37]. For result reproducibility,
53
Chapter 5. Data Analytics
n
1X
E= (Ti − yi )2 (5.13)
n i=1
With n the number of neurons in the output layer, T the target expected
output, and y the predicted output.
TP TP
P = ,R = (5.14)
TP + FP TP + FN
54
Chapter 5. Data Analytics
both precision and recall to provide an overall performance score such that:
P ·R
F = (1 + β)2 (5.15)
β2 ·P +R
1
R=1− S (5.16)
log n
n
X
S=− yi log yi (5.17)
i=1
55
Chapter 5. Data Analytics
tion. However, the main downside of this method, termed holdout, is that the
data used for evaluation are never used to train the classifier and vice versa.
Thus leading to a less general performance assessment. Different techniques
have been developed to address this issue. A popular solution termed K-Fold
Cross-Validation is used in this project to assess the quality of the different
classification models. This method first consists of shuffling the dataset and
splitting it into k partitions (i.e. folds) approximately equal in size. The
classifier is then trained with k − 1 datasets and evaluated on the remaining
partition. This process is repeated k times by selecting a different training
set and evaluation set such that every partition is used at most one time for
evaluation and k − 1 times for training. The evaluation results are finally
averaged to represent the global performance of the classifier. All data are
thus used for both training and evaluation to provide a more general and
accurate performance assessment [48].
Experiments were performed to select the sensor fusion strategy holding the
best classification results. Observations showed that signals resulting from
motions occurring while typing on keys in opposite corners (i.e. 1, 3, ∗, #)
were differentiable enough to be recognized with the naked eye by simply
56
Chapter 5. Data Analytics
looking at the resulting signal patterns (example in Figure 5.9). Thus, mo-
tions from keystrokes on these specific keys were recorded to construct a toy
dataset based on the assumption that a good classifier model should be able
to do as well as the human eye on these simple patterns. The toy dataset
contains a total of 120 keystrokes targeting 4 labels, resulting in 30 instances
per key.
Table 5.1: Fusion strategy benchmark results (average values for 100 training
Epochs).
To assess the quality of the different fusion strategies, the vectors were
used to train an RNN with one hidden LSTM layer of 9 units (as illustrated
in Figure 5.11) for 100 Epochs using the toy dataset. That is, 100 passes
through the entire training dataset. The number of units in the linear input
layer depends on the length of the feature vector returned by the fusion
algorithm. The output layer is a standard linear layer and the evaluation
is performed with the same dataset used for training. Although this would
be a bad approach for evaluating the performance of the classifier itself,
the goal here is only to assess the quality of the features returned from the
57
Chapter 5. Data Analytics
Now that a fusion strategy has been chosen to leverage the quality of the
predictions, it is possible to compare different types of models. Each ANN
is built following the architecture template illustrated in Figure 5.12 and
trained for 100 Epochs on the same toy dataset used in Section 5.4.2. A
linear layer is first used to forward the input vector to the internal structure
of the network. Each layer is then fully connected to the next layer in the
network. Since classification is performed, a Softmax layer is finally used as
output (see Table 2.2 for mathematical definition of Softmax). The compared
models only differ from the type of hidden layer employed and the type of
features they are trained on. The hidden layers each contains 128 hidden
units. The following results are measured using K-Fold Cross-Validation
with k = 5. The results are therefore averaged mean and averaged standard
deviation. Confusion matrices generated during this benchmark can be seen
in Appendix C.
58
Chapter 5. Data Analytics
Multilayer FNN: Standard FNN is one of the most simple ANN architec-
ture and is interesting for comparing their performance with more advanced
ANNs such as networks with recurrent architecture. Layers with different
activation functions are compared to select an appropriate FNN architecture
for the problem at hand. Standard Rprop- is used for training. As presented
in Table 5.2 and Figure 5.13, FNN with Sigmoid hidden layers can process
statistical features significantly more efficiently than Tanh layers. However,
when the network is trained with data segments directly, Tanh layers are
able to make more reliable predictions with a smaller standard deviation.
F1 Score Reliability
Ref. Hidden Layer Features
Mean Std. Dev Mean Std. Dev
A Sigmoid Statistical 0.816 0.056 0.999 0.0014
B Tanh Statistical 0.762 0.167 0.979 0.0192
C Sigmoid Segment 0.908 0.061 0.972 0.0134
D Tanh Segment 0.866 0.016 0.982 0.0045
E LSTM Segment 0.891 0.042 0.924 0.0247
F LSTM peephole Segment 0.866 0.055 0.924 0.0315
Table 5.2: Hidden layer benchmark (see Figure 5.13 for graphical represen-
tation).
59
Chapter 5. Data Analytics
together with the first data-point of the sequence. During evaluation, the
predictions from an FNN are returned as generated by the neural network.
However, because LSTM units first need to be activated to initialize their
memory cell internal structure, the outputs generated for the whole sequence
are contributing to the final prediction result. The output vectors are sim-
ply added together and normalized before to be returned. Two different
LSTM implementations were studied: standard recurrent LSTM consisting
of a forget gate, and LSTM with peephole connections. Table 5.2 and Fig-
ure 5.13 show that while LSTM with peepholes is able to make remarkably
good predictions in some cases, the standard deviation remains higher than
when a standard recurrent LSTM layer is used. Thus making the peephole
implementation less robust in this specific application context.
Figure 5.13: Hidden layer benchmark (see Table 5.2 for references).
60
Evaluation
6
This chapter first goal is to describe the different experiments setup to collect
empirical data from various people. Secondly, to describe the different results
returned by the system. Finally, to interpret the results and their relations
with the research questions enunciated in Section 1.1.
Seven persons external to this research and aged between 23 and 30 have
participated in the following experiments. Each person was asked to enter
multiple series of keystrokes on a touchscreen and a keypad while wearing
a WAD on their wrist. To prevent the influence of external motions, the
participants were required to sit in a comfortable position allowing them
to stay still for the entire duration of the recording session. Each dataset
contains 240 keystrokes with 20 instances of each of the 12 labels (i.e. 1, 2,
3, 4, 5, 6, 7, 8, 9, 0, ∗, #). If a classifier makes its predictions at random,
1
the probability of correctly classifying a keystroke K is P (K) = 12 .
61
Chapter 6. Evaluation
6.2 Experiments
(a) FNN-Sigmoid.
(b) FNN-Tanh.
(c) RNN-LSTM.
62
Chapter 6. Evaluation
63
Chapter 6. Evaluation
Figure 6.2: F1 Score for the different ANN architectures during the touchlog-
ging experiment.
64
Chapter 6. Evaluation
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.672 0.0193 0.982 0.0029
P-H 0.720 0.0339 0.969 0.0094
R-T 0.614 0.0029 0.980 0.0022
R-H 0.570 0.0029 0.982 0.0024
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.685 0.0106 0.948 0.0019
P-H 0.672 0.0415 0.912 0.0080
R-T 0.635 0.0029 0.840 0.0021
R-H 0.625 0.0183 0.864 0.0170
65
Chapter 6. Evaluation
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.697 0.0128 0.902 0.0097
P-H 0.737 0.0088 0.935 0.0057
R-T 0.647 0.0460 0.808 0.0118
R-H 0.660 0.0147 0.822 0.0074
66
Chapter 6. Evaluation
Figure 6.3: F1 Score for the different ANN architectures on the keylogging
experiment.
Despite the fact that the FNN-Sigmoid is convinced by its predictions, the
quality of its decisions is far from satisfying. In fact, Table 6.4 shows that the
average reliability score is high with a small standard deviation even thought
the standard deviation of the F1 Score is very important. The predictions
average close to and below average.
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.515 0.1078 0.973 0.0153
P-H 0.518 0.1745 0.973 0.0088
R-T 0.454 0.1417 0.964 0.0149
R-H 0.415 0.1868 0.969 0.0101
67
Chapter 6. Evaluation
for raw data classification are indeed below average. FNN-Tanh reliability
is fluctuating with a great standard deviation, showing that the model has
difficulties to generate strong predictions. This ANN architecture clearly
struggles to achieve Unsupervised Feature Learning on the keypad dataset.
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.524 0.1493 0.913 0.0276
P-H 0.523 0.1594 0.908 0.0244
R-T 0.425 0.1234 0.775 0.0179
R-H 0.368 0.1468 0.759 0.0275
68
Chapter 6. Evaluation
F1 Score Reliability
Data Prep.
Mean Std. Dev Mean Std. Dev
P-T 0.556 0.0110 0.871 0.0201
P-H 0.554 0.0200 0.858 0.0184
R-T 0.541 0.0110 0.758 0.0021
R-H 0.593 0.0310 0.776 0.0225
For this experiment, the classifier is trained for 200 Epochs with the complete
dataset recorded during Experiment 1 when users are entering keys on a
smartphone touchscreen. The logging phase is later performed using the
full measurement collection recorded for Experiment 2 with users typing on
an ATM-like keypad. It is worth noting that the classifier is trained and
later evaluated with features generated by the same data scheme. That is,
when the model is trained with pre-processed data with timestamp-based
segmentation (i.e. P-T), the same data scheme is applied to the evaluation
dataset. This experimentation is completed using the RNN-LSTM model
because it is the model yielding the best results in both previous experiments.
The results presented in Table 6.7 are calculated at once when the eval-
uation dataset is shown to the classifier. Although the returned predictions
are far from excellent, RNN-LSTM is still able to recognize patterns from
unknown signals recorded when users are typing on a different keyboard than
the one used for training. In this application context, it is worth noting that
the classifier can perform better when it is both trained and evaluated with
pre-processed data. Similarly to the results presented in Figure 6.2 and Table
69
Chapter 6. Evaluation
6.3, RNN-LSTM have trouble learning from raw data with timestamp-based
segmentation (i.e. R-T).1
Table 6.7: Results from RNN-LSTM trained for touchlogging and evaluated
for keylogging with data segments used as features.
6.3 Discussions
70
Chapter 6. Evaluation
ms. Even though these time values are expected to match actual keystrokes
when aligned with the signal (as shown on Figure 5.9), it is likely that small
time measuring inaccuracies can lead to worse classification results. Second,
the heuristic segmentation works by measuring physical properties of the
signal (as explained in Section 5.3.2.2). With this in mind and given the
experiments results, it is reasonable to assume that these physical properties
are resilient across keystrokes and consequently a robust method for signal
segmentation.
71
Conclusion
7
The purpose of this chapter is mainly to summarize and reflect on our find-
ings. Conceivable future works are also considered to extend this project.
7.1 Summary
The system developed in this work can perform touchlogging and keylog-
ging with an accuracy of 73% and 59%, respectively. Despite the fact that
these results are smaller than the ones claimed in related works, our classi-
fier can perform equally successfully when confronted with raw unprocessed
data. Thus demonstrating that deep neural networks are capable of mak-
ing keystroke inference attacks based on motion sensors easier to achieve by
removing the need for non-trivial pre-processing pipelines and carefully en-
gineered feature extraction strategies. All related works rely heavily on such
techniques as presented in Chapter 2.
72
Chapter 7. Conclusion
To minimize the risk of such attacks, users should always wear their WAD
on their less preferred hand for device interaction. For example, a right-
handed person should wear the WAD on its left arm. Because of the demon-
strated risks, the different operations systems powering wearable technologies
should require user permissions before any application is allowed to use the
accelerometer and the gyroscope. Furthermore, a permission system should
restrict or allow access to the motion sensors in specific contexts or for trusted
applications only.
73
Chapter 7. Conclusion
User’s moves can create important motion interferences that can poten-
tially obfuscate keystroke signal patterns. Experiments in such conditions
could be conducted to compare the results and the likelihood of such an at-
tack in these contexts (e.g. controlled environment when the user is sitting,
uncontrolled environment when she is walking).
The motion of WAD could also potentially be used for the identification
and tracking of users as studied in similar research [64, 38].
Some WAD models come built-in with a wide range of sensors including
Galvanic Skin Response (GSR), heart rate sensor, Electromyography (EMG),
or ambient light sensor. Fusing motion sensors with one or many of these
additional sensors might further improve the accuracy and the robustness of
the keystroke predictions.
74
Bibliography
[1] Ahmed Al-Haiqi, Mahamod Ismail, and Rosdiadee Nordin. On the best
sensor for keystrokes inference attack on android. Procedia Technology,
11:989–995, 2013.
[3] Adam J Aviv, Benjamin Sapp, Matt Blaze, and Jonathan M Smith.
Practicality of accelerometer side channels on smartphones. In Proceed-
ings of the 28th Annual Computer Security Applications Conference,
pages 41–50. ACM, 2012.
[4] Dirk Balfanz, Diana K Smetters, Paul Stewart, and H Chi Wong. Talk-
ing to strangers: Authentication in ad-hoc wireless networks. In NDSS,
2002.
[5] Ling Bao and Stephen S Intille. Activity recognition from user-annotated
acceleration data. In Pervasive computing, pages 1–17. Springer, 2004.
[6] Andrea Barisani and Daniele Bianco. Sniffing keystrokes with lasers/-
voltmeters. Proceedings of Black Hat USA, 2009.
[7] Justin Bayer, Tom Schaul, and Thomas Rückstieß. Pybrain: Python-
based reinforcement learning, artificial intelligence and neural network
library. https://github.com/pybrain/pybrain, Online, accessed 04-
11-2015.
75
Bibliography
[10] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term
dependencies with gradient descent is difficult. Neural Networks, IEEE
Transactions on, 5(2):157–166, 1994.
[11] Yigael Berger, Avishai Wool, and Arie Yeredor. Dictionary attacks using
keyboard acoustic emanations. In Proceedings of the 13th ACM confer-
ence on Computer and communications security, pages 245–254. ACM,
2006.
[16] Liang Cai and Hao Chen. Touchlogger: Inferring keystrokes on touch
screen from smartphone motion. In HotSec, 2011.
76
Bibliography
[17] Liang Cai and Hao Chen. On the practicality of motion based keystroke
inference attack. Springer, 2012.
[18] Claude Castelluccia and Pars Mutaf. Shake them up!: a movement-
based pairing protocol for cpu-constrained devices. In Proceedings of
the 3rd international conference on Mobile systems, applications, and
services, pages 51–64. ACM, 2005.
[19] Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. Side-
channel leaks in web applications: A reality today, a challenge tomor-
row. In Security and Privacy (SP), 2010 IEEE Symposium on, pages
191–206. IEEE, 2010.
[22] W3C (World Wide Web Consortium). Http - hypertext transfer proto-
col. http://www.w3.org/Protocols/, Online, accessed 30-09-2015.
[25] Denis Foo Kune and Yongdae Kim. Timing attacks on pin input de-
vices. In Proceedings of the 17th ACM conference on Computer and
communications security, pages 678–680. ACM, 2010.
77
Bibliography
[26] Felix Gers, Jürgen Schmidhuber, et al. Recurrent nets that time and
count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-
INNS-ENNS International Joint Conference on, volume 3, pages 189–
194. IEEE, 2000.
[27] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to for-
get: Continual prediction with lstm. Neural computation, 12(10):2451–
2471, 2000.
[29] Michael T Goodrich, Michael Sirivianos, John Solis, Gene Tsudik, and
Ersin Uzun. Loud and clear: Human-verifiable authentication based
on audio. In Distributed Computing Systems, 2006. ICDCS 2006. 26th
IEEE International Conference on, pages 10–10. IEEE, 2006.
[30] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, Bas R Steune-
brink, and Jürgen Schmidhuber. LSTM: A search space odyssey. arXiv
preprint arXiv:1503.04069, 2015.
[31] Jiawei Han, Micheline Kamber, and Jian Pei. Data mining: concepts
and techniques: concepts and techniques. Elsevier, 2011.
[32] Simon Haykin. Kalman filtering and neural networks, volume 47. John
Wiley & Sons, 2004.
[35] Michael Hüsken and Peter Stagge. Recurrent neural networks for time
series classification. Neurocomputing, 50:223–235, 2003.
78
Bibliography
[36] IDSIA. Brainstorm: Fast, flexible and fun neural networks. https:
//github.com/IDSIA/brainstorm, Online, accessed 04-11-2015.
[37] Christian Igel and Michael Hüsken. Empirical evaluation of the improved
rprop learning algorithms. Neurocomputing, 50:105–123, 2003.
79
Bibliography
[48] Ron Kohavi et al. A study of cross-validation and bootstrap for accuracy
estimation and model selection. In Ijcai, volume 14, pages 1137–1145,
1995.
[49] Ron Kohavi and George H John. Wrappers for feature subset selection.
Artificial intelligence, 97(1):273–324, 1997.
[52] Martin Längkvist, Lars Karlsson, and Amy Loutfi. A review of un-
supervised feature learning and deep learning for time-series modeling.
Pattern Recognition Letters, 42:11–24, 2014.
[53] Yann LeCun and Yoshua Bengio. Convolutional networks for images,
speech, and time series. The handbook of brain theory and neural net-
works, 3361(10), 1995.
[54] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Na-
ture, 521(7553):436–444, 2015.
[55] Jiayang Liu, Lin Zhong, Jehan Wickramasuriya, and Venu Vasudevan.
uwave: Accelerometer-based personalized gesture recognition and its
applications. Pervasive and Mobile Computing, 5(6):657–675, 2009.
[56] Xiangyu Liu, Zhe Zhou, Wenrui Diao, Zhou Li, and Kehuan Zhang.
When good becomes evil: Keystroke inference with smartwatch. In
Proceedings of the 22nd ACM SIGSAC Conference on Computer and
Communications Security, pages 1273–1285. ACM, 2015.
80
Bibliography
[58] Paul Lukowicz, Holger Junker, Mathias Stäger, Thomas von Bueren,
and Gerhard Tröster. Wearnet: A distributed multi-sensor system for
context aware wearables. In UbiComp 2002: Ubiquitous Computing,
pages 361–370. Springer, 2002.
[59] Anindya Maiti, Murtuza Jadliwala, Jibo He, and Igor Bilogrevic.
(smart) watch your taps: side-channel keystroke inference attacks using
smartwatches. In Proceedings of the 2015 ACM International Sympo-
sium on Wearable Computers, pages 27–30. ACM, 2015.
[61] Philip Marquardt, Arunabh Verma, Henry Carter, and Patrick Traynor.
(sp) iphone: decoding vibrations from nearby keyboards using mobile
phone accelerometers. In Proceedings of the 18th ACM conference on
Computer and communications security, pages 551–562. ACM, 2011.
81
Bibliography
[65] JSON org. Ecma-404 the json data interchange standard. http://www.
json.org/, Online, accessed 30-09-2015.
[66] Emmanuel Owusu, Jun Han, Sauvik Das, Adrian Perrig, and Joy Zhang.
Accessory: password inference using accelerometers on smartphones. In
Proceedings of the Twelfth Workshop on Mobile Computing Systems &
Applications, page 9. ACM, 2012.
[69] Martin Riedmiller and Heinrich Braun. A direct adaptive method for
faster backpropagation learning: The rprop algorithm. In Neural Net-
works, 1993., IEEE International Conference on, pages 586–591. IEEE,
1993.
[71] Roman Schlegel, Kehuan Zhang, Xiao-yong Zhou, Mehool Intwala, Apu
Kapadia, and XiaoFeng Wang. Soundcomber: A stealthy and context-
aware sound trojan for smartphones. In NDSS, volume 11, pages 17–33,
2011.
82
Bibliography
[72] Thomas Schlömer, Benjamin Poppinga, Niels Henze, and Susanne Boll.
Gesture recognition with a wii controller. In Proceedings of the 2nd
international conference on Tangible and embedded interaction, pages
11–14. ACM, 2008.
[75] Dawn Xiaodong Song, David Wagner, and Xuqing Tian. Timing analysis
of keystrokes and timing attacks on ssh. In USENIX Security Sympo-
sium, volume 2001, 2001.
[77] Ajay Kumar Tanwani, Jamal Afridi, M Zubair Shafiq, and Muddassar
Farooq. Guidelines to select machine learning scheme for classification of
biomedical datasets. In Evolutionary Computation, Machine Learning
and Data Mining in Bioinformatics, pages 128–139. Springer, 2009.
[79] Berkeley Vision and Learning Center. Caffe: a fast open framework
for deep learning. https://github.com/BVLC/caffe, Online, accessed
04-11-2015.
83
Bibliography
[81] He Wang, Ted Tsung-Te Lai, and Romit Roy Choudhury. Mole: Motion
leaks through smartwatch sensors. In Proceedings of the 21st Annual
International Conference on Mobile Computing and Networking, pages
155–166. ACM, 2015.
[82] Paul J Werbos. Backpropagation through time: what it does and how
to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
[83] Jiahui Wu, Gang Pan, Daqing Zhang, Guande Qi, and Shijian Li. Ges-
ture recognition with a 3-d accelerometer. In Ubiquitous intelligence and
computing, pages 25–38. Springer, 2009.
[84] Zhi Xu, Kun Bai, and Sencun Zhu. Taplogger: Inferring user inputs on
smartphone touchscreens using on-board motion sensors. In Proceedings
of the fifth ACM conference on Security and Privacy in Wireless and
Mobile Networks, pages 113–124. ACM, 2012.
[85] Li Zhuang, Feng Zhou, and J Doug Tygar. Keyboard acoustic emana-
tions revisited. ACM Transactions on Information and System Security
(TISSEC), 13(1):3, 2009.
84
Appendices
85
Backpropagation
A
This appendix provides additional details about the Backpropagation algo-
rithm [31, 12, 70, 33] to support Section 2.2. The total network error E is
computed from a loss function, such as the Mean Squared Error formalized
in Equation 5.13. According to the chain rule, the gradient can be expressed
as:
∂E ∂E ∂yi ∂xi
= (A.1)
∂Wij ∂yi ∂xi ∂Wij
First, the partial derivative with respect to Wij can be computed from
Equation 2.1 as follows:
∂xi
= yj (A.2)
∂Wij
∂yi ∂φ(xi )
= (A.3)
∂xi ∂xi
86
Appendix A. Backpropagation
∂
(Ti − yi ) if i ∈ output layer,
∂E ∂yi
= (A.4)
n
!
∂yi ∂ X ∂E
∂yi W ij otherwise;
j=1
∂yj
∂φ(xi )
(Ti − yi ) if i ∈ output layer,
∂xi
∂E ∂yi
= ei = ∂φ(x ) X (A.5)
n
!
∂yi ∂xi
i
Wij ej otherwise;
∂xi
j=1
∂E ∂E ∂yi ∂xi
= = ei yj (A.6)
∂Wij ∂yi ∂xi ∂Wij
Wij = Wij − η ei yj (A.7)
87
Signal Pre-processing
B
This appendix illustrates the different pre-processing operations applied to
the sensor signals. On the following figures, both sensors data have been
recorded during the same typing session and the values along the three axis
(i.e. x, y, and z) are processed.
B.1 Gyroscope
88
Appendix B. Signal Pre-processing
89
Appendix B. Signal Pre-processing
90
Appendix B. Signal Pre-processing
B.2 Accelerometer
91
Appendix B. Signal Pre-processing
92
Appendix B. Signal Pre-processing
93
Confusion Matrices from Model
C
Benchmark
This appendix shows the confusion matrices generated during the benchmark
detailed in Section 5.4.3 and performed to compare different neural network
architectures on different types of features.
94
Appendix C. Confusion Matrices from Model Benchmark
95
Appendix C. Confusion Matrices from Model Benchmark
96
Appendix C. Confusion Matrices from Model Benchmark
97
Experiment Results
D
This appendix provides result details (i.e. classifier loss during training,
confusion matrices from evaluation) to support Chapter 6.
98
Appendix D. Experiment Results
D.1.1 FNN-Sigmoid
99
Appendix D. Experiment Results
100
Appendix D. Experiment Results
D.1.2 FNN-Tanh
101
Appendix D. Experiment Results
102
Appendix D. Experiment Results
D.1.3 RNN-LSTM
103
Appendix D. Experiment Results
104
Appendix D. Experiment Results
105
Appendix D. Experiment Results
D.2.1 FNN-Sigmoid
106
Appendix D. Experiment Results
107
Appendix D. Experiment Results
D.2.2 FNN-Tanh
108
Appendix D. Experiment Results
109
Appendix D. Experiment Results
D.2.3 RNN-LSTM
110
Appendix D. Experiment Results
111
Appendix D. Experiment Results
112
Appendix D. Experiment Results
113
Appendix D. Experiment Results
114