The Threat of Screenshot-Taking Malware: Analysis, Detection and Prevention
The Threat of Screenshot-Taking Malware: Analysis, Detection and Prevention
The Threat of Screenshot-Taking Malware: Analysis, Detection and Prevention
Hugo Sbai
Balliol College
University of Oxford
A thesis presented for the degree of
Doctor of Philosophy
February 2022
Abstract
Among the various types of spyware, screenloggers are distinguished by their ability to
capture screenshots. This gives them considerable nuisance capacity, giving rise to theft
of sensitive data or, failing that, to serious invasions of the privacy of users. Several
examples of attacks relying on this screen capture feature have been documented in
recent years.
Moreover, the available countermeasures either suffer from a lack of usability that pre-
vents their large-scale use or have a limited effectiveness.
Our detection model achieves an accuracy of 97.4% versus 94.3% for a standard state
of the art detection model. This model was trained and tested on the first complete and
representative dataset dedicated to malicious and legitimate screenshot-taking applica-
tions.
Our prevention mechanism is based on the retinal persistence property of the human
visual system. Its usability was tested with a panel of 119 users.
This thesis is written in accordance with the regulations for the degree of Doctor of
Philosophy. The thesis has been composed by myself and has not been submitted
in any previous application for any degree. The work presented in this thesis is
my own.
Acknowledgements
This research was supervised by Michael Goldsmith and Jassim Happa, and
I would like to thank them for their continued support, guidance and advice
throughout the project. Aside from these direct supervisors, I am very grateful
to the rest of our research group, who have all provided advice throughout.
I would also like to thank Professors David De Roure, Kurt Debattista, Ivan Mar-
tinovic and Kasper Rasmussen, who, through their comments at my Transfer of
Status, Confirmation of Status and final viva, provided me with invaluable feed-
back and direction, with which this thesis was much improved.
Thank you especially to all the friends who have made this process both more
manageable and more enjoyable. Finally, I would like to thank my family for
everything they have done.
ii
Table of Contents
1 Introduction 2
1.1 Context and motivation . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research questions and Contributions . . . . . . . . . . . . . . . 6
1.3 Detailed design . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 List of publications . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Literature Review 14
2.1 Screenlogger behaviour analysis and construction of a dataset . . . 14
2.1.1 Screenlogger behaviour analysis . . . . . . . . . . . . . . 14
2.1.2 Existing datasets . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Malware detection . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Signature-based detection . . . . . . . . . . . . . . . . . 16
2.2.2 Anomaly-based detection . . . . . . . . . . . . . . . . . . 17
2.2.3 Behaviour-based detection . . . . . . . . . . . . . . . . . 18
2.2.4 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Screenlogging prevention . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Screenlogger prevention: the authentication case . . . . . 32
2.3.2 Screenlogger prevention: the general case . . . . . . . . . 45
2.3.3 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Adversary Model 67
3.1 Screenloggers’ capabilities and comparison with other attacks . . 67
3.1.1 Credential theft on virtual keyboards . . . . . . . . . . . . 68
3.1.2 Sensitive data breach . . . . . . . . . . . . . . . . . . . . 74
3.1.3 Spying on the victim’s activity . . . . . . . . . . . . . . . 76
iii
University of Oxford Balliol College
3.1.4 Blackmail . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.5 Reconnaissance . . . . . . . . . . . . . . . . . . . . . . . 80
3.1.6 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.2 Attack scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.2.1 Online banking customer attack (capability 1 + capability 2) 83
3.2.2 Real-time monitoring of the victim’s activity (capability 3) 85
3.2.3 Blackmail (capability 4) . . . . . . . . . . . . . . . . . . 87
3.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.1 Targeted systems . . . . . . . . . . . . . . . . . . . . . . 88
3.3.2 Targeted victims . . . . . . . . . . . . . . . . . . . . . . 88
3.4 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4.1 General Description . . . . . . . . . . . . . . . . . . . . 89
3.4.2 Operating process . . . . . . . . . . . . . . . . . . . . . . 89
3.4.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Threat Analysis 93
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.1 Extensive study of security reports . . . . . . . . . . . . . 93
4.1.2 Proposed taxonomy . . . . . . . . . . . . . . . . . . . . . 94
4.2 Screen capturing . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2.1 Used API . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2.2 Screenshot-triggering . . . . . . . . . . . . . . . . . . . . 99
4.2.3 Captured area . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Screenshots storage . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3.1 Image compression . . . . . . . . . . . . . . . . . . . . . 103
4.3.2 Storage media . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Screenshots exfiltration . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.1 Communication protocol . . . . . . . . . . . . . . . . . . 105
iv
University of Oxford Balliol College
v
University of Oxford Balliol College
vi
9 Evaluation and Security Analysis of the Proposed Countermeasures 195
9.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.1.1 Security analysis . . . . . . . . . . . . . . . . . . . . . . 195
9.1.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.1.3 Real-time . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.1.4 Network bandwidth . . . . . . . . . . . . . . . . . . . . . 198
9.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.1 Security analysis . . . . . . . . . . . . . . . . . . . . . . 199
9.2.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.2.3 Real-time . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.2.4 Network bandwidth . . . . . . . . . . . . . . . . . . . . . 216
9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.3.1 Security analysis . . . . . . . . . . . . . . . . . . . . . . 217
9.3.2 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.3.3 Real-time . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.4 Network bandwidth . . . . . . . . . . . . . . . . . . . . . 226
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10 Conclusion 230
10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.1.1 Threat analysis (Chapters 4 and 6) . . . . . . . . . . . . . 230
10.1.2 Dataset construction (Chapter 5) . . . . . . . . . . . . . . 231
10.1.3 Screenlogger detection (Chapter 7) . . . . . . . . . . . . 232
10.1.4 Retinal persistence-based mitigation technique (Chapters
8, 9 and 10) . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.3 Final remarks and future work . . . . . . . . . . . . . . . . . . . 236
vii
List of Figures
viii
University of Oxford Balliol College
4.1 Need for a screenshot command to start screen capturing (in the
malware of Mitre [14]). . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Screenshot-triggering (in the malware of Mitre [14].) . . . . . . . 101
4.3 Captured area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4 Image files format. . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Files storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.6 Communication protocol. . . . . . . . . . . . . . . . . . . . . . . 106
4.7 Image files encryption. . . . . . . . . . . . . . . . . . . . . . . . 107
ix
University of Oxford Balliol College
x
University of Oxford Balliol College
xi
List of Tables
7.1 Detection results for the basic approach using features from the
literature with the RF algorithm (k=10). . . . . . . . . . . . . . . 160
7.2 Detection results for the basic approach using features from the
literature with the KNN algorithm (k=10). . . . . . . . . . . . . . 161
7.3 Detection results for the basic approach using features from the
literature with the SVM algorithm (k=10). . . . . . . . . . . . . . 161
7.4 Detection results for the basic approach using features from the
literature with the RF algorithm (k=3). . . . . . . . . . . . . . . . 162
7.5 Detection results for the basic approach using features from the
literature with the RF algorithm (k=5). . . . . . . . . . . . . . . . 162
7.6 Detection results for the basic approach using features from the
literature with the RF algorithm (k=7). . . . . . . . . . . . . . . . 163
xii
University of Oxford Balliol College
7.7 Detection results for the basic approach using features from the
literature with the RF algorithm (k=12). . . . . . . . . . . . . . . 163
7.8 Detection results for the basic approach using features from the
literature with the RF algorithm (k=15). . . . . . . . . . . . . . . 163
7.9 Detection results for the optimised approach using our specific
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xiii
Acronyms
API Application Programming Interface. iv, v, 8, 9, 18–22, 28, 29, 31, 54, 67, 69,
82, 89, 92, 96, 99, 114–117, 122, 137, 138, 144, 151, 154, 159, 161–163,
165–171, 174–176, 207, 232–235
CAPTCHA Completely Automated Public Turing test to tell Computers and Hu-
mans Apart. 50–54, 63, 196, 197
DD Desktop Duplication. 67, 96, 99, 116, 117, 122, 131, 138, 144, 145
FD Factor of Displacement. 37
FPS frame per second. 179, 192, 198, 207, 208, 226, 228, 235
xiv
University of Oxford Balliol College
GDI Graphics Device Interface. 67, 96, 97, 99, 116, 117, 122, 131, 138, 144,
175
KNN K-nearest neighbour. xii, 21, 25, 27, 29, 53, 161
ML Machine learning. 7, 9, 17, 18, 20, 21, 23, 27–29, 50, 73, 112, 151, 153,
165, 172
OCR Optical Character Recognition. xiii, 10, 26, 48, 50–53, 83, 85, 87, 90, 98,
107, 166, 173, 193, 203, 204, 219, 220, 228, 234
RAT Remote Access Trojan. 15, 77, 96, 101, 120, 122–124
RF Random Forest. xii, xiii, 21, 25, 27, 157–163, 171, 232
1
1 | Introduction
Among the aforementioned spyware modules, screenloggers have one of the most
dangerous functionalities in today’s spyware as they greatly contribute to hackers
achieving their goals, as illustrated in figure 1.1.
2
University of Oxford Balliol College
Screenlogger users can be divided into two categories: financially motivated ac-
tors and state-sponsored attackers. The first category targets important industrial
companies (e.g., BRONZE BUTLER [19]), online banking users (e.g., RTM [21],
FIN7 [22], Svpeng [19]), and even banks themselves (e.g., Carbanak [21], Silence
[23]). The second category, which is even more problematic, targets critical in-
frastructure globally. For instance, the malware TinyZbot [24], a variant of Zeus
[25], has targeted critical infrastructure in more than 16 countries. More precisely,
the targets can be democratic institutions; for instance, XAgent targeted the US
Democratic Congressional Campaign Committee and the Democratic National
Committee [26]. In Europe, Regin [23], took screenshots for at least six years in
3
University of Oxford Balliol College
the IT network of the EU headquarters. Diplomatic agencies have also been com-
promised, for example, North Korean malware ScarCruft has targeted diplomatic
agencies [27]. US defence contractors have also been hit by spyware, such as Iron
Tiger.
Screenloggers have the advantage of being able to capture any information dis-
played on the screen, offering a large set of possibilities for the attacker compared
to other spyware functionalities. Moreover, malware authors are inventive when
maliciously using screen captures. Indeed, screen captures have a wide range of
purposes. Some malware, such as Cannon and Zebrocy, only take one screenshot
during their entire execution as reconnaissance to see if the victim is worth infect-
ing [28, 29]. Others hide what is happening on the victim’s screen by displaying
a screenshot of their current desktop (FinFisher [29, 30]). Others take numerous
screen captures for a close monitoring of the victim’s activity. This can allow
stealing of sensitive intellectual property data (Bronze Butler [4]), banking cre-
dentials (RTM [21], FIN7 [22], Xagent [26]), or to monitor the day to day activity
of banking clerks to understand the banks’ internal mechanisms (Carbanak [31],
Silence [23]).
4
University of Oxford Balliol College
These examples show that the screenshot functionality is widely used today in
modern malware programs and can be particularly stealthy, enabling powerful at-
tacks. Even in the case where no specific attack is performed, the simple fact of
monitoring and observing all the victims’ activity on their device is a serious inva-
sion of privacy. Moreover, screenshots are likely to contain personally identifiable
information [33].
What makes the screenlogger threat even more problematic is that, on desktop
environments, the screenshot functionality is legitimate and, as such, is used by
many benign applications(e.g. screen sharing for work or troubleshooting, saving
important information, creating figures, monitoring employees). The necessity of
capturing the screen has for instance increased with telework, even on sensitive
machines. Teleworkers, including bank employees or lawyers, may need to con-
trol their office computer remotely from home, or to share their screens during
5
University of Oxford Balliol College
Paradoxically, only a few works appear in the literature about screenloggers, leav-
ing the threat relatively unknown. This gap is revealed by the low number of oc-
currences of the words ‘screenlogger’ and affiliated keywords on Google Scholar
compared to other threats: 153 for ‘screenlogger’ vs 6,960 for ‘keyloggers’, 6,540
for ‘packet-sniffing’ or 20,600 for ‘social engineering’.
To the best of our knowledge, the only studies focusing on screenloggers have
been limited to specific questions, such as the Android Debug Bridge (ADB)
vulnerability that allows screenshot-taking on Android smartphones [34–37], or
screenshot protection during authentication using virtual keyboards [38–40]. No
overall view is available of the threat represented by screenshot-taking malware
and how it can be countered. Therefore, the objective of this thesis is to address
the lack of emphasis in the literature on screenloggers by studying their behaviour
and proposing a defence-in-depth approach to fight them.
6
University of Oxford Balliol College
• Use of the constructed dataset to identify the existing features which are the
most adapted to the detection of screenloggers.
• Use of the retinal persistence property of the human eye to mitigate man-
ual and automatic screenshot exploitation while trying to achieve the best
possible usability. Forcing malware programs to take screenshots more fre-
quently makes them more easily detectable by our specific behavioural de-
tection system. Several experiments were conducted to ensure the method’s
robustness and the possibility to deploy it widely, on four main aspects: se-
7
University of Oxford Balliol College
8
University of Oxford Balliol College
- Takes more
Internet screenshots
-Dedicated algorithm
to recover
information
The first part is a behaviour based dynamic detection system. As soon as a pro-
gram makes a screenshot request, this detection system starts monitoring its API
calls and network traffic. It also tracks the different background and foreground
processes which are linked to this event. All this data is given to a ML model that
we trained in order to distinguish benign and malicious patterns.
9
University of Oxford Balliol College
use the special “print screen" key on the keyboard, taking the necessary hardware
precautions to avoid the keypress being simulated.
The mitigation system has two major effects. The first one is to make it harder for
an attacker to recover information from the screenshots. They need to take several
screenshots in a limited time window and to use a specific algorithm to reconstruct
the screen. Indeed, we designed the alteration mechanism in a way that a simple
majority rule would not suffice; a pixel can be more time visible than hidden but
it can also be the reverse. Therefore, an Optical Character Recognition (OCR)
will be necessary to analyse the successive images and find what parts are hidden,
followed by an algorithm which will combine the different parts.
The second effect of this mitigation mechanism is that, as the attacker needs to
take more screenshots, it becomes harder to remain undetected. Indeed, it is not
possible to capture the screen at spaced intervals of time to remain stealthy. There-
fore, this mechanism is complementary with the detection module and makes its
task easier.
10
University of Oxford Balliol College
The main works partially responding to the issues raised in the thesis; namely, the
detection and prevention of screenloggers, are presented and analysed in Chapter
2. The studied prevention approaches have been intentionally extended to anti-
‘shoulder surfing’ countermeasures, even if this type of attack is outside the scope
of the thesis. This choice was made because of the important similarities between
shoulder-surfing attacks and those that are the subject of our work and the lack of
countermeasures against screenloggers.
Chapter 4 presents our threat analysis methodology. This work aims to remedy
the lack of behavioural studies specific to screenloggers in the literature. The be-
haviour of screenlogging malware is studied and analysed in detail at each stage
of its operating mode using novel criteria specially created for the occasion. The
chapter concludes with a definition of the completeness criteria for our screenlog-
ger dataset.
11
University of Oxford Balliol College
12
University of Oxford Balliol College
13
2 | Literature Review
In this chapter, we present an overview of the state of the art regarding the five
contributions listed in Section 1.2.
To the best of our knowledge, the only studies focusing on screenloggers are lim-
ited to specific questions, which are the ADB vulnerability that allows screenshot-
taking on Android smartphones [44–47], and screenshot protection during authen-
tication using virtual keyboards [5, 48, 49].
No general view is available of the operating mode of this kind of spyware and
the behaviours they exhibit at different stages of their execution. Only MITRE
dedicates a page to screenshot-taking malware [14]. The MITRE page cites 127
security reports about screenshot-taking malware. However, the reports are unfor-
tunately not compiled or analysed to give a clear view of the threat represented by
screenshot-taking malware and the capabilities they possess.
Although there are many malware datasets available containing diverse categories
of malware [50, 51], there is no existing dataset dedicated to gathering diverse
forms of screenshot-taking malware. The only screenlogger samples can be found
in general malware datasets in the middle of other types of malware. However,
14
University of Oxford Balliol College
such general datasets would not allow us to test our future detection approach nor
to gather meaningful insights into screenloggers’ behaviour. There are several
reasons for that.
First, general dataset authors do not indicate whether their dataset contains
screenshot-taking malware. They usually simply indicate that the dataset con-
tains ‘spyware’ without giving information about their functionalities and partic-
ularly the screenshot functionality. To the best of our knowledge, only one dataset
explicitly includes screenloggers’ network traffic related to the well-known mal-
ware programs Zeus and Ares [52]. However, the collected network traffic of the
aforementioned screenloggers is not representative of the real screenloggers’ be-
haviour because authors only consider a periodic screen recording of 400 seconds,
whereas recent screenloggers are highly diverse. Thus, existing malware datasets
include few screenlogger samples, and they are limited because they only reflect
a small number of characteristics describing a specific screenlogger behaviour.
15
University of Oxford Balliol College
[61]), parental or employees control (e.g., Verity [62], Kidlogger [63], Norton On-
line Family7 [64]), or screenshot-taking and editing (e.g., PicPick [65], Snipping
Tool [66], FastStone Capture [67]). Thus, to effectively analyse spyware using the
screenshot capture feature, it is critical to be able to understand and identify the
similarities and differences between their behaviour and that of legitimate appli-
cations. However, there is currently no dataset of legitimate applications taking
screenshots and again, even when some of the legitimate applications contained in
the dataset can take screenshots, it is important to ensure that this functionality is
triggered at runtime (for example by enabling screensharing during a Skype call).
The basic idea of signature-based methods is that the signature of the program is
extracted from its files and matched to known signatures. Those methods work us-
ing two steps. First, known and already detected malware programs are analysed,
and their signatures are stored in a database; second, the signatures are compared
to detect possible malware. Bazrafshan et al. described those methods using three
16
University of Oxford Balliol College
components: (1) a data collector which performs static or dynamic analysis, (2) an
interpreter in charge of converting the collected data into an intermediate format,
(3) a matcher, which matches the extracted features with already known behaviour
signatures [68].
Those methods perform well for known malware programs, with a low false-
positive rate. However, they are vulnerable to obfuscation techniques by which the
attacker can alter the code and avoid detection because the signature is changed
[69]. Moreover, signature-based methods cannot detect malware employing poly-
morphism and metamorphism because the signature of the malware changes every
time a machine is infected, nor zero-day attacks. Therefore, the signature database
must be updated continuously and must hold a signature for each variant of every
known malware program.
The drawback of these methods for detecting malware in computer systems is that
those systems contain many processes and many possible behaviours, making it
difficult to define normal behaviour. This can result in a high false-positive rate.
17
University of Oxford Balliol College
Static analysis
Features can be extracted in a static way, which means that the binary files of the
malware are analysed without executing it. Different features can be extracted
from the binary files. The most frequently used features in the literature are:
• Control flow graph: after disassembling the binary file of a program, a con-
trol flow instruction graph can be created. Nodes represent blocks of in-
structions, and edges represent control flow instructions. A method using
control flow graphs was presented by Eskandari et al. [72]. Another exam-
18
University of Oxford Balliol College
ple of recent work using control flow graphs is [73], which is an approach
for detecting Java bytecode malware programs using static analysis with a
deep Convolutional Neural Network classifier to determine maliciousness.
• N-gram: n-grams are all possible substrings of length N that can be obtained
from a larger string. Using this method, malware programs are analysed
using text processing algorithms.
Moreover, malware can encrypt itself with a different key each time it infects a
machine. It is impossible to extract its features statically without being able to
decrypt it. More specifically, the packing technique is a method of using multiple
layers of compression and encryption to hide the malicious code. Anezi et al. ar-
gue that 80% of current malware use this technique, and half of them are repacked
versions of old malware programs [78].
Dynamic analysis
19
University of Oxford Balliol College
ware can be run in a real environment, taking the necessary precautions to avoid its
spreading; , or it can be run in virtual environments [79, 80]. Megira et al. present
a more detailed description of tools used to analyse malware [81]. Dynamic anal-
ysis can overcome obfuscation techniques, since it analyses the runtime behaviour
of the malware. Therefore, this technique is widely used in recent malware detec-
tion works, which consider diverse types of dynamic features as described in the
following subsections.
The basic principle of these methods is illustrated by Han et al. [77]. The authors
developed a detection system called MalDAE, which employs both static and dy-
namic API calls analysis. Their method is based on several steps. In the first step,
the data is collected by analysing the portable executable section of the binary files
to extract the sequences of API calls sequences statically. Then, the program is run
in a sandbox, its dynamic API calls are extracted, , and a pruning is done in order
to eliminate the redundancy of API calls and no-op instructions. This step helps
to eliminate noisy API calls used by the malware designers. Then, the sequences
of API calls sequences from both methods are merged, and the features are gen-
erated by taking the N API calls with the highest contribution. Finally, the vector
combining the N API calls is filled with the number of occurrences of each one,
and this vector is passed as an input to an ML model to classify the programs.
This method was evaluated using the VirusShare dataset [85], and five types of
20
University of Oxford Balliol College
Malware developers, however, use ever more sophisticated methods to evade de-
tection, and, as argued by Weijie et al., traditional methods based on API calls
sequences may be insufficient [82]. For instance, they can be cheated by malware
with strong imitation ability, which can reproduce the same API calls as legitimate
programs. As a result, detection methods based on API calls [86, 87] failed in de-
tecting some malware. Another drawback is that the feature vector generation is
based on the Top-N API calls, which were the most current in the training samples
[77, 82]. Thus, the selected API calls strongly depend on the nature of the dataset.
Hence, if the dataset does not contain any screenshot-taking malware or very few,
screenshot calls would not be taken into account. A recent work aiming at solving
these issues is presented in [88]. The authors propose a novel ensemble adversar-
ial dynamic behavior detection method aiming at three features of malicious API
sequence, namely Immediacy, Locality, and Adversary. There results shows that
this technique provides more resiliency compared to existing ones.
21
University of Oxford Balliol College
the GetWindowDC and BitBilt functions, and uses the system service dispatcher
table, which is the table that contains pointers to each kernel function. Then, to
classify a screenshot-taking program as spyware or benign, they used a DT (with
the J48 algorithm) considering the following features: frequency of repetition,
uniqueness of the applicant process, state of the applicant process (hidden or not)
and values of parameters in the system calls. The results showed that the proposed
method could detect screenloggers with an accuracy of 92% and an error rate of
7%.
22
University of Oxford Balliol College
The RFC 3697 (standard for IPv6 Flow Label Specification) defines a traffic flow
as ‘a sequence of packets sent from a particular source to a particular unicast, any-
cast, or multicast destination that the source desires to label as a flow’. Lashkari
et al. defined a network flow as a sequence of packets with the same values for
five attributes: Source IP, Destination IP, Source Port, Destination Port and Pro-
tocol [93]. TCP flows are usually terminated upon connection teardown (by FIN
packets), while UDP flows are terminated by a flow timeout.
The flow-based network features from previous works in this domain have been
used to detect different malware types such as botnet, adware or ransomware. The
features are classified using the four categories defined by Lashkari et al.: byte-
based (features based on byte counts), packet-based (features based on packets
statistics), time-based (features depending on time) and behaviour-based (features
representing specific flow behaviour). To these, we can add transport layer fea-
tures and host-based features. Numerous features may not be used by the classifi-
cation algorithm because they do not carry useful information (depending on the
malware type).
23
University of Oxford Balliol College
Some of the existing research has developed models for general malware detec-
tion based on network traffic [92–94]. To characterise general malware traffic,
Lashkari et al. proposed an Android malware detection model based on nine net-
work features: flow packet length (min, max), backward variance data byte, num-
ber of packets with FIN, flow forward and backward bytes, maximum idle time,
initial window forward and minimum segment size forward [93]. These features
have been selected from an initial set of 17 by applying three feature selection
algorithms. The proposed approach showed an average accuracy of 91.41% and a
false-positive rate of 8.5%.
Boukhtouta et al. used two approaches: Deep packet inspection and IP packet
headers classification [92]. Deep packet inspection is an advanced method for
examining the content of packets and identifying the specific applications or ser-
vices it comes from. It is a form of packet filtering that locates, identifies, classi-
fies, reroutes or blocks packets with specific data that conventional packet filter-
ing, which examines only packet headers, cannot detect. The evaluation results
showed that it is possible to detect malicious traffic using J48 and Boosted J48
algorithms with 99% precision and less than 1% of false positives rate.
Nari and Ghorbani proposed a framework for the automated classification of mal-
ware samples based on network behaviour [94]. They considered different mal-
ware families, such as Bifrose, Ardamax, Mydoom or LDPinch, and identified
network flows from traffic traces gathered during the execution of malware sam-
ples. Then they created a behaviour graph for each sample to represent the mal-
ware activity. The behavioural profile is built using dependencies between net-
work flows in addition to the flow information. The classification features are
extracted from the behaviour graphs. This includes, for example, the graph size
(the number of nodes in the graph, which also shows the number of flows in the
network trace of the malware), the root out-degree (which represents the number
24
University of Oxford Balliol College
of independent flows in the network trace of the malware), the maximum out-
degree (which shows the maximum number of dependent flows in the network
trace of the malware).
The Canadian Institute for Cybersecurity offers two datasets intended for intrusion
detection: CICIDS2017 [96] and CSE-CIC-IDS2018 [97]. The datasets cover six
attack profiles: DoS/DDoS, SSH/FTP, Web attack, infiltration, bot and port scan.
Note that the CSE-CIC-IDS2018 includes a botnet attack with a screenshot-taking
and keylogging scenario [97]. For both datasets, 80 network traffic features have
been extracted. Then, RandomForestRegressor was used to select the best feature
set for each attack [96]. For instance, the best set of features for botnet detection
consisted in: subflow forward bytes, total length forward packets, forward packet
length mean and backward packets per second.
Saad et al. and Beigi et al. used features characterising the length of the flow along
with other general features to detect peer-to-peer botnet traffic [98, 99]. To select
the features, they assumed that botnet traffic generated by bots is more uniform
than traffic generated by legitimate users showing highly diverse behaviour.
25
University of Oxford Balliol College
Note that screenloggers have received less attention from the researchers in com-
parison with other malware types, despite their impact on personal user data. Al-
though malware programs share some common characteristics, they have different
communication patterns with their C&C server. For instance, in P2P botnets, the
bots frequently communicate with each other and send ‘keep alive’ messages. To
detect screenloggers based on their network traffic, it may be necessary to use
specific features related to their characteristics, such as the data type or the fact
that their traffic is asymmetric. To the best of our knowledge, the only available
screenloggers’ network traffic is related to the Zeus and Ares malware programs
[96,97] and are not representative of existing screenloggers, as they only consider
periodic sending of 400 s. Screenloggers may exhibit various behaviours to re-
main stealthy, such as irregular sending or reducing the packet size by taking only
part of the screen, lowering the resolution or using compression. Our experiments
showed that a screenshot of a virtual keyboard may reach a size of 33KB by using
low resolution and compression while still being exploitable by OCR tools.
Malware detection based on resource use Some works use the monitoring of
resources consumption to detect malware, such as CPU or memory use.
Many works in the literature are based on the number and sequences of the file
system or registry operations. Indeed, to perform their tasks, malicious programs
need to make some operations on the file system and the registries, such as creat-
ing, deleting or altering files or registers.
An example of a method using those parameters was proposed by Mao et al. who
used directed graphs composed of edges representing the dependency between the
processes, the files and the registry entries [100]. Once the dependency graph is
created, a PageRank-inspired algorithm is used to quantify the importance of each
node. Then heuristic rules are used to detect the malware. The authors compared
26
University of Oxford Balliol College
their approach with three classifiers: KNN, linear regression and RF. Other works
consider memory and registry operations, along with other features [82, 83].
A work using memory access is presented by Banin et al. [101]. In this work,
the authors use an automated virtualised environment and Intel PIN - a dynamic
binary instrumentation tool - to record memory access sequences produced by
malicious and benign executables. The dynamic binary instrumentation tool can
be used for live analysis of binary executables and facilitates analysis of different
properties of the execution, such as memory activity or addressing space. The au-
thors focused on the analysis of basic memory access operations (read and write)
and their n-grams. They used several ML models to test their methods, such as
KNN and Artificial Neural Network (ANN).
However, these works only focus on the number or the sequence of memory or
file operations and do not consider other parameters, such as the size of the writ-
ten files, for example, which may be necessary for screenlogger detection and
differentiation from legitimate screenshot-taking applications.
All these methods target mobile environments, and to the best of our knowledge,
CPU usage was not employed to detect malware in desktop environments. Al-
27
University of Oxford Balliol College
though it may not be effective alone, in the screenlogger case, this could be useful
along with other parameters such as screenshot API calls to help distinguish be-
tween screen-recording malware and legitimate screenshot-taking applications.
Weijie et al. proposed a framework that allows for the detection and classification
of malware programs using several parameters [82]. The proposed framework is
called MalInsight and consists of four stages: (1) data collection, (2) profiling
based on the programs’ API calls, file and registry access, and network use, (3)
feature generation (number of occurrences of API calls, network operations and
I/O operations) and (4) training of four ML models. The dataset used contains
benign applications and five malware types: backdoor, constructor, email-worm,
net-worm and trojan-downloader. The performance of each model is compared by
combining the different features. This is done to show the impact of each feature
on the performance of the models, which depends on the malware type.
28
University of Oxford Balliol College
SVM, classification tree, KNN and Perceptron. The SVM model showed the best
performance. The authors argued that their method showed a precision of 99% in
classification and 98% in clustering.
However, considering too many features may lead to overfitting. Indeed, depend-
ing on the malware type, some features may be more relevant than others. Thus,
adding too many unnecessary features may mislead the classifier. Weijie et al.
tested their method on completely unknown malware samples having no link with
their training dataset, and the results showed that for this malware type, consid-
ering only high-level features related to file, registry and network access is better
than relying on a complete set of features including API calls, DLLs and static
features [82].
Urooj et al. [112] made a study about the machine learning techniques proposed
in the literature during the last three years to detect ransomware. All these meth-
ods are based on behavioral detection with dynamic features. This study shows
that many recent techniques use deep learning algorithms (ANN), alone or com-
bined with traditional ML, to detect ransomware. However, to be efficient, this
29
University of Oxford Balliol College
2.2.4 Synthesis
Signature-based methods give good results for known malware programs with a
low false-positive rate. However, they are vulnerable to obfuscation techniques
that are more and used by modern malware. Moreover, signature-based methods
cannot detect malware integrating polymorphism and metamorphism mechanisms
because the signature of the malware changes every time a machine is infected.
30
University of Oxford Balliol College
ering many features for general malware detection is not ideal either as it makes
it easier for malware programs to pretend to be legitimate by taking advantage of
overfitting.
Figure 2.1 sums up why existing malware detection approaches are not adapted to
the screenlogger case. There is a need to design a behaviour-based approach with
dynamic features specific to screenloggers (API calls and network monitoring).
31
University of Oxford Balliol College
Most works aiming at protecting the screen content during authentication cover
the case of shoulder-surfing attacks from an external observer. However, some of
32
University of Oxford Balliol College
33
can watch the entire interaction. In fact, the observer has little information
The virtual keyboard has ato help focus
number theirthatattention
of fallacies the in
attacker can take advantage of. There are advanced
the mapping phase, and is forced either to guess which characters they should focus on or to memorize
Trojans which take screenshots on the Mouse Click event.
Fig. 1 Hardware Keylogger
mappings for all keys. The former is unreliable in deciphering what has
All these screenshots arebeen typed
uploaded and
to the the latter
hackers websiteis
difficult without recording
2.4 Using Trojan Horses equipment. or mail account. Hence in this way even a virtual keyboard
is made susceptible to attacks. Also the virtual keyboard is
susceptible to shoulder surfing.
Trojan is a program that contains or installs malicious
4. SPY-RESISTANT
code. KEYBOARD
There are many such Trojans available online today.
Trojans capture the keystrokes and store them somewhere
We instantiated thissend
in the machine and approach
them backintoan theinterface 5. Anti-Screenshot
we call the
adversary. Once Virtual Keyboard
Spy-Resistant Keyboard. (ASVK)
This keyboard
a Trojan is activated, it provides the adversary with a string
randomizes the spatial location of all characters
of characters that a user might enter online, consequently
as each password character is entered.
In order to overcome the fallacies of the virtual keyboard,
University of Oxford
The putting personal data and online account information at
Spy-Resistant Keyboard is composed of 42 Character such asTiles,
susceptibility to screenshotTiles
two Interactor
Balliol College
capturing and shoulder
(labeled “Drag
risk. They work in the background without the user coming surfing, the anti-screenshot virtual keyboard is proposed in
Me…”),
to knowa feedback
about them.textbox,
Chances ofa computer
backspace beingbutton,
affected and this
an paper.
enter button (see Figure 2). Each Character Tile is
randomly
by suchassigned a lowercase
malicious software letter,
is 70% even ancomputer
if the uppercaseis letter,
In the and either a number
anti-screenshot or a symbol,
virtual keyboard, when theall positioned
mouse
up-to-date. A Trojan Horse typically move to one
vertically
contains two files: DLL file end EXE file. The DLL file are always
on top of each other. Lowercase letters onkey,
the all
toptherow
keys of
on each
that particular
tile and
keyboard are changed to some special symbol like an
row of thea red
have
background;
does all theuppercase
recording in letters
some fileare placed
in the in the
computer middleasterisk(*)
while and have or aahash(#).
greenFigure
background;
3 shows thenumbers
position ofand
the
the EXE
symbols areinstalls the DLLon
positioned andthe
triggers
bottomit to work. The file
and have a blueanti-screenshot
background.virtualSince there are
keyboard whenexactly
the mouse 42cursor
numbers and
in which all the recording is done is mailed to the moves on a particular key.
symbols combined,
adversaries but only 26 letters, some letters are
mail account. repeated. Just as each button on a standard
keyboard represents two characters, depending on the state of the caps lock or shift keys, each Character
3.Virtual
Tile represents three characters, Keyboard
depending on the state of shifting. Rather than having a fixed shift state
for the entire keyboard,Virtualas traditionally done, each tile has a randomly assigned shift state, indicated by the
Keyboard is a software
red line under the active character.
technology that is used to mitigate the attack of password
stealing Trojans. It is an on-screen keyboard which uses
In order
mousetotoselect a character
enter sensitive details on
suchthe Spy-Resistant
as an credit card pin Keyboard, the typist first locates the tile that contains
Fig. 1 : Position of the virtual keyboard on mouse move event
the character (a) [114]
to be typed.
number or password. They remember the mapping by noting the location (b)of[115]
this tile. Next, the typist
Fig 2 shows a virtual keyboard. The
Figure 2. In the first phase of typing, the mapping phase, the user first finds the character they
(c) [116]
would like to type and notes its location. For example, if they are trying to type the capital letter
“Z”, they would scan the three green rows, finding the letter on the seventh tile on the second row.
Figure 2.2: Examples of anti-shoulder surfing authentication techniques using ob-
fuscation and confusion.
One proposed solution consists of adding artefacts (noise) on the screen when a
click occurs [117]. For example, it can be the display of an artificial mouse pointer
to prevent the malware from knowing which part of the screen has been clicked.
This approach would not stand up to malware recording the click coordinates.
Several works propose using virtual keyboards in which keys are mixed or hidden.
Agarawal et al. suggested a dynamic virtual keyboard that mixes key layouts after
each click and hides them at click events (figure 2.2 (a)) [114]. A colour code is
used to easily remember character positions. The user can enter one character at a
time. The position of the character to be typed is noted, and the user clicks on the
‘hide keys’ button to hide all characters. Finally, the position of the desired key is
located using its colour code.
34
University of Oxford Balliol College
The keyboard proposed by Srinivasan et al. consists of 72 keys that are logically
divided into four groups – A, B, C and D, each comprising 18 keys [118]. When
the keyboard is shown to the user, the keys’ positions are randomised. The user
must locate and remember the current position of the desired key. Each key is
indexed using its position and its group ID. Thus, the user needs to note down
the index value and group ID corresponding to the required key (for example, A2,
where A is the group ID and 2 is the index of the required key). At the next step,
the user shuffles the keyboard keys and hides their labels so that only their indexes
appear. After a key is entered, the keyboard is randomised and switched to visible
mode. This process continues until the user enters all the password characters and
chooses to submit.
A similar approach was proposed by Parekh et al., with a virtual keyboard that
changes its appearance and disposition over time [115]. When the user clicks on a
key, all the keys of the keyboard are transformed into asterisks, and the keyboard
layout is changed randomly after each click (Figure 2.2 (b)).
All these techniques have in common that they rely on the attacker’s short-term
memory because they cannot remember the position of each character between
each click as the layout is mixed and the keys are hidden at the click events.
The technique is effective against shoulder-surfing attacks performed by a human
observer, but it is not the case when the process is recorded using a camera. More-
over, it is effective against screenloggers that take screenshots at each click but not
against those taking screenshots at regular and short intervals.
Cognitive tasks Other techniques have attempted to make games out of the
authentication procedure without needing to hide any information (Figure 2.3).
These techniques use cognitive tasks to increase the difficulty of the login session.
35
(5) The verifier repeats steps 1–4 four times, one time for each of the four digits
D1, . . . , D4 that constitute the prover!s PIN.
Overall, 16 input/output rounds have to be completed, four rounds per digit and four
repetitions for the four digits of the prover!s
Single-set scheme is the basic S3PAS scheme. In this
PIN. If any of the set intersections contains
Without loss of generality, we assume that the user Alice’s
either no digit
scheme, or more
the available passwordthan
icons setone
T is digit
the set ofthenoriginal
an error
passwordoccurred during
k is “A1B3”. Since input.
the length In that case, the
of the pass-
University
all the printable
verifier notifies charactersof
justOxford
like the conventional textual
the|T |prover Balliol
word is, |k| = 4, based on the basic click-rule, College
Alice has to
password system, and = 94. Thereof is athe
stringerror,
k which increases thecorrectly
click four times overallin thecount of false
right sequence attempts for the
to be au-
alleged prover, and offers to repeat the entire procedure unless three false attempts were
is user’s password previously chosen and memorized by the thenticated. The four combinations of password in order
user, which is named “original password”. The characters are “A1B”, “1B3”, “B3A” and “3A1”. The login procedure
counted. Otherwise,
in k are called
However, theuseverifier
the of cognitive
“original pass-characters”. verifies that is
challenges
consists the notdigits
alwaysfour
of the following , . . .and
Defficient
1steps ,D constitute
against
is4also
screen- the correct
shown in
Initially, the system randomly scatters the set T in the Figure 2 (a) to (d).
PIN.login
Fig. 1 illustrates steps 1–3.
image as shown in Figure 1(a) and 1(b).
recording attacks depending on the proposed scheme.
1. Alice finds her pass-characters “A”, “1” and “B”, then
clicks inside the pass-triangle or input a session pass-
Online Banking S3PAS Powered
! 0 ? N ] A } " @ 1 character inside !A1B (e.g., “P”).
O ^ z input w
| # 2 { P _ y input b input w
Easy, Secure, and Free. ~ $ a Q ' x ` % 4 2. Alice finds her pass-characters “1”,
input“B”b and “3”, then
C R 3 w & 5 D S b v
Conventional Textual Password 6 E T c u ( 7 F U clicks inside the pass-triangle or input a session pass-
Graphical Interface (S3PAS) 1 d2 3 e t ) 8 G 1 2 3
V B s * 1 character
2 3 inside !1B3 1 2 (e.g.,
3 “D”).
9 H W g r + : I x h
q
4 5 6 , ; y i p
4 5 6- < K
4 Alice
3. 5 6finds her pass-characters
4 5 6 “B”, next
“3”digit
and “A”, then
Z j o . = L [ k n >
/ M \ J l m f
or clear
clicks inside the pass-triangle or input a session pass-
Proceed to Login
7 8 9 7 8 9 7 character
8 9 inside !B3A 7 8 (e.g.,
9 “5”).
Sign In
system, the original passwords and the session passwords. (a) pass-triangle !A1B (b) pass-triangle !1B3
Users choose their original passwords when creating their
accounts. In every login process, (b)users
[120]input different ses- (c) [121]
sion passwords so that they can protect their original pass- ! 0 ? N ] A } " @ 1 ! 0 ? N ] A } " @ 1
words from releasing. O ^ z | # 2 { P _ y O ^ z | # 2 { P _ y
Figure 2.3: Examples of
The click-rule for single-set scheme is as follows. For anti-shoulder~ $ surfing
a Q ' xauthentication
` % 4 ~ $ a techniques
Q ' x ` % 4using
C R 3 w & 5 D S b v C R 3 w & 5 D S b v
cognitive
the user’s password string tasks.
k, we number the first charac- 6 E T c u ( 7 F U 6 E T c u ( 7 F U
ter in k as k1 , the second k2 , the third k3 , etc. Then we d e t ) 8 G V B s * d e t ) 8 G V B s *
have k1 , k2 , k3 , . . . , kn−1 , kn , n = |k|. To login, users 9 H W g r + : I x h 9 H W g r + : I x h
have to find out k1 , k2 , k3 , . . . , kn−1 , kn in the login im- q , ; y i p - < K q , ; y i p - < K
age. Then the first click must be inside the pass-triangle Z j o . = L [ k n > Z j o . = L [ k n >
formed by kRoth1 , k2 and etkal. presented
3 . The anmust
second click alternative
be in- PIN
/ M entry
\ J method
l m f called / M‘the
\ cognitive
J l m f trap-
side the pass-triangle formed by k2 , k3 and k4 . Recursively,
the i-th clickdoor
must be game’
inside to
the prevent shoulder
pass-triangle formed bysurfing [119]. The main idea is to consecutively
Sign In Sign In
kimod|k| , k(i+1)mod|k| and k(i+2)mod|k| , i = 1 . . . n. This (c) pass-triangle !B3A (d) pass-triangle !3A1
display the set of PIN digits as two partitions. Instead of clicking on the targeted
is the “basic click-rule.”
To show the login process, let us follow an example. Figure 2. S3PAS Login Process
digit, the user must indicate the partition to which it belongs. This process is re-
peated four times for each of the four PIN digits. Roth et al. proposed a similar
approach in another work [122]. These approaches make use of the limited human
visual short-term memory to counter shoulder surfing, because a human observer
cannot remember the combination of the 16 partitions entered by the user with the
36
University of Oxford Balliol College
Other approaches are more difficult to break, even with screenshots, but may be
subject to brute-force attacks. This is the case of the approach proposed by Sur-
jushe et al., which aims to bolster the security level of virtual keyboards against
shoulder surfing without hiding characters [124]. During the registration process,
the user specifies an email address where a Factor of Displacement (FD) and a
traversal direction are received. FD is a random number between one and five
specifying the number of shifts from one key to another. The traversal direction
indicates the shift direction. When logging on, for each character of the password,
the user must calculate the new character to type by adding the number of shifts
FD. For example, if the original character of the password is ‘q’, with FD equal to
one and the traversal direction horizontal right, then the new character to be typed
is ‘w’. The weakness of this approach lies in the fact that the adversary can try all
the combinations of FD and directions, which makes only 20 possibilities.
Some works propose password schemes that are resistant both to shoulder surfing
and screen-recording attacks. Deulgaonkar et al. propose a pair-based scheme
37
University of Oxford Balliol College
38
University of Oxford Balliol College
Jermyn et al. proposed a scheme for a graphical input display that allows for
drawing any shape or any character as a password (Figure 2.5) [2]. The pro-
40
University of Oxford Balliol College
This scheme was extended by Martinez et al. [128]. The proposed authentication
system is based on behavioural biometric characteristics which are extracted from
the dynamics of the drawing process such as speed and acceleration. In other
words, the attacker would have to imitate not only what the user draws, but also
how the user draws it. In this case, it is more difficult to impersonate the user
even with screen recording. However, if screenshots are taken at a sufficiently
high frequency, it is still possible to deduce the speed at which the form was
drawn. It may also be possible for a malware to use the phone’s sensors such
as the accelerometer. Moreover, this technique is specific to the authentication
process and is not extendable to protect any information displayed on the screen.
41
University of Oxford Balliol College
back is that special hardware is needed. Moreover, the technique is specific to the
authentication process and cannot be used for general screen content protection.
Luca et al. presented an authentication mechanism named XSide that uses the
front and the back of smartphones to enter stroke-based passwords (Figure 2.7)
[4]. Users can switch sides during input to minimise the risk of shoulder surfing.
The authors explored how switching sides during authentication affects usability
and the security of the system. This technique is specific to smartphone devices
and is also limited to the authentication process.
42
University of Oxford Balliol College
A few works proposed methods based on human vision features to prevent screen-
loggers from stealing credentials during authentication. Lim et al. proposed a
solution based on retinal persistence (Figure 2.8) [5]. The goal of this technique
is to divide each character into segments and display them one after the other in a
quick succession. At a sufficiently high speed, a human can see the whole charac-
ter. On a screenshot, the numbers are never fully visible. However, this solution
is limited to digital numbers composed of a limited number of segments. Also,
an adversary can deduce possible characters from one image (knowing that one
segment is on may greatly reduce the set of possible digits).
Figure 2.8: Formation of the number ‘4’ using visual persistence [5].
43
University of Oxford Balliol College
belonging to the same form. The idea is to divide the keyboard into tiles that
contain either a random or a pre-calculated texture. This technique stimulates the
brain to detect the pre-calculated tiles on which the same rotation is applied as
whole shapes while disregarding the random textures as background. In doing
so, a human sees a traditional keyboard while a program only detects noise. This
method can also allow for the prevention of screenshot exploitation by a human:
motion perception allows a human to distinguish a given form through motion.
When a screenshot is taken, however, there is no motion, and even a human can-
not read the content. This idea is extended by Bacara et al. [48], who designed
a virtual keyboard using motion perception and two other human vision proper-
ties : visual assimilation that allows a human to perceive the apparent brightness
of an object depending on the contrast between the object and its surroundings,
and visual interpolation, which allows a human to perceive the complete shape
of an object even with missing parts [130]. Although these works afford ade-
quate protection against screenloggers during authentication, it may be difficult
to use them for general content protection because the user’s visual comfort is
significantly affected due to the techniques used, resulting in greyscale and noisy
images. Another obstacle for the adoption of this method as a general solution
against screenshots is that it is content-dependant and requires computations to
be performed on each pixel, which is time- and resource- consuming. Moreover,
these methods are limited to textual content protection.
A different approach is proposed by Nyang et al. [6]. The authors provide a se-
cure authentication protocol that overcomes several attacks, including keyloggers
and screenloggers, by relying on the use of two devices (a smartphone and a com-
puter, for example). A blank virtual keyboard is displayed on the first screen, and
after scanning a QR code, randomly dispersed keyboard keys are shown on the
second screen. The user enters the password using a mouse pointer on the blank
44
University of Oxford Balliol College
virtual keyboard of the computer while seeing the arrangement of keys through
the smartphone (Figure 2.9). However, this technique requires having two devices
and cannot be used for general screen content protection.
Shoulder surfing
45
University of Oxford Balliol College
This is the case of Brudy et al., who proposed a solution to address the shoulder-
surfing threat in public spaces for large displays (fixed device) [131]. The authors
exploited the spatial relationships between the user, the shoulder surfer and the
screen. Their solution warns the user about the presence of a shoulder surfer by
flashing the border of the screen and providing a 3D model in which the position
and gaze of the surfer are indicated. Then, the user can hide or collect the per-
sonal window together on one side of the screen. Moreover, the position of the
user’s head against the display and the shoulder surfer is estimated to darken the
regions visible to the shoulder surfer and leave the shielded display area unaltered.
However, this technique presents several drawbacks, such as its inapplicability to
cases where it is a camera instead of a human recording the content of the screen,
or in the case of a screenlogger. Moreover, this system may disturb the user in
the case of false positives, where faces are detected from persons who are not
shoulder surfers looking at the screen or persons whom the user wants to show
information.
Another technique uses face detection and thus suffers from the same drawbacks
[7]. The authors proposed the Shuffling Texts Method (STM) – a method aim-
ing to prevent shoulder-surfing attacks. The STM displays shuffled texts to the
malicious shoulder surfers. In plain mode, when only the user is reading the
document, the shuffling level is 0. As soon as an intruder appears, STM provides
shuffled texts with a higher shuffling level (Figure 2.10). As the solution proposed
by Brudy et al., this method, which relies on face detection, is not applicable to
46
the document
ose to 0. From Fig. 5: Webcam perceives attacker with face
ain thewords
huffled secret
vel Shuffled words
ame screenhfufiSlng at This face detection module is used to recognize the shoulder
nt part of the University
fuSnlihgf surfer Trudy who positions behind the normal user to steal
of Oxford Balliol College
gflfSunhi information from the users screen as shown in Fig. 5.
different level
possibility that Fig. 5a represents the condition that the normal user Alice
urther
om theembed screen.a model the case of an adversary using a camera to film the content of the screen nor to
witched their position gazes a monitor screen. And the condition, which is attacked
more piece of by a malicious
screenloggers. user
the Trudy,
usabilityisis captured in Fig. 5b. As can users
d in Fig. 4. In this
nformation is be Fig.Moreover,
seen in4:Fig.
One5b,more
if theshuffle
second
affected since
for
face is
to allow
increasing
detected
reading,
safty
through a
aracter level, but in must move the cursor while reading to indicate which zones must not be shuffled.
ROI represents
nce again among the webcam during operation, STM works to treat the threat such
smoved
processby themake
might as a shoulder surfing attack by printing shuffled text out on
hSTM STM sets
although the screen, because the documents on the screen are assumed
the this
mproves to be exposed by the threat of stealing information.
sor, andthe security
print
rovides
is way,this function
Alice We used Haar [6] [7] algorithm in STM in order to detect
ws where the attacker’s face since Haar algorithm is one of the famous face
uld not obtain
shuffling exist is for detection algorithms. It is not only powerful for detecting faces
re
ent.of The
Fig. ROI4 is not but also fast as much as it can operate on a smartphone without
elershuffling would not load.
understand
ty of the document in
-level shuffled
Words near the cursor
ost impossible
around a level of 0 C. (a)
Displaying
A user isshuffled texts in a(b)
perceived shuffling mode is perceived
An attacker
(a) Shoulder surfer detection using face recognition
s can read the words
e slightly misspelled.
t word in a sentence, Fig. 5: Webcam perceives attacker with face
words
the word next to the
his rule, the word on
evel words
uffled 2. Through this
ut
hfufiSlng and it wouldThis face detection module is used to recognize the shoulder
delay,
ewer program.
fuSnlihgf surfer Trudy who positions behind the normal user to steal
gflfSunhi information (a)
from
Plainthe
textsusers screen as text
shown in Fig.
(b) Shuffled texts 5.
(b) Displaying shuffled
e Trudy
mbed modelDetec- Fig. 5a represents
dure ofa “Face Figure
Fig. the condition
6: 2.10: Shuffling
General textthat
and protectionmethodthemode
normal user Alice
[7].
malicious user Trudy.gazes a monitor screen. And the condition, which is attacked
their position
ig. 4. In this by a malicious user Trudy, is captured in Fig. 5b. As can
Another
be seen solution,
in Fig. proposed
5b, ifbytheKhamis et al., face
second is to use gaze-tracking
is detected [132]. The
through a
level, but in
in among the authors webcam
9 during
proposed operation,
to display textual STM
contentworks to treatbythe
on smartphones threatthesuch
tracking user’s
ss might make gaze as a shoulder surfing attack by printing shuffled text out on
while hiding the rest of the screen using different masks (blackout, crys-
although this the screen, because the documents on the screen are assumed
tallise
to beorexposed
Wollongong. Downloaded on using03,2022
February a fake
bytext).
atthe However,
threat
23:50:26 UTCofthis
from solution
stealing isinformation.
IEEE Xplore. notRestrictions
effective against
apply. screen-
s the security
this function recording attacks.
We used Haar [6] [7] algorithm in STM in order to detect
attacker’s
Other face
solutions givesince Haar
an active rolealgorithm
to the usersis
or one of the
let them famous
choose face
what content
g exist is for detection algorithms. It is not only powerful for detecting faces
they want to protect.
Fig. 4 is not but also fast as much as it can operate on a smartphone without
ing would not load.
47
e document in
ear the cursor
a level of 0 C. Displaying shuffled texts in a shuffling mode
ead the words
ly misspelled.
in a sentence,
University of Oxford Balliol College
Eiband et al. proposed replacing texts with the user’s handwriting to protect tex-
tual content from shoulder-surfing attacks [133]. This method assumes that the
shoulder surfers are slower when reading the unfamiliar handwriting of other per-
sons. The proposed scheme is split into two steps in which the user’s handwriting
is collected word-by-word then combined into original sentences of textual con-
tent. However, this method presents several drawbacks, including usability and
that it is not necessarily effective against screen-recording attacks because mod-
ern OCR are able to recognise handwritten content.
Mitchell et al. offered a method to replace sensitive information with less mean-
ingful data to combat shoulder surfing [8]. As illustrated in Figure 2.11, the user
must define a list of sensitive words that will be replaced by aliases. In that users
must choose the data they want to protect, the scope of the proposed approach is
significantly reduced because most users are not security aware and would not do
it. Moreover, the approach is quite intrusive because users cannot see the protected
data.
Figure 2.11: Sample mappings of sensitive private data elements to their corre-
sponding Cashtag alias [8].
For the protection of graphical content, Zezschwitz et al. proposed the application
of distortions to images displayed on smartphones to protect them from shoulder
48
#chi4good, CHI 2016, San Jose, CA, USA
on smartphones
hem in a way that
eir content for an
hs. On the other
he device owners
lems recognizing
dy (n = 18) that
l tested graphical
were not correctly
, two of the filters
image contents.
fuscation Figure 1. The user study prototype showing two different filters and
Figure 2.12:
strengths. Left: Filters applied
Crystallize to photos
(high). Right:against shoulder
Pixelate (high).surfing
The red[9].
bor-
ders indicate the photos selected by the study participant.
In this paper,
To conclude, wethepresent
some of ansolutions
proposed approach that solves
against this
shoulder problem.
surfing are not ef-
ation (e.g. HCI): Photos are obfuscated in a way that does not negatively influ-
fective against
ence the screen-recording attacks
users’ ability to because
correctly they rely
identify on spatial
them. information
However,
such the obfuscation
as face recognitionmakes it hard for an
or gaze-tracking. onlooker
Other solutionstouse
make sense data
subjective
of the photos’ contents. The main challenge of this approach
from the users, such as using their handwriting or letting them choose which data
is to improve the privacy of the user while maintaining high
us computing de- they comfort in using
want to protect the photohide.
or completely browser.
These approaches are, however, highly
e and generate a
s, created with or intrusive and cannot
To achieve be adopted
this, widely.
we exploit several known phenomena from
very sensitive in- memory and visual perception research [2, 4, 7, 12]. In re-
a subset of these These drawbacks
lated work,stem from
these the general
effects havedifficulties of preventing
already been shoulderap-
successfully surfing,
end, is a common whichplied to make
is quite authentication
a challenging problem. more secure
An attacker against
next observation
to a user seeing the same
attacks. Hayashi et al. [10] as well as Harada et al. [8]
screen must be prevented from acquiring sensitive data, while users must be able
present image-based authentication systems, in which prim-
oto sharing take ing effects and image obfuscation are used to improve the
s that private de- 49
systems’ shoulder surfing resistance. Wang et al. [13] showed
e device owner’s that repeated exposure of such filtered images enables users
a photo gallery to to recognize even highly degraded images.
al other sensitive
for sharing them However, these effects have never been tested in connection
hen interacting in with privacy-related problems. We report on a user study and
University of Oxford Balliol College
to utilise their devices and read the content of their screens. The problem is quite
different in the case of screen-recording spyware, and other solutions have been
proposed to cover this case.
Anti-OCR techniques
During the last decades, OCR techniques have rapidly evolved. OCR systems
take as inputs images containing text and produce a textual representation of the
extracted characters. The recognition process is composed of two stages. The
segmentation is done in the first stage: features are extracted from the images, and
a series of separated individual character is extracted accordingly. The second
stage is the recognition process: characters are recognised using ML, and their
textual representations are produced.
Usually, OCR algorithms are legitimately used for digitising and making search-
able typed or handwritten text from scanned images. In some cases, however, they
are used in online bots to bypass a text-based Completely Automated Public Tur-
ing test to tell Computers and Humans Apart (CAPTCHA). A CAPTCHA is used
to protect online services from attacks, such as denial of service and spamming,
and allows only human interactions. To prevent OCR algorithms from recognising
the content of text-based CAPTCHAs, researchers have proposed different nois-
ing techniques. Each of them works on specific stages and attempts to introduce
noise automatically.
50
University of Oxford Balliol College
Anti-segmentation techniques allow for the production of images with text, where
the position, size and other characteristics are intentionally chosen to prevent seg-
mentation. In contrast, anti-recognition techniques are used to mislead the recog-
nition process. In this section, we review recent studies proposing techniques to
prevent OCR.
Bursztein et al. presented a new text-based CAPTCHA scheme [134]. The aim
was to design a user-friendly scheme that minimises user error. The authors used
three categories of features: visual features, anti-segmentation features and anti-
recognition features. Visual features describe the visual aspect of the generated
image, which influences users’ errors. The anti-segmentation features include
character overlaps, random dot size, random dot count, type of lines, line po-
sitions, count and width and similar foreground and background colours. Anti-
recognition features encompass rotated character count, rotation degree, vertical
character shifting, character size variation and character distortions. The impact
of each of these features was evaluated with users and quantified according to
solving time and error.
51
University of Oxford Balliol College
recognition is the use of similar objects, which are shapes intentionally introduced
into images to mislead the recognition. Humans can recognise them as shapes,
while OCR recognises them as characters since they are segmented as characters.
During the recognition stage, the OCR returns a false classification. ADAMS
uses a randomised colour palette in both the foreground and background to dis-
turb segmentation, as segmentation algorithms are highly influenced by the use of
colours. The proposed solution has been validated using different segmentation
and recognition methods.
Even if new CAPTCHA techniques are proposed, OCR evolves rapidly, and it has
been demonstrated that many anti-OCR mechanisms can be broken using specific
techniques. Bursztein et al. offer a review of the weaknesses and strengths of cur-
rent text-based CAPTCHA systems [137]. Table 2.1 provides an overview of their
findings. Moreover, Bursztein et al. present an algorithm for enhancing the break-
ing of recently used text-based CAPTCHAs [138]. Authors start by describing an
anti-segmentation technique called negative kerning, which collapses characters
together to prevent segmentation. The algorithm shown could break a dataset of
currently used CAPTCHAs using this technique. First, the algorithm segments
52
University of Oxford Balliol College
the images to all possible cuts, and then applies a recognition algorithm to each of
them based on KNN to identify the most-correct cuts.
53
University of Oxford Balliol College
mat, which is difficult to read. A solution may be to limit the use of CAPTCHA to
sensitive data; however, the issue would then be to define what must be considered
as sensitive data, requiring either the intervention of the users (reducing the scope
of the solution) or using automatic techniques such as Natural Language Process-
ing (NLP). It may be the case that an entire text is considered sensitive for a user.
Finally, the use of CAPTCHA-only prevents automatic screenshot exploitation
but is not applicable in the case where the human adversary directly inspects the
screenshots.
Screenshot prevention
Several solutions have been proposed to protect the user’s data from malicious
screenlogging.
54
University of Oxford Balliol College
Other solutions allow the user to open sensitive documents through an application
that prevents screenshots [143].
All these solutions have in common that they are not intended for the overall pro-
tection of the system but rather target specific applications or files, often chosen
by the user. This assumes that users are security aware to a certain extent, whereas
most of them have not been made aware of the threat, do not have enough skills
or may be negligent or mislead by malicious attackers. This general lack of secu-
rity awareness is demonstrated in several surveys which reveal, for example, that
more than half of respondents in Germany and Italy did not know anything about
ransomware [143]. Therefore, solutions requiring an active role from the users are
quite limited in scope and effectiveness.
Note that some operating systems, such as Windows, allow the application devel-
opers to forbid other programs to take screenshots of their application by using
a specific flag, SetWindowDisplayAffinity on Windows devices [144]. However,
this solution relies on the fact that developers will think of activating this flag for
their sensitive applications, which is not a safe assumption. Moreover, some ap-
plications are so wide that developers cannot know in advance if the information
displayed will be sensitive or not. Such is the case of Microsoft Office, which can
be used to display benign content as well as sensitive, confidential documents.
55
University of Oxford Balliol College
tual property data may look like normal data, whereas it is highly sensitive data in
industrial contexts. Users may define what information they want to protect, but it
cannot be a widely adopted solution because, as already explained, most users are
not security aware. Moreover, this solution is rather extreme because it prevents
users from taking screenshots containing sensitive information even if they would
legitimately want to do so.
Theoretical solutions using the HVS Different solutions are proposed in the
literature. These solutions, instead of forbidding screenshot operations, propose
specific methods based on properties of the HVS to display information so that it
is visible for users looking at their screen, whereas an isolated screenshot does not
provide any meaningful information.
The HVS is a combination of two main parts. The first is the perception part,
which involves the eye and its components, and the second is the visual cortex,
where perception is processed [146]. The optics nerves are responsible for con-
necting the two parts [147]. The HVS is a powerful system as it can perform
many image-processing tasks, many of which are too complex or even impossible
to perform on computers.
For this reason, HVS properties can be used to distinguish machines from humans.
We can cite, for example, the law of pragnanz [148], according to which the HVS
interprets images in the simplest possible way. Thanks to this property, the use of
lines passing through the text does not disturb the human viewer, as opposed to
automatic tools, which may make errors in the segmentation step.
Hou et al. achieve protection through a visual cryptography algorithm that divides
the target images into several frames [10]. The technique consists of encrypting
successive N images by affecting new values for each pixel of each image. In
56
University of Oxford Balliol College
addition to the pixel value in the previous encrypted image, the new value of the
pixel is affected by whether it belongs to the foreground or the background of the
original image. Pixels in the foreground are modified across the N images while
pixels in the background are not. The decryption step is done by sequentially dis-
playing the encrypted images during a calculated duration. This allows the human
eye to recover the displayed images (Figure 2.13). However, two screenshots can
allow an adversary to determine which pixels change values (foreground pixels)
and which don’t. A similar solution is proposed by Grange et al. (Figure 2.14)
[11].
57
University of Oxford Balliol College
(3)
(3)(3)
nts.
.
s.
image
ng pixels
image
image pixels
pixels
e,
e, II is
ype, shared
Iis is shared
shared
eimage
imageas
image asasthe
the
the
(a)(a)
(a)
(a) input image (b)(b) image
(b)
(b) encoded
(4)(4)
(4)
random
ndom
andom
ndom number
number
number
...,
NN Nare
.., recur-
are
are recur-
recur-
i1
1 11 (x,
i (x,
(x, y)),
y)),
y)),
(5)
(5)(5)
,,y),
{0,1}),
d({0,1}),
{0,1}),
(6)(6)
(6)
(c)(c)
(c)
(c) encoded image (d)(d)
(d)
(d) expected view of recovered image
x,
x, y),
(x,
1 y),
1 y),
ehe frames
frames
frames that
that
that Figure 2.13: Samples of results from the visual cryptography approach proposed
Figure
Figure 4: 4:Sample of
Sample results:
of of (a)
results: input
(a)(a) inputimage
imageI, (b)
I, I, S
S11S,, 11(c)
(b) , (c)
())is is
is afunction
function
aa function byFigure
Hou et al.4:[10].
Sample results: input image (b) (c)
encoded
encoded
encoded S11Swith
S 11 with
with M0M
M 00 and
0 and
and M1M
M , and
1,, 11and
and (d)(d)
(d) expected
expected
expected view
view
view of of
of re-re-
re-
respectively.
spectively.
spectively. AsAs
As covered
covered image.
image.
round (type covered image.
und
undound (type
(type 1)1)
1)
me.ame.
me. PixelsPixels
Pixels forfor
for Chia et al. applied distortion to images representing the content of the display by
ng g signal,
ring signal,but
signal, butbut randomly selecting N distortion planes (Figure 2.15)as
[12]. The aim is to limit the
thethe
the
the background
background
background
background (type
(type
(type1)1)
(type 1)
are
are 1)
areare perceived
perceived
perceived
perceived asas gray
gray as
pixels
pixels gray
graywhen
when pixels
pixels when
when
than
r thanthose
than thoseof
those of of meaningful
1/f
1/f1/f
m
mm is visual
is
m larger
is static
larger
larger thancontents
than
than . ofIn
ttcc..tccIn
In a display
contrast, from
contrast,
contrast, beingfor
pixels
pixels
pixels captured
forfor by screenshots.
content
content
content (type
(type
(type
ee S
age S11S1forforforFig.
1 Fig.
Fig. 2) 2) appear
appear
2)appear
appear
2)authors
The
as as
exploited
a flickering
a flickering
asasa aflickering
flickeringsignal signal
signal
signal
image-processing because
because
techniques
because
because their
their
theirtemporal
totheir
distort temporal
the
temporal
temporal
du-
du-
visual data of a
du-du-
et‘A’
‘A’‘A’ from
from fromthe
thethe
ared ration
ration
ration is at
is is
atatleast
least
least n
nn frames,
frames,
frames, which
which
which is not
is is able
not
not able
able to induce
toto induce
induce the
thethe
dred
d image.
image.
image. display and present distorted data to the viewer. Given that a screenshot captures
ration there temporal
temporal
temporal summation.
summation.
summation. As
AsAsaa result,
a result,
result, the
the thehuman
human
human eye
eye can
eye distin-
can
can distin-
distin-
tion
tion therethere is ais
is a a guish between the contents pixels and the background pixels,
ea guish
distorted
guish between
visual
between the
contents,
the contents
the method
contents pixels
yields
pixels and and
limitedthe
the background
useful data. The
background pixels,
idea
pixels, is to
a in
in
in thethe
the synthe-
synthe-
synthe- and can read (or recognize) the protected content.
and
and can
can read
read (or
(or recognize)
recognize) the the protected
protected content.
content.
ethe background
e background
background
varying
arying
arying thethe
the vi-vi-
vi- 58
fficult
ult
ultcult to to
to achieve.
achieve.
achieve. 4.4. EXPERIMENTAL
EXPERIMENTAL EVALUATION
EVALUATION
noise
oise
oise from
from from thethe
the
e basis
basis matrices
matrices InIn
In this
this
this section,
section,
section, wewe
we demonstrate
demonstrate
demonstrate thethe
the e↵ectiveness
e↵ectiveness
e↵ectiveness of of
of our
our
our
basis matrices methodin
method
method in interms
termsof
terms of ofcontent
contentprotection
content protectionthrough
protection throughdeep
through deepex-
deep ex-ex-
perimental
perimental
perimental analyses,
analyses,
analyses, which
which
which include
include
include objective
objective
objective andand
and subjective
subjective
subjective
# tests.
tests.
tests. Similar
Similar
Similar to the
toto the
the study in
study
study inin[18], we
[18],
[18], wewe generated
generated
generated 400 ⇥
⇥⇥
400
400 400400
400
University of Oxford Balliol College
University of Oxford Balliol College
exploit
University
University the
ofof HVS to allow viewers to automatically recover the distorted
Oxford
Oxford contents
Balliol
Balliol College
College
into a meaningful form in real-time.
exploitAll
Allthethe
the approaches
approaches
HVS discussed
discussed
to allow present
present
viewers thesame
the samethree
threemain
to automatically main drawbacks.
drawbacks.
recover the distorted contents
59
Thefirst
into aThe firstisistheir
meaningful their
formlimited
in real-time.
limited scope:they
scope: theyare
arerestricted
restrictedtototextual
textualcontent.
content.Moreover,
Moreover,
thetechniques
the techniquesrequire
requireimages
imagestotobebeconverted
convertedtotogreyscale.
greyscale.
All the approaches discussed present the same three main drawbacks.
Thesecond
The seconddrawback
drawbackisispoor
poorusability:
usability:the
thequality
qualityofofthe
thevisualised
visualisedimages
imagesisis
University of Oxford Balliol College
pixel is inside or outside a letter ([10, 12, 48, 49]). This implies that the content to
protect must be known in advance, which prevents them from being performed on
the fly.
All these points are obstacles to the adoption of these solutions for the general
goal of protecting any content displayed on the screen.
!"#$%&$"%'(%)(*"#+,-(.')%&/,$"%'(
!"#$%&'()"& !6-75&+"#$%&!&& 4%5&)'&+65%("%*+#5%&*+,5)(5&-.#6%,& 4%5&)'&'+6#.&*+,5)(5&-.#6%,& 4%5&)'&*+,5)(5%*&+"#$%,&&
! !
*+,-.#/&
*%0+1%& 6$07,1'()* 6$07,1'()* :'+1$&1'()**
'(1"&0"8'-1"********** /'(-.*8'+1$&1* %'+,-.*
8'+1$&1*7.-("+*&&&&&&&& 7.-("+********** '(/$&0-1'$(*******
2"#13*439* 2"#13*434* 2"#13*434&
! ! ! ! ! !
!
!
! ! ! ! ! !
0+$%/,$"1(231%43&5(%)(*"#+,-(.')%&/,$"%'(
4%5&)'&*+,5)(5%*&+"#$%,&&
2+%3%(& !"#$%"&'()*
%'+,-.*
!
"! '(/$&0-1'$(********
2"#13*435*
! ! ! !
Figure 1: System overview of our method to limit meaningful data from being captured by screenshots. Illustrations shown are obtained
with n = 4Figure 2.15:In System
distort planes. practice, weoverview
use around nand distortion
= 22 planes
distort planes, used by
which increases theChia et ability
method’s al. [12].
to protect against
screenshots.
2.1. Distortion by Random Values process may not be optimum in modeling this blending of
Retinal Persistence Some approaches found in the literature
visual information by our shortpartially over-
term memory. Neverthe-
We term a static display of the screen as image I, and
less, in our experiments, we found an additive process to
come
denote the the atpreviously
intensity mentioned
(x, y) coordinate of the screenobstacles
as
betoremarkably
general sufficient
screen for protection. These recover
a viewer to mentally
I(x, y). Let α and β be respectively the minimum and max-
meaningful visual contents of the original image.
approaches
imum intensity aredisplayed
that can be basedonontheascreen.
property
We seek of the HVS
Our aim called
here isretinal persistence
to generate (also planes
a set of n distorting
n distorting planes D1 , . . . , Dn (each with the same dimen-
which satisfies eqs. (1) - (3) for an arbitrary value of n,
sion as known
I) that canasbethe afterimage
arithmetically effect).
added to I to hide its
contents. In the followings, we term Dj (x, y) as a distorting $ ≥ 2. We note that the%tight coupling of distorting values
n
D( x, y), . . . , Dn (x, y) by these equations, together with
value and Dj (x, y) + I(x, y) as a distorted value. Impor-
This property the arbitrary
can values y) can take,images
makes direct
per compu-
set {Dj } toissatisfy
defined as the equations,
fact that the HVS process I(x,10 to 12
tantly, we desire the following
tation of a Dj (x, y) challenging. Instead, we recognize that
second. After staring to an image for a fixedthere timeis a and
rangedueof values for which Dj (x, y) prop-
to photochemical can take. Thus,
Dj (x, y) = random number, j = 1, . . . , n (1) rather than computing Dj (x, y) directly, we develop an iter-
erties of the retina, the image will be retained ative
forframework which computes
few milliseconds, the lower
even whenanditupper
is limits
of Dj (x, y) at each j th iteration and exploits these limits to
α ≤ Dj (x, y) + I(x, y) ≤ β, j = 1, . . . , n (2) compute a random value for Dj (x, y).
no longer displayed in front of the viewer [149]. Thus, when an image is replaced
Let Lj (x, y) and Uj (x, y) denote the lower and upper
!n " # ! n limits of Dj (x, y) respectively. From eq. (2), we note that
Dj (x, y) + I(x, y) = I(x, y). (3) Dj (x, y) is bounded as 60
j=1
α − I(x, y) ≤ Dj (x, y) ≤ β − I(x, y), (4)
Eq. (1) specifies our requirement that each distorting val-
ue Dj (x, y) is randomly computed. In this aspect, I(x, y) where Dj (x, y) is obtained from eq. (3) as
cannot be recovered from screenshot of distorted pixel
Dj (x, y)+I(x, y). Eq. (2) expresses our requirement that a j−1
! n
!
distorted pixel can be displayed on the screen. Most impor- Dj (x, y) = − Dk (x, y) − Dk (x, y). (5)
tantly, eq. (3) models our requirement that Dj (x, y) sup- k=1 k=j+1
University of Oxford Balliol College
by another image during a period above the fifteenth of a second, there is an illu-
sion of continuity [150]. This implies that if subparts of an image are displayed at
high speed, the human viewer can see the whole image with no changes, whereas
one frame alone only contains a subpart of the original information.
As seen in Section 2.3.1.2, the first work to use retinal persistence against screen-
loggers targets the specific case of authentication with a virtual keyboard [5].
Park et al. extended the use of retinal persistence to pictures [13]. Their goal was
to prevent screenshots of images published on online social networks to avoid
identity theft. The proposed mechanism is to use ‘privacy black bars’ that move
sufficiently fast so that the viewer does not perceive them, but when using the
screenshot function, the bars appear in the resulting image. Contrary to the ap-
proaches based on visual cryptography ([10–12]), this approach does not require
the image to be in greyscale form and does not apply computations by pixel but
instead by groups of pixels. The consequences are that the image quality as mea-
sured using peak signal to noise ratio is much less affected and that computations
can be performed on the fly in real-time.
However, the use of a precomputed mask and a periodical pattern makes it rela-
tively simple to reconstruct the original images from few screenshots. Moreover,
in the case of textual data, the use of large bars may not be the best solution be-
cause important parts of the text would still be visible on one screenshot, which
may be enough for an attacker to extract sensitive data.
As a result, the authors proposed to display random blur blocks on the image
instead of black periodic bars. However, only the usability of this method was
measured, and not the security in the case of images or sensitive textual content.
61
University of Oxford Balliol College
Figure 2.16: Images projection using moving privacy black bars [13].
2.3.3 Synthesis
Most of the works aiming at protecting the screen’s content target the shoulder-
surfing attack. Most of the approaches proposed in this context specifically focus
on the login operation [119, 121]. The goal is to prevent attackers from stealing
passwords by looking at the victim’s screen. The limitation of these approaches
in our context is that they rely on authentication as a dynamic process with dif-
ferent actions performed by the user, whereas we also aim at protecting screens
containing static information with a passive user.
Scant few works address the shoulder-surfing problem in the general context.
These studies mainly aim to protect confidential documents from malicious view-
ers. Many of these methods use spatial data [7, 131]. However, the screen-
recording problem does not imply any spatial considerations as the adversary re-
motely observes the victim’s screen by using the screen logging function offered
62
University of Oxford Balliol College
by the OS. Moreover, anti-shoulder surfing solutions are highly intrusive. Some
of them replace displayed data with aliases [8], while others propose to display
documents with the user’s handwriting [133] or display images using filters [9].
These techniques reduce the usability of these approaches. Therefore, these solu-
tions must be limited to sensitive data. Determining what data is sensitive may be
done automatically using NLP techniques [151], but the drawback is that what is
considered as sensitive may be different for different users. Moreover, this addi-
tional processing may prevent real-time display. Another solution is to ask users to
explicitly specify what data they want to be hidden. This is the approach proposed
by Mitchell et al. [8]. However, this greatly reduces the scope of these approaches
as they are limited to security-aware users who must be well-informed and care
about the threat. This issue is intrinsic to the anti-shoulder surfing solutions as the
attacker sees exactly the same information as the victim by looking at the same
screen, as opposed to screenlogging attacks, for which it may be possible to find
less intrusive solutions that are transparent to the users.
Few works in the literature specifically address the screenlogging threat. Some
target the authentication process [48, 49]. Specific treatments are applied to the
63
University of Oxford Balliol College
pixels to display the virtual keyboard’s keys in a way that a screenshot does not
contain any meaningful information while the user is able to see the content of the
keys. These methods are based on the Gestalt laws, which include several proper-
ties of the HVS [152, 153]. The application of all these properties results in noisy
images in greyscale representing the keys of the virtual keyboard. Therefore, this
is not adapted to protect general content displayed on the screen.
Closer to the subject of this work, few methods aim at countering screen cap-
ture in a more general context not limited to authentication. However, these
methods present several drawbacks. Some achieve screenshot protection using
image-processing techniques [12] or visual cryptography concepts [10], generat-
ing important computations, particularly when applied to large images. Moreover,
they require converting the target image to greyscale. The resulting image qual-
ity displayed is significantly degraded compared to the original. Finally, these
approaches are only secure against adversaries taking a single screenshot. These
constraints are obstacles to the adoption of such methods for general screenshot
protection.
64
University of Oxford Balliol College
Moreover, even a single screenshot can contain sensitive information between the
widely spaced bars.
Table 2.2 shows that existing anti-screenshot approaches are all limited in terms
of scope, usability and security.
• Noisy images
• Need to identify pixels inside let-
Visual cryptogra-
ters
Protect copyright phy: pixels inside
• Limited to text (vs images)
[10] [12] images from letters move
• 2 screenshots suffice to deter-
screenshots while others
mine which pixels move and
don’t
which don’t
Therefore, all these works have in common that they are not intended for overall
protection of the system but rather target specific applications or files, often chosen
65
University of Oxford Balliol College
by the user. This assumes that users are security aware to a certain extent, whereas
most of them have not been made aware of the threat or do not have enough skills.
Moreover, users might be negligent or mislead by malicious attackers. This gen-
eral lack of security awareness is well illustrated by a recent international survey
[143] that revealed, for example, that more than 50% of the respondents in Italy
and Germany were unaware of what ransomware is. Application developers also
have tools against malicious screen recording, such as security flags forbidding
screen recording of specific pages. However, this solution relies on developers
thinking of activating this flag for their sensitive applications, which is not a safe
assumption. Moreover, some applications are so broad that developers cannot
know in advance whether the information displayed will be sensitive or not.
66
3 | Adversary Model
Screenloggers have the advantage of being able to capture any information dis-
played on the screen, offering a large set of possibilities for the adversary com-
pared to other spyware functionalities. We therefore start this chapter by defining
the different capabilities of a screenlogger (Section 3.1). We show that taking a
screenshot is relatively simple and frequently does not require specific privileges.
This threat is compared to other possible attacks aiming at the same goals, show-
ing that screenloggers are often simpler yet less studied in the literature. We then
define some applications scenarios where the identified capabilities can be put in
practice and be particularly harmful (Section 3.2). Finally, we present the system
and threat models that define the scope of the thesis (Sections 3.3 and 3.4).
The possibility of taking screenshots of other apps exists only on Android systems,
through exploiting the ADB or the MediaProjection API vulnerabilities.
67
University of Oxford Balliol College
The ADB screencap is another way to take screenshots of apps on Android de-
vices [46]. ADB is an Android SDK tool used by developers to debug Android
applications [157]. The Android device must be connected to a PC for launch-
ing ADB with its USB debug option activated. ADB commands, such as install,
debug and more can be sent. ‘Screencap’ is a shell command that allows screen-
shots of any app without any required permissions. Several malware programs
exploiting ADB have been found [158–160]. A screenlogger using ADB to take
screenshots was implemented [46].
The main capabilities of screenloggers are described in the remainder of this sec-
tion. For each capability, alternative methods of execution are discussed and com-
pared to screenshot-taking.
The process of entering credentials has always been a very sensitive task tar-
geted by attackers. For this purpose, keyloggers, which can record the victim’s
keystrokes, are widely used. However, due to the very high number of machines
infected by keyloggers (in 2006, the SANS Institute estimated that up to 9.9 mil-
lion machines in the US were affected [32]), some website developers, especially
banks have switched to virtual keyboards (e.g. State Bank of India, ICICI Bank,
Bank Millennium, Seven Bank, Bank ABC, Canara Bank, HDFC Bank, Credit
Libanais, Heritage Bank, Syndicate Bank, LCL france, Societé Generale, la-
68
University of Oxford Balliol College
The first way of stealing credentials entered with a virtual keyboard is to take
screenshots during the log in, either at regular time intervals or at each user event.
The screenshots are then sent to the attacker for analysis to infer the password. A
recent example is the well-known mobile banking malware family Svpeng which
added screenshot capability to its functionalities in July 2017 [25].
The other techniques which can be used by attackers to recover a password entered
with a virtual keyboard are:
The smartphone keylogger landscape is relatively different from that of PCs. In-
band attacks are almost infeasible, except on rooted or compromised devices be-
cause of the restrictions of the OS security design [164]. Indeed, according to the
Android security model, an app cannot read touch events performed by the user
on other apps.
69
University of Oxford Balliol College
70
University of Oxford Balliol College
In the literature, several studies about keyloggers on mobile devices focus on tap
inference using side channels [164]. These methods consist of using smartphone
sensor data, such as accelerometers or gyroscopes, to infer the coordinates where
the user tapped. However, the methods have the drawback of not being accurate
in noisy data situations. For example a prediction model to infer a 4-digit PIN
using accelerometers had an accuracy of 43% (within five attempts) when the
users entered the PIN in a controlled setting while sitting, whereas the accuracy
was 20% in uncontrolled settings, such as entering the PIN while walking [36].
Moreover, the model of the targeted smartphone must be known to compute the
correspondence between the keys and the coordinates.
Web extensions are JavaScript modules installed in the browser to offer function-
alities such as preventing ads or showing the number of inbox emails. It was
found that both Google Chrome and Firefox web extensions have serious security
issues and can access various types of data, including the user’s web history and
entered passwords [37]. Indeed, they can access all the data entered by the user,
including passwords, if they are granted the ‘access all website data’ permission.
A statistical study was performed showing that 71% of the top 500 most-used
web extensions require this permission [38]. Using this permission, the malicious
extension injects scripts into the web pages the user is visiting. Since the scripts
are running in the page’s environment, they can read the password the user enters
from a Document Object Model (DOM), which defines the logical structure of
documents and the way a document is accessed and manipulated [39]. However,
this attack is much more difficult or even impossible when a website requires the
user to enter a password with a virtual keyboard. An example is illustrated in
Figure 3.1 (LCL bank). The customer uses a keyboard provided by the web site,
71
University of Oxford Balliol College
which is dynamically loaded at each visit. Instead of using the standard password
field, the virtual keyboard uses a hidden input field called postClavier. When the
user presses a key, the field records the key’s position instead of its value.
For example, in Figure 3.1, when the user enters ‘0’, the recorded value is 10,
which is the key’s position on the virtual keyboard. Another example is illustrated
in Figure 3.2 with Oney’s virtual keyboard: when the user clicks on number 0,
a hidden field is filled with the pair (1, 3) where 1 is the row index and 3 is the
column index. In this situation, the malware developer cannot know the password
without taking screenshots.
72
University of Oxford Balliol College
Phishing attacks
The first is the scenario of a web application using a virtual keyboard. The at-
tacker uses social engineering techniques such as email or SMS to redirect the
user to a website with the same look as the real one. The fake site reproduces
the same virtual keyboard, and users are asked for their login information. The
phishing threat, however, is widely studied and there exists many anti-phishing
mechanisms which can for instance block suspicious emails. Automated ML so-
lutions are employed such as probabilistic models, rule-based classification and
more. Several anti-phishing techniques have been proposed [40].
The second phishing scenario steals credentials entered on a mobile phone app
with the phone’s virtual keyboard. Making use of SYSTEM_ALERT_WINDOW
and the ability to overlay other apps, the malicious application detects when a
banking app is opened and shows unsuspecting users a window mimicking the
targeted app’s password prompt, into which users enters their account logins and
passwords [35].
Shoulder surfing
One of the weaknesses of virtual keyboards is ‘shoulder surfing’. This form of at-
tack occurs when a user types a password using a visual keyboard and the attacker
73
University of Oxford Balliol College
is behind watching the displayed characters. In some cases, cameras can be used.
Generally, a trained user takes precautions to avoid shoulder surfing, whereas oth-
ers do not take measures to cover their inputs. To perform this attack, the attacker
must have the opportunity to be physically present near the user, and guessing the
password requires several observations [167].
In addition to stealing credentials, sensitive data theft is another important goal for
an attacker. The stolen information could be intellectual property or industrial data
that the attacker can sell to a competitor or use to make a rival product. Possible
stolen data also include a target’s identity cards, credit cards or banking details
that could allow an attacker to impersonate the victim or make transactions.
Screenloggers can capture any data displayed on the screen, including locally
stored documents, visited websites or any software employed by the user and
displayed on the screen.
An attacker has several other options to reach their goals. The following are the
principal attack methods.
Keyloggers
As mentioned above, keyloggers can be used to steal sensitive data when typed
using the keyboard; however, they cannot steal data visualised by the user.
Network-based attacks
Three main kinds of network-based attacks can be used to steal sensitive data
from victims: sniffing, Address Resolution Protocol (ARP) spoofing and Domain
Name System (DNS) spoofing.
74
University of Oxford Balliol College
A sniffing attack allows attackers to intercept packets and capture data from a
computer network.
Similarly, DNS spoofing consists of corrupting the DNS by introducing data into
the DNS resolver’s cache, causing the traffic to be diverted to the attacker’s com-
puter to steal sensitive information [168].
However, encryption and the use of SSL/TLS ensures that data sent across a net-
work from one host to another is unreadable to a third party. Moreover, there is a
wide range of mechanisms for ARP spoofing detection and prevention. Examples
include using static ARP or dedicated anti-ARP tools [169]. For DNS spoofing,
there also are some protection mechanisms used by updated DNS software appli-
cations (e.g., versions of BIND 9.5.0-P1 and above).
Note that mobile systems have a file access pattern which is quite different from
that of desktop environments. The different apps are strictly separated and can-
not access other apps’ data. The apps also cannot access the user’s data un-
less specific permission has been explicitly granted by the user (such as the
75
University of Oxford Balliol College
A web extension can use the ‘access all websites data’ permission to access the
DOM of every web page. Some extensions legitimately need access to this data to
perform their function. For example, video-blocking software must read a web-
site’s code to see that a clip is set to play automatically and to know how to stop
it. However, the ‘access all websites data’ permission can be diverted. Malware
makers have hijacked or even bought legitimate extensions from their original
developers and used the access to pump invasive advertisements into web pages
[166]. Thus, these malicious extensions can access any sensitive information dis-
played in the browser, with a few exceptions such as the virtual keyboard case
(explained in the previous section). However, malicious extensions cannot access
user data that is outside the browser activity. Some techniques were developed to
detect suspicious behaviours depicted by browser extensions [171].
Monitoring the victim’s activities is another possible goal for an attacker. The
attacks performed in this case usually are targeted attacks.
76
University of Oxford Balliol College
databases, poker calculators, etc.). Once installed, the screenlogger takes screen-
shots if the victim is running PokerStars or Full Tilt Poker. The screenshots are
then sent to the attacker’s remote computer to extract the players ID and the hands
of the infected opponents. Both targeted poker sites allow searching for players
by their player IDs, hence the attacker can easily connect to the tables on which
they are playing and start cheating. Researchers have found Odlanor on machines
in several Eastern European countries.
77
University of Oxford Balliol College
Screenlogging, therefore, is the only way to observe with precision all the victim’s
computer activities as it can capture everything displayed on the screen.
3.1.4 Blackmail
Another possible goal for an attacker is to blackmail the victim for money by
acquiring sensitive or private information. The means to achieve this goal are
similar to those enabling the theft of sensitive data as described earlier. In addition
to these attacks, we must add ransomware and webcam blackmail.
Webcam blackmail
According to the UK National Crime Agency, webcam blackmail, also called sex-
tortion, happens when criminals use a fake identity online to persuade the vic-
tims to perform sexual acts in front of their webcam, often by using an attractive
woman to entice the victim to participate [173]. The attacker then threatens the
victims to share or publish the recorded images if they do not pay an amount of
money. This attack may be effective and has the advantage of not requiring any
infection of the victims’ devices because they turn on the webcam deliberately.
78
University of Oxford Balliol College
Player webcam settings dialog). Clickjacking occurs when a user clicks on some-
thing seen on the screen but is actually clicking on something controlled by the
attacker [174]. Several defence mechanisms against clickjacking have been pro-
posed in the literature and included into browsers [175].
Ransomware
79
University of Oxford Balliol College
stop many ransomware attacks by monitoring abnormal file system activity. More-
over, numerous countermeasures have been proposed since the first ransomware
appearance, and a common way of protecting against ransomware is to back up
files.
3.1.5 Reconnaissance
80
University of Oxford Balliol College
The simplest way to gain information about a person in preparation for an attack
is to gather publicly available information about the targeted victim by looking
at search results and public data on social networks. However, the amount of
information that be collected in this way is limited, especially if the victim’s social
networks are private.
Social engineering
Packet-sniffing
Usually, the analysis of network traffic is done in the context of network recon-
naissance, which is a way to test potential vulnerabilities in a computer network
before an attack, such as through port scanning. Network analysis is done once
the target is already chosen and the hacker is planning the attack process. If the
goal instead is to collect information about potential victims to choose whether
to infect them, then network analysis is more limited. The adversary can per-
81
University of Oxford Balliol College
form packet-sniffing to see the type of applications the victims use and the type
of data they send and receive, but it is highly limited by packet encryption and the
attacker’s necessity to have access to the LAN.
3.1.6 Synthesis
82
University of Oxford Balliol College
In this scenario, the attacker targets online banking customers. The goal is to steal
login credentials to access bank accounts and perform transactions (capability 1),
or to impersonate the victims and make changes to their banking data by stealing
personal information (capability 2). The process is illustrated in Figure 3.3.
83
University of Oxford Balliol College
In this scenario, the attacker goes beyond stealing credentials by continuing to take
screenshots while the user is logged in. This process can enable the attacker to ob-
tain sensitive data and personally identifiable information such as the customer’s
name, banking balance or credit card data. Again, the screenshots are compressed
and sent over the network, and they can be processed either manually or using
84
University of Oxford Balliol College
automatic tools. In this case, besides OCR, the adversary can use NLP tools to
automatically detect sensitive information in the screenshots’ content [151]. The
retrieved information can allow the performance of several malicious actions. For
example, credit card data can be used to make online payments. Moreover, ob-
taining the customer’s phone number can be useful in case of two-factor authen-
tication using SMS. Indeed, it would allow the attacker to intercept the received
SMS by using the well-known vulnerabilities of SMS, such as SS7 [188]. The
attacker can also use the victims’ personal information to impersonate them and
change their bank account data by calling the bank, for example.
In this scenario, we suppose that the attacker wants to have a real-time view of the
content visualised on the victim’s screen. Many cases require such a capability:
• Observing the victim’s screen in real time has also been used for an attack
targeting poker players to see their hands and cheat [172].
• Another case is when the attacker is spying on the victim for business pur-
poses, such as when competitors want to have real-time data about deals in
preparation. Attackers might also want to anticipate financial market move-
ments to buy or sell securities by targeting stock-exchange employees.
85
University of Oxford Balliol College
The malware can monitor on-screen activity, as explained in the previous scenario.
In this case, the target applications depend on the attackers’ purpose: for exam-
ple, messaging applications in the context of an armed conflict, or poker websites
in the case of an attack targeting poker players. The screenshots are sent, com-
pressed and analysed. In a scenario targeting specific people, it is most likely that
the screenshots will be seen by a human directly. However, screenshots could also
86
University of Oxford Balliol College
be exploited automatically using OCR and NLP tools to search for specific infor-
mation. The retrieved information can then be used to act in real time depending
on the attacker’s goal.
With blackmail, screenshots are used to obtain private content that the victim does
not want to be published, such as messages or photos (Figure 3.5).
87
University of Oxford Balliol College
If made public, the revealed content would harm the victim’s social or professional
life. Once a device is infected, the malware detects when the victim consults
emails, social media accounts, messaging applications, dating websites and more.
The compromised data can also be a video conversation, such as on Skype or
Viber. The malware then takes screenshots of the visualised content and uses it to
blackmail the victim.
The targeted systems are desktop environments. The main reason why our work
focuses on computer operating systems is that the screenshot functionality is a
legitimate functionality offered to any application. In contrast, on smartphones,
the principle is that apps cannot take screenshots of other apps, and the only way
to accomplish this is to exploit specific vulnerabilities or to divert some libraries.
However, many limitations exist for these techniques, such as permission required
from the user at the beginning of each session, or a recording icon displayed in the
notification bar. In sum, the architecture designs of mobile systems and computer
systems are fundamentally different, which may lead to different solutions.
Targeted victims may be any individual or organisation, ranging from typical lap-
top users to small companies or powerful institutions.
Another assumption in our system model is that the victims are not particularly
security aware, which implies they are not necessarily cognizant of the existing
threats and will not install a specific protection against screenshots, such as a spe-
88
University of Oxford Balliol College
Our threat model is composed of a victim, an attacker and spyware with a screen-
shot functionality.
The adversary’s goals are diverse. They can range from general activity monitor-
ing, which requires to see the whole screen, to sensitive data theft, which can be
limited to some areas of the screen.
Attackers may infect a system using common methods such as trojans, social en-
gineering or through a malicious insider. The adversary has no physical access
to the victim’s device (except in the case of a malicious insider). They have no
knowledge about the system and tools installed on it before infection. We also as-
sume they have not compromised the victim’s device at a kernel level. Apart from
that, the attacker can use any technique to evade detection, including hiding by
injecting API calls into system or legitimate processes, dividing its tasks between
multiple processes, making the API calls out of sequence, spaced out in time, or
interleaved with other API calls.
89
University of Oxford Balliol College
To reach their objective, attackers take screenshots of the victim’s device. The
data may be either (1) extracted automatically using OCR tools inside the vic-
tim’s device locally, then sent to the attacker’s server using the victim’s network
interface or (2) extracted, also using OCR tools, on the attacker’s server after
screenshots have been transferred from the victim’s machine to the attacker’s as
compressed image files. The screenshots can also be analysed manually by the
attacker. Moreover, the screenshots may be taken and sent at regular or irregular
rates. These different options are depicted in Figure 3.6.
The adversary does not need any advanced hardware - only a remote machine
with network access and known by the malware. They also have state-of-the-art
OCR and NLP tools at their disposal and enough resources to run such tools and
store the screenshots. They have no interaction with the victim’s device except
the capability of receiving files transmitted by the spyware via the network and
optionally the possibility to send commands to the spyware, as illustrated in Figure
3.6.
3.4.3 Scope
90
University of Oxford Balliol College
surfers directly observe the victim’s screen. For this analysis, we assume that a
hacker is intervening from a remote site using software and networking capabili-
ties.
Figure 3.6: Threat model message sequence chart with different settings.
Upper chart: Screenlogging attack where screenshots are triggered by a screen capture
command and sent at a regular frequency f after being compressed. Lower chart: Other
settings where screenshots are triggered by a specific event on the victim’s machine, such
as opening a specific website, and taken or sent irregularly. In the illustrated case, the
malware program performs local processing to extract data from the screenshots and then
sends only the extracted information.
91
University of Oxford Balliol College
Session replays let app developers record a screen and play it back to see how
its users interacted with the app to determine whether something did not work
or there was an error. Every tap, button-push or keyboard entry is recorded –
effectively screenshotted - and sent back to the app developers [191].
Some firms, such as GlassBox, offer customers the possibility to record the
screens of the users of their apps [192]. This has raised serious concerns as it
results in applications sending unencrypted screenshots containing sensitive data
such as passport or credit card numbers [191].
As dangerous as session replay can be, we exclude it from the scope of this thesis
because the screenshots are not taken by an adversary but by the application itself.
The DOM is an API for valid HTML and well-formed XML documents. As men-
tioned above, a DOM defines the logical structure of documents and the manner
in which a document is accessed and manipulated [193].
In most browsers, the content of webpages is represented using the DOM. Some
attacks consist in getting information from by abusively accessing the DOM of
webpages.
These kinds of attack are excluded from our work, even if they permit steal-
ing some data targeted by screenloggers. Their operational mode is limited to
browsers as opposed to screenloggers.
92
4 | Threat Analysis
Analysing the behaviours displayed by screenloggers in the wild allows for the
definition of completeness criteria (Section 4.5) for the malicious dataset, which
will be used for detection.
4.1 Methodology
The first step in analysing the behaviour of screenshot-taking malware was to
gather high-level information from 100 security reports (Section 4.1.1) recovered
from the MITRE ATT&CK database [14]. Next, a set of novel criteria that can
discriminate screenshot-taking malware from each other was identified (Section
4.1.2).
93
University of Oxford Balliol College
Screen capture (T1113) has been identified as one of the techniques that can be
used by the adversary to implement a collection tactic. As a result, 127 reports
from security firms such as Symantec have been compiled and referenced on a
webpage [14]. These reports cover about 103 different screenshot-taking malware
programs targeting desktop environments. The list is regularly updated (at the
time of writing, the last update was 24 March 2020).
Therefore, we assume that by analysing these reports, we can have a realistic view
of the behaviours displayed by screenloggers in the wild.
Although the MITRE ATT&CK database provides the most exhaustive list of
screenshot-taking malware, it fails to give insight into their behaviour. Indeed,
merely referencing security reports cannot facilitate gathering sufficient knowl-
edge to develop effective countermeasures.
94
University of Oxford Balliol College
83
Table 4.1:4.1:
Table Criteria
Criteriaofofcompleteness.
completeness
4.5.3 Discussion
The criteria we used are recapitulated in Table 4.1.
When assessing the completeness of our dataset, we had to choose between the two
Our criteria cover the main operating steps of the screenlogging process:
definitions given above.
screenshot-taking, screenshot storage and screenshot exfiltration.
On the one hand, the assumption made for the behavioural completeness definitions
seems realistic as the MITRE ATT&CK database is renowned for documenting di↵erent
Screen capturing Depending on the adversary’s objective, screenshots can be
kinds of attacks and is updated regularly.
taken at different moments. Moreover, the functionalities offered by the Windows
On the other hand, the assumption made for the proportional completeness definition
operating system for screenshot-taking offer several possibilities. The criteria we
seems less likely to hold for two reasons. The first is that much information is missing.
use at the screen-capturing stage are:
For example, only 45% of the security reports indicated what area of the screen was
targeted by the screenshots. The second reason is that, even if the reports were complete,
there are many more than 103 screenshot-taking malware programs in the wild, and
95
University of Oxford Balliol College
Starting from Windows 8, the DD API was introduced. This API was cre-
ated precisely to make high-frequency screenshot-taking less cumbersome.
Knowing what API is used by malware provides information about its be-
haviour. On the one hand, malware using the DD API intends to monitor all
the victim’s activity. This is the scenario, for example, in the case of RATs.
On the other hand, malware using the GDI API might be looking for more
flexibility in the screenshot-taking.
96
University of Oxford Balliol College
• Captured area: The GDI API offers several possibilities regarding the
screenshot’s content. Indeed, two distinct functions can alternatively be
called to retrieve the DC: GetDC and GetWindowsDC. On the one hand,
GetDC allows for capturing a subpart of the screen if the coordinates of
a rectangle’s corners are given as parameters. If GetDC is called with the
NULL argument, the DC of the whole screen is returned. On the other
hand, GetWindowsDC obtains the DC of a given window if a handle to the
window is provided. Even a window that is not in the foreground can be
captured. If the NULL argument is passed instead, the DC of the full screen
is returned.
Again, depending on the attacker’s goal, different screen areas can be tar-
geted. For instance, when specific applications are targeted, it is possible
to capture only the corresponding windows. On the other hand, generalist
malware programs not looking for specific information might tend to take
the whole screen.
97
University of Oxford Balliol College
Moreover, the captured area has a direct impact on the screenshot size (in
bytes) and thus on the storage and network usage. If the malware author
wants to avoid notice, a limited area of interest can be captured (e.g., the
area around the mouse pointer in the case of a virtual keyboard).
Screenshot storage Once the screenshot is taken, an image file can be created,
stored and manipulated on the victim’s machine. The criteria we consider at this
stage are:
• Storage media: The image file can either be stored in memory for short-term
use or in the disk for long-term use. Memory storage might imply that the
screenshot will be sent over the network shortly after it was taken, whereas
disk storage is more adapted for local processing of the screenshot (e.g.,
using OCR tools) or delayed sending.
98
University of Oxford Balliol College
Knowing the encryption methods which are used to send the data can be
important to detect the exfiltration of image type files, which will be more
easily discovered if there is no or very simple encoding (base64, ZIP, RAR).
On Windows systems, which constitute the target of this work, two main libraries
can be used, Windows GDI [194] and DD API. Our analysis showed that existing
malware does not seem to use the functionalities offered by DD API. However,
some of them (i.e., Azorult [195], Bandook [196], RTM [197] and Proton) use
VNC, a legitimate remote desktop manager which uses DD API and GDI.
4.2.2 Screenshot-triggering
As illustrated in Figure 4.1, a vast majority of screenloggers (67%) wait for a com-
mand from the C2 server to start capturing the screen. This imposes an important
constraint on our malicious dataset: we must gather both the server and the client
99
University of Oxford Balliol College
80
70
60
50
40
%
30
20
10
0
Yes No Not specified
Figure 4.1: Need for a screenshot command to start screen capturing (in the mal-
ware of Mitre [14]).
100
University of Oxford Balliol College
40
35
30
25
20
%
15
10
0
Frequency Ponctual Applica�on Mouse clicks Unique Not specified
command of interest screenshot
upon
infec�on
101
University of Oxford Balliol College
dent punctual screenshot command (e.g., Azorult [195], Carbanak [21], NetWire
[64]).
Finally, some malware take one screenshot for reconnaissance during their entire
execution to see if the victim is worth infecting (e.g., Cannon [206], Zebrocy
[207]).
Regarding the screenshots area, even if this information is often unavailable (no
information for 55% of the security reports), it follows from our study that more
than 37% of screenloggers capture the entire screen without targeting a particular
area or window (Figure 4.3). Nevertheless, three other operating modes are repre-
sented, even if it is in small proportions. These are the capture of a target window
(T9000 [204]), of all overlapping windows (InvisiMole [208]) or of a delimited
area of the screen (ZeusPanda [209]). Interestingly, the Remexi malware [198]
proposes a parameter for full screen or only active window screenshots.
102
University of Oxford Balliol College
It emerges from our analysis that the formats preferred by screenloggers are, with-
out surprise, the most common among the general public. In fact, 43% of screen-
loggers for which information is available use the JPG format and 32% the PNG
format. However, some other malware types opt for other formats, such as BMP
(12%) and, to a lesser extent, AVI, WCRT and RAR, as shown in Figure 4.4.
60
50
40
30
%
20
10
0
Full screen Target window Limited area Overlapping Not specified
windows
103
University of Oxford Balliol College
45
40
35
30
25
%
20
15
10
0
JPG PNG BMP AVI DAT WCRT RAR Not
specified
104
University of Oxford Balliol College
60
50
40
30
%
20
10
0
Disk Memory Not specified
All the malware programs of this study transmit the captured screenshots to re-
mote and malicious servers. This even includes malware taking only one screen-
shot to hide the on-screen activity, such as FruitFly.
The communication protocols used cover a wide spectrum (Figure 4.6). 40% of
screenloggers for which the protocol is identified use HTTP, which increases their
chances of going unnoticed in the large HTTP flow passing through almost all
machines. HTTPS, FTP, SMTP and SOAP complete the list of used protocols.
Note that a non-negligible proportion of malware (20%) does not have an appli-
cation layer network protocol and simply uses the transport layer by sending TCP
105
University of Oxford Balliol College
packets. Finally, only one of the studied screenloggers, Biscuit, uses a proprietary
and non-standard network protocol [199].
40
35
30
25
20
%
15
10
ap
s
P
tp
A
m
ps
S
pt
FT
TP
TC
N/
p
ps
sto
sm
h�
So
H�
sm
F
h�
P/
Cu
FT
p/
h�
4.4.2 Encryption
The encryption technique used by malware to exfiltrate data was unavailable for
almost half of the malware programs studied. This result may be because the
information is difficult to obtain. Another possible explanation could be that some
malware programs do not encrypt their data.
However, for the malware for which this information was available, we observed
that the encryption techniques are highly diverse. As shown in Figure 4.7, each
malware program designs its own encryption method, often by combing several
techniques. The encryption methods are more or less sophisticated, for instance,
some malware only use an eXclusive OR (XOR) operations (e.g., T9000 [204])
or base64 encoding (e.g., POWRUNER [210]). Others combine several encryp-
106
University of Oxford Balliol College
Note that Proton and Micropsia [213] simply send a ZIP/RAR archive, which may
be protected by a password.
50
45
40
35
30
25
%
20
15
10
5
0
S
ES
H
S
4
h
R
A
2
AR
AE
TL
6
fis
RC
XO
N/
RS
SS
3D
se
R
L/
ow
4/
P/
ba
SS
RC
ZI
Bl
There were no reports of malware locally exploiting the screenshots using OCR
tools. The security reports we analysed all mention that the screenshots are sent
as is, without local processing.
107
University of Oxford Balliol College
Regarding the event triggering screenshot sending, most of the time this infor-
mation is not available. However, we were able to identify four main possible
behaviours among the studied screenloggers.
The first one, which is also the most common, is the immediate sending of the
captured image directly after the screen capture (e.g., Magic Hound [215], RTM
[197]). The explanation could be that many of these malware programs use
screenshots for a real-time purpose, such as remote control on the victim’s ma-
chine or real-time observation of the victim’s activity.
The three other behaviours that were found include sending screenshots at a reg-
ular frequency (e.g., Micropsia [213], Flame [205], Rover [216]), when a specific
command is received from the C2 server (e.g., Biscuit [199], Powruner [210]),
or each time a predefined number of screenshots is taken (e.g., RTM after six
screenshots [197], Pteranodon after a configurable number of screenshots [217]).
The results we obtained by analysing the 127 security reports of the MITRE
ATT&CK database enabled us to define completeness criteria for our malicious
dataset.
108
University of Oxford Balliol College
A first definition that can be given to the word ‘completeness’ is that our dataset
will be complete if its malware samples display all the behaviours found in the
security reports, regardless of their proportions. This definition assumes that the
MITRE ATT&CK database is exhaustive enough to encompass all the behaviours
found in the wild but not necessarily in representative proportions.
The different behaviours that our dataset should include based on this first defini-
tion are displayed in Table 4.2.
109
University of Oxford Balliol College
83
Table 4.2:4.1:
Table Criteria
Criteriaofofcompleteness.
completeness
4.5.3 Discussion
When assessing the completeness of our dataset, we had to choose between the two
definitions given above.
On the one hand, the assumption made for the behavioural completeness definitions
seems realistic as the MITRE ATT&CK database is renowned for documenting di↵erent
kinds of attacks and is updated regularly.
On the other hand, the assumption made for the proportional completeness definition
seems less likely to hold for two reasons. The first is that much information is missing.
For example, only 45% of the security reports indicated what area of the screen was
targeted by the screenshots. The second reason is that, even if the reports were complete,
there are many more than 103 screenshot-taking malware programs in the wild, and
110
University of Oxford Balliol College
85
4.5.2 Proportional completeness
DD API -
Screenshot capture triggering Frequency 35
App. of interest 9
Mouse clicks /User trig. 3
Punctual command 38
Unique capture upon infec. 5
Frequency 2s to 15mn
Format and storage
Format JPG 25
PNG 19
BMP 7
Video 1
Other 6
Storage Memory 5
Disk 59
Captured area Full screen 37
Coordinates 2
Di↵erence -
Other 6
Exfiltration
Remote cmd 6
Scheduled 10
Other 13
Encryption? No -
Files sending? No -
Comm. protocol HTTP 36
HTTPS/FTP/SMTP/RFB 36
Proprietary 1
TCP 18
111
University of Oxford Balliol College
The second definition we can give to the word ‘completeness’ is that our dataset
will only be complete if the behaviours at the different stages of the screenlog-
ger operating process are represented in the same proportions as in the security
reports. These proportions can be found in Table 4.3.
The assumption made is strong as we assume that the security reports compiled
by MITRE provide an accurate overview of the behaviours exhibited by screen-
loggers, with the same proportions.
4.5.3 Discussion
When assessing the completeness of our dataset, we had to choose between the
two definitions given above.
On the one hand, the assumption made for the behavioural completeness defini-
tions seems realistic as the MITRE ATT&CK database is renowned for document-
ing different kinds of attacks and is updated regularly.
On the other hand, the assumption made for the proportional completeness defini-
tion seems less likely to hold for two reasons. The first is that much information
is missing. For example, only 45% of the security reports indicated what area of
the screen was targeted by the screenshots. The second reason is that, even if the
reports were complete, there are many more than 103 screenshot-taking malware
programs in the wild, and most of them probably use the most naïve and unso-
phisticated behaviours. However, one could argue that the attacks referenced in
the MITRE ATT&CK database are the ones with the most critical consequences
and that we precisely aim to detect the most stealthy and unusual screenlogger
behaviours. Moreover, as the detection is based on ML algorithms, if the dataset
is overwhelmingly composed of basic and naïve behaviours, the features that
112
University of Oxford Balliol College
would be selected would not allow for the detection of more sophisticated vari-
ants, which will be under-represented.
Therefore, even if it might appear less realistic than the first definition, we chose
to implement the proportional completeness definition in our dataset.
113
5 | Dataset Construction
In this chapter, we present our methodology to build the first dataset dedicated to
screenshot-taking malware (Section 5.1) and legitimate screenshot-taking appli-
cations (Section 5.2).
The dataset is not only composed of malware and legitimate samples but also
contains execution reports. These reports monitor two aspects of the samples’
execution: API calls and network behaviour. They are used to train and test our
detection models.
114
University of Oxford Balliol College
The names and hashcodes of our samples were collected from MITRE [14], which
has the advantage of centralising and listing 103 malware programs with the
screenshot functionality. The hashcodes can be found in the indicators of com-
promise section of the security reports. Based on these hashcodes, we were able
to collect 600 spyware samples from VirusShare [85], AVCaesar [51], Malshare
[218] and VirusSign [219]. We also collected some screenlogger source codes on
open archives, such as Github.
Given the nature of the software studied, the execution was carried out in a se-
cure environment. Subsequently, their behaviour was analysed using dedicated
software.
To analyse the spyware, we used Cuckoo sandbox [79], which allows generating
reports on the API calls made by malware programs, and on their network traffic
in pcap files. We also used Wireshark [221] as well as API Monitor [222], which
provides more precise information than Cuckoo on API calls (log with each call
and when they were made).
While employing a novel technique based on the monitoring of API call se-
quences, we realised that our samples were not taking screenshots during ex-
115
University of Oxford Balliol College
ecution. We then investigated the reasons for this unexpected outcome, which
allowed us to propose a solution based on widely used screenshot tools.
To ensure that the collected samples were taking screenshots, it was necessary to
find a sequence of API calls that characterises screenshot-taking. This sequence
must be precise enough that non-screenshot cases would be excluded and simul-
taneously not too precise to avoid false negatives (excluding cases where screen-
shots were actually taken). The difficulty lies in the fact that there is no unique
screenshot function proposed by Windows APIs but rather a sequence of 5–6 func-
tions that can be used alternatively with other functions. Moreover, each function
in the sequence can individually be used in other contexts. Thus, it is impossible
to deduce that a screenshot was taken by only looking at a unique function.
As explained in Section 4.1.3, two main libraries can be used to take a screenshot:
the Windows GDI and the DD API, which replaced mirror drivers from Windows
8 onward. To define the criteria characterising a screen capture, screenloggers
and legitimate screenshot-taking applications were analysed. Results showed that
different API call sequences can be used for screenshot-taking. We therefore had
to make flowcharts with different alternatives at each step (Figure 5.1). The charts
were then transcribed into a script that takes the API call reports as an input and re-
turns information such as the number of screen captures taken and their frequency
(if the screenshots are taken at a regular time interval).
Using the flowcharts, for the GDI library, we identified a screen capture by the call
of one function that retrieves the content of the screen (GetDC, GetWindowsDC,
CreateDCA or CreateDCW), followed by the call of the BitBlt or the Stretch-
Blt function. The script ensures that BitBlt’s hdcSource parameter is the value
116
University of Oxford Balliol College
that was returned by the function capturing the screen’s content. Sometimes, the
GetDC function is called only one time at the beginning of the screenshot session
and the subsequent functions are called multiple times with the same hdcSource
parameter.
As the functions in the sequence take as parameters the return values of the previ-
ous functions, it is impossible for them to be called out-of-order. Moreover, as the
return values are kept in memory until they are used as parameters, the screenshot
is detected even if the API calls are spaced in time.
Finally, a stealthier way of taking screenshots using GDI is to call only the getDC
function for each screenshot, with the GdipCreateBitmapFromScan0 function,
from the GDI+ API, in the callstack. Indeed, api calls made in the callstack are
not recorded by API Monitor. However, we were able to make our scripts detect
this way of taking screenshots by noticing that the getDC function is called by the
gdiplus.dll module, which is not the case in other situations.
Zhao et al. proposed a method for screenlogger detection and also identified crite-
ria based on Windows API calls to characterise a screen capture operation [223].
Their solution was to examine the GetWindowsDC and BitBlt function calls. This
117
University of Oxford Balliol College
approach is similar to the criteria we identified but does not consider the other
equivalent functions (StretchBlt, GetDC, etc.) nor the other libraries that can
be used (DD API). Thus, examining only the GetWindowsDC and BitBlt func-
tion calls does not allow for the identification of all the screen-capturing applica-
tions. Moreover, the parameters of the function are not verified, particularly the
hdcSource parameter of the BitBlt function, which must refer to the screen con-
tent. Otherwise, if the hdcSource parameter does not correspond to the DC of the
screen, it means that BitBlt was used for a different purpose and some applica-
tions that do not capture the screen can be identified as taking screenshots, which
is problematic as well.
GetDc or
GetWindowDC or
CreateDCA or
CreateDCW
HdcSource
CreateCompatibleDC CreateCompatibleBitmap or
CreateDIBSection
HdcDestination Bitmap
BitBlt or StretchBlt
SelectObject
118
University of Oxford Balliol College
Problem
By running our scripts on the found samples, we realised that no sample but one
(over 600 samples) was actually taking screenshots. Different reasons can be put
forward to explain this unexpected result.
• Many malware programs infect their targets through different steps. The
screenshot module is often not present from the beginning and is rather
downloaded from the server afterwards. For example, the Prikormka mal-
ware has a ‘downloader module’ that downloads other modules that are not
present at the time of infection [202]. Silence downloads a dropper which
will communicate with the server to obtain the screenshot plugin [224].
However, almost all the malware servers are currently dead. Even if they
were not, we would not allow the samples to connect to the network for
obvious security reasons.
119
University of Oxford Balliol College
Solution
The solution we propose to this challenge results from the observation that mal-
ware authors often do not bother implementing the screenshot functionality them-
selves. Instead, they reuse RATs or ‘Pentest tools’ widely available on the internet.
The only element that varies between different malware is therefore the infection
medium, which is not relevant for this study. Several well-known cybercrimi-
nals, having realised significant attacks, employed these kinds of tools. By way
of illustration, and in a non-exhaustive manner, one could cite:
• The Group5 group uses njRAT and Nano Core RAT [227]
120
University of Oxford Balliol College
Analysing these malicious tools seems a reasonable attempt to cover the majority
of screenshot-taking malware. As they are widely available, we were able to get
working samples (with the client and server parts) and even source codes where
available. We constructed a dataset dedicated specifically to screenshot-taking
malware, containing 118 samples. Some were source codes that we had to com-
pile and make operable. We ran each of them to ensure that the screen-capturing
function was triggered. For that, it was necessary that each selected malware be
composed of executable client and server parts to allow us to trigger the screen
capture ourselves.
These malicious tools have then been analysed to gather the same type of infor-
mation as from the MITRE security reports. When we had the source code, we
directly looked there. If not, we ran the executable files. Given the dangerous
nature of the studied software, their execution was carried out in a secure envi-
ronment, with the client and server parts run on two different VirtualBox virtual
machines. The network was configured by creating a virtual interface network
named vboxnet0. The two machines were configured with host only network to
isolate them from the real interface network and to allow communication between
them.
This dataset is, to the best of our knowledge, the first one dedicated to screenshot-
taking malware and is available on Github [230].
We used the source codes and execution reports at our disposal to analyse the mal-
ware samples of our dataset using the criteria presented in Section 4.1.2 (Sections
5.1.3.1, 5.1.3.2, 5.1.3.3).
121
University of Oxford Balliol College
Screen capturing
Used API The malware of our dataset exclusively uses the GDI library to cap-
ture screenshots. Therefore, this does not include the DD API that may be used
by some malware programs, as seen in Section 4.2.1.
Screenshot-triggering All the malware samples composing the dataset must re-
ceive a command to start taking screenshots. This corresponds to the observations
made in Section 4.2.1, as most malware had to wait for a command to capture the
screen. However, the opposite behaviour (start taking screenshots automatically)
is not represented.
Regarding the event triggering the screen-capturing functionality, the two most
frequent behaviours encountered are the capture of the screen at a given frequency
or on-demand upon receipt of a specific command allowing for one screenshot to
be taken (Figure 5.2). Several screenloggers, such as Xtreme RAT, SpyNet and
NetWire, offer the two possibilities.
Remcos constitutes a particular case in the dataset because its screenshots are
triggered by the occurrences of some target keywords in the titles of the opened
windows.
122
University of Oxford Balliol College
90
80
70
60
50
%
40
30
20
10
0
Frequency Ponctual command Applica�on of interest
Two behaviours observed in the previous section are not represented here: screen-
shots triggered by mouse clicks and unique screen capture during the whole exe-
cution for reconnaissance purposes.
Note that all malware samples in our dataset take screenshots using hidden pro-
cesses with no user interaction. Some of them even inject themselves in legitimate
processes such as the default browser (Figure 5.3).
123
University of Oxford Balliol College
Figure 5.3: Malware infiltrating the default browser process (Navegador padrao).
Captured area Regarding the captured area, as shown in Figure 5.4 a majority
of the selected malware (86%) captures the entire screen at each shot. This cor-
responds to the observations made in Section 4.2.3. Capturing the whole screen
might be used to recover as much information as possible. Relatively advanced
software can subsequently be used to extract relevant data after the transfer to the
malicious servers.
However, 14% of malware have smarter and more optimised behaviours as they
capture either a specified zone using its coordinates (7%, e.g., SpyNet, Xtreme
RAT) or only the difference (changes) between two successive screens (Gh0st).
Moreover, we found a behaviour that we did not observe in Section 4.2.3, which
is the capture of the zone around the mouse click, an option proposed by Pupy.
This enables sending smaller, and thus stealthier, packets, and can be useful in
attacks targeting online banking users who enter their password using a virtual
keyboard, as proposed by many banks [41].
124
University of Oxford Balliol College
100
90
80
70
60
50
%
40
30
20
10
0
Full screen Zone coordinates Difference Mouse area
Screenshot storage
Image compression In the constituted dataset, 57% of malware use JPG coding
to represent images. The remaining ones use BMP and PNG formats in equal pro-
portions (Figure 5.5). These observations appear to correspond to those theorised
in Section 4.3.1, with the exception that several formats found in the minority,
such as AVI, RAR and WCRT, are missing.
125
University of Oxford Balliol College
60
50
40
30
%
20
10
0
JPG PNG BMP
Figure 5.5: Image files format (in the malware of the dataset).
126
University of Oxford Balliol College
70
60
50
40
%
30
20
10
0
Memory Disk
Figure 5.6: Image files storage (in the malware of the dataset).
Screenshot exfiltration
127
University of Oxford Balliol College
mediately after capturing the screen. This is unfortunately not sufficiently rep-
resentative of the potential behaviours presented in Section 4.4.3. Indeed, all the
exfiltration modes that are more advanced than the systematic sending after screen
capture are not represented in this sample.
Problem
Moreover, Table 5.1 shows the differences between the proportions found in our
dataset and the proportions found by analysing the security reports of MITRE.
128
University of Oxford Balliol
104College
TableTable
5.1: Screenlogger
5.1: Screenlogger behaviours in our
behaviours in our dataset
dataset vsfound
vs those those found reports
in security in security re-
selected from Mitre [17]
ports selected from Mitre [14].
DD API - -
Screenshot capt. trig. Frequency 35 78
App. of interest 9 3
Mouse clicks /User trig. 3 -
Punctual command 38 19
Unique capture upon infec. 5 -
Frequency 2s to 15mn 17ms to 67s
Format and storage
Format JPG 25 56
PNG 19 24
BMP 7 21
Video 1 -
Other 6 -
Storage Memory 5 66
Disk 59 34
Captured area Full screen 37 87
Coordinates 2 6
Di↵erence - 3
Other 6 3
Exfiltration
Remote cmd 6 -
Scheduled 10 -
Other 13 -
Encryption? No - 75
Files sending? No - -
Comm. protocol HTTP 36 6
HTTPS/FTP/SMTP/RFB 36 -
Proprietary 1 -
TCP 18 94
129
University of Oxford Balliol College
Taking the form of a ‘screenlogger generator’, our solution poses various ques-
tions to the user at the beginning of each execution. To generate a screenlogger
with certain characteristics, questions include ‘screenshots must be taken every ?
clicks’, ‘should the malware start the screenshots right away or wait for a com-
mand?’, ‘what protocol should be used for exfiltration?’ and others (Figure 5.7).
This screenlogger generator takes the form of a builder that generates a payload
according to the specified parameters. Then, this payload is run on the victim’s
machine, whereas the server part waits for connections on the attacker’s machine.
Measures were taken to prevent the tool from being used for malicious purposes
(a ‘malware’ flag on network messages and a warning message constantly dis-
played to the user). The screenlogger generator will also be declared on open and
dedicated databases.
As illustrated in Figures 5.7 and 5.8, two types of parameters can be specified
when building the payload.
130
University of Oxford Balliol College
STEP 1 : PAYLOAD
Static params. : Dynamic params. :
- IP address - Resolution
- Comm. port - Capturing area
- Comm. - Screen capture Payload
protocol triggering
- Images - Screenshots
format sending
- Storage triggering
With commands
STEP 2 : EXECUTION
C&C Server Dynamic params. specified in the
command
Screenshots
Automatic
Payload
C&C Server Dynamic params. specified at
Screenshots
building (step 1)
On the one hand, static parameters are predefined before execution and cannot
be modified. These are the network parameters that will be used for the con-
nection between the server and the client parts (IP address of the attacker’s ma-
chine, port, communication protocol - TCP/HTTP/SMTP/FTP), but also the li-
brary that will be used to take screenshots (GDI/DD API), the compression for-
mat (PNG/JPEG/ZIP/BITMAP/AVI) and the location of storage (disk storage or
memory-only).
On the other hand, dynamic parameters can be modified during execution using a
command, if this option is chosen. These are:
131
University of Oxford Balliol College
If the option ‘without command’ is selected, all the parameters will be set at the
building time and the payload will start taking screenshots according to these pa-
rameters directly upon infection of a new victim (Figure 5.9 a). If the option ‘with
command’ is selected, the parameters cannot be specified at the building time and
the screenlogger will wait for a command from the C2 server to start capturing
the screen. This command may be used to take isolated or continuous screenshots
(Figure 5.9 b).
Figure 5.9: Screenlogger generator during execution: (a) When the option ‘with
command’ is chosen (b) When the option ‘without command’ is chosen.
132
University of Oxford Balliol College
Using the generator to make the dataset complete In addition to adding sam-
ples with the missing behaviours to meet the ‘behavioural’ completeness require-
ment defined in Section 4.5.1, we also had to match the proportions found in the
security reports, as explained in Sections 4.5.2 and 4.5.3. For this, 31 generated
samples have been added to our 118 existing samples.
The proportions to match had to account for data that was missing in security
reports. For example, if for a given criterion we have 60% unknown, 20% of be-
haviour A and 20% of behaviour B, the proportions to match are 50% of behaviour
A and 50% of behaviour B.
The generated screenloggers that make the dataset complete can be found along
with the tools mentioned in Section 5.1.2.3 on the following Github repository:
[230]. (For security reasons, the executable is not provided.)
133
University of Oxford Balliol College
their consent, even if the actual user of the device (the employee) did not agree to
the screenshot-taking.
• Remote control: These tools allow a user not only to see another computer’s
screen in real time but also to remotely control it through a given protocol.
They are generally used by system administrators in companies to access
and work on remote computers inside their information systems. Remote
control can be used for after-sell or maintenance purposes as well. We could
also mention the rise of remote computer fixing with operators asking the
users to share their screens to help them resolve the problems they encounter
[233]. Some examples of this category are TeamViewer [56], Netviewer
[57], GoToMyPC [58], AnyDesk [234], Apple Remote Desktop [235] and
Chrome Remote Desktop [236].
134
University of Oxford Balliol College
Diverging behaviours
Each of the five categories presented above has specificities regarding the way
they take and process screenshots:
• Screen sharing: Screenshots are often taken at a high frequency to allow for
a smooth view of the screen activity. The real-time constraint mandates that
screenshots must be sent right after they are taken.
• Remote control: The behaviours mentioned for screen sharing also apply
to remote control. The main difference lies in the triggering of the screen-
135
University of Oxford Balliol College
shot operations. Indeed, screenshots start to be taken when the user of the
remote-control application chooses to start the session. This behaviour is
quite similar to malware programs that start taking screenshots when they
receive a command from the malicious server.
136
University of Oxford Balliol College
These applications have been analysed with the same tools used to assess mal-
ware: API Monitor for API calls and Wireshark for network activity. This analysis
was focused on the criteria presented in Section 4.1.2. Information regarding the
selected criteria was not always available. However, the objective was to obtain
the maximum amount of information on all applications regarding the maximum
number of criteria.
137
6 | Behavioural Comparison
In this chapter, we first present the results of our analysis of legitimate screenshot
taking applications according to the criteria proposed in Chapter 4 (Sections 6.1,
6.2, 6.3).
Then, by comparing these results with those we obtained through the analysis
of the security reports referenced on MITRE, we gather insights and formulate
hypotheses about the promising criteria for screenlogger detection and differenti-
ation from legitimate screenshot-taking applications (Section 6.4).
Legitimate applications use the two main available libraries to take screenshots:
GDI (76%) and the DD API (17%). The DD API is more represented than in the
malware case because it is used by several real-time applications (screen sharing
- e.g. Skype, remote control - e.g. TightVNC, ShowMyPC, UltraVNC, Eho-
rus). Indeed, this API is adapted to this type of applications: it allows taking of
screenshots in a fast and robust way by sending only the difference between two
consecutive screens. However, this is a quite recent API (Windows 8 and above)
compared to GDI, and this could explain why it is not used in malware. Malware
developers rarely develop all the modules themselves and, in many cases, they
use stable and widely tested plugins. Most of those modules are implemented
using native technologies like GDI. Another reason could be that this library is
much more flexible, allowing to take screenshots at the desired frequency with the
desired size.
138
University of Oxford Balliol College
Screen captures are triggered only while the application is executed, and usually
not by a command from a distant server, unlike screenloggers. However, this ob-
servation is unfortunately not systematic as some types of legitimate applications
such as remote control respond to distant commands to establish the connection
and start the screen sharing session.
Another observation is that some applications may exhibit a behaviour very sim-
ilar to malware in the way screenshots are triggered. For example, Spyrix Free
Keylogger, an application for monitoring children’s activities, takes screenshots
each time the user switches window or opens a new window. Hubstaff, an em-
ployee control application, takes screenshots irregularly, and we were not able to
identify the triggering event. It may be linked to the “performance factor" of the
employee, which is a function of several elements like mouse clicks, keystrokes,
opened windows and more.
The capturing areas of the legitimate applications making up our sample are il-
lustrated in Figure 6.1. The majority of applications take the full screen (37%),
but some samples target more precise areas of the screen (18% use coordinates,
14% target specific windows). A quarter of the tested applications capture the
whole screen but in an optimised way: only the areas of the screen that have been
updated are captured. We can note that none of our legitimate applications target
overlapping windows.
139
University of Oxford Balliol College
40
35
30
25
20
%
15
10
0
Full screen Difference Limited area Target N/A
Window
Determining which format is used for the screenshots captured by legitimate ap-
plications is not an easy task. It is often time-consuming and complex because of
the encryption and because many samples only use a memory representation of the
screenshot. However, it was possible to observe that, unsurprisingly, an important
part of legitimate applications for which obtaining the information was possible
use standard and portable compression formats (png and jpg) to save captured
screenshots (Figure 6.2), with a large majority opting for the png format. Other
applications use video formats such as AVI, MPEG4 and VMW to store images:
these are the screencasting applications (e.g. VLC, TinyTake, CamStudio).
140
University of Oxford Balliol College
70
60
50
40
%
30
20
10
0
JPG PNG VMW TREC AVI MPEG4 N/A
141
University of Oxford Balliol College
70
60
50
40
%
30
20
10
0
Memory Disk
An important part of the legitimate applications does not send the screenshots
over the network because this is not their purpose. Indeed, the screenshots edit-
ing, employee control and screen casting applications do not send the screenshots
by default. The other applications often use proprietary protocols as shown in Fig-
ure 6.4. Few use standard known protocols such as RFB (Remote Frame Buffer
protocol) or HTTPS.
142
University of Oxford Balliol College
50
45
40
35
30
25
%
20
15
10
0
Proprietary Standard No sending
In real time contexts (screen sharing, remote control), the screenshots are sent
immediately after they are taken. As expected, children/employee monitoring
applications do not send their screenshots right away. As illustrated in Figure 6.4,
the other 45% of legitimate applications never send the screenshots.
6.3.3 Encryption
Applications of our dataset that send the screenshots mainly use either SSL or
TLS encryption before transmitting files. A few of them, such as TightVNC and
ULTRAVNC do not encrypt files as shown in Figure 6.5.
143
University of Oxford Balliol College
60
50
40
30
%
20
10
0
TLS SSL No encryp�on No sending
We formulate hypotheses about which criteria seem to be the most suitable for
screenlogger detection.
Used API Regarding screenshot taking, it is possible to discern that the malware
programs in our sample only use the GDI library to capture screenshots while 17%
of legitimate applications use DD API. This might be an interesting criterion of
differentiation of these software classes, but it is limited in the sense that even if it
144
University of Oxford Balliol College
does not seem to be the case nowadays, malware developers could choose to use
the DD API in the future.
Screenshot triggering Our study revealed that most malware (67%) starts tak-
ing screenshots in response to commands coming from the C&C server, whereas
in the case of legitimate applications, screen capturing is strongly correlated with
the use of the application and most of the time cannot be done in response to
distant commands or events unrelated to the current application.
Finally, contrary to the observations made about our malware samples in Section
5.1.3.1, most legitimate applications do not try to conceal their screenshot-taking
activity. This could be an interesting criterion for screenlogger detection.
Captured area The vast majority of malware captures the full screen whereas
it is more diverse for legitimate applications: only 37% of them capture the full
screen, and the others either capture only part of the screen, a target window or
only the difference between two successive screens. This contrast may be ex-
plained by the fact that most screenshot-taking malware does not look for specific
145
University of Oxford Balliol College
information on the infected device, but rather spies on the user’s activity in gen-
eral.
This trend is much less evident in the case of malware. Indeed, the theoretical
study of malware security reports shows a strong inclination towards the use of the
hard drive for storage. This might be because a significant proportion of screen-
loggers do not send the images right away but rather, wait for specific events (e.g.
frequency, buffer size reached, command for screenshot sending).
146
University of Oxford Balliol College
Most of the legitimate applications which send the screenshots send them directly
after the screen capture as they are real-time applications (remote control, screen
sharing). A majority of malware programs exhibit the same behaviour, except
that some of them may exhibit different sending patterns, such as sending the
image files after a certain number of captures or upon reception of a command.
However, the fact that the exfiltration of screenshots is delayed is not sufficient
to conclude that the program is malware, because children/employee control may
147
University of Oxford Balliol College
behave similarly, by sending the screenshots at the end of the control session for
example.
6.5 Discussion
Based on our experiments and execution analysis of screen-logging malware and
legitimate applications, Table 6.1 summarises the difference levels for each crite-
rion.
In this table, we mainly distinguish three classes of criteria according to the degree
of differentiation between malware and legitimate applications.
Indeed, criteria like the compression format and screen capture frequency do not
highlight enough differences between legitimate applications and malware to be
used effectively in a detection methodology.
148
University of Oxford Balliol114
College
DD API 17 -
Scr. capt. trig. Frequency 63 35 ++
App. of interest 3 9
Mouse/User trig. 23 3
Punctual command - 38
Capt. upon infec. - 5
Frequency 9ms to 1h 2s to 15mn +
Format and storage
Format JPG 7 25 +
PNG 17 19
BMP - 7
Video 17 1
Other - 6
Storage Memory 57 5 ++
Disk 43 59
Captured area Full screen 37 37 ++
Coordinates 19 2
Di↵erence 26 -
Other 14 6
Exfiltration
Remote cmd - 6
Scheduled 3 10
Other - 13
Encryption? No 7 - +++
Files sending? No 50 - +++
Comm. prot. Http - 36 +++
Https/Ftp/Smtp 24 36
Proprietary 24 1
TCP - 18
149
University of Oxford Balliol College
The third and final class includes criteria with significant degrees of differenti-
ation: sending of screenshot files, encryption and the communication protocol.
Indeed, software that uses a proprietary protocol with TLS/SSL encryption for
network communication or which does not send the screenshots over the network
is likely to be a legitimate application, whereas one that does not encrypt data
or just uses base64 encoding and that uses a standard protocol such as TCP, FTP
or SMTP might be malware. However, a detection methodology limited only to
these criteria cannot be effective because of the potential for false positives.
150
7 | Behavioural Screenlogger Detection
Using the malicious and benign datasets constructed in Chapter 5, we were able
to identify the features found in the literature which are the most performant for
screenlogger detection. Moreover, we trained and tested a malware detection
model with new features adapted to the specifics of screenlogger behaviour.
During their execution, the behaviour of malicious and benign samples were mon-
itored using API Monitor and Wireshark.
To implement and test our detection models, we used the Weka framework, which
is a collection of ML algorithms for solving real-world data mining problems
[245].
151
University of Oxford Balliol College
More precisely, we used it to process the run-time analysis reports, select the best
detection features, select the classification algorithms, train and test the models,
and visualise the detection results.
The measures used to assess the performances of our detection models are the
following:
152
University of Oxford Balliol College
In the case of malware detection, it is crucial that all malware programs be de-
tected, to avoid them causing important damage. On the other hand, classifying
a legitimate application as malware, even if it can be inconvenient for the user,
might not be as critical. As a result, we give a particular importance to the false
negatives and recall metrics.
153
University of Oxford Balliol College
API call features and network features were tested both independently and con-
jointly.
When running the samples from our malicious and benign datasets in a controlled
environment, we collected reports on two aspects of their behaviours: API calls
(API Monitor reports) and network activity (Wireshark reports).
API calls
This category of features was extracted from the reports produced by API Monitor.
The first feature we used consisted in counting the number of occurrences of each
API call. For each malicious and benign API call report, the numbers of occur-
rences of the API calls it contains was extracted in a .csv file.
In the literature, we found that malware programs try to dissimulate their mali-
cious functionality by introducing benign API calls to their API call sequences. A
popular way of performing malware detection using API calls is to use the num-
ber of occurrences of API call sequences rather than API calls taken alone. For
this, the concept of N-grams is used. N-grams are sequences of N API calls made
successively by the studied program.
Network traffic
Using the .pcap files produced by Wireshark and the Argus tool to isolate net-
work flows [246], we extracted 47 network features found in the literature. These
features belong to four categories:
154
University of Oxford Balliol College
• Source IP address
• Destination IP address
• Source port
• Destination port
• Total number of bytes in the flow over the number of packets in the flow
155
University of Oxford Balliol College
• Ratio between the number of incoming packets and the number of outgoing
packets
156
University of Oxford Balliol College
• Standard deviation of the time a flow was idle before becoming active
• Standard deviation of the time a flow was active before becoming idle
Our detection model uses the RF algorithm [248]. This algorithm trains several
DTs and uses a majority vote to classify observations. Each DT is trained on a
random subset of the training dataset using a random subset of features.
157
University of Oxford Balliol College
The main shortcoming of DTs is that they are highly dependant on the order in
which features are used to split the dataset. RF addresses this issue by using
multiple trees using different features.
• Minimum number of instance per leaf (by default 1 but can be raised to
prevent overfitting).
It emerged from our experiments that the default parameters described above pro-
vided the best restults. Bootstrapping was used.
Other algorithms than RF were tested: KNN (K=7; LinearNNSearch) and SVM
(polynomial kernel, normalised training data). As shown in Section 7.3.5, RF
yielded the best results.
Using an Artificial Neural Network was also considered, but we did not have a
sufficient number of samples in our dataset. Indeed, neural network require a
large amount of data for training.
To train and test our model, we used the k-fold cross-validation method (with
k = 10). This value of k allowed to reach the best tradeoff between bias and
variance. As seen in Section 7.3.5, other values were tested but led to degraded
performance.
158
University of Oxford Balliol College
This method consists in dividing our dataset into k blocks of the same size. The
blocks all have the same proportions of malware and legitimate applications. For
each block, we train the model on the k − 1 other blocks and test it on the current
block. The final detection results are obtained by adding the results of each block.
Using cross-validation, we trained and tested our model using first API call fea-
tures only, then network features only, and, finally, using both categories of fea-
tures.
Due to the high number of features used, to avoid overfitting, it was necessary to
select the most useful ones. A features is useful if it is informative enough for our
classification task, that is, if it enables to effectively distinguish between malicious
and benign behaviours.
For this task we used the Recursive Feature Elimination (RFE) method [249].
Given a number of features to select, this method iteratively trains our RF model
using cross-validation and removes the least important features at each iteration.
The importance of a feature is given by the average of its Gini impurity score for
each DT in which it is used.
The Gini impurity of a feature that splits the samples at a node of a DT reflects
how ‘pure’ are the subsets produced by the split. In our case, a subset is purer
if it contains mostly screenloggers or mostly legitimate screenshot-taking appli-
cations. For instance, a subset containing 75% malware and 25% legitimate ap-
plications is purer than a subset that contains 50% malware and 50% legitimate
applications.
159
University of Oxford Balliol College
The Gini impurity of a feature is the weighted average of the impurity scores of
the subset it produces. The weights are computed using the number of samples
contained in each subset.
When the features are numerical values (which is our case), instead of computing
the impurity of the subsets produced by each single value, intervals are used. More
precisely, the Gini impurity of the feature is obtained through the following steps:
• Step 3: For each average value from Step 2, the Gini impurity of the feature
if the samples were split using this value is computed.
• Step 4: The Gini impurity of the feature is the minimum among the Gini
impurities from Step 3.
7.3.5 Evaluation
Table 7.1 contains the results we obtained for the first detection approach using
features found in the literature with the RF algorithm and k=10.
Table 7.1: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=10).
160
University of Oxford Balliol College
Table 7.2 contains the results we obtained for the first detection approach using
features found in the literature with the KNN algorithm and k=10.
Table 7.2: Detection results for the basic approach using features from the litera-
ture with the KNN algorithm (k=10).
Table 7.3 contains the results we obtained for the first detection approach using
features found in the literature with the SVM algorithm and k=10.
Table 7.3: Detection results for the basic approach using features from the litera-
ture with the SVM algorithm (k=10).
For all categories of feature except 3-gram API calls, RF outperforms KNN and
SVM.
We can observe that network features seem to give better results overall than API
call features. Regarding API calls, using sequences of two and three calls signifi-
cantly decreases the performances of the model, with more than 10% of malware
classified as legitimate (vs 3.8% when individual API calls are used).
161
University of Oxford Balliol College
Combining network features and API call features does not improve the results
compared to using network features alone.
Tables 7.4, 7.5, 7.6, 7.7 and 7.8 show the detection results for respectively k=3,
k=5, k=7, k=12 and k=15 and the RF algorithm. We can see that k=10 yields the
best accuracy (see Table 7.1). With k=10, the variance is 2.111% and the training
time is 0.03sec.
Table 7.4: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=3).
Table 7.5: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=5).
162
University of Oxford Balliol College
Table 7.6: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=7).
Table 7.7: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=12).
Table 7.8: Detection results for the basic approach using features from the litera-
ture with the RF algorithm (k=15).
Finally, using RFE, we identified the most relevant API calls for screenlogger
detection:
163
University of Oxford Balliol College
We also identified the most relevant state of the art network features:
164
University of Oxford Balliol College
screenlogger detection. These features target specific behaviours that can allow
screenloggers and legitimate screenshot-taking behaviours to be distinguished.
The first novelty of this approach is that, instead of using hundreds of features and
trusting a ML model to select the most discriminating ones, we use features that
reflect a specific behaviour. The advantage is that malware authors will not be
able to misguide the detection system without changing their core functionality.
Indeed, as seen in Section 2.2, existing detection models are prone to overfitting
and can easily be misguided by malware authors by acting on features unrelated
to the malicious functionalities of their programs.
The second novelty lies in the way our features are collected. Indeed, in the mal-
ware detection field, it is commonplace to automatically run millions of malware
samples in a controlled environment to collect features without interacting with
the samples. Such an approach would unfortunately not work for screenloggers,
as their malicious functionality needs to be triggered at run time through interac-
tion with the malware program. To collect our features, we paid a particular care
to ensure that each malicious or legitimate program worked as intended.
A prerequisite to the following features is that the studied application has been
identified as having a screenshot-taking activity (using the API call sequences
from Section 5.1.2.1).
To extract this feature, we had to identify the API calls which result from user
interaction.
165
University of Oxford Balliol College
We found that, on Windows, some API calls involved in user interaction can be
called on other applications’ windows. As such, they could easily be called by a
malware program pretending to interact with the user, whereas in fact, it does not
even have a window.
Other API calls, mainly those involved in drawing on the window can only be
called by the application that created the window. If they are called by another ap-
plication, their return value is f alse. Therefore, we monitor this second category
of functions and, even if they are called, we verify their return value.
We saw in Section 6.4.1 that, unless they infiltrate themselves in legitimate pro-
cesses, all the malicious samples of our dataset take screenshots through back-
ground processes hidden to the user. Legitimate screenshot-taking applications,
apart from children/employee monitoring and some applications that create a
background process for the screenshot-taking (e.g. TeamViewer), use foreground
processes.
Thus, the fact that the screenshots are taken by a background process increases
the probability of malicious activity.
Image sending
166
University of Oxford Balliol College
schedule the sending of screenshots. In such a case, even if image packets are not
sent during the monitoring time, it can be that these packets will be sent later.
Therefore, our ‘Image sending’ feature only reflects whether or not screenshots
are sent during the monitoring time, and cannot be used to affirm that an applica-
tion does not send the screenshots it takes.
167
University of Oxford Balliol College
Note that, to measure this feature accurately, it was necessary that the API calls
and network reports be generated at the exact same time.
By analysing our samples, we found that the maximum duration between the com-
mand and the screenshot is 46 772 ms, the minimum duration is 0.0059 ms, the
average duration is 83.044 ms and the median duration is 33.115 ms. We con-
ducted experiments with these different values for T .
Even if this was not found in our dataset, we account for the case where the process
receiving the command is different from the process taking the screenshots.
To the best of our knowledge, our detection model, through this feature, is the first
to make a correlation between two kinds of events (reception of a command and
screenshot API call sequences) for malware detection.
Asymmetric traffic
Therefore, instead of measuring the ratio between the number of incoming and
outgoing packets, we use the ratio between the numbers of bytes exchanged in
both directions.
168
University of Oxford Balliol College
Captured area
In Section 6.4.1, we saw that almost all malware capture the full screen as opposed
to legitimate applications which may target more specific areas of the screen de-
pending on their purpose. As a result, we implemented a ‘captured area’ feature
which takes three values: full screen, coordinates and target window.
We had to identify, in our screenshot API call sequences, the elements that show
what area of the screen is captured. However, as discussed in Section 5.1.2.1 for
the screenshot sequences, there is not only one way to capture a given area of
the screen, but several. For instance, to capture a zone with given coordinates,
one might get a cropped DC from the beginning using the GetDC function with
the desired coordinates as parameters, or take the whole DC and do the cropping
afterwards when copying the content of the screen in the destination bitmap using
BitBlt’s arguments.
Therefore, for each of the three values of the ‘captured area’ feature, we listed the
possible API call sequences which might be used.
Note that we consider that an application capturing more than the three quarters of
the screen’s area captures the full screen. This is to avoid malware programs pre-
tending that they capture a precise area when, in fact, only few pixels are removed
from the whole screen.
Screenshot frequency
169
University of Oxford Balliol College
Therefore, it is possible that not all the screenshots be taken at the same time
interval.
Each time a screenshot API call sequence is found, we record its time stamp.
Then, we subtract the timestamps of consecutive sequences and compare the in-
tervals obtained. If more than ten intervals are found to be equal, the feature
takes the value of this interval. Else, it takes the value ‘no frequency’. Note that
screenshots taken using different sequences are accounted for in this frequency
calculation.
As seen in Section 5.1.3.1, some malware programs may try to evade detection by
dynamically changing the screenshot frequency using random numbers. To cover
this case, we consider that the intervals are equal if they are within 15s of each
other.
7.4.2 Evaluation
Table 7.9 contains the results we obtained for the second detection approach using
the screenlogger-specific features we implemented.
Table 7.9: Detection results for the optimised approach using our specific features.
We can see that the detection performance is improved on all metrics: with only
7 features, our model outperforms the first model based on hundreds of standard
features. That is because our features capture specific malicious behaviours.
Moreover, a malware author would not be able to act on these features to mislead
the classifier without changing the malicious functionality. Indeed, to mislead
170
University of Oxford Balliol College
7.5 Discussion
We built a first RF detection model using only API calls and network features from
the literature. This model was trained and tested using our malicious and benign
datasets. Using RFE with Gini importance, we identified the most informative
existing features for screenlogger detection.
Then, we built a second model including novel features adapted to the screenlog-
ging behaviour. These features were collected using novel techniques. Particu-
larly, we can cite:
• Making a correlations between API calls made by an application and its net-
work activity. During their execution, the API calls and network activity of
our samples were simultaneously monitored. This allowed us to extract fea-
171
University of Oxford Balliol College
tures such as the reception of a network packet before starting the screenshot
activity or the sending of taken screenshots over the network.
When adding these novel features to the detection model, the detection accuracy
increased by at least 3.108%. Indeed, it is well known that, when there is few
data, a detection model based on less features is less likely to fall into overfitting.
Moreover, a detection model based on features which have a logical meaning
and reflect specific behaviours, is less prone to evasion techniques often used by
malware authors.
More generally, our results show that, for some categories of malware, a tailored
detection approach might be more effective and difficult to mislead than a gener-
alist approach relying on a great number of seemingly meaningless features fed to
a ML model.
172
8 | Leveraging Retinal Persistence for a Us-
able Countermeasure to Malicious Screen-
shot Exploitation
• The RTM malware mimics a registry check by using the regedit icon and
a design similar to Window’s progress bars. Then, a fake error message is
shown to the user. Clicking on one of the two options runs a process with
administrator privileges [197].
173
University of Oxford Balliol College
• The Prikormka malware records Skype calls using the Skype Desktop API.
The use of this API by a third-party application causes Skype to display a
warning asking the user to allow or deny access. The malware then creates a
thread that attempts to find the window and click the “Allow access" button
programmatically, without human interaction [202].
• The Janicab malware uses the right-to-left override character in its file’s
name. This way, a .fdp.app file becomes a .ppa.pdf one and the users think
that they are opening a PDF instead of an executable ([250]).
All of these examples demonstrate that malware authors can deploy countless
strategies to mystify users or completely bypass them by simulating events.
• Being real-time: This is an implication of the fact that the solution must
be user-friendly. Indeed, if the solution requires heavy treatments on each
screenshot, it will not allow screenshot-taking applications such as screen-
174
University of Oxford Balliol College
Concretely, the proposed approach does not aim to change the general way in
which information is displayed on the screen for the user’s day-to-day activity,
but only alters the screenshot function.
The idea, instead of returning a screenshot containing the whole information dis-
played on the screen in response to a screenshot API call, is to return an altered
image (Figure 8.2). On Windows devices, screenshots are usually taken using the
GDI API. The returned bitmap is generated from a DC obtained with functions
such as GetDC.
Using hooking techniques such as the one offered by the Microsoft Detours li-
brary [251], we can intercept the API call used to take the screenshot (e.g. BitBlt)
and modify its return value. More precisely, the functions’ definitions are modi-
175
University of Oxford Balliol College
fied by inserting jump instructions. The jump instructions modify the program’s
execution flow by redirecting it to our code (figure 8.1).
The difficulty, in our case, is that the API calls that compose the screenshot-taking
sequence are not used for the sole purpose of taking screenshots. For example,
the BitBlt function, which copies the content of a source DC into a Bitmap object,
can perfectly be called by an application that does not take screenshots. Therefore,
we have to make sure that the intercepted function is indeed part of a screenshot
sequence before redirecting it to our image-altering code (Algorithm 1). To the
best of our knowledge, this way of using the hooking functionality (i.e. by redi-
176
University of Oxford Balliol College
recting the program to our code only if the function is called in a specific context)
is novel.
Application / Malware
Screenshot
request Screenshot
Operating System
Application / Malware
Altered
Screenshot
Screenshot
Operating System
177
University of Oxford Balliol College
Note that this solution does not completely prevent malicious screenshot exploita-
tion, but only makes it more difficult. Indeed, with a sufficient number of screen-
178
University of Oxford Balliol College
shots, it is possible to see the whole screen’s content. This limitation comes
from the important requirement that users should not be involved and that, at
the same time, legitimate screenshot-taking applications must continue working
as intended.
The process used to alter the image must not require too complex and important
computations to be able to display the frames at a high enough rate. This allows
legitimate screenshot-taking application already taking screenshots at a high fre-
quency (such as Skype screenshare) to keep working thanks to retinal persistence.
However, this solution prevents legitimate applications that are taking screenshots
punctually from having a complete and clean screenshot, as they cannot leverage
retinal persistence. Two solutions have been studied to remedy this challenge:
179
University of Oxford Balliol College
On the one hand, Solution 1 does not require any hardware change to be applied
and can therefore be deployed immediately on old devices. However, a malware
that takes screenshots at a frequency lower than f will be unaffected by our coun-
termeasure.
To be used widely for general screen content protection, our approach must re-
spect three main constraints: security (preventing the screenshot exploitation by
the attacker, be it automatic or manual), usability (as users of legitimate applica-
tions must not be overly disturbed) and real time (so that incomplete images are
generated and displayed at a high frequency to leverage retinal persistence).
180
University of Oxford Balliol College
However, these constraints can be contradictory. This is, for instance, the case
with security and usability. Indeed, to achieve better security, an important portion
of the screen must be hidden; this is done at the expense of usability.
To achieve the best trade-off between these constraints, three main parameters
can vary: the pattern used in the hidden areas (Section 8.2.1), the way areas to
be hidden are determined (Section 8.2.2) and the frequency at which incomplete
images are displayed (Section 8.2.3).
We implement several algorithms that offer different options regarding these three
parameters.
Each one of the proposed patterns has strengths and weaknesses with regards to
the three constraints evoked earlier (usability, security, and real-time).
Uniform patterns
The simplest pattern that can be used to hide a part of the screen is a uniform
colour, as illustrated in figure 8.4.
181
University of Oxford Balliol College
This pattern has the advantage of not requiring any additional computation. In-
deed, it is independent of the hidden content. This makes it the most suitable
pattern for real-time use.
However, hiding the screen’s content with a colour that is independent from it is
not an optimal solution for user comfort [13].
Gaussian blur
We implement a second filter that uses blur, as illustrated in figure 8.5 Blur allows
us to minimise the differences between hidden and visible areas.
182
University of Oxford Balliol College
More precisely, we use Gaussian blur, which is known to be one of the best blur-
ring algorithms for preserving edges [253]. Thanks to this property, it is harder to
detect by an algorithm.
The level of blur can be set to obtain the best trade-off between usability and
security. The radius of the blur defines the value of the standard deviation to the
Gaussian function, i.e. how many pixels on the screen blend into each other; thus,
a larger value will create more blur while value of 0 leaves the input unchanged.
However, contrary to the uniform pattern, the blur pattern varies according to what
is displayed inside the hidden area. As a result, it requires to apply a filter on each
pixel of the image, which can be detrimental to real-time display. Moreover, even
if the blur level is low, blur detection algorithms can detect the parts of the image
that are blurred as the contrast is much lower than in other areas (e.g. in the case of
text). This can allow malware programs to automatically discard the blurred zones
and “superimpose" the different incomplete screenshots, which would make the
solution insecure.
Hierarchical patterns
The third pattern we implement tries to address this issue by hiding parts of the
screen while keeping a high contrast, as illustrated in figure 8.6. We introduce a
level of hierarchy by dividing each hidden area into a given number of columns.
These columns are then randomly mixed. This allows us to keep a high contrast in
the hidden areas, which makes it harder for attackers to automatically know which
parts of the screen are hidden. However, the computation required at runtime
might reduce the frequency at which incomplete screenshots can be displayed.
183
University of Oxford Balliol College
Periodical algorithms
For the sake of comparison with the approach proposed by Park et al. [13], we
implemented vertical bars that are periodically sliding on the screen (figure 8.7).
The space between the bars is equal to the bars’ width. This value can be chosen
ranging from ten pixels to one hundred pixels. We also implement horizontal
bars with configurable height, diagonal bars with configurable dimensions, and
concentric circles (figure 8.8).
184
University of Oxford Balliol College
Figure 8.7: Examples of text screenshots altered with the periodical vertical slid-
ing bars algorithm (captured at different times)
Figure 8.8: Examples of text screenshots altered with the periodical concentric
circles algorithm (captured at different times)
Random algorithms
185
University of Oxford Balliol College
The smaller the w and h parameters, the more calculations there will be in the
algorithm and the more the execution time will increase. However small values
for these two parameters allow us to hide numerous, small portions of the screen.
Probability of being hidden Each rectangle with coordinates (i, j) has a prob-
ability Pij (k) of being hidden at iteration k, with i ranging from 1 to I = W
w
and
j ranging from 1 to J = H
h
.
186
University of Oxford Balliol College
neighbouring rectangles can be hidden in the same iteration. This first constraint
aims at making the solution more usable by limiting the maximum area of the
screen that can be hidden. The second constraint states that no more than V
neighbouring rectangles can be visible in the same iteration. This second con-
straint aims at making the solution more secure by limiting the maximum area of
the screen that can be visible on a given screenshot.
Examples of screenshots altered using this algorithm can be found in figures 8.9
and 8.10.
187
University of Oxford Balliol College
Figure 8.9: Examples of text screenshots altered with the random vertical rectan-
gles algorithm (captured at different times)
Figure 8.10: Examples of text screenshots altered with the random horizontal
rectangles algorithm (captured at different times)
• line 13: 1.
188
University of Oxford Balliol College
• line 14: 1.
• line 15: 5.
• line 41: 1.
189
University of Oxford Balliol College
190
University of Oxford Balliol College
Circle patterns We implement a second algorithm where the shapes are circles
instead of rectangles. The radius and the minimum distance between two cen-
tres are given as parameters. At each iteration coordinates pairs are randomly
chosen. Then, circles with the specified radius are drawn with these coordinates
pairs as their centres. These circles can overlap if the distance between two cen-
tres is smaller than two times the radius. Here, the advantage is that the transi-
tion between hidden and visible areas is smoother, which could improve usability.
However, choosing random numbers and drawing circles at each iteration implies
important computations, which limits the frequency at which images can be dis-
played. Examples of screenshots altered using this algorithm can be found in
Figure 8.11.
Figure 8.11: Examples of text screenshots altered with the random circles algo-
rithm (captured at different times).
191
University of Oxford Balliol College
The frequency at which images are displayed is a key factor. Indeed, retinal per-
sistence precisely relies on displaying images at a high frequency. It has already
been shown that greater frequency implies a higher usability [13].
However, it seems that when a certain threshold is reached, increasing the fre-
quency does not increase the usability much anymore [13]. This can be explained
by the fact that, for the retina to keep an image “in memory", it must be shown for
a minimum duration.
Imposing a very high frequency (f < 25ms i.e. f > 40FPS) is not ideal either.
Indeed, it limits the computations possible to be performed on each screenshot be-
cause of the time constraint, which could be detrimental to security. Moreover, the
number of incomplete screenshots sent over the network would explode, without
a guarantee of improving usability.
To test these different implications of frequency, each algorithm can have a fre-
quency ranging from 1ms to 1000ms.
8.3 Hypotheses
As discussed in Section 8.2, for our approach against screenloggers to be effective
and usable by a wide spectrum of users, it must reach the best trade-off between
three main constraints: security, usability and real-time.
Before thoroughly testing the various screenshot alteration algorithms that we are
proposing, we can already make a number of hypotheses about the performance
of some parameter configurations on the three criteria. These hypotheses arise
from the characteristics of our different algorithms and their behaviours:
192
University of Oxford Balliol College
• Security: When black and blur patterns are used, we believe it will be pos-
sible to automatically reconstitute a complete screenshot from incomplete
ones, with more or less computation involved. In contrast, when the hierar-
chical pattern is used, it is less certain that reconstruction will be possible.
Experiments with state of the art OCR algorithms will be conducted to as-
sess this hypothesis.
Besides the pattern used, the way hidden areas are chosen also has an in-
fluence on security. Indeed, predictability is the main obstacle to the effec-
tiveness of periodical patterns. In the same way, the more randomness there
is to random patterns, the more we believe the solution will be secure. As
a result, a low probability update (E) between successive iterations should
yield the best security. Moreover, to minimise the limits posed by the U (re-
spectively, V ) parameter on security, this parameter should be maximised
(respectively, minimised).
• Usability: The blur and hierarchical patterns should give the best results
because there will be less contrast with visible areas than when the black
pattern is used.
The size of the hidden area should also have an impact on usability. Indeed,
we can suppose that wide hidden areas will be more visible to the user and
thus, more detrimental to usability. In the case of the random rectangles
algorithm, the parameters that have an incidence on the size of the hidden
areas are w, h, and U .
• Real-time: Large black rectangles or vertical bars should allow image files
generation at the highest speed. On the other hand, the use of blur or hierar-
193
University of Oxford Balliol College
The hypotheses made in this section have been assessed through a rigorous and
in-depth testing phase which is the subject of Chapter 9.
194
9 | Evaluation and Security Analysis of the
Proposed Countermeasures
Different tests were conducted to compare the proposed algorithms based on four
key criteria (Section 9.1). After presenting our methodology for evaluating our
algorithms using these criteria (Section 9.2), we show the results we obtained
(Section 9.3) and discuss them (Section 9.4).
As explained in Section 8.2, our approach does not aim at making malicious
screenshot exploitation impossible because it is possible to reconstitute the origi-
nal screen content from incomplete screenshots.
Therefore, the security of our model depends on the number of screenshots the
adversary would need to reach their objective.
Two kinds of attacker’s goal can be distinguished: getting all the content displayed
on the screen and getting particular sensitive information displayed on the screen.
In the first case, we look for the number of screenshots needed to get the whole
screen content. Due to the random nature of our algorithms, this number will vary
from one execution to the other. Let P 1n (A) be the probability that the attacker
195
University of Oxford Balliol College
will need n screenshots or less to read the whole screen content when algorithm
A is used. Our objective is to find the maximum N such that:
In the second case, we look for the number of screenshots needed to get a partic-
ular sensitive information displayed on the screen. Let P 2n (A) be the probability
that the attacker will need n screenshots or less to read the sensitive information
when algorithm A is used. Our objective is to find the maximum N such that:
9.1.2 Usability
In order for our approach to be scalable and usable in a general way, legitimate
screenshot-taking applications must remain usable even with altered screenshots.
Contrary to existing works, we concretely measure usability with actual users and
not theoretical measures such as Peak signal-to-noise ratio (PSNR).
∆t(A) = ( t(A)
T
− 1) ∗ 100.
196
University of Oxford Balliol College
• Usability score: A subjective feedback given by the the user regarding read-
ing comfort. For each algorithm A, it consists in a grade s(A) ranging from
0 (very difficult and unpleasant) to 10 (very easy and pleasant).
These metrics were designed to measure the usability of our algorithms on text.
We chose not to include images in this usability test because we think it would
require further optimisation and a specific usability test which could be conducted
as future work.
9.1.3 Real-time
We have seen in section 8.2.3 that frequency is a crucial aspect for the usability of
our approach.
However, important on the fly computations can limit the maximum frequency
that can be reached with a given algorithm. To ensure a certain comfort of use,
it is essential that the algorithms used to hide parts of the screenshots allow real-
time execution. Indeed, the fluidity of the images succession on the screen should
not constitute an obstacle to the use of countermeasures.
197
University of Oxford Balliol College
We have seen in section 8.2.3 that frequency is a crucial aspect for the usability of
our approach.
However, since the screenshots are altered with many parts hidden, with good
compression algorithms, one could consistently reduce their size. We would there-
fore be able to send more screenshots but with the same network bandwidth con-
sumption that when no countermeasure is used.
The objective is to determine, for each algorithm, to what extent we can increase
the frequency of screenshots without increasing network use.
In order to do this, we first had to determine the compression ratio r that we can
obtain with each algorithm (r= size of the compressed altered screenshot/size of
the compressed normal screenshot).
Then we did the following computation: (FPS with normal screenshots) x 1/r.
9.2 Methodology
In this section, we present the methodology used to assess our models using the
criteria explained in Section 9.1.
Some aspects of our approach (security against a human adversary, usability) re-
quired user tests to be conducted.
198
University of Oxford Balliol College
We gathered 119 users and made them pass a 20 minutes test on our ‘Per-
sistest’ website [254]. As this study involved human participants, we ob-
tained the approval of the Departmental Ethics Committee (reference number
CS_C1A_20_022). Before undertaking the test, the users gave their informed
consent [255].
As explained in our threat model (Section 3.4), regarding the malicious exploita-
tion of the stolen screenshots, two types of adversaries with different capabilities
can be found. The first kind of adversary exploits the screenshots manually and
the second kind uses algorithms to automatically exploit the screenshots. As man-
ual exploitation involves a human, it is more suited to targeted attacks whereas
automatic exploitation allows for large-scale attacks.
We have evaluated the security of our approach both against a human adversary
(Section 9.2.1.1) and against an automatic adversary (Section 9.2.1.2). Then, we
measured the impact of having to take more screenshots on malware detection
(Section 9.2.1.3).
In this section, the objective is to determine the number of screenshots the human
attacker would need to reach their objective.
Six algorithms have been tested to determine the impact of some parameters on
security:
199
University of Oxford Balliol College
In the case of human users, it was not necessary to test the impact of the hiding
pattern. Indeed, be it black, blur or column shuffling, the user will not be able to
read the hidden area.
Frequency was also irrelevant regarding security. Indeed, once the human attacker
has the screenshots, they can display them in any way they like to try to exploit
them, regardless of the frequency at which they were taken.
To simulate the first kind of adversary(Step 3 of the user tests), users were faced
with an incomplete screenshot of a text (the first one generated by the current
algorithm). Before starting Step 3, users received detailed instructions (Figure
9.1). The users had to click on ‘Unreadable’ if they were not able to read or infer
the whole text (Figure 9.2). This would add one incomplete screenshot which is
alternated with the first screenshot at a frequency of 25ms. The user has to repeat
the operation until they are able to read or infer the whole text. In this case, they
click on ‘Readable’. This allows us to count the number of incomplete screenshots
they needed to read or infer the whole text.
200
University of Oxford Balliol College
Figure 9.2: Step 3 of the user test : determine the number of incomplete screen-
shots necessary for the user to be able to read the text
To simulate the second kind of adversary(Step 4 of the user tests), the functioning
is the same, but the user clicks on ‘Unreadable’ until they are able to see a partic-
201
University of Oxford Balliol College
ular information (and not the whole screen content). Before starting Step 4, users
received detailed instructions (Figure 9.3). To confirm that the user has indeed
acquired the desired information, they have to enter it instead of just clicking on a
button (Figure 9.4). Their input is compared to the expected result and they cannot
proceed to the next step until they enter the right information.
202
University of Oxford Balliol College
Figure 9.4: Step 4 of the user test : determine the number of incomplete screen-
shots necessary for the user to read a specific information on the screen (example
of a hotel reservation where the user must read the arrival date, departure date and
city)
As presented in our threat model (Section 3.4), we assume that the attacker has at
their disposal state of the art OCR tools to exploit the stolen screenshots.
203
University of Oxford Balliol College
Note that a simple majority or minority rule would not allow the hidden areas to be
discarded. Indeed, when the pattern is periodical, the hidden zones have the exact
same area as the visible ones. This implies that each given zone on the screen is
visible 50% of the time and visible 50% of the time. When the pattern is random,
the probability of each area of being hidden is initialised at 0.5 and changed at
each iteration. It is impossible to predict in advance whether a given area of the
screen will be visible or hidden most of the time. For example, in our experiments,
one area was found to be visible 8 times out of 20 screenshots whereas another
one was visible 13 times out of 20 screenshots.
The results have been used to iteratively improve our solution until we found the
best possible solution to force the attacker taking more screenshots while at the
same time letting the human eye read what is displayed on the screen.
204
University of Oxford Balliol College
the attacker’s task harder, but also makes them more visible and thus more easily
detectable.
Even if the malware program tries to hide by injecting itself in other running
processes or dividing between multiple processes, we will see an abnormally high
number of screenshots taken in background.
9.2.2 Usability
The objective of our usability study was to measure the performance of several
obfuscation algorithms and parameters combinations according to the metrics pre-
sented in Section 9.1.2.
205
University of Oxford Balliol College
Design
The independent variables of our experiment are the pattern used inside hidden ar-
eas, the algorithm used to determine areas to be hidden (with its different parame-
ters) and the frequency at which altered screenshots are shown. All these variables
are within-participants: each user was presented with all test cases. Even if this
choice reduced the number of test cases, we chose to have the maximum amount
of data for each case for the sake of statistical significance.
The dependant variables are the metrics presented in Section 9.1.2: ∆t(A), e(A)
and s(A).
In order to test the impact of each independent variable on the dependant variables,
we tested several parameter combinations. Parameters vary one at a time to ob-
serve the individual impact of each. For instance, when the hiding pattern varies,
the determination of hidden areas and the frequency of display remain constant.
206
University of Oxford Balliol College
Here again, we compared these methods by only varying the identification of hid-
den parts (same hiding pattern and same frequency).
We chose not to test periodical horizontal bars because it turned out that, as tests
are in the form of lines, hiding complete lines was to detrimental to usability, the
user having to wait until the bar passes to resume the reading of the current line.
207
University of Oxford Balliol College
Materials
We used 10 texts of 350 characters to test our algorithms. These texts were ex-
tracted from 10 different BBC news articles. Each time a user took the test, the
texts were randomly assigned to the tested algorithms: different users tested algo-
rithms on different texts.
Our usability test did not require the participant to be physically present. The test
is deployed on a publicly deployed website [254]. As a result, participants took
the test in different physical conditions, either on mobile or desktop devices.
Participants
A first subset of 20 participants was recruited among Oxford students using snow-
ball sampling.
The rest of the users (99) were recruited through the SurveyCircle website.
Due to the GDPR and ethics requirements, we did not collect any personal in-
formation on users (e.g. name, age, address) other than the fact that they had a
normal vision and that they were wearing glasses when needed.
Procedure
The participants were first introduced to the objectives of the experiment and gave
their consent for participating.
Two scenarios were used to measure usability according to the three dependant
variables:
208
University of Oxford Balliol College
• In the first scenario (Step 1 of the user tests), the user is passive and simply
has to read different texts with the same number of characters (350). On
each text, a different algorithm is applied. One of the texts is displayed
normally for the sake of comparison.
Then, for each text, they had to click the ‘Start Reading’ button to see the
text (Figure 9.6).
209
University of Oxford Balliol College
This triggers a timer that stops when users click on the ‘I finished reading’
button (Figure 9.7).
210
University of Oxford Balliol College
After clicking on this button, users had to rate the visual comfort for the text
they just read (Figure 9.8).
At the end of Step 1, after reading all texts, users were given the possibility
to adjust all the grades (Figure 9.9).
211
University of Oxford Balliol College
This first scenario was designed to measure two metrics: reading time and
usability score.
• In the second scenario (Step 2 of the user tests), the user is active and must
enter different 5-digits codes displayed for 3.5 seconds with different algo-
rithms and one code displayed normally for comparison. This duration was
chosen because it seems to correspond to a reasonable time necessary for
an average person to read and memorise a 5-digit code.
212
University of Oxford Balliol College
Then, for each algorithm, they had to click the ‘Show the Code’ button to
see a 5-digits code for 3.5 seconds (Figure 9.12, Figure 9.11).
213
University of Oxford Balliol College
Figure 9.11: Step 2 before clicking on the ‘Show the Code’ button
Figure 9.12: Step 2 after clicking on the ‘Show the Code’ button
214
University of Oxford Balliol College
After the 3.5 seconds, the code was hidden and users had to enter it to
proceed to the next algorithm (Figure 9.13).
This second scenario was used to measure the error rate. Let r(A) be an
Integer ranging from 0 to 5 which represents the average number of digits
incorrectly entered by the users for algorithm A. The error rate is given by
the formula:
r(A)
e(A) = 5
.
9.2.3 Real-time
Our objective was to see how many frames per second can be generated with each
of our algorithms.
215
University of Oxford Balliol College
For each algorithm, we varied the parameters and calculated the number of images
generated per second.
• RAM: 8 Gb.
In addition to the width of the patterns, we varied the set of parameters specific to
the random rectangles algorithm (pattern height, number of maximum successive
hidden patterns in a row, number of maximum successive shown patterns in a raw,
probability step) in order to explore the efficiency of all possible configurations.
The objective of this part in our experiments is to figure out how the various algo-
rithms that we studied affect screenshot files sizes.
Indeed, the increase (respectively decrease) in the images sizes after an algorithm
is applied could be a disadvantage (respectively advantage) to its use.
To carry out this part of the experiment, we selected five images with different
properties in terms of colours, text and background.
216
University of Oxford Balliol College
The images correspond to screenshots resized to (800px * 450px), and the TAR
and ZIP standards were used for compression.
9.3 Results
From a security point of view, that is, the difficulty for a human to visually extract
sensitive information from a screenshot, several interesting observations can be
made from Table 9.1. The first column represents the average number of screen-
shots necessary for a user to read a text, to which a protection algorithm has been
applied beforehand. The second column shows the number of screenshots the
users needed to obtain some specific piece of information contained in an image
to which algorithms were applied.
First, one can notice that, paradoxically, when looking for a specific piece of in-
formation, users systematically need more screenshots than when reading a whole
text. This can be explained by the fact that, when reading a meaningful text, hu-
man users can infer a substantial amount of hidden information, whereas, when
having to recopy a date or an exotic city name (column 2), they have to see all the
characters without being able to infer them and therefore need more screenshots.
217
University of Oxford Balliol College
At a first glance, it would seem that algorithms 4 and 5 give very good perfor-
mances in terms of number of screenshots needed by the user. However, this is
a result to be taken with great care because in fact, the periodic pattern moves
slowly in these algorithms. Actually, it would suffice to have the first screenshot
and the one where the pattern has completely moved (completely left the areas
hidden on the first screenshot). Thus, two well selected screenshots would be
enough to extract the data.
218
University of Oxford Balliol College
information and need fewer screenshots, unlike the case of horizontal rectangles
(algorithms 2 and 3). When comparing more specifically algorithms 1 and 3, we
see that even with a lower E, a lower maximum number of contiguous visible rect-
angles and a higher maximum number of contiguous hidden rectangles, vertical
rectangles are less effective than horizontal ones.
Algorithms 2 and 3 are identical from all points of view except the parameter E
(the probability step). From the test results, we can see that Algorithm 3 requires
fewer screenshots. This is quite logical because, in Algorithm 3, E = 0.4, which
concretely means that if a pattern is hidden in screenshot number I, its probability
of being displayed in screenshot number I + 1 will increase by 0.4 instead of 0.05.
Therefore, the number of screenshots required for all areas to be visible at least
once is significantly lower.
We can see that, contrary to human attackers, OCR algorithms need more images
when vertical rectangles are used than when horizontal rectangles are used. This
might be because vertical rectangles are to thin to disturb human inference but are
sufficient to mislead OCR algorithms.
219
University of Oxford Balliol College
Table 9.2: Probability that a state of the art OCR will read the full text (350
characters) with different numbers of incomplete screenshots.
6 7 8 9 10
Algo1 - Random 8% 16% 20% 48% 100%
rectangles (10px,
75px, 7, 3, 0.05)
Algo2- Random 12% 48% 100% 100% 100%
rectangles (100px,
10px, 6, 4, 0.05)
Table 9.3 focuses on the case where the attacker’s objective is to get a specific
piece of information from the screen (in this case from the same hotel bookings
as for human adversaries).
Table 9.3: Probability that a state of the art OCR will extract a specific piece of
information (350 characters) with different numbers of incomplete screenshots.
2 3 4 5 6 7
Algo1 - Random 14% 22% 58% 76% 86% 100%
rectangles (10px,
75px, 7, 3, 0.05)
Table 9.4 shows that, even if the improvement is relatively small, forcing malware
programs to take more screenshots makes the false negative rate drop to 0 (all
malware programs are detected as malicious). In parallel, the number of legitimate
applications classified as malware is lowered as well.
220
University of Oxford Balliol College
9.3.2 Usability
• Passive user: The results obtained are summarised in Table 9.5. The first
column shows the reading time increase for each algorithm compared to an
unaltered text (∆t(A)). The second column is the average usability score
given by the users (s(A)).
221
University of Oxford Balliol College
We can also notice that a shorter reading time is not always correlated with
a better usability (algorithms 4 and 6).
The parameters of the algorithms appearing in Table 9.5 have been chosen to
be able to accurately compare the impact of certain factors on the usability.
The results of algorithms 1, 2, 3 and 5 show that blurring areas with random
overlapping circles instead of random rectangles produces a lower usability
on both measures whereas we anticipated that overlapping patterns would
provide smoother transitions and therefore better usability. This may be due
to the fact that the snowfall algorithm is completely random and does not
use constraints to make sure that a given area will be more likely to appear if
it is hidden in a given iteration. Algorithms 1, 2, 3 and 5 also confirm that,
even if areas are randomly hidden, and thus not necessarily visible more
222
University of Oxford Balliol College
than half of the time, the reading time augmentation compared to a clear
text only ranges from -4% to 15%.
Algorithms 1 and 6 are the same in all respects except the use of blurring in
Algorithm 1 and a simple black colour in Algorithm 6 to hide parts of the
image. We see that the use of the black colour increases the reading time by
about 10% but at the same time generates the division of the usability score
by 2, which therefore marks a significant discomfort caused to users due to
the use of black colour.
In the same way, we can see, by comparing the results of algorithms 2 and
7 that the use of mixed rectangles instead of blurred ones significantly im-
pacts the usability by increasing the reading time and reducing the usability
score in a significant way. This poor usability of the mixed pattern can be
explained by the fact that it is harder to distinguish from real text than blur.
This can cause some confusion for the user and increase their reading time.
Regarding periodical patterns (algorithms 4 and 8), we notice that they pro-
duce among the highest increases in reading time, second only to random
mixed rectangles. This is due to the fact that the user must wait for the
pattern to move to be able to read the text under it. We can notice that the
subjective grade given by the user is also lower than algorithms 1, 2, 3 and
5. Moreover, contrary to the random algorithms, the vertical bars algorithm
allows each area to be visible half of the time. This does not correlate with
an improved usability.
• Active user: All of the 119 users correctly returned the five characters for
each algorithm. This shows a certain relevance of the parameters selected
for each algorithm and also that over a reasonable duration (3.5 seconds),
223
University of Oxford Balliol College
human users have no trouble reading the masked data using all the tested
algorithms.
9.3.3 Real-time
The number of images generated per second for each algorithm are given in Table
9.6.
224
University of Oxford Balliol College
As expected, the number of generated images is always higher when the black
pattern is used, as it requires less computations. The two other patterns produce
similar results except for the concentric circles algorithm for which the use of the
mixed pattern drastically decreases performance.
The results that oscillate around 20 images per second are higher than the average
frequency currently used by legitimate screenshot-taking applications (11 FPS). It
corresponds to a frequency of 50ms which is compatible with the usability tests
presented earlier.
We applied each pair (hiding algorithm/pattern) to each one of the five images
listed in Section 9.2.4. The file sizes obtained allowed calculating the compression
rates shown in Table 9.7.
It appears quite clearly in the Table 9.7 that the black pattern produces, almost in
all cases and with all the algorithms, the best compression rate compared to the
other patterns. This is a relatively predictable finding.
226
University of Oxford Balliol College
We can note that, apart from those involving the concentric circles algorithm and
the mixed pattern, all the other combinations give interesting compression rates.
These rates are even particularly good with the use of the black filter reaching
52%, which means that the size of the returned screenshot is half the size of the
original image.
However, in the case of video formats, our approach would increase the amount
of data to be sent over the network. Indeed, in such formats, individual frames are
not compressed separately. Instead, only the changes between successive frames
are encoded. Therefore, with different parts of the screen being hidden in each
frame, there would be more changes to record.
9.4 Discussion
Our experiment was carried out on four criteria which seem to have an extreme im-
portance for any effective widespread countermeasure against screenshot-taking
spyware. Each of these criteria has brought out a subset of efficient (algorithm /
pattern) pairs and, often, another subset leading to degraded performances.
From a security point of view, the random rectangles and the snowfall algorithms
have proven to achieve the best performance in terms of the number of screen-
shots required for the reading of a text or the extraction of a piece of sensitive
227
University of Oxford Balliol College
information. More specifically, for the random rectangles algorithm, in the case
of a human reader, horizontal rectangles were more effective whereas, in the case
of an OCR, vertical rectangles consistently yielded better results.
On the usability criterion, the random blur rectangles, vertical blur bars and blur
snowfall algorithms obtained the best scores from the panel of testers. This score
should nevertheless be put in perspective for the vertical bars algorithm as it sig-
nificantly increases reading time.
Regarding the bandwidth, the vertical bars algorithm offers the best compression
rate. The random rectangles and snowfall algorithms present interesting compres-
sion rates as well, allowing to potentially increase the number of images sent by
20%.
The random rectangles algorithm with the blur pattern seems to produce the best
results when considering all the criteria. It is therefore a relevant choice as a coun-
termeasure to spyware taking screenshots. However, our multiple experiments,
with various parameters configurations of this algorithm, show that the choice of
the parameters values must be done with the greatest care.
In the future, we could make our approach more robust by preventing automatic
screenshot merging. This would involve using techniques to raise the OCR confi-
dence level of hidden areas, while lowering the confidence level of visible areas.
228
University of Oxford Balliol College
If this can be achieved, the only way for an attacker to exploit stolen screenshots
would be to manually analyse them, which would drastically reduce the scope of
screenlogging attacks.
229
10 | Conclusion
After recapitulating the contributions brought about in this thesis (Section 10.1),
we acknowledge the limitations of our work (Section 10.2), and conclude with
final remarks (Section 10.3).
10.1 Contributions
This study showed that there are several reasons explaining the lack of knowledge
on screenloggers. The main one is that in the majority of cases, a specific event
such as the reception of a command, the opening of an application of interest, or
a number of mouse clicks, is needed to trigger the screen capture functionality,
which explains why it is neglected in the malware detection works.
Across the security reports, we identified recurring steps in the screenlogger op-
erating process. This allowed us to define criteria for establishing a novel screen-
logger taxonomy.
230
University of Oxford Balliol College
231
University of Oxford Balliol College
We built a first RF detection model using only API calls and network features from
the literature. This model was trained and tested using our malicious and benign
datasets. Using RFE with Gini importance, we identified the most informative
existing features for screenlogger detection.
Then, we built a second model including novel features adapted to the screenlog-
ging behaviour. These features were collected using novel techniques. Particu-
larly, we can cite:
• Making a correlations between API calls made by an application and its net-
work activity. During their execution, the API calls and network activity of
our samples were simultaneously monitored. This allowed us to extract fea-
tures such as the reception of a network packet before starting the screenshot
activity or the sending of taken screenshots over the network.
When adding these novel features to the detection model, the detection accuracy
increased by 3.108%. Indeed, it is well known that a detection model based on
less features is less likely to fall into overfitting. Moreover, a detection model
232
University of Oxford Balliol College
based on features which have a logical meaning and reflect specific behaviours, is
less prone to evasion techniques often used by malware authors.
To account for the cases in which a screenlogger might not be detected, we de-
signed and implemented a novel technique to mitigate malicious screenshot ex-
ploitation, with the strong constraint of not requiring any action from the user.
This work is the first to impose such a constraint, existing works focusing on pre-
venting screenshot exploitation in sensitive scenarios using intrusive techniques
impossible to deploy at a large scale.
We analysed the security of our approach against a human adversary and against
an automatic adversary. For the latter case, we implemented an adversarial al-
gorithm dedicated to merging incomplete screenshots depending on the pattern
used to hide parts of the screen. Then, we counted the number of altered screen-
233
University of Oxford Balliol College
shots necessary for an adversary, with state of the art OCR tools at their disposal,
to reach their goal in two scenarios: a whole text and a piece of sensitive infor-
mation. Using these results, we conducted experiments to demonstrate that an
attacker forced to take more screenshots is more likely to be detected by our de-
tection system.
The usability of our approach was tested with a panel of 119 users with three met-
rics: reading time, reading accuracy and a subjective usability score. To the best
of our knowledge, this is the first time an approach based on retinal persistence
was tested with actual users.
Finally, the performance of our approach was tested on two additional criteria:
real time and network bandwidth.
10.2 Limitations
The work proposed in this thesis presents five main limitations:
Moreover, another constraint on the size of the dataset was the necessity to
interact with the samples at run time, which prevented their automatic exe-
cution. Indeed, even if we had millions of samples, it would not have been
possible to run them all and generate their API calls and network reports be-
cause of the time-consuming constraint of interacting at run time to trigger
the screenshots.
234
University of Oxford Balliol College
• The lack of usability for images and videos: Currently, our model can gen-
erate 20 images per second which is less than the rate of some legitimate
screenshot taking applications (up to 45 FPS). This rate of 20 FPS would
be insufficient when screen-sharing a video. Moreover, the usability of our
algorithms has only been tested on textual content.
235
University of Oxford Balliol College
More generally, some of the conclusions we have drawn can be extended to other
fields. For instance, our study has shown the need to pay a particular care to
the functionalities triggered by malware samples while they are executed in a
secure environment. Moreover, the performances of our detection system show
that a tailored approach might, in some cases, avoid overfitting and make detection
models more robust to evasion techniques. One last example is the usability results
we obtained regarding retinal persistence, which can be of use to anyone interested
in this property of the HVS.
236
University of Oxford Balliol College
ever, even in sensitive contexts, it might not be possible to completely prevent the
screenshot functionality, particularly with the recent rise of telecommuting and
screen sharing. Therefore, in the future, one solution might be to make the screen
content impossible to exploit unless a specific device with a shared secret, such as
programmable glasses or a screen overlay, is used to view the screenshots.
237
References
[1] Drozhzhin, “The greatest heist of the century: hackers stole $1 bln.”
[Online]. Available: https://www.kaspersky.com/blog/billion-dollar-apt-
carbanak/7519/
[3] A. Bianchi, I. Oakley, and D. S. Kwon, “The secure haptic keypad: A tac-
tile password system,” in Proceedings of the SIGCHI Conference on Hu-
man Factors in Computing Systems, ser. CHI ’10. New York, NY, USA:
Association for Computing Machinery, 2010, pp. 1089–1092.
[5] J. Lim, “Defeat spyware with anti-screen capture technology using visual
persistence,” in Proceedings of the 3rd Symposium on Usable Privacy and
Security, ser. SOUPS ’07. New York, NY, USA: Association for Comput-
ing Machinery, 2007, pp. 147–148.
238
University of Oxford Balliol College
[8] M. Mitchell, A.-I. A. Wang, and P. Reiher, “Cashtags: Protecting the input
and display of sensitive data,” in Proceedings of the 24th USENIX Confer-
ence on Security Symposium, ser. SEC’15. USA: USENIX Association,
2015, pp. 961–976.
[10] J.-U. Hou, D. Kim, H.-J. Song, and H.-K. Lee, “Secure image display
through visual cryptography: Exploiting temporal responsibilities of the
human eye,” in Proceedings of the 4th ACM Workshop on Information Hid-
ing and Multimedia Security, ser. IH&MMSec ’16. New York, NY, USA:
Association for Computing Machinery, 2016, pp. 169–174.
239
University of Oxford Balliol College
[13] S. Park and S.-U. Kang, “Visual quality optimization for privacy protection
bar-based secure image display technique,” KSII Transactions on Internet
and Information Systems, vol. 11, pp. 3664–3677, 07 2017.
[17] G. Zhao, K. Xu, L. Xu, and B. Wu, “Detecting apt malware infections based
on malicious dns and traffic analysis,” Access, IEEE, vol. 3, pp. 1132–1142,
01 2015.
[18] S. Bahtiyar, “Anatomy of targeted attacks with smart malware: Targeted at-
tacks with smart malware,” Security and Communication Networks, vol. 9,
02 2017.
240
University of Oxford Balliol College
2015_the-great-bank-robbery-carbanak-cybergang-steals--1bn-from-100-
financial-institutions-worldwide
[22] S. David, E. and P. Nicole, “Bank hackers steal millions via malware.”
[Online]. Available: https://www.nytimes.com/2015/02/15/world/bank-
hackers-steal-millions-via-malware.html
[24] Z. Charline, “Viruses and malware: Research strikes back.” [Online]. Avail-
able: https://news.cnrs.fr/articles/viruses-and-malware-research-strikes-
back
[26] S. Lukas, “New telegram-abusing android rat discovered in the wild, we-
livesecurity by eset.” [Online]. Available: https://www.welivesecurity.com/
2018/06/18/new-telegram-abusing-android-rat/
241
University of Oxford Balliol College
[35] Y. Fratantonio, C. Qian, S. Chung, and W. Lee, “Cloak and dagger: From
two permissions to complete control of the ui feedback loop,” 05 2017, pp.
1041–1057.
242
University of Oxford Balliol College
[37] PandaSecurity, “Watch out for chrome and firefox web extensions
that access browser history and rob passwords.” [Online]. Avail-
able: https://www.pandasecurity.com/en/mediacenter/malware/malicious-
web-extensions/
[38] S. Heule, D. Rifkin, A. Russo, and D. Stefan, “The most dangerous code in
the browser,” in Proceedings of the 15th USENIX Conference on Hot Topics
in Operating Systems, ser. HOTOS’15. USA: USENIX Association, 2015,
p. 23.
[39] L. Bauer, S. Cai, L. Jia, T. Passaro, and Y. Tian, “Analyzing the dangers
posed by chrome extensions,” 10 2014.
243
University of Oxford Balliol College
[44] C. Lin, H. Li, X. Zhou, and X. Wang, “Screenmilker: How to milk your
android screen for secrets,” in NDSS, 2014.
[47] S. Hwang, S. Lee, Y. Kim, and S. Ryu, “Bittersweet adb: Attacks and
defenses,” in Proceedings of the 10th ACM Symposium on Information,
Computer and Communications Security, ser. ASIA CCS ’15. New York,
NY, USA: Association for Computing Machinery, 2015, pp. 579–584.
244
University of Oxford Balliol College
https://virusshare.com/
245
University of Oxford Balliol College
246
University of Oxford Balliol College
[76] P. Shijo and A. Salim, “Integrated static and dynamic analysis for malware
detection,” Procedia Computer Science, vol. 46, pp. 804–811, 12 2015.
247
University of Oxford Balliol College
[88] C. Jing, Y. Wu, and C. Cui, “Ensemble dynamic behavior detection method
for adversarial malware,” Future Generation Computer Systems, vol. 130,
pp. 193–206, 2022. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0167739X21004945
248
University of Oxford Balliol College
249
University of Oxford Balliol College
[101] S. Banin, A. Shalaginov, and K. Franke, “Memory access patterns for mal-
ware detection,” 2016.
250
University of Oxford Balliol College
[107] Yara, “The pattern matching swiss knife for malware.” [Online]. Available:
https://virustotal.github.io/yara/
[108] T. Liu, X. Guan, Y. Qu, and Y. Sun, “A layered classification for malicious
function identification and malware detection,” Concurrency and Compu-
tation: Practice & Experience, vol. 24, pp. 1169–1179, 08 2012.
[109] A. Hellal and L. Romdhane, “Minimal contrast frequent pattern mining for
malware detection,” Computers & Security, vol. 62, 06 2016.
251
University of Oxford Balliol College
[119] V. Roth and K. Richter, “How to fend off shoulder surfing,” Journal of
Banking & Finance, vol. 30, pp. 1727–1751, 02 2006.
252
University of Oxford Balliol College
[126] H. Sun, S. Chen, J. Yeh, and C. Cheng, “A shoulder surfing resistant graph-
ical authentication system,” IEEE Transactions on Dependable and Secure
Computing, vol. 15, no. 2, pp. 180–193, 2018.
253
University of Oxford Balliol College
[131] F. Brudy, D. Ledo, and S. Greenberg, “Is anyone looking? mediating shoul-
der surfing on public displays (the video),” in CHI ’14 Extended Abstracts
on Human Factors in Computing Systems, ser. CHI EA ’14. New York,
NY, USA: Association for Computing Machinery, 2014, pp. 159–160.
254
University of Oxford Balliol College
[136] C. Tangmanee, “Effects of text rotation, string length, and letter format
on text-based captcha robustness,” Journal of Applied Security Research,
vol. 11, pp. 349–361, 07 2016.
[141] Raymond, “What is the best anti keylogger and anti screen capture
software?” [Online]. Available: https://www.raymond.cc/blog/what-is-
the-best-anti-keylogger-and-anti-screen-capture-software/2/
255
University of Oxford Balliol College
[147] H. Quigley, A. Brown, J. Morrison, and S. Drance, “The size and shape of
the optic disc in normal human eyes.” Archives of ophthalmology, vol. 108
1, pp. 51–7, 1990.
[152] S. Lehar, “The world in your head : A gestalt view of the mechanism of
conscious experience,” 2003.
256
University of Oxford Balliol College
[153] A. Desolneux, L. Moisan, and J. Morel, “Gestalt theory and computer vi-
sion,” 2004, pp. 71–101.
[155] Cwac-security, “About the flag secure child window issue.” [On-
line]. Available: https://github.com/commonsguy/cwac-security/blob/
master/docs/FLAGSECURE.md
[161] S. Heron, “The rise and rise of the keyloggers,” Network Security, vol.
2007, pp. 4–6, 06 2007.
257
University of Oxford Balliol College
[165] Shadow, “Shadow - kid’s key logger app.” [Online]. Available: https:
//play.google.com/store/apps/details?id=simpllekeyboard.main&hl=en
[167] O. Wiese and V. Roth, “See you next time: A model for modern shoulder
surfers,” in Proceedings of the 18th International Conference on Human-
Computer Interaction with Mobile Devices and Services, ser. MobileHCI
’16. New York, NY, USA: Association for Computing Machinery, 2016,
pp. 453–464.
[168] S. Son and V. Shmatikov, “The hitchhiker’s guide to dns cache poisoning,”
vol. 50, 09 2010, pp. 466–483.
258
University of Oxford Balliol College
259
University of Oxford Balliol College
[183] X. Li, J. Smith, T. Dinh, and M. Thai, “Privacy issues in light of reconnais-
sance attacks with incomplete information,” 10 2016, pp. 311–318.
260
University of Oxford Balliol College
[191] W. Zack, “Many popular iphone apps secretly record your screen without
asking.” [Online]. Available: https://techcrunch.com/2019/02/06/iphone-
session-replay-screenshots/?guccounter=1
261
University of Oxford Balliol College
[197] F. Matthieu and B. Jean-Ian, “Read the manual: A guide to the rtm
banking trojan.” [Online]. Available: https://www.welivesecurity.com/wp-
content/uploads/2017/02/Read-The-Manual.pdf
262
University of Oxford Balliol College
[207] I. Ionut, “Russian hackers hide zebrocy malware in virtual disk images.”
[Online]. Available: https://www.bleepingcomputer.com/news/security/
russian-hackers-hide-zebrocy-malware-in-virtual-disk-images/
263
University of Oxford Balliol College
able: https://unit42.paloaltonetworks.com/unit42-patchwork-continues-
deliver-badnews-indian-subcontinent/
[215] L. Bryan and F. Robert, “Magic hound campaign attacks saudi targets.”
[Online]. Available: https://unit42.paloaltonetworks.com/unit42-magic-
hound-campaign-attacks-saudi-targets/
[216] R. Vicky and H. Kaoru, “New malware ‘rover’ targets indian ambassador
to afghanistan.” [Online]. Available: https://unit42.paloaltonetworks.com/
new-malware-rover-targets-indian-ambassador-to-afghanistan/
264
University of Oxford Balliol College
[225] B. Parys, “The key boys are back in town.” [Online]. Available: https:
//vx-underground.org/archive/APTs/2017/2017.11.02(2)/Keyboys.pdf
[233] XpressTex, “How remote computer repairs can help you!” [Online].
Available: https://www.xpresstex.com.au/why-remote-computer-repairs/
265
University of Oxford Balliol College
[239] M. Polly and M. Anders, “Employers deploy spy software to monitor at-
home workers.” [Online]. Available: https://www.insurancejournal.com/
news/national/2020/03/27/562594.htm
266
University of Oxford Balliol College
[248] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct.
2001.
267