Efficient Code Obfuscation For Android: Faculty of Science, Technology and Communication
Efficient Code Obfuscation For Android: Faculty of Science, Technology and Communication
Efficient Code Obfuscation For Android: Faculty of Science, Technology and Communication
Author: Supervisor:
Alexandrina Kovacheva Prof. Alex Biryukov
Reviewer:
Prof. Jean-Sébastien Coron
Advisor:
Dr. Ralf-Philipp Weinmann
August 2013
ii
Declaration
I, Alexandrina Kovacheva, declare that this thesis titled, “Efficient Code Obfuscation
for Android" and the work presented in it are my own. I confirm that:
This work was done wholly while in candidature for a master degree at the Uni-
versity of Luxembourg.
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
iii
Acknowledgements
I would like to thank my two supervisors for trusting me to work on this topic without
me having prior knowledge on the subject and for guiding me through the way. The last
six months have been the most self-growing period of my master studies. I learned a lot
and I had fun doing so.
I would also like to thank the brave hearted, adventurous and self-taught musicians in
my life. Your music inspires me, it makes my days. Without you my life is a cappella.
iv
Abstract
Recent years have witnessed a steady shift in technology from desktop computers to
mobile devices. In the global picture of available platforms, Android stands out as a
dominant participant on the market and its popularity continues rising. While beneficial
for its users, this growth simultaneously creates a prolific environment for exploitation
by vile developers which write malware or reuse software illegally obtained by reverse
engineering. A class of programming techniques known as code obfuscation targets pre-
vention of intellectual property theft by parsing an input application through a set of
algorithms aiming to make its source code computationally harder and time consuming
to recover. This work focuses on the development and application of such algorithms on
the bytecode of Android, Dalvik. The main contributions are: (1) a study on samples
obtained from the official Android market which shows how feasible it is to reverse a
targeted application; (2) a proposed obfuscator implementation whose transformations
defeat current popular static analysis tools while maintaining a low level of added time
and memory overhead; (3) an attempt to initiate a discussion on what techniques known
from the x86 architecture can(not) be applied on Dalvik bytecode and why.
Contents
Introduction 1
1.1 Android architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Android package file in details . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 APK structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 APK build and installation processes . . . . . . . . . . . . . . . . . 4
1.2.3 DEX file format overview . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Android security overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Final Remarks 39
5.1 Remarks on obfuscating Dalvik bytecode . . . . . . . . . . . . . . . . . . . 39
5.1.1 Static obfuscation techniques . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Dynamic obfuscation techniques . . . . . . . . . . . . . . . . . . . 43
v
vi CONTENTS
5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Appendix 51
Introduction
Ever since the early 1990s, devices combining telephony and computing have been offered
for sale to the general public. In 1997, the term smartphone was introduced for the first
time with the release of Ericsson’s GS88 “Penelope” [44]. Although one might deride that
smartphones are merely in their sixteens, their rapid development and extensive usage
nowadays are indisputable. A report from February 2013 estimated the total number of
smartphone devices sold only in 2012 as surpassing 1.75 billion units with a record peak
in the last quarter [21].
In addition to making and receiving calls, smartphones allow their users to generate, store
and share multimedia by accessing the Internet through various applications. Similar
functionalities have tablet computers, another class of mobile devices. Due to their wide
ranging applicability and high mobility both smartphones and tablets have been preferred
over stationary or laptop computers as access devices to personal information services
such as e-mail, social network accounts or e-commerce websites. These services are easily
made available to the end user via online mobile application markets. By the end of
2012, the market was dominated with a ratio of 70% by the Android platform [25].
This huge market share as well as the sensitivity of the user data processed by most
applications raise an important security question regarding the source code visibility
of the developed mobile software. Firstly, developers have an interest of protecting
their intellectual property against piracy. Moreover, an alarming 99% of the mobile
malware developed in 2012 has been reported to target Android platform users and
inspections reveal both qualitative and quantitative growth [20]. In terms of quality,
Android malware has evolved from applications sending SMS messages to premium-rate
numbers without the user’s authorization to sophisticated code that is able to infect
legitimate applications and propagate via Google Play (the official Android market) [7].
Hence, Android application code protection is crucial to maintaining a high level of trust
between vendors and users which in turn reflects in a correct functioning of the Google
Play market itself.
In general, there are two main approaches towards software protection: enforcing legal
software usage policies or applying various forms of technical protection to the code. This
work concentrates on the latter, more precisely on a technique called code obfuscation. In
the context of information security the term obfuscation encompasses various deliberately
made modifications on the control-flow and data-flow of programs such that they become
computationally hard to reverse engineer by a third party. The applied changes should
be semantic preserving with ideally negligible or minor memory-time penalty. Prior to
elaborating on how to apply obfuscation on Android software, an introduction to the
platform fundamentals is necessary.
1
2 CONTENTS
Applications
Application Framework
Activity Resource Package Telephony
Manager Manager Manager Manager
Linux Kernel
Sensor Power
Drivers Management
The underlying entity of the system is its kernel which bridges the hardware of the device
and the remaining software components. Being a Linux-based kernel, it allows remote
access to the device via a Linux shell as well as the execution of standard Unix com-
mands.
Going up one level in the system stack abstraction is the Dalvik Virtual Machine (DVM).
The DVM is highly tailored to work according to the specifications of the Android plat-
form. It is optimized for a slower CPU in comparison with a stationary machine and
works with relatively little RAM memory: 20MB after the high-level system services
have started [5]. The DVM is register-based, differing from the standard Java Virtual
Machine (JVM) which is stack-based. Such a solution is motivated by the fact that
register-based architectures require fewer executed instructions than stack-based archi-
tectures. Although register-based code is approximately 25% larger than the stack-based,
the increase in the instructions fetching time is negligible: 1.07% extra real machine loads
[13]. Moreover, the Android OS has no swap space imposing that the virtual machine
works without swap. Finally, mobile devices are powered by a battery thus the DVM
is optimized to be as energy preserving as possible. Except being highly efficient, the
DVM is also designed to be replicated quickly because each application runs within a
“sandbox”: a context containing its own instance of the virtual machine assigned a unique
Unix user ID.
At the same abstraction level as the virtual machine are the native libraries of the system.
Written in C/C++, they permit low level interaction between the applications and the
kernel through Java Native Interface (JNI). Although a limited set has been shown on
1.2. THE ANDROID PACKAGE FILE IN DETAILS 3
Fig 1.1, the functionalities provided by these libraries expand to cover features such as
text rendering, application window management, drawing of 2D and 3D graphics etc. A
noteworthy library of this layer is SQLite since mobile applications often store a user’s
identifiable information in such a database which, if not protected adequately, might be
accessed by a third party for malicious purposes.
The next layer is the application framework which provides generic functionality to mo-
bile software through Android’s application programming interface (API). The following
listed represent key structure concepts of Android applications:
Activity. The unitary concept which all applications are built upon. From a design
perspective, an activity corresponds to a single screen with a user interaction inter-
face. Each activity has standard defined methods for managing its lifecycle which
is initiated with the onCreate() method. The control between activities is inter-
changed by an “intent” which can be either direct or indirect depending on whether
the application invokes a concrete activity or calls external applications. It is ex-
actly the Activity classes of the application which are usually infected by malicious
software and thus must be properly protected.
Service. Services are application processes which most often run in background assum-
ing no user interaction is needed to keep them alive. They can also serve as supply
components from the current application to external ones. Malicious code can be
packed into a legitimate application by exploiting weaknesses of services which are
not managed adequately [7].
Content provider. Content providers are an interface for managing the access to a
structured set of data of the current or external applications. Additionally to
encapsulating data, these components define mechanisms for defining data security
[16].
Broadcast receiver. Broadcast announcements are made upon events which affect the
entire system such as an incoming phone call, a screen turn off or wireless avail-
ability. A broadcast receiver responds to such an announcement and is often used
to trigger the execution of malicious code [7].
The top layer of the Android OS stack is where custom applications are compiled, in-
stalled and executed. The file format of the install ready application is called Android
Package (APK) and all the mobile software is distributed over Google Play in this for-
mat. The APK format is a package management system based on the ZIP file archive
format. Further details about the contents of Android applications are provided in the
subsequent section.
To show that Android is targeting a wide range of devices, including resource constrained
ones, the minimal device hardware requirements [13] are given on table 1.1. Currently,
most smartphones and tablets largely exceed the listed.
Feature Requirement
Chipset ARM-based
Memory 128 MB RAM; 256 MB Flash External
Storage Mini or Micro SD
Primary Display QVGA TFT LCD or larger, 16-bit color or better
Navigation Keys 5−way navigation with 5 application keys, power, camera and vol-
ume controls
Camera 2MP CMOS
USB Standard mini-B USB interface
Bluetooth 1.2 or 2.0
META-INF
res Contains the raw resources of the application such as images and audio files.
classes.dex The container of the classes of the application in the Dalvik Executable
bytecode format. This file is of key importance: if not protected, the application’s
reversing is straightforward.
Although not obligatory, it is common for applications to have a lib directory with the
pre-compiled native code for a specific processor architecture.
On the Android platform, the build process differs after the point when the .class
files have been generated. Once having the latter, they are forwarded to the “dx” tool
which is part of the standard Android SDK. This tool compresses all .class files into
a single classes.dex file i.e. the .dex file is the sole bytecode container for all the
application’s classes. After it has been created, the classes.dex is forwarded to the
ApkBuilder tool altogether with the application resources and shared object (.so) files
which, if present, contain native code. As a result, the APK archive is created and the
final compulsory step is its signing. Figure 1.2 shows the APK build process and the
possible obfuscation manipulations which are optional during the build stages. The next
chapter provides more details on bytecode analysis and protection.
App resource
.so files
files
obfuscation
.java Files classes.dex (bytecode) ApkBuilder
(b)
obfuscation
(a) dx APK file
(source code)
jarsigner
javac .class Files
Upon installation, there are two notable steps performed: primary is the APK verifica-
tion and secondary is the bytecode optimization. For security reasons applications whose
legitimate signature as well as correct classes.dex structure cannot be verified are
rejected for installation by the OS. Once verified, the .dex file is forwarded for opti-
mization: a necessary step due to the high diversity of Android running hardware. Thus,
Dalvik executable is a generic file format which needs additional processing to achieve
best performance for the concrete device architecture. The command to manually in-
voke the optimizer is dexopt which outputs an .odex (optimized DEX) pre-processed
version of the classes.dex file and stores it locally in /data/dalvik-cache. The opti-
mization step removes the classes.dex from the original APK archive and loads in
memory the .odex file upon execution. This step occurs only once, during the initial run
of the application which explains the usually slower first application launch comparing
to the subsequent ones.
data repetition and applying implicit typing and labeling. Figure 1.3 shows the .dex
file structure and compares a .jar archive composed of multiple .class files with an
APK containing the same classes packed in a single .dex file. Also, the mappings from
the sections of the .class file to the ones in the .dex file are shown. Although not
depicted, the remaining .class files are mapped analogically.
.jar APK
.class .dex
Header Header
string_ids
Heterogenous constant pool
constant pool type_ids
constant pool
Class proto_ids
constant pool
Field field_ids
Data
constant pool
Method
method_ids
constant pool
Attributes
Class Def Table
Method List
Each .class file has its own heterogeneous constant pool which may contain dupli-
cating data. For example, multiple methods which return variables of the same type,
say String, will result in a repeating Ljava/lang/String; in each of the method’s
signatures. The memory efficiency of a .dex file comes primarily from the type-specific
constant pools used to store the data. This means that in the previously given example,
the constant Ljava/lang/String; will be present only once in the type_ids pool
and will be referenced by each method using it. As a consequence, there are significantly
more references within a .dex file compared to a .class file. This optimized .dex
design ensures data granularity and allows compression as efficiently as up to 44% of the
size of an equivalent .jar archive [13].
Regarding the Dalivk bytecode, some general remarks on the instructions format are a
necessary prerequisite to the next chapters. As already mentioned, the DVM is register
based. Registers are considered 32 bits wide to store values such as integers or floating
point numbers. Adjacent register pairs are used to store 64-bit values. There is no align-
ment requirement for these register pairs [33]. If a method has N arguments, they land
in order in the last N registers of the method’s invocation frame[35]. The corresponding
instruction mnemonic of the method is formatted in a dest-then-source ordering for its
arguments. During the install-time optimization process, some instructions may alter.
In total, there are 218 used valid opcodes in Dalvik bytecode [33][34].
1.3. ANDROID SECURITY OVERVIEW 7
* Binders
Android IPC * Services
App01 * Intents
App02
* Content providers
Unprotected * Network sockets
channels * Openly writable files
Internet
Reverse engineering and code protection are processes which are opposing each other, yet
none can be classified as neither good nor bad. It is the intentions of the agent performing
either action which are biased. From a “good” developer’s viewpoint, code protection is a
means towards intellectual property preservation and reverse engineering can be used to
detect malware. Flipping the coin, an adversary would use code protection to make their
malicious code analyst-resistant and perform reverse engineering to examine potential
applications as attack targets.
Either way, to recover the original code of an application bytecode analysis is most
often used. By applying both dynamic and static techniques, it is possible to detect
an overprivileged application design, find patterns of malicious behavior or trace user
data such as login credentials. Dynamic analysis is the process of extracting the desired
information during runtime. This method requires simulation of the complete input
domain of the examined application to reach high precision in the evaluation of the
program behavior or to successfully track the desired data. By contrast, static analysis
is executed on raw bytecode. Usually, an automatic tool is run through the targeted
code and outputs an approximation of its control flow and data flow. The approximation
accuracy depends on the used reverse engineering algorithms by the analysis tool as well
as on what forms of technical protection the examined code has underwent. In the best
(or worst) case despite the applied protection on the input, the entire source code is
completely recovered.
dexdump Included as a part of the standard Android SDK, this is the most easily ac-
cessible tool to a developer performing Dalvik bytecode disassembling [15]. The
implemented analysis algorithm is linear sweep i.e. it traverses the bytecode and ex-
pects each next valid instruction to succeed the currently analyzed one. In the case
of non-obfuscated code the disassembling will be successful, however a modification
on control flow complexity can fail the recovery process.
dedexer A disassembler tool for dex files [27]. Outputs the recovered bytecode in a
Jasmin-like syntax.
9
10 CONTENTS
baksmali One of the most popular Dalvik bytecode decompilers [32]. Due to the more
sophisticated underlying analysis algorithm, recursive traversal, the recovery rate of
baksmali is greater than the previously presented tools. The algorithm improve-
ment lies in the fact that the next instruction need not necessarily be immediately
following the current one i.e. jumps are successfully processed. However, this
approach only minimizes but does not eliminate the effects of some control flow
manipulations as will be shown later. Due to its popularity, baksmali is used by
multiple reverse engineering tools as a base disassembler, amongst which is the also
well-known apktool.
dex2jar A binary file conversion tool which takes as its input a .dex file and generates
its corresponding .jar archive containing the extracted .class files [28]. To view
the source code, any Java decompiler such as JAD or JD-GUI can be used.
radare2 An interactive console tool for both bytecode disassembling and analysis which
allows very precise control from the user regarding the decompilation process [31].
For specific bytecode functions, decompilation is done with the integration of the
open-source boomerang decompiler. Besides the usage of recursive traversal, the
user may specify decompilation starting at a specific address. Because of this hybrid
approach, some obfuscation techniques breaking other decompilers are reversible
with radare2, however not automatically.
androguard An analysis and disassembling tool processing both Dalvik bytecode and
optimized bytecode [26]. The tool has three different decompilers: DAD, DED and
JAD. The one used by default is DAD which is also the fastest due to the fact
it is a native decompiler. Its underlying algorithm is recursive traversal. Also,
androguard has a large online open-source database with known malware pat-
terns. Additional features such as measuring efficiency of obfuscators by comparing
a program with its obfuscated version, visualizing the application as a graph and
permissions examination are available as separate scripts.
dexter An online analysis tool [29] processing APK files and displaying a rich set of
results amongst which: application’s defined and used permissions; ratio of ob-
fuscated versus non-obfuscated code; ratio of internal versus external packages;
broadcast receivers and content providers etc. This tool also allows graph visual-
ization of the application and full list of strings used by the application. Although
free to use, dexter has its code closed on the server-side and the only information
about the underlying performed algorithms available is that currently it performs
solely static analysis.
dexguard Introduced in June 2013, a set of scripts currently targeting mainly auto-
mated strings deobfuscation and recovery of the .dex file [6]. This tool has a
hybrid approach of dynamic and static analysis and is comprised of: (a) .dex file
reader, (b) Dalvik disassembler, (c) basic Dalvik emulator, (d) .dex file parser.
At the moment of this work’s submission this tool is not publicly available. Also,
for the future the developers plan to keep its code server-side closed.
IDA Pro A widely used commercial tool [12] for reverse engineering under multiple
supported architectures. IDA Pro has multiple features such as program graph
visualization and support of plug-ins which extend its standard functionality.
2.2. BYTECODE PROTECTION TOOLS 11
Evidently, there are numerous tools to the help of the reverse engineer which can be used
either separately or to complement each other. The same diversity cannot be claimed for
software regarding the code protection side which is presented in the following section.
ProGuard A Java source code obfuscator [30]. ProGuard performs variable identifiers
name scrambling for packages, classes, methods and fields. It shrinks the code size
by automatically removing unused classes, detects and highlights dead code, but
leaves the developer to remove it manually.
APKfuscator Another open-source bytecode obfuscation tool [41] which applies mul-
tiple variations of dead code injection.
DexGuard A commercial Android obfuscator [37] working both on bytecode and source
code level (should not be mistaken with dexguard analysis tool). Performs various
techniques including strings encryprion, encrypting app resources, tamper detec-
tion, removing logging code.
The here described open-source bytecode obfuscation tools have the status of a proof-
of-concept software rather than being used at regular practice by application developers.
To show the ease with which source code can be retrieved from Android mobile software,
a case study on applications including both legitimate and malware apps was performed
and the results are presented in the upcoming chapter.
2. (polynomial slowdown) The program size and running time of O(P ) are at most
polynomially larger than those of P .
The following techniques are sorted in ascending order according to the computational
difficulty for their reverse engineering. Whenever a technique is used by an obfuscation
tool, this is explicitly noted with accompanying details on the concrete implementation.
Identities name scrambling. This technique affects the layout of the program and
can be implemented both on source code and bytecode level. Its purpose is to obfus-
cate the program on an abstract level by replacing the meaningful names of variables,
methods, classes, files with ones which provide no metadata information regarding the
code. Identities name scrambling is implemented both in ProGuard and in APKfuscator
with some major differences. ProGuard works on Java source code and uses replace-
ment with minimal lexical-sorted strings {a, b, c, ..., aa, ab, ...} to have
little space penalty cost which is essential on mobile devices [24]. APKfuscator works
on bytecode level and exploits the Unix filesystem restriction that a class name should
not exceed 255 characters [42]. This exploit is possible also on Dalvik bytecode due to
the class definition item structure used in the .dex file format [34]. As shown on fig-
ure 2.2.1, one may replace the classname with a larger one stored in the ubyte[] data
type constant. A .dex format requirement is to have all strings sorted alphabetically
class_def_item
class_idx string_data_item
type_ids string_id_item
access_flags utf16_size uleb182
descriptor_idx string_data_off
superclass_idx data ubyte[]
...
without the occurrence of repeating string names [34]. Furthermore, any misplace of the
entries in the .dex header tables requires a corresponding relevant offset change in all
references pointing to that particular table entry. To avoid such a risky manipulation,
APKfuscator implements name scrambling by simply appending data to the class name
without modifying its position in the constant pool table.
Encoding manipulations. This transformation regards both the file layout and the
data structures of the program. By specification, the byte ordering in the .dex format is
little-endian. The ARM Architecture Reference Manual [2] states that ARM processors
support mixed-endian access in hardware, meaning that they can operate in either little-
endian or big-endian modes. Hence, the DVM verifier is supposed to be able to detect
the encoding of the interpreted .dex file and convert big-endian to little-endian and vice
versa. While changing the encoding is not hard to implement, it has been suggested as
potentially efficient since the majority of the Dalvik bytecode analysis tools work only
with little-endian encoded files [42].
2.2. BYTECODE PROTECTION TOOLS 13
Strings obfuscation. This technique is a well known data transformation applied often
on source code level. Although it is not implemented by any of the examined open-source
obfuscators, it is possible to adjust it to the level of Dalvik bytecode. String obfuscation
prevents from metadata information extraction and is efficient against static analysis.
Since many applications process personal data, it is rather common to store strings such
as user credentials in a database. However, the consequence of keeping the latter in
plaintext is making them an easy target for the reverse engineer. There is a signifi-
cant difference between obfuscating the strings of a program and scrambling the variable
names: changing the latter does not affect the semantics of the program. By contrast,
strings need to be on one hand encrypted to prevent static extraction and on the other
hand, they need to be available as plaintext during runtime such that a process like user
verification is performed successfully. Depending on whether obfuscation is applied on
source code or bytecode level the effort needed to obtain the plaintext string varies. What
can be done on source code level is passing the string s as an argument to an invertible
transformation function F: it is F(s) which is stored in the code. Whenever the plaintext
string is needed during runtime, the program returns F −1 (F(s)) = s. Hence, perform-
ing string obfuscation requires the implementation of a custom encryption/decryption
algorithm or preferably, the usage of a standardized algorithm. On Android, with this
approach the encrypted strings will be stored in the string_ids constant pool, i.e.
the cyphertext would be visible to the reverser and obtaining the plaintext relies on the
hardness of breaking the encryption algorithm. As a remark to the latter, previous work
reveals usages of deprecated algorithms [18] as well as implementations of custom XOR
ciphers [46] which clearly are poor security practices. While theoretically possible, it
is not feasible to perform obfuscation by storing encrypted strings in the constant pool
on bytecode level. Having the entire string_ids table shuffled and later reassembled
such that: (a) the ordering of the content is alphanumeric; (b) does not contain repeat-
ing entries and (c) fixing all table reference offsets across the bytecode is worth a huge
programming effort simultaneously being highly error prone. An alternative improved
approach is converting each string first into a byte array, encrypting the bytes and storing
the encrypted bytes instead of the encrypted string. This makes it significantly harder
for a third party to obtain the plaintext since the encrypted bytes will no longer appear
in the string_ids constant pool forcing the reverse engineer to manually scan the
bytecode to discover the encrypted string.
Dead code injection variants. Dead code injection is another transformation which
is borrowed from x86. It affects the control flow of the application and is implemented on
bytecode level by both dalvik-obfuscator and APKfuscator, each of the tools using
its own variation of the technique. In essence, this algorithm modifies the control flow
by inserting code which will never be executed, yet adds nodes and edges to the program
graph which respectively increases the complexity. To guarantee that the execution will
not go through the introduced bogus paths, a conditional branch is used for redirection.
Thus, it is necessary that this condition is especially chosen as producing an a priori
known to the programmer result, but one which is computationally hard to estimate at
runtime, i.e. it is either always true (directing to “good” paths) or it is always false (never
directing to “bad” paths). Such conditional constructs are called opaque predicates and
they have been used, among others, in Java source code obfuscation [11]. At bytecode
level, the implemented in the two obfuscators dead code injection variants are using
legitimately defined in the documentation but somewhat special instructions.
14 CONTENTS
(a) (b)
fill-array-data
ion
dit
tr
tr
tr
tr
tr
tr
tr
tr
tr
tr
ins
ins
ins
ins
ins
ins
ins
ins
ins
ins
con
Both linear sweep and recursive traversal algorithms fail to recover the correct bytecode
sequence because of the preceding opaque predicate. Linear sweep cannot handle any
“jumping” control flow manipulation. Recursive traversal will discover the presence of the
fill-array-data-payload instruction because of the condition, but will consider
it a legitimate branching leaving untouched the overlapped instructions. The result is
displaying the method internals as a sequence of bytes instead of source code.
In APKfuscator three different variations of dead code injection are implemented [42]:
(a) inserting illegal opcodes in dead code; (b) using legitimately defined opcodes into
“bad” objects; (c) injection of code in the .dex header by exploiting a discrepancy be-
tween the claims of the official .dex file format documentation and what the Dex Verifier
does in reality.
(a) Since the injected code will contain illegal opcodes, a consideration using this tech-
nique must be made with regards to the Dex Verifier. To implement this variant suc-
cessfully, the illegal opcodes should be injected into classes which are not used in the
application i.e. the dead code itself contains the illegal opcodes. If bad opcodes were
used in meaningful classes, the application would crash not being able to execute them.
Furthermore, the dead code should not be removed by the optimizer, otherwise the trans-
formation is meaningless.
(b) This injection variant exploits the fact that there exist multiple legitimate, but un-
used Dalvik opcodes e.g. 0xFC, 0xFD, 0xFE, 0xFF [33]. Let us have the following
injected bytecode sequence:
1201 // load 0 in v1
3801 0300 // if v1 == 0 (always true), jump ahead
1A00 FF00 // load const-string at index 0xFF (not existing)
The verification of the upper sequence is successful since all opcodes are legitimate, but
due to the fact that the opcode 0xFF does not correspond to any valid address, some
disassembling tools fail recovering the entire application, others fail processing only the
obfuscated file [42].
(c) The third injection variant performed by APKfuscator is based on the tool’s author
observation that there is an inconsistency between the official .dex file format specifica-
tion and what the Dex Verifier actually does. For the header_item it is claimed in the
documentation that the header size has a fixed length of header_size = 0x70 [34].
Since Android is an open source platform, it is possible to review the code and observe
the following for the Dex Verifier:
2.2. BYTECODE PROTECTION TOOLS 15
Self modifying code. Self modifying code is a known code transformation applied suc-
cessfully on the x86 architecture whose purpose is to hinder dynamic analysis. Used often
by malware in combination with buffer overflow attacks, it has also found its application
in obfuscation techniques for legitimate software. Having a program protected against
static analysis results in a more complex yet identical upon every execution control flow.
By contrast, dynamic code changes have an effect at runtime altering the execution path
16 CONTENTS
The applicability of executable compression, self modifying code as well as other known
dynamic obfuscation algorithms on Android bytecode is discussed in the final chapter of
this work. It is not uncommon that an obfuscation technique needs to be designed with a
balance between the added program complexity and the robustness of the modified code
against analysis. Regarding this, dynamic obfuscation techniques increase resilience con-
siderably, but it can be a challenge to apply them uniformly on an input APK file which
is why a chapter is dedicated to that topic.
The next chapter presents a case study whose purpose is to justify the claim that current
analysis tools are powerful enough to analyze free applications retrieved from Google
Play. Also, we show that a very small proportion of the examined files are deliberately
preprocessed to resist analysis.
A Case Study on Applications
There exist an extensive set of works examining applications from the viewpoint of privacy
invasion, as was cited in the Introduction chapter. The current case study aims to show
that bytecode undergoes few protection. If present, obfuscation is very limited with
regards to the potential transformation techniques which could be applied, even for apps
which were found to protect their code. The study was performed in two stages. Initially,
automated static analysis scripts were run on bytecode for a coarse classification the
purpose of which was profiling the apps according to a set of chosen criteria. A secondary,
fine grinding examination, was to manually select a few “interesting” apps and looking
through the code at hand. All applications studied were available through the official
Google Play market as of March 2013.
17
18 CONTENTS
ing it (left to be analyzed entirely manually) while the other two decompilers hindered
significantly.
The here enumerated criteria were used for apps profiling:
3. Dynamic loading. Dynamic loading allows invocation of external code not in-
stalled as an official part of the application. It has been discovered as a technique
applied in practice by applications executing malicious code [19]. For the initial
automation phase its presence was only detected by pattern matching check of the
classes for the packages:
Ldalvik/system/DexClassLoader
Ljava/security/ClassLoader
Ljava/security/SecureClassLoader
Ljava/net/URLClassLoader
4. Native code. Filter the class definition table for the usage of code accessing
system-related information and resources or interfacing with the runtime environ-
ment. For the coarse run only detecting the presence of native code in the following
packages was considered:
Ljava/lang/System
Ljava/lang/Runtime
5. Reflection. The classes definition table was filtered for the presence of the Java
reflection packages for access to methods, fields and classes.
6. Header size. Referring to the bytecode injection possibility in the .dex header
by exploiting the discrepancy between the format documentation and the file veri-
fication in reality, the header size was also checked.
7. Encoding. A simple flag check in the binary file for whether an application uses
the support of mixed endianess of the ARM processor.
Ljavax/crypto/
Ljava/security/spec/
All the 1691 applications were profiled according to the formerly listed criteria. For the
malware-alert raising set of 94 apps, the initial automation also included the following:
Once having been processed according to the former listed criteria, the malware-alert
files were studied for similarity with over 200 available malware samples. Since file com-
parison is a time-costly operation, to improve efficiency the malware samples themselves
were classified into clusters by comparing them with each other. This “clusterification”
reduced the initial set to 153 malware files which in turn had a noticeably positive time-
performance impact. To summarize, in total the malware-alert apps were processed in
three stages: (a) general profiling; (b) coarse comparison to determine the belonging
cluster; (c) fine comparison with each application in the cluster. For all similarity tests
the androsim.py tool part of androguard was used. Merely giving a similarity score
based on static analysis with known malware is not sufficient to classify an application as
malicious, but because the primary topic of this work is not related to malware detection
and analysis, no further processing was conducted. All 94 files were sent as report to
Google with according accompanying information. As a result, 24 applications listed in
the appendix were removed from the market.
OBF 100% (100 − 80] (80 − 60] (60 − 40] (40 − 20] (20 − 0) 0% Total
# 82 291 196 166 283 423 250 1691
% 4.85 17.21 11.59 9.82 16.74 25.01 14.78 100%
Table 3.1: Obfuscation ratio. The row with # marks the absolute number of applica-
tions with obfuscated number of classes in the given range. The row with % marks the
percentage this number represents in the set of the total applications.
Table 3.2: Profiling the set of applications according to the given criteria: OBF (total
obfuscated classes), B64 (number of apps containing base64 strings), NAT (number of
apps with native code), DYN (number of apps with dynamic code), REF (number of
apps with reflection), CRY (number of apps with crypto code), HEAD (number of apps
with header size of 0x70), LIT (number of apps with little endian byte ordering). The
row with # marks the absolute numbers of occurrences, % marks the percentage this
number represents in the set of the total applications.
OBF B64 NAT DYN REF CRY HEAD LIT REC SER PRO
# 1433 67 13 30 94 48 94 94 79 89 3
% 38.10 71.28 13.83 31.91 100 31.91 100 100 84.04 94.68 3.91
Table 3.3: Profiling the set of malicious applications according to the given criteria.
The annotations are analogical to the ones on table 3.2 with the addition of: REC
(total number of applications having receivers), SER (total number of applications having
services), PRO (total number of applications having providers).
Table 3.4: Classification of the base64 encoded strings. Categories are denoted as follows:
TXT for text, MUL for multimedia, OTH for other.
3.4. MANUAL REVIEW 21
A set of several applications was selected for manual review, the selection criteria trying
to encompass a wide range of possible scenarios. Among the files were: (1) the most
highly obfuscated (89.7%) malware-alert application; (2) a highly popular social applica-
tion with no obfuscation and a large number of packages; (3) a popular mobile Internet
browser with 100% obfuscated packages; (4) an application which androguard (DAD)
and dexter failed to process; (5) an application which is known to use strings encryp-
tion and is claimed to be obfuscated as well; (6) an application containing many base64
encoded strings; (7-10) four other applications both legitimate and malware-alert chosen
at random. Additionally, the permissions usage of all malware-alert files was reviewed
and analyzed.
With the exception of application (4) all files were successfully processed by andro-
guard. The source code of all checked obfuscated methods was successfully recovered to
a correct Java code with the androguard plugin for Sublime Text 3 . The control-flow
graphs of all analyzed files was recovered successfully with androgexf.py. However,
in some applications the excessive number of packages created an inappropriate setting
for adequate analysis thus the graphs were filtered by pattern-matching the labels of
their nodes. Having the graphs of all applications simplified revealed practices such as
implementation of custom strings encryption-decryption pair functions and having their
source code implementation hidden in a native library (seen in two of the analyzed files).
Reviewing the graph of application (4) was a key towards understanding why some tools
break during analysis: they simply do not handle cases of Unicode method or field names
(e.g. 文章:Ljava/util/ArrayList;). On the other hand, baksmali did fully re-
cover the mnemonics of the application, Unicode names representing no obstacle.
3
http://www.sublimetext.com/
22 CONTENTS
4
http://contagiominidump.blogspot.co.il/
Implementing a Dalvik Bytecode
Obfuscator
The results in the previous chapter confirmed that little protection on Android applica-
tions is used in practice. This chapter describes a possible implementation of a Dalvik
bytecode obfuscator including four transformations whose main implementation accents
fall on fulfilling the generic and cheap properties.
In the context of this work the term “generic” denotes that the transformations are con-
structed in aspiration to encompass a large set of applications without preliminary as-
sumptions which must hold for the processed file. On Android this can be a real challenge
since an application has to run on a wide range of devices, OS versions and architectures.
It can happen that applications which are not obfuscated at all have limited device sup-
port either because the developers intentionally decided so, or due to a limitation such
as lack of testing devices hardware. Thus, it is crucial that any applied code protection
would not decrease the set of application running devices. When a transformation is
characterized as “cheap” this is in referral to previously published work by Collberg et.
al. on classifying obfuscating transformations [10]. By definition, a technique is cheap
if the obfuscated program P 0 requires O(n) more resources than executing the original
P. Resources encompass processing time and memory usage: two essential performance
considerations, especially for mobile devices.
Following is a description of the general structure of the Dalvik bytecode obfuscator 1 as
well as details on the four transformations applied.
23
24 CONTENTS
META-INF
.dex APK
process
original modify new
disassemble .smali assemble pack and sign
APK bytecode APK
files
Adopting this workflow has the advantage of accelerating the development process by
stepping on a .dex file assembler and disassembler pair. However, a disadvantage is
that the implemented obfuscator is bound by the limitations of the used external tools.
As will be shown in the next section this approach has its constraints regarding the range
of the transformations’ applicability.
wrapper-1
class1 … class1 class4 class1
…
… wrapper-3 … .so class2 .so class5 class2
class2
… …
… class3 class6 class3
wrapper-2 …
class3
The application is primarily scanned for the location of native calls by pattern matching
the mnemonics in the method declarations. Let us have a class containing native calls
which are highlighted in colors on (a). For each unique native method a corresponding
wrapper with additional arguments is constructed redirecting the native call. To compli-
cate the control flow, the wrappers are scattered randomly in external classes from those
located originally. As a final step each native call is replaced with an invocation of its
respective wrapper as shown in (b).
The overall impact of this transformation on the program graph can be seen as a tran-
sition from what is depicted in (c) to the final result in (d). Initially, the locality of the
native method calls give a hint on what the containing class is doing. For example during
the manual application review it was trivial to locate a class containing calls to a custom
encryption implemented in a native library (Lcom/.../util/SimpleEncryption;
encryptString(Ljava/lang/String; I) Ljava/lang/String;) i.e. know-
ing exactly which class to track accelerates reversing the custom encryption algorithm.
By contrast, after applying the here suggested transformation once, the reversing time
and effort is increased by locating the wrapper, reviewing its code and concluding that
there is no logical connection between the class containing the wrapper and the native
invocation. If the transformation is applied more than once, the entire nesting of wrap-
pers has to be resolved. Usually, a mobile application would have hundreds of classes to
scatter the nested wrapping structures: a setting that definitely slows down the reversing
process.
26 CONTENTS
…
…
…
…
const/4 …
...
const/16 3-10
getConst()
…
…
(a) (b) …
The transformation thus put represents not much of added complexity to the program.
To further challenge the reverser, the packer class creates between 3 and 10 replicas of
itself, each time applying anew the shuffling and the selection of the numeric transforma-
tion to the array. This means that even if the obfuscated application has several packer
classes which apply the XOR-twice transformation, in each of them the two random
numbers for the transformation will differ as well as the data array index of every unique
numeric value. Designed like this, the transformation has the disadvantage of data du-
plication. However, an advantage that is possible due to this reduplication is removing
the necessity that a single class containing constants is calling the get-constant method
of the same packer which is shown on (b) in the figure above.
To summarize, control flow is complicated by multiple factors. Firstly, additional classes
are introduced to the application i.e. more data processing paths in the program graph for
the reverser to track. Then, in each packer class the array constant values will be seem-
ingly different. Lastly, different packers are addressed to retrieve the numeric constants
in a single class and the reverser would have to establish that the connection between
each of the different packer calls is merely data duplication. Metadata information is hid-
den on an abstract level with the supplementary graph paths and the modified numeric
values. Therefore by applying this transformation both static and dynamic analysis are
hindered.
4.2. BYTECODE TRANSFORMATIONS 27
...
3-10
tered out. A unique key is generated for
and stored inside each such class. All decrypt
strings in a class are encrypted with the
same class-assigned-key. Encryption yields a byte sequence corresponding to each unique
string which is stored as a data array in a private static class field. This results in remov-
ing strings from the constant pool upon application re-assembly thus preventing from
visibility with static analysis. A consideration to use static class fields for storing the
encrypted strings is the relatively small performance impact. Decryption occurs during
runtime, the strings being decoded once upon the first invocation of the containing class.
Whenever a given string is needed, it is retrieved from the relevant class field.
Analogically to previous transformations, adding control flow complexity is at the cost
of duplication. The obfuscator parses a decryption class template and creates between 3
and 10 semantically equivalent replicas of itself in the processed application as shown in
the figure. Each class containing strings chooses randomly its corresponding decryption
class. A simple trick applied with the aim to increase potency (i.e. confusing a human
reader, not an automated tool [10]) is naming the replicas with logical strings which give
no hint as to what is contained in the decryption class. Normally, a human reader would
not expect decryption functionality in a class called InternalLoggerResponse.
To summarize, there are several minor improvements of our suggested implementation
over what was found in related works. Encrypting the strings in each class with a unique
key slows down automatic decryption because the keys are placed at different positions
and need to be located separately for each class. Designing the transformation by using
a decryptor-template approach allows in principal the developer to modify this template:
they can either choose to strengthen potency and resilience or change easily the under-
lying encryption/decryption algorithm pair. Finally, the added control flow complexity
is increased by the supplementary decryption classes.
purpose to defy popular static analysis tools without claiming to be highly resilient. In
fact, it is the contrary. We show that a simple combination of known exploits is enough
to cause certain tools to crack and produce an output error. There are two defeat target
tool types: decompilers and disassemblers performing static analysis. The used tech-
niques are classified in previous works as “preventive” [10] for exploiting weaknesses of
current analysis tools.
:labelA
To thwart decompilers an advantage is taken from the dis-
goto :labelC
crepancy between what is representable as legitimate Java
code and its translation into Dalvik bytecode. Similar tech-
niques have been proposed for Java bytecode protection :labelB
[4]. The Java programming language does not implement goto :labelD
a goto statement, yet when loop or switch structures are
converted into bytecode this is done with a goto Dalvik
instruction. Thus by working directly on bytecode it is pos- :labelC
goto :labelB
sible to inject verifiable sequences composed of goto state-
ments which either cannot be processed by the decompilers
or do not translate back to correct Java source code. In this :labelD
particular implementation a bogus method is created con- goto :labelA
taining goto statements which recursively call each other.
Having this underlying idea in common, different variations are generated to harden auto-
matic detection. Above is given the skeleton of an example recursive goto code sequence
with an indirect recursion whose inner code is not detectable as dead code by the Dalvik
optimizer.
To thwart disassemblers several “bad” instructions are injected directly in the bytecode.
Execution of the bad code is avoided by a preceding opaque predicate which redirects the
execution to the correct paths. This technique has already been shown to be successful
[40]. However, since its publishing new tools have appeared and others have been fixed.
The here suggested minor modifications are to include in the dead code branch: (1) an
illegal invocation to the first entry in the application methods table; (2) a packed switch
table with large indexes for its size; (3) a call to the bogus method we previously created
such that it looks as if it is being used (not to be removed as dead code). The bytecode
sequences corresponding to the first two items are given below with their mnemonics.
(1) 7400 0000 0000 invoke-virtual/range {} method@0000
(2) 2b01 fdff ffff packed-switch v1, fdff ffff
Table 4.1: Profiles of the test applications. The label abbreviations are identical to those
in the case study of applications. The black bullet marks a presence of the criteria.
The label MISC stands for “miscellaneous” and indicates notable app features. In the
facebook app, CCL stands for the custom class loader.
Defined so to ease the editing of smali code, this has its restrictions on our transforma-
tion. Fortunately, on average an application has around 10% of methods using more than
15 registers which is not a severe limitation.
Packing Numeric Variables is applied only to the 4-bit and 16-bit registers, because
there is a risk of overflowing due to the applied transformation when extended to lager
registers. Clearly, a transformation shifts the range of the possible input values. Regard-
ing the simple XOR-based modifications, the scope is preserved but a linear mapping
shrinks the interval of possible values. Also, packing variables was restricted only to
numeric constant types because in Dalvik registers have associated types i.e. packing
heterogeneous data together might be a type-conversion dangerous operation. In the
last chapter more details are given on this particular part of the DVM as well as the
limitations it implies.
Table 4.2: Testing the obfuscated applications on HTC Desire and Sony Xperia tablet.
The transformations abbreviations are as follows: w adding native wrappers, o obfus-
cating strings, p packing variables, b adding bad bytecode. The black bullet indicates
successful install and run after applying the series of transformations.
During the development process all transformations were tested and verified to work sep-
arately. On table 4.2 are given the results of their combined application in accordance to
the order specified by the automata on Figure 4.2. The plus sign should be interpreted
as that the transformations have been applied consequently (e.g. w+o+p means applying
adding wrappers then obfuscating strings then packing variables).
With the exception of the bad code injection on the facebook application, every applica-
tion undergoing the possible combinations of transformations was installed successfully
on both test devices. An observation on the error console logs for the facebook appli-
cation suggests that it might implement its own bytecode verifier, or at least it passes
the bytecode through a custom parser which conflicts with the injected bad code. The
rest of the transformations did not make the app crash. For the Korean ebay app no
crash occurred, but not all of the UTF-8 strings were decrypted successfully i.e. some
messages which should have been in Korean appeared as their UTF-8 equivalent bytes
sequence. The most probable reason is that large alphabets are separated in different
Unicode ranges and smali implements a custom UTF-8 encoding/decoding 2 which might
have a slight discrepancy with the encoding of python for some ranges. Finally, the voxer
communication app did not initially run with the injected bad code. This lead to im-
plementing the possibility to toggle the verification upon bytecode injection. By setting
a constant in the method as verified its install-time verification can be suppressed. En-
abling this feature let the voxer app run without problems. However, verifier suppression
is disabled by default for security considerations.
Besides the upper mentioned, no other anomalies were noted on the tested applications.
No noticeable runtime performance slowdown was detected while testing manually. The
memory overhead added by each transformation separately is shown on Table 4.3. Be-
cause the applications differ significantly in size, for a better visual representation only
the impact on the least significant megabyte is shown.
2
https://code.google.com/p/smali/source/browse/dexlib/src/main/java/org/jf/dexlib/
Util/Utf8Utils.java
4.5. TESTING ANALYSIS TOOLS ON MODIFIED BYTECODE 31
When attempting to view the source code of the five found methods, all of them were
empty. For example:
method: get_frame_to_play ([B)V [public static native] size:0
This means that their actual code is located in a native library and cannot be seen
with static analysis. However, here we look for their usage, not their implementa-
tion. By assigning to a unique variable each of the native methods, we can use the
androguard function show_Paths() to track the usage. In this particular case, our
wrapper was located in the class Landroid/support/v4/util/AtomicFile and had the
name d(Lcom/rebelvox/voxer/System/NativeSystem; [B)V. The next step is to locate
32 CONTENTS
Lcom/rebelvox/voxer/System/LocalNotificationManager;-><clinit>()V
static LocalNotificationManager()
{
v1 = new byte[150];
v1 = {205, 159, 2, ......, 119, 127};
v0 = new com.actionbarsherlock.BasicRandomEventHandler(v1);
v1 = new byte[5];
v1 = {136, 88, 68, 135, 21};
com.rebelvox.voxer.System.LocalNotificationManager. p7890 = v0.up(v1);
v1 = new byte[6];
v1 = {12, 90, 93, 245, 185, 102};
com.rebelvox.voxer.System.LocalNotificationManager. e1951 = v0.up(v1);
.
.
.
}
We can see that the initialized variables are static string class fields:
field: e1951 Ljava/lang/String; [private static java.lang.String]
field: p7890 Ljava/lang/String; [private static java.lang.String]
Looking at the recovered source code of the methods none of them appears to call any
of the other methods, although a correlation between the constructor and RGB can be
established due to the similarity of the performed actions. The reverser has to look
at the mnemonics of the up method to see that it invokes the RGB method for decryp-
tion. An experienced reverser would recognize the RC4 algorithm, but to decrypt they
need to re-write the disassembled code to recover the plaintext or emulate the execution.
A tool which claims to do this automatically is dexguard, however its is unavailable
at submission time so we could not challenge our transformation [6]. Moreover, even
if this process is automated, each time the stream needs to be re-initiated manually
with the uniquely generated decryption class key. Another tool which does automatic
strings decryption is part of dex2jar and is called dex-tool-0.0.9.123 . In this
case it is useless against our encryption because it handles only methods with the signa-
ture Ljava/lang/String en(dec)crypt(Ljava/lang/String); but we repre-
sent the encrypted strings as byte data arrays.
In total our transformation encrypted 9725 strings which were distributed in more than
2000 of the 3539 classes i.e. more than 2000 unique keys to decrypt with. A rough
estimation of the time and efforts needed to reverse all strings left to the reader.
3
URL: https://code.google.com/p/dex2jar/wiki/DecryptStrings
34 CONTENTS
Output
..
.
23 (0000004a) packed-switch-payload 12b0000:
24 (00000052) AG:invalid_instruction (OP:fd)
25 (00000054) AG:invalid_instruction (OP:ff)
26 (00000056) fill-array-data-payload \x00\x00\x12\x10\x54\x85\x75\x06\x72\x40
\xe1\x14\x95\xba\x0c\x04\x54\x85\x77\x06\x72\x40\xfd\x14\x45\xb0\x0a\x05\x38\x05
\x31\x00\x54\x85\x77\x06\x72\x10\xfc\x14\x05\x00\x0b\x02\x54\x85\x76\x06\x22\x06
\x61\x09\x70\x10\xbf\x4a\x06\x00\x1a\x07\xfb\x29\x6e\x20\xca\x4a\x76\x00\x0c\x06
\x6e\x10\x2e\x4a\x01\x00\x0c\x06\x70\x20\x4c\x49\x65\x00\x27\x05\x11\x04
Note: The entire app was successfully processed by androguard, but the output pro-
duced the methods internal code as a packed switch data array. Some methods for which
injection is not applicable were recovered successfully (see also dedexer).
Output
UNEXPECTED TOP-LEVEL EXCEPTION:
org.jf.dexlib.Util.ExceptionWithContext: regCount does not match the number of
arguments of the method
at org.jf.dexlib.Util.....withContext(ExceptionWithContext.java:54)
at org.jf.dexlib.Code.....IterateInstructions(InstructionIterator.java:91)
at org.jf.dexlib.CodeItem.readItem(CodeItem.java:154)
at org.jf.dexlib.Item.readFrom(Item.java:77)
at org.jf.dexlib.OffsettedSection.readItems(OffsettedSection.java:48)
at org.jf.dexlib.Section.readFrom(Section.java:143)
at org.jf.dexlib.DexFile.<init>(DexFile.java:431)
at org.jf.baksmali.main.main(main.java:280)
Caused by: java.lang.RuntimeException: regCount does not match the number of
arguments of the method
at org.jf.dexlib.Code.Format.Instruction3rc.checkItem(Instruction3rc.java:129)
at org.jf.dexlib.Code.Format.Instruction3rc.<init>(Instruction3rc.java:79)
at org.jf.dexlib.Code.Format.Instruction3rc.<init>(Instruction3rc.java:44)
at org.jf.dexlib.Code.Format....$Factory.makeInstruction(Instruction3rc.java:145)
at org.jf.dexlib.Code.....IterateInstructions(InstructionIterator.java:82)
... 6 more
Error occurred at code address 152
code_item @0x91074
Note: Since apktool is based on baksmali their console outputs were identical.
4.5. TESTING ANALYSIS TOOLS ON MODIFIED BYTECODE 35
DARE decompiler
Executed command
dare -d testDare com.rebelvox.voxer.apk
Output
Processing class #2486: Lnet/hockeyapp/android/internal/ExpiryInfoView;
W/dalvikvm(11427):Error while translating ao opcode:type object - constant:103
W/dalvikvm(11427):Unknown instruction format
W/dalvikvm(11427):Error while translating ao opcode:type object - constant:103
W/dalvikvm(11427):Unknown instruction format
W/dalvikvm(11427):Error while translating ao opcode:type object - constant:103
W/dalvikvm(11427):Unknown instruction format
Note: According to the project website, DARE is the improved to target Android ver-
sion of the DED decompiler [43]. When attempting to process the modified application
with DARE, a large console log similar to the output above was produced. After some
point, the decompiler looped endlessly: for the testing it was left to run 3 hours with no
success. When keyboard-interrupted, the result was having a nested hierarchy of directo-
ries corresponding to the packages of the application as well as for its optimized version.
Eventually, the application was not processed at all since those directories were empty.
dedexer
Executed command
java -jar ddx1.25.jar -d testDedexer classes.dex
Output without injecting junk code sequences after the opaque predicate 4
Processing android/...ServiceInfoCompat$AccessibilityServiceInfoStubImpl
Processing android/...ServiceInfoCompat$AccessibilityServiceInfoIcsImpl
Processing android/support/v4/accessibilityservice/AccessibilityServiceCompat
Processing android/...ServiceInfoCompat$AccessibilityServiceInfoVersionImpl
Processing android/support/v4/app/Fragment
Unknown instruction 0xFF at offset 000A4CBC
Note: Only a small part of the app (the upper listed 5 classes) was successfully processed
by dedexer.
Output with injecting junk code sequences after the opaque predicate 4
Note: The entire app was processed, but when looking inside a .ddx file few parts of
the code were translated back to legitimate mnemonics. The majority of the recovered
code looked like the data array bytes given above. The recursively calling goto sequence
can be seen between the addresses l92876 and l9289a. The method internal code is
represented as a data array on address l9289c. It is not always applicable to inject the
bad code sequences. For example methods which are static, native or abstract are not
processed because they do not have the necessary registers to inject the opaque predicate.
Hence, some methods were reversed successfully.
dex2jar
Executed command
./d2j-dex2jar.sh com.rebelvox.voxer.apk
Output
dex2jar touched-com.rebelvox.voxer.apk -> touched-com.rebelvox.voxer-dex2jar.jar
...DexException: while accept method:[Landroid/...ModernAsyncTask$3;.done()V]
at com.googlecode.dex2jar.reader.DexFileReader.acceptMethod(DexFileReader.java:701)
at com.googlecode.dex2jar.reader.DexFileReader.acceptClass(DexFileReader.java:448)
at com.googlecode.dex2jar.reader.DexFileReader.accept(DexFileReader.java:330)
at com.googlecode.dex2jar.v3.Dex2jar.doTranslate(Dex2jar.java:84)
at com.googlecode.dex2jar.v3.Dex2jar.to(Dex2jar.java:239)
at com.googlecode.dex2jar.v3.Dex2jar.to(Dex2jar.java:230)
at com.googlecode.dex2jar.tools.Dex2jarCmd.doCommandLine(Dex2jarCmd.java:109)
at com.googlecode.dex2jar.tools.BaseCmd.doMain(BaseCmd.java:168)
at com.googlecode.dex2jar.tools.Dex2jarCmd.main(Dex2jarCmd.java:34)
Caused by:...DexException: while accept code in method:[...AsyncTask$3;.done()V]
at com.googlecode.dex2jar.reader.DexFileReader.acceptMethod(DexFileReader.java:691)
... 8 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at com.googlecode.dex2jar.reader.DexOpcodeAdapter.xrc(DexOpcodeAdapter.java:791)
at com.googlecode.dex2jar.reader.DexCodeReader.acceptInsn(DexCodeReader.java:625)
at com.googlecode.dex2jar.reader.DexCodeReader.accept(DexCodeReader.java:337)
... 8 more
Note: Having this output error dex2jar produces an empty .jar file.
dexter
To the benefit of the reverser or the disappointment of the code protector, dexter did
not fall for any of our bytecode injection tricks. This result was expected since we use
a similar approach to what was described by one of the tool’s authors for our bytecode
injection [40]. Alongside the development process about 20 applications were analyzed
with dexter, four of which produced an error. Since the code is server-side closed
and no error log information was available on the website, only a supposition on what
may have caused the error is suggested here for the sole purpose to give feedback for
improving the tool. Three out of the four applications which crashed had UTF-8 names
(e.g. NotifierSettings$容) which most likely is an indicator that dexter does
not yet handle such cases. The fourth problematic app was successfully reversed with
androguard.
4.6. SUMMARY 37
JD-GUI
Output
public void setUpdateThrottle(long paramLong)
{
if ((’å’ % 2 == 0) || ((-1 + ’å’ * ’å’) % 8 != 0))
while (true)
new String[3];
this.mUpdateThrottle = paramLong;
if (paramLong != 0L)
this.mHandler = new Handler();
}
Note: To test separately the effect of the recursive goto sequences on decompilers, bad
code injection was removed. In JD-GUI some classes produced //INTERNAL ERROR//.
The remaining classes were translated into not compilable, yet relatively easy to correct
by manual examination Java code. Although not intentional, the transformation had an
effect on the encoding of the variable names and represented them as strings instead of
numeric variables. An obvious drawback of the currently used opaque predicates is the
ease with which they can be detected and removed manually. This weakness is due to the
fact that in order to comply with Dalvik’s requirement for the registers to have known
types, they had to be initialized with a value before being used by the predicates. Trying
to avoid this resulted in a verifier error.
4.6 Summary
This chapter proposes a possible implementation of a Dalivk bytecode obfuscator. The
obfuscator is called half-jokingly “Innocent Dalvik Obfuscator” for two reasons. Firstly,
none of the transformations applied alone is robust enough against an experienced re-
verser armed with multiple analysis tools. Secondly, combined together our transforma-
tions have a very reasonable impact on the underlying application: no more than 1Mb
of additional memory altogether and no noticeable CPU slowdown when tested with an
old phone. It is often the case that a balance between resilience, potency, stealth and
cost has to be found in an efficient obfuscator. This can lead to compromise either with
performance, or with security. Moreover, one is not limited to mingling solely on byte-
code level. In the current state of most freely available Android analysis tools, our four
bytecode transformations combined with a source code level UTF-8 class and method
naming can already provide a good protection level against all here tested tools.
38 CONTENTS
Final Remarks
The final chapter focuses on topics which went naturally alongside the development of
the Dalvik obfuscator. In the succeeding section an attempt to initiate a discussion on
applying known x86 bytecode obfuscation techniques on Dalvik is proposed. Both static
and dynamic techniques are reviewed. In the concluding discussion are given a summary
of the contributions of this work and a possible future development.
39
40 CONTENTS
The non-obfuscated code and its corresponding control flow are given on (a). The gener-
ated address mapping function M is shown on (b). The result of redirecting the control
flow through M is shown on (c). To increase the potency by hiding the real address
5.1. REMARKS ON OBFUSCATING DALVIK BYTECODE 41
values in the branch function M , one could store their hashed values and return the
reversed hash value at runtime. This improvement is possible on x86 because this archi-
tecture allows direct manipulation of registers. Moreover, the instruction pointer itself
is a register i. e. its value can be altered with load or store instructions. For Dalvik
bytecode a verification function enforces constraints on the branch instructions targets.
This can be seen in the following (only the most relevant parts of code are cited):
FILE: /dalvik/vm/analysis/DexVerify.cpp, LINE: 717, PLATFORM v4.2.2
1: if (!selfOkay && offset == 0) {
2: LOG_VFY_METH(meth, "VFY: branch offset of zero
3: not allowed at %#x", curOffset);
4: return false;
5: }
.
. .
. . .
6: if (((s8)curOffset + (s8)offset) != (s8)(curOffset+offset))
7: { LOG_VFY_METH(meth, "VFY: branch target overflow %#x +%d",
8: curOffset, offset);
9: return false;
10: }
.
. .
.
. .
11: if (absOffset < 0 || absOffset >= insnCount ||
12: !dvmInsnIsOpcode(insnFlags, absOffset))
13: {
14: LOG_VFY_METH(meth,
15: "VFY: invalid branch target %d (-> %#x) at %#x",
16: offset, absOffset, curOffset);
17: return false;
18: }
When the code is loaded, the DVM preliminarily scans and marks the beginning ad-
dresses of the instructions. Each instruction is then flagged by the space offset which
it requires, leaving all unflagged bytes to be interpreted as data or parts of a long in-
structions. The main reason why unconditional address jumps are impossible is because
the DVM expects each target to be constant i.e. its value must be known at compile
time and cannot be altered during runtime. On line 1 the cited above code asserts that
instructions do not branch into themselves with the exception of a few ones allowed to do
so. On line 7 a check against 32-bit overflow is done. On line 11 the check prevents from
unconditional memory jump, only valid opcodes can be jump targets. To summarize, the
DVM expects valid instructions as jump destinations and manages them as constant off-
sets. Code containing violations of these requirements would cause a verifier (VFY) error.
10: return a;
11: }
After flattening the same sequence of code and its graph would look like the following:
next = 0
switch(next)
the code cannot simply be copied into the branch statement, an analysis is needed for
the registers types usages and possible side effects. If a side effect of the code is for exam-
ple modifying more than one value, these cannot be returned by the flattened function
and a shared class fields must be used to maintain the semantics. Moreover, all entry
and exiting points of the branches need to be asserted correct register types. While this
is technically possible, it is hard to be implemented generically and might be quite an
unsafe operation.
Injecting dead code in a method is a code splitting technique. In the previous chap-
ter one possible variant of dead code injection with opaque predicates was shown. A
straightforward “trick” to guarantee that this modification is type safe is to inject the
dead code before any registers are used i.e. just after the method declaration. However,
a thorough preliminary static analysis is needed if the bogus branch was to split the
method in two. Here are proposed two implementation possibilities. The simpler is to
trace which registers are free at the point of insertion and use only those freely avail-
able registers to construct the opaque predicate. Although relatively type safe, such an
implementation will highly likely restrict the strength of the inserted opaque predicate:
by default the bytecode is optimized for using as least as possible additional registers.
Empirical testing showed that at most three registers were found to be freely available at
suitable intersection points, a challenge to design a strong opaque predicate with. The
alternative is to allocate as much registers as needed for a strong predicate which has
the drawback of being risky. Firstly, the registers need to be checked for availability by
tracing both before and after the insertion point because jumps are possible in either
direction. Secondly, a register type-checking with regards to the possible jumps is re-
quired. Finally, after inserting the dead code all used registers need to be converted back
to types that the succeeding code is expecting to receive. Another restriction is that not
all register types can be freely converted into each other as can be seen in the merge table
in the /dalvik/vm/analysis/CodeVerify.cpp file. Again, this is technically possible on
bytecode, but much more feasible to apply code on source code level.
Java does not support pointer operations for data integrity reasons. Thus, the custom
class loader would act as part of the DVM itself, having access to the virtual machine’s
memory where the code is and alternating the program behavior. While possible, this
is clearly not a generic transformation, it needs to be applied to the concrete target
program. For example, in C/C++ programs, a possible dynamic change technique is to
duplicate the semantics of a method in two syntactically different versions which inter-
change calls at runtime [9]. In the DVM, the JIT compilation requires that one tailors
such techniques by adding means to locate the methods in interest during execution (e.g.
with an a priori know value variable).
Dynamic code loading. Used both by malware and legitimate applications to load
external code, this technique is shown to be successfully applied with the help of a cus-
tom class loader [39]. To answer the question whether it can be applied generically, a
consideration has to be made. Let us suppose one would like to load some given classes
externally. This means that all invocations to those classes, be it to access static class
fields or to create a class objects, have to go through the custom class loader. This
implies that the external class loader could induce noticeable performance slowdown if
not implemented optimally. Moreover, the case study on market applications proves a
major proportion of the apps use Java reflection. If one would like reflection to work
with dynamic class loading, the entire application needs to be processed with the custom
loader: a challenge regarding performance issues. To maintain a good performance, only
selected classes should be loaded dynamically which imposes a constraint on the usage
of reflection. Therefore genericity with dynamic code loading is restricted.
Code encryption. There are several considerations which need to be taken into account
to adapt this technique for the Android platform. While it is clear that the encryption
would be performed on the application .dex file, there are some subtleties regarding the
decryption at runtime. During the unpacking process, after the successful decryption
of the .dex file, it should be passed to the DVM for loading and execution. Dynamic
loading is possible due to the support of reflection in Dalvik, but the contained public
methods can only be executed if the file is stored in the file system. Thus, by saving the
decrypted and decompressed .dex file on the device’s storage, the previously applied
protection becomes impractical. Moreover, the bytecode is optimized upon its initial
launch and the .odex file is stored in the cache secured by enforced system permissions.
Implementing a custom file dex loader can bypass the restrictions of interfacing directly
the libraries within the DVM. To summarize, encryption can be implemented analogi-
cally to dynamic code loading which brings up the mentioned performance and lack of
genericity considerations. In this case, the performance is also highly dependent on the
efficiency of the chosen encryption/decryption algorithms pair. Finally, the key must be
stored in the decryption program stub i.e. is available to the reverser and if not hidden
appropriately this technique is ineffective.
This subsection concludes with a remark regarding the stealth of the here listed dynamic
transformations on Android. Applying either of them to the entire application is not
performance efficient, yet selecting a subset of classes to load dynamically or encrypt
gives an immediate hint to where the valuable code is. It can be the case that code
which needs to be protected is also critical for the performance of the application. If
so, obfuscation represents an additional layer of processing time and allocated memory.
Therefore each application which makes use of some dynamic modifications can be seen
as a special case which needs determining what technique to use and how.
5.2. DISCUSSION 45
5.2 Discussion
This work accented on several important aspects of code obfuscation for the Android
mobile platform. To commence, we confirmed the statement that currently reverse en-
gineering is a lightweight task regarding the invested time and computational resources.
We studied more than 1600 applications for possible applied code transformations, but
found no more sophisticated protection than variable name scrambling or its slightly
more resilient variation of giving Unicode names to classes and methods. In some appli-
cations we also found encryption applied on strings generated during runtime. Yet, these
applications themselves had hardcoded strings visible with analysis tools.
Having demonstrated the feasibility of examining randomly selected applications, we
proposed a proof of concept open-source Dalvik obfuscator with the purpose of intro-
ducing a reasonable slowdown in the reversing process. Our obfuscator performs four
transformations three of which target both data flow and control flow. The last trans-
formation is a slight modification to a proven efficient technique from previous work.
We challenged various analysis tools on our modified code, showed that the majority of
them are defeated and proposed an already used in practice supplementary source-code
transformation to target the remaining.
During the development process it was occasionally necessary to look through the source
code of the DVM. Also, except several blog posts no previous comments were found on
what known from the x86 architecture obfuscation techniques can be applied on Android.
This motivated the writing of the last chapter: our attempt to initiate such a discussion
by summarizing how popular techniques can be adapted for Dalvik bytecode.
Android is merely since five years on the market, but because of its commercial growth
much research is conducted around it. The evolution of the platform is a constantly
ongoing process. It can be seen in the source code that some of the now unused bytecode
instructions were former implemented test instructions. Possible future opcode changes
may invalidate the effects our transformations. Moreover, analysis tools will keep on
getting better and to defeat thems newer, craftier obfuscation techniques will need to be
applied. This outwitting competition between code protectors and code reverse engineers
exists ever since the topic of obfuscation has been established of practical importance.
So far, evidence proves this game will be played continuously.
46 CONTENTS
Bibliography
[3] Boaz Barak, Oded Goldreich, Russell Impagliazzo, Steven Rudich, Amit Sahai,
Salil P. Vadhan, and Ke Yang, On the (im)possibility of obfuscating programs, Pro-
ceedings of the 21st Annual International Cryptology Conference on Advances in
Cryptology (London, UK, UK), CRYPTO ’01, Springer-Verlag, 2001, pp. 1–18.
[4] Michael R. Batchelder, Java bytecode obfuscation, Master’s thesis, McGill University
School of Computer Science, Montréal, 2007.
[5] Dan Bornstein, Dalvik vm internals, Google I/O Session Videos and Slides, URL:
https://sites.google.com/site/io/dalvik-vm-internals (2008).
[7] Carlos A. Castillo, Android malware: Past, present, and future, McAfee Mobile
Security Working Group (2011).
[8] Frederick B. Cohen, Operating system protection through program evolution, Com-
puters and Security 12 (1993), no. 6, 565 – 584.
[9] Christian Collberg and Jasvir Nagra, Surreptitious software: Obfuscation, water-
marking, and tamperproofing for software protection, no. ISBN-13: 978-0321549259,
Addison-Wesley Professional, 2009.
[10] Christian Collberg, Clark Thomborson, and Douglas Low, A taxonomy of obfuscating
transformations, Technical Report 148, Department of Computer Science, University
of Auckland, New Zealand, July 1997.
[11] Christian Collberg, Clark Thomborson, and Douglas Low, Manufacturing cheap,
resilient, and stealthy opaque constructs, IN PRINCIPLES OF PROGRAMMING
LANGUAGES 1998, POPL’98, 1998, pp. 184–196.
[14] Adrienne Porter Felt et. al., Android permissions demystified, Univer-
sity of California, Berkeley, URL: http://www.cs.berkeley.edu/~afelt/
felt-permissions-ccs.pdf (2011).
47
48 BIBLIOGRAPHY
[17] Devon Long Hannah Gommerstadt, Android application security: A thorough model
and two case studies: K9 and talking cat, Harvard University.
[18] Peter Hornyack, Seungyeop Han, Jaeyeon Jung, Stuart Schechter, and David
Wetherall, These aren’t the droids you’re looking for: retrofitting android to pro-
tect data from imperious applications, Proceedings of the 18th ACM conference on
Computer and communications security (New York, NY, USA), CCS ’11, ACM,
2011, pp. 639–652.
[19] Xuxian Jiang, Security alert: New stealthy android spyware - plankton - found in
official android market, Department of Computer Science, North Carolina State
University, URL: http://www.csc.ncsu.edu/faculty/jiang/Plankton/.
[20] Kaspersky Lab, 99% of all mobile threats target android devices, URL:
http://www.kaspersky.com/about/news/virus/2013/99_of_all_mobile_
threats_target_Android_devices.
[22] Cullen Linn and Saumya Debray, Obfuscation of executable code to improve resis-
tance to static disassembly, Proceedings of the 10th ACM conference on Computer
and communications security (New York, NY, USA), CCS ’03, ACM, 2003, pp. 290–
299.
[23] Cypherpunks (mailing list archives), Rc4 source code, URL: http://cypherpunks.
venona.com/archive/1994/09/msg00304.html, 1994.
[33] The Android Open Source Project, Bytecode for the dalvik vm, URL: http://
source.android.com/devices/tech/dalvik/dalvik-bytecode.html, 2007.
[36] David Reiss, Under the hood: Dalvik patch for facebook for an-
droid, URL: http://www.facebook.com/notes/facebook-engineering/
under-the-hood-dalvik-patch-for-facebook-for-android/
10151345597798920, 2013.
[42] , Dex education: Practicing safe dex, Blackhat USA, URL: http://www.
strazzere.com/papers/DexEducation-PracticingSafeDex.pdf, 2012.
[43] Systems and Internet Infrastructure Security, Dare: Dalvik retargeting, URL: http:
//siis.cse.psu.edu/dare/, 2012.
[44] 296/5-21-1974 U.S. Patent No 3 727 003/4-10-1973 U.S. Patent No 3 842 208/10-
15-1974 U.S. Patent No 3, 812, (apparatus for generating and transmitting digital
information), (decoding and display apparatus for groups of pulse trains), (sensor
monitoring device).
[45] John von Neumann, First draft of a report on the edvac, University of Pennsylvania
(1945).
51
52 BIBLIOGRAPHY
Feature Value
Model HTC Desire
CPU ARMv7 Processor rev 2(v7l), 1GHz
GPU Adreno 200 (AMD Z430)
RAM 512 MB
Storage 405MB built-in
SD card 2GB Micro SD
OS Android 2.3.7 Gingerbread
ROM CyanogenMod-7.2.0.1-bravo
MISC A-GPS, Micro USB, Camera, Bluetooth 2.1, Wifi 802.11
Feature Value
Model Sony Xperia Tablet SGPT121
CPU Nvidia Tegra 3 Quad-core, 1.7 GHz
GPU OnBoard Graphic
RAM 1024 MB
Storage 16GB built-in
SD card None present
OS Android 4.1.1 Ice Cream Sandwich
ROM Sony proprietary firmware
MISC A-GPS, USB, Camera, Bluetooth 3.0, Wifi 802.11