Shirley Yang Masc Thesis
Shirley Yang Masc Thesis
Shirley Yang Masc Thesis
by
Mingyue Yang
Abstract
Using Machine Learning to Detect Software Vulnerabilities
Mingyue Yang
Master of Applied Science
Graduate Department of Computer Engineering
University of Toronto
2020
Due to the size of modern software projects, it is unscalable to protect all parts of the project by
the existing code analysis tools are not effective enough at finding vulnerabilities. Lightweight static
analysis tools are imprecise, while expensive symbolic execution strategies do not scale. Dynamic analysis
techniques such as fuzzing can only verify the executed path in programs, and it is impractical to explore
As a result, we propose machine learning to detect software vulnerabilities: this approach is more
precise than lightweight static analysis, but less expensive than symbolic execution. Prediction results
from machine learning can be used to guide fuzzing. We evaluate two machine learning strategies: coarse-
grained statistical model and fine-grained raw feature representation. The statistical model requires
less data but does not capture all relationships in code. The raw feature representation learns subtle
relationships in programs and does not require manual feature engineering, but needs more training
Acknowledgements
I would like to express thanks to my supervisor, Professor David Lie, for his advice, guidance, and
financial support that helps me throughout my research. I appreciate his kindness, patience, and the
amount of time he spent. I am also grateful to Ivan, James, and Vasily for their insightful feedback on
this project. I would like to thank my labmates who are always willing to help. I am thankful for the
financial support provided by the University of Toronto and the Ontario Graduate Scholarship. Lastly,
I would like to thank my family for their support throughout my studies.
Contents
1 Introduction 1
1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 LLVM Language Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Abstract Syntax Tree (AST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 LLVM IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Basic Block & Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Program Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Vulnerability Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 CVE/NVD Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 SARD Juliet Test Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminaries 9
3.1 Definition for Vulnerability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
iii
CONTENTS iv
7 Related Work 47
7.1 Code Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.1 Vulnerable Clone Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Matching Vulnerable AST Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.1.3 Program Metrics for Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Machine Learning for Vulnerable Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.1 Automatically Learning Vulnerable Patterns . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.2 Program Slicing with Machine Learning For Vulnerability Detection . . . . . . . . 50
7.3 Vulnerability Discovery with Machine Learning to Guide Fuzzing . . . . . . . . . . . . . . 51
Bibliography 54
List of Tables
v
List of Figures
vi
List of Algorithms
vii
Listings
viii
Chapter 1
Introduction
Although software vulnerabilities raise many concerns these days, there are still many security com-
promises that result from them. For example, the Wannacry ransomware exploited a vulnerability in
Microsoft Windows, causing billions of dollars in losses [1]. Apple’s FacePalm vulnerability allows at-
tackers to eavesdrop on victims by self-answering FaceTime calls [2]. While there are many automated
code analysis approaches to detect software vulnerabilities, their effectiveness is limited. For these ap-
proaches, the tradeoffs between the time/computing power invested and the accuracy of analysis cannot
satisfy the need for vulnerability discovery in modern-day projects.
To detect vulnerabilities, static analysis tools such as FlawFinder [3] and Cppcheck [4] find prede-
fined vulnerability patterns in code. While these tools are lightweight, the predictions they made are
inaccurate: not all vulnerabilities match predefined rules exactly. More precise analysis tools such as
CodeSonar [5] add symbolic execution [6] to static analysis. Symbolic execution is a technique that
explores execution paths in programs and determines what an input needs to be for a part of a program
to execute. However, the cost for analysis greatly increases, as it is expensive to track relationships
between all variables that can affect targeted paths in a program. Also, as it is impractical to precisely
analyze all values for every variable, some paths found by symbolic execution are infeasible.
In contrast, dynamic code analysis techniques such as fuzzing find vulnerabilities by triggering them.
Although the triggered vulnerabilities are real, fuzzing can only check executed parts of the code. To
fully analyze a program, fuzzing has to exhaustively execute every single path in the program. This
makes the analysis intractable.
As a result, for open-source projects and commercial products commonly used, vulnerability discovery
still largely depend on manual auditing, in which human developers inspect the source code to check
for vulnerabilities. However, modern software systems usually contain millions of lines of code. This
renders manual code audits impractical to protect all parts of software projects. To solve this problem,
we explore the hypothesis that there are common patterns in source code that are indicative
of the presence of security vulnerabilities.
These patterns, although detectable, are not explicitly stated. Thus, in this thesis, we propose
machine learning to identify these vulnerable code patterns. Machine learning is more precise than simple
pattern matching techniques, as it aims to find vulnerability patterns in code that are not obviously
stated. Besides, machine learning is also more lightweight than approaches using symbolic execution.
Unlike symbolic execution, machine learning simply collects features to represents the code. It does not
1
Chapter 1. Introduction 2
have to deeply analyze all relationships between instructions/variables and exhaustively solve for feasible
inputs for every single path in the program.
Predictions made by machine learning classifiers can help human reviewers limit their scope for
vulnerability search. Machine learning can also be used to help dynamic analysis techniques such as
fuzzing. Knowing which part of the code is more likely to be vulnerable, a fuzzer does not have to
explore all paths in the program, and can more thoroughly test more vulnerable parts of the code. Also,
fuzzing can be used to automatically verify vulnerabilities found by machine learning, eliminating false
positives in predictions.
Other than the described benefits, there are also challenges for the machine learning approach to
detect software vulnerabilities. Vulnerabilities in the real world are diverse and have implicit patterns.
This means an expressive, and thus complex machine learning model is required to learn the vulnerable
patterns. However, in real-world projects, there are not many samples for each type of vulnerability. A
complex model is likely to overfit on its training data, and the patterns it learns may not be generalizable.
To solve these challenges, there are conflicting requirements a machine learning classifier needs to satisfy.
As a result, we experiment with two different machine learning techniques: coarse-grained statistical
model and fine-grained graph classifiers. Each approach deals with an aspect of the conflicting require-
ments. For coarse-grained statistical features, we collect statistics relevant to vulnerabilities as features,
and train classifiers to find vulnerability patterns in those statistical features. While the features and
models are simple, it requires less data to perform well. On the other hand, we build a graph classifier
to learn raw features from programs. Information such as the types of instructions and the relationships
between the instructions are used to train the classifier. Although this model needs more data to work
well, its complexity allows it to find subtle patterns in code related to vulnerabilities.
Background
In this chapter, we present some concepts and tools later used in this thesis. Section 2.1 explains concepts
about the LLVM language such as instructions, basic blocks, and program dependences in the LLVM IR.
These concepts are used during feature extraction for the two machine learning approaches we propose.
Section 2.2 describes databases we collect vulnerable samples from: they consist of both vulnerabilities
in real world and synthetic vulnerability test cases. Lastly, Section 2.3.2 explains evaluation criteria we
later use to measure performance for machine learning models.
2.1.2 LLVM IR
The LLVM IR [7] is an intermediate representation (IR) used by the LLVM compiler project. The
IR is a representation used by the compiler after processing the source code, but before generating
platform-dependent machine code: the LLVM compiler first processes the source code and puts them
into ASTs, and then outputs LLVM IR from the AST representation. The LLVM IR is usually used
for code optimization. Unlike AST, while LLVM IR preserves the semantic meaning of the program, it
does not contain as much syntactic information such as tokens from the source code and code structures
3
Chapter 2. Background 4
like if/else statements. On the other hand, the platform-independent nature of the LLVM IR makes it
generalizable to all computer architectures. As a result, we work on the LLVM IR level to ensure our
predictions are not project-specific.
The LLVM IR is a number of instructions in the Static Single Assignment (SSA) form: every value
is assigned at most once, to simplify code optimization. As IRs are platform-independent, the concept
of register is not introduced. Rather, operands for instructions may be results from other instructions,
forming def-use relationships. For example, in Listing 2.1, by using instruction %3 as an operand,
instruction %4 adds the loaded result from %3 with the literal value 2. The add instruction %4 is also
used by a store instruction as an operand.
While no values can be repeatedly assigned, different store instructions can store values into the same
variable. In Listing 2.1, for instance, the first two store instructions can both store values (3 and %4)
into the same variable allocated by %1. The stored values can also be loaded back by load instructions.
literal indices are treated as separate variables. However, for array accesses whose offsets are variables,
they are treated as the same variable, as the value of the variable cannot be easily determined without
precise/expensive analysis.
Def-Use Relationship: As shown in Section 2.1.2, some instructions are used by other instructions
as operands. This results in def-use relationships between instructions.
Control Flow Dependence Within Basic Block: For instructions in the same basic block, as
program execution is sequential inside a basic block, instructions have control flow dependences on their
immediate predecessor instructions.
Control Flow Dependence Across Basic Blocks: For basic blocks in the control flow graph,
a successor basic block has control flow dependence on its predecessor block. In terms of instructions,
we consider the first instruction of the successor block to be dependent on the last instruction of the
predecessor block.
In a bidirectional LSTM (BiLSTM), information flows both forward and backward. A BiLSTM
consists of two LSTMs: the output of two LSTM can be merged together with operations such as add,
multiply and concatenate. Output for an LSTM may be fed into other machine learning models for tasks
such as classification.
Word2Vec Skip-gram Model: The word2vec skip-gram model [12] is commonly used in natural
language processing. It generates an embedding representation for every word in the language. The skip-
gram model first finds neighbors for every word in a corpus of training text. Context size is specified to
limit the number of neighbors before and after each word. For example, a context size of 3 means that
only 3 words before and after the target words are neighbors.
Every input word is represented as a one-hot embedding, whose dimension is the same as the size
of the vocabulary. The input embeddings are then fed into a neural network classifier. The output of
the classifier is also a vector, whose size is the same as the size of the input embedding. To train the
skip-gram model, each dimension of the output layer is set to 1 if its corresponding word is a neighbor
of the input word, and set to 0 otherwise. After the loss is minimized, weight matrix at the hidden layer
is the final word embedding. To get embedding for a certain word, the one-hot embedding of that word
can be fed into the trained neural network. The vector representation obtained at the hidden layer is
the word embedding. In Section 5.2.2 of this thesis, the word2vec skip-gram model is used to generate
embeddings for instructions in programs.
while the other increases. The area under PRC curve (PRC AUC), as it is named, is the area under the
PRC curve ranged between 0 and 1. PRC AUC is class-specific: higher PRC AUC means the classifier
generally has better precision and recall for a certain class. When testing on an imbalanced dataset, the
underrepresented class should have a lower PRC AUC. Similarly, the overrepresented class has higher
PRC AUC.
True Positive Rate (TPR) & False Positive Rate (FPR): The true positive rate (TPR) is
defined as the recall for the positive class. The false positive rate (FPR) is the proportion of negative
samples wrongly classified as positive over all actual negative samples. The false positive rate (FPR)
can also be considered as 1 minus the recall for the negative class.
TP
TPR = (TP+FN)
= recall for positive class
FP TN
FPR = (TN+FP)
= 1− TN+FP
ROC AUC: The ROC curve plots how TPR varies with FPR under different thresholds (its x-axis
is the FPR, while its y-axis is the TPR). The area under ROC curve (ROC AUC) is also a number
between 0 and 1. ROC AUC is measured for both classes: one can get the ROC curve for the negative
class by flipping the axes from the ROC curve for the positive class. ROC AUC measures how well
a classifier distinguishes between two classes. As a classifier is able to better separate data from two
classes, it is easier to get higher TPR with low FPR, and thus the ROC AUC is closer to 1. In other
words, with a good ROC AUC, it is easier to get good recall values for both classes.
ROC AUC represents the probability a model is able to separate the positive and the negative class.
An ROC AUC below 0.5 means the classifier does no better than a random classifier. Class imbalance
in test data generally does not affect the ROC AUC much. For example, with more negative samples,
there are both more true negatives and false positives, and thus the FPR on an imbalanced set would
be similar to the FPR on a balanced set. On the other hand, true positives and false negatives are not
affected by the number of negative samples, so TPR is unchanged.
Chapter 3
Preliminaries
Before model training can start, there are a few preparations need to be done and a number of challenges
to be considered. In this chapter, we first define what constitutes a vulnerability in this project. Then
we present both the challenges we face and the methods we use to prepare the set of vulnerabilities.
9
Chapter 3. Preliminaries 10
3.3 Challenges
There are a number of challenges that need to be considered before designing a strategy for vulnerability
prediction. While these challenges may not be completely resolved, different techniques we present later
can achieve better performance in some aspects by aiming to solve some of the problems discussed below.
Size and Quality of Dataset: As the proportion of vulnerable and non-vulnerable code is imbal-
anced (there is much more non-vulnerable code than vulnerable code in the real world), compared to
the number of non-vulnerable samples, there are not as many samples for vulnerable code. In fact, in
the CVE database consisting of reported vulnerabilities, vulnerabilities with patches available have more
variance and complexity than the size of the dataset represents. While there are vulnerable samples in
synthetic datasets, they do not accurately reflect the nature of real-world vulnerabilities. This presents
challenges for machine learning: with a limited amount of vulnerable data, it is hard for a classifier to
learn representative patterns for vulnerabilities.
Another challenge is that there is no ground truth for non-vulnerable samples. It could be tempting
to assume that a piece of code in which no vulnerability has been reported is non-vulnerable. However, a
previously undiscovered vulnerability may still occur in there. Expert knowledge and manual inspection
could help sanitize the data, but this would be expensive in terms of time and human effort. In fact,
while expert’s inspection improves the quality of the dataset, it still cannot be guaranteed that a piece of
code is entirely free of vulnerabilities after human inspection. Although the mislabeled non-vulnerable
samples do not add up to a significant proportion, it could affect the performance of the classifier. If
some non-vulnerable samples exhibit certain characteristics of vulnerable samples, some false positive
predictions obtained by the classifier may actually be correctly predicted vulnerable samples.
Location of Vulnerability: One observation is that the difference between vulnerable and non-
vulnerable code is usually very small. Patches for vulnerabilities usually involve only changes in only a
few lines of code. The change is typically much less than the size of a function/file, which could consist
of hundreds of lines of code.
As a result, it would be ineffective to simply feed detailed features representing a large chunk of code
into a machine learning classifier. If the representation is not well selected, the small amount of code
indicative of the difference between vulnerable and non-vulnerable code would have little or even no
effect on the final representation of the code: other features such as the functionality of the code may
instead dominate the difference in representation.
In addition, it is also a challenge to accurately identify which parts of the code a known vulnerability
resides in exactly, even after knowing the lines of code changed in patches: the lines changed in a patch
may not contain complete patterns for the vulnerability. For example, a fix for a buffer overflow may be
an added/changed check before memory access: the added/modified check, although related, should not
be considered as vulnerable by itself. It is not clear where the point in which unexpected behaviors may
occur by simply looking at the patch. For the buffer overflow example just described, as there could be a
lot of code guarded by the check, it is hard to identify which part is actually related to the vulnerability
without careful analysis.
Some interprocedural vulnerabilities may also have several checks in different functions. For this
type of vulnerability, it may not be clear in which function or part of the code a vulnerability is in. One
function could return incorrect offset calculation, while another function uses the result without check.
For simplicity, we currently only deal with intraprocedural vulnerabilities.
Chapter 4
Coarse-grained Statistics as
Function-Level Features
Although the difference between vulnerable and patched code usually involves only a few lines of code,
there are certain patterns at a larger scale of the code that indicates if a function is more likely to have
vulnerabilities. We aim to distinguish these vulnerability patterns with coarse-grained function-level
statistics. For our statistic-based strategy, rather than trying to directly distinguish between functions
with or without vulnerabilities, the goal is thus to predict which functions are more likely to be
vulnerable.
For every function, a feature vector is computed, of which each dimension represents a certain statistic
measure in the function. This model addresses the problem of a limited dataset previously mentioned:
simpler features and models need fewer training points to work well. It is also easier to see which
features are more relevant to the result of classification. However, these statistical features are rather
coarse-grained. This means not all meanings of instructions and relationships in code are preserved. In
addition, as the features are manually picked, the performance of the model is highly dependent on the
types of features selected.
11
Chapter 4. Coarse-grained Statistics as Function-Level Features 12
called, the number of certain instructions/function calls is divided by the total number of instructions
in the function. We use the proportion, rather than the number of certain instructions in the function,
because for larger functions that have more lines of code (and thus more instructions), the number of all
types of instructions increases proportionally. As a result, the number of certain instructions/function
calls does not represent the overall density for a specific type of operation. Scaling is performed, because
these features are supposed to be independent of function size. Below we describe the list of features
we use.
Size of Function: Number of lines, basic blocks, and instructions are all included as indicators of
the size of a function. Size of functions are included for two reasons: 1) Larger functions have more
code and thus may also have more chances for a vulnerability to occur. 2) Machine learning models can
use these features as normalizers for other features so they do not overemphasize or deemphasize certain
patterns when they occur more often in larger functions.
There are also correlations between these three features. Fewer lines of code while more instructions
mean every line of code written is more complex. Fewer instructions comparing to basic blocks can also
indicate more control flow paths within a function.
Loads/Stores with Pointers: Both the proportion of load/store instructions and loads/stores
involving pointers/arrays in the function are included as features. This is based on the intuition that
memory errors are usually associated with loads/stores with pointers. For example, during a buffer
overflow, the program stores data into a buffer (represented as a pointer in LLVM) without properly
checking if the end of the buffer is hit, and overwrites sensitive data. Additionally, loading values from
pointers pointing to invalid addresses could result in program crashes and denial of service vulnerabilities.
For every load/store with pointers, we also track pointer offsets and load operations to find the
original pointers involved. (For example, loads/stores to array[i], array[1] and *array all correspond
to the same pointer array.) The proportion of pointers loaded/stored is then kept as features. If there
are more load/store instructions while relatively a small number of pointers loaded/stored, this means a
few important pointers are frequently used and the code associated with these pointers would be more
complex. These patterns may be associated with vulnerabilities, as with increased complexity and denser
load/store with pointers, it requires more effort for the programmers to check for proper usage of those
pointers.
Pointer Cast Operations: Cast operations are also tracked: specifically, we take into account
casts with pointers. Both casts to and from pointers are considered. These operations are included,
as improper casts would result in vulnerabilities. For example, incorrectly casting between pointers to
different data structures can lead to misalignments. Accessing fields using the casted pointers may thus
yield unexpected results.
Type Conversions: The proportion of convert instructions is included. Different types of convert
instructions are treated separately: the proportion of truncations, extensions, and conversions between
different literal types are separately calculated. Truncations limit the number of bits in types, while
extensions add more bits to existing types. Conversions between literal types, as it is described, convert
one literal type to another (Ex. float to int).
Type conversions without proper checks could lead to vulnerabilities. For example, when converting
an 8-bit char to an unsigned integer, if the char is negative, the converted unsigned integer would become
a large positive number. When this large positive number is used later, for instance, as a length variable
for copying data, the program could overwrite sensitive parts of memory outside of the buffer boundary.
Chapter 4. Coarse-grained Statistics as Function-Level Features 13
# of Lines
# of Basic Blocks
# of Instructions
Proportion of Load Instructions with Pointers
Proportion of Store Instructions with Pointers
Proportion of Pointers Loaded
Proportion of Pointers Stored
Proportion of Cast Instructions
Proportion of Cast Instructions with Pointers
Proportion of Cast to Pointers
Proportion of Cast from Pointers
Proportion of Convert Instructions
Proportion of Extension Instructions
Proportion of Trunc Instructions
Proportion of Conversion Between Literal Types
Proportion of getelementptr Instructions (GEPI)
Proportion of GEPI with first zero index
Proportion of double GEPI (loaded GEPI as operand of GEPI)
Proportion of Cmp Instructions
Proportion of Cmp Instructions with Pointers
Proportion of Cmp Instructions with NULL Pointers
Proportion of Arithmetic Op Instructions
Proportion of Add & Minus
Proportion of Multiply & Divide
Proportion of MOD
Proportion of AND & OR
Proportion of XOR
Proportion of Shift
Proportion of Branch Instructions
Maximum Depth of Nested Loop
# of Top-Level Loops
Proportion of Functions Called
Proportion of Functions Called in the Same File
Proportion of specific LIBC Calls (Multiple Features Grouped)
Pointer Index Operation (GEPI): The getelementptr instruction in LLVM calculates a memory
location with relative offsets to a pointer. It takes a pointer operand and can have many indices. The
type pointed to by the first index should be data structures such as arrays and structs: the data structure
is indexed by the following elements. The first index of a getelementptr instruction represents the offset
to the pointer operand. A first index of zero means the calculated memory location is in the element
pointed to by the operand itself.
The proportion of getelementptr instruction is included as a feature, as this indicates the frequency
of pointer operations. We also keep track of the getelementptr instructions whose first offset is zero.
This feature is kept as it is related to direct pointer dereference. We define a double getelementptr as
a getelementptr instruction whose operand contains loads to another getelementptr instruction. This
pattern is associated with many levels of pointer offset calculations and deep data structures. These
features are included, as more frequent and complex getelementptr instructions mean there are complex
pointer operations in code, and can be correlated to vulnerabilities.
Cmp Instructions: The proportion of cmp instruction is also included, as they are usually necessary
Chapter 4. Coarse-grained Statistics as Function-Level Features 14
checks in code guarding crucial operations such as memory copying and API calls. Comparisons with
pointers and null values are also tracked. Checks involving pointers and null values would be relevant
to null pointer dereferences. A function with more of these checks usually need more careful attention,
as one of such checks missing could cause the occurrence of a vulnerability.
Arithmetic Operations: Arithmetic operations are included as features. Different types of arith-
metic operations are separately considered. Add and minus instructions are often used as counters in
programs. Multiply and divide may be used to calculate the size or offset of memory. MOD, XOR, AND,
OR and shift operations are not as frequently used and may be associated with specific functionality
of programs such as encryption. These features, when combined with other features such as pointer
operations, may reveal patterns in code that could be related to vulnerabilities.
Branches and Loops: The proportion of branch instructions indicates the complexity of the code:
more branch instructions means more varied execution paths. The number of loops and the maximum
depth of the loops are also included. The existence of a loop could significantly increase the number
of potential execution paths and complicate the control/data dependence of instructions involved in
the loop. Nested loops bring dependences between loops and further complicates the amount of code
analysis that needs to be done in order to guarantee correctness. These features are included as there
is usually a relationship between the complexity of the code and the probability for it to be vulnerable.
More complex a piece of code is, there exists both more effort for human programmers/reviewers and
more difficulty for a program analysis tool to fully analyze the code and therefore ensure its correctness.
This thus increases the likelihood of which a mistake or vulnerability would occur.
Functions Called: The number of functions called is included as features as well. The inappropriate
usage/checking for the result of API functions could result in vulnerabilities. More function calls also
mean potentially more complicated interactions between functions.
To distinguish between commonly used utility API functions in a project and local helper functions
with specific functionalities, we include the number of called functions in the same file. A set of commonly
used libc functions are also tracked, this list of functions are related to program input/output, file
reading/writing, memory operations, and string manipulation. Improper usage of these libc functions
could also lead to vulnerabilities.
different machine learning classifiers on a set of open-source projects. Cross validations are used both
for training and evaluation.
This leads to another issue that we need to deal with: dataset imbalance. With too many non-
vulnerable samples in the training set, classifiers would prefer making predictions in the non-vulnerable
class. In the extreme case, a classifier labeling every function as non-vulnerable could still have a good
loss, but this classifier is entirely useless for us.
Chapter 4. Coarse-grained Statistics as Function-Level Features 16
As a result, for training, the non-vulnerable functions are subsampled: we randomly select only part
of the samples from the non-vulnerable dataset. To reduce the number of non-vulnerable samples, we
let the size of the non-vulnerable set to be 1x, 1.5x, 5x, and 10x than that of the vulnerable set. More
non-vulnerable samples are selected as vulnerable ones, to take into account the fact that there are more
non-vulnerable functions in the real world.
Evaluation Criteria for Imbalanced Dataset: In addition, to measure performance over the
imbalanced datasets, using accuracy and loss alone is not a very good measure: bad performance in the
underrepresented class may still lead to good result overall. As a result, along with accuracy, we use
recall, precision and F1 score for both classes in our evaluation. We also incorporate the area under
curve (AUC) for ROC and PRC curves to get a measure for performance under all possible thresholds
for classification.
Another limitation due to class imbalance reduces the precision of the trained classifiers. As there are
much more non-vulnerable functions than vulnerable ones in practice, a small probability of misclassifi-
cations predicting a non-vulnerable as vulnerable could greatly increase false positive rate. On one hand,
measures such as precision and PRC curve thus may not be as viable, since their performance would be
significantly affected if a small chance of false positive prediction is made. On the other hand, this issue
may also be practical, since reduced precision would cause an increased amount of work inspecting the
potentially vulnerable functions predicted in order to find real vulnerabilities. As a result, precision and
recall for top-k percent of functions in vulnerable class may be considered, to measure the amount of
work saved for inspection using these machine learning classifiers.
Machine Learning Models: We evaluate our statistical features using a variety of machine learning
models. Table 4.3 below lists the models we use. Abbreviations that we later use to represent their
corresponding classifiers are also included.
Table 4.4 above shows the performance of different classifiers we use over all projects when training
with a balanced dataset. The non-vulnerable set is subsampled, so the number of samples in the non-
vulnerable set is the same as the number of samples in the vulnerable set. Overall, while there are
differences in performance, the differences between classifiers do not vary too much. The ROC AUC is
around 0.7-0.8, and the accuracy is around 0.7. This means around 70% of the samples in the balanced
dataset are correctly classified. The classifier has 70-80% probability to rank a vulnerable sample as more
likely to be vulnerable than a non-vulnerable one. While this performance is not excellent, the classifier
can learn some vulnerability patterns and make some correct predictions on these real-world projects
we use. Although different classifiers or tuning may cause slight differences in precision and recall, these
two values are almost the same: as the dataset is balanced, and there is no need to significantly tradeoff
one metric for another to get better loss.
For all models, the vulnerable class has slightly better PRC AUC scores than the non-vulnerable class.
This indicates that using the set of features we pick, it would be slightly easier to classify vulnerable
functions than to classify non-vulnerable ones. While this could demonstrate some effectiveness of the
set of features picked, other reasons could be non-vulnerable samples have more diverse functionality
and characteristics, compared to the vulnerable patterns we may capture using statistical features. This
may be also due to the fact that some non-vulnerable functions exhibit certain features in the vulnerable
set: there may be a high chance for a function to vulnerable, but it is possible that either the code is
checked carefully so potential vulnerabilities are eliminated. Indeed, some non-vulnerable functions may
actually be vulnerable, with an existing vulnerability that has not yet been found or reported.
Table 4.4: 10-Fold Cross-Validation On All Projects with Different Classifiers (Balanced Dataset)
Performance on Individual Projects: As functions in the same project share more common
programming patterns, we also evaluated the performance of different classifiers on individual projects.
We train and test on samples obtained from the same project, to see how well vulnerable patterns
discovered may generalize within projects. The dataset used is balanced. We use F1 score as a measure
for performance: F1 score is used as it considers both the precision and recall of the measured class.
To show the effect of slightly imbalanced datasets, we train the classifiers with non-vulnerable datasets
both 1 and 1.5 times the size of the vulnerable dataset. Note that the number of vulnerable samples for
all projects is previously shown in Table 4.2. The evaluated results are displayed in Table 4.5.
Chapter 4. Coarse-grained Statistics as Function-Level Features 18
In Table 4.5, all classifiers have better F1 scores for individual projects, compared with general
cross-validation among all 5 projects in Table 4.4. While this indicates that vulnerabilities within one
project are less diverse and are easier to be learned and captured, it also presents a challenge to find
vulnerabilities in new projects using existing data. We will evaluate this problem later in Section 4.2.3.
The performance of the classifiers on FFmpeg, Qemu and Wireshark are fair. With more data, Qemu
does better than FFmpeg and Wireshark. It can also be seen that Imagemagick has the best performance
across all classifiers. Its F1 scores are around 0.8-0.9. On the other hand, while Linux has the most
number of vulnerable samples (more than half of the total vulnerable samples we obtained), it has the
worst intra-project evaluation, with less than 0.7 F1 score overall for the balanced dataset. While the set
of obtained vulnerabilities in Linux is large, we believe that the patterns in those vulnerable samples are
too diverse and is not well captured, compared to projects like Imagemagick. We believe that this may
also result from the fact that errors in kernel programming have more catastrophic effects or crashes
and code in Linux have higher quality: while there are certain functions that may be easier to have
vulnerabilities, the vulnerabilities in them have already been eliminated.
Difference in Classifiers: Different classifiers have different performance for both cross-fold vali-
dation and training within individual projects.
The random forest classifier achieves the best performance across almost all measures. While the
above shows results from mostly balanced training, the performance for the random forest is consistently
good with all different rates of subsampling, even when training with an imbalanced dataset: random
forest’s sampling method makes it perform better on imbalanced datasets. Interestingly, the random
tree classifier, which can be considered as components building a random forest, performs almost the
worst among all models. For the random tree classifier, the k randomly chosen features for a decision
tree may not cover the most important features that are relevant to vulnerability patterns. While with
random forest, it is able to find average prediction among different trees and thus takes into account the
most relevant features.
Chapter 4. Coarse-grained Statistics as Function-Level Features 19
Another reason that the overall performance for random forest is better than classifiers such as neural
network may result from the fact we have a large number of features while a small vulnerability dataset.
The ability for random forests to handle large number of features leads to its good overall performance.
(Indeed, in our experiments with fewer selected features and lower performance, random forest does not
perform the best.) Potentially with more vulnerable samples collected in future, random forest may or
may not be the best algorithm to choose.
Similarly, bayesian network performs generally better than naive bayes, both in cross-validation and
intra-project training: bayesian network takes into account the dependences between features, while
Naive Bayes assumes the features are independent. The ability to learn correlated patterns between the
features we picked result in a better performance for bayesian network. While bayesian network does
not outperform naive bayes too much, this shows that some small correlations need to be made from the
statistical features, in order to find vulnerable patterns.
For individual projects, while neural network works better than the simple logistical regression clas-
sifier in most projects, there is not much difference between the performance of the two classifiers. As
logistical regression can be considered as similar to a one-layer neural network, this means the relation-
ship between the picked statistical features is not too complex, and simple logistical regression could also
work well. The reason that neural network may not work that well overall may result from the fact that
the power of the neural network comes from its ability to learn complex patterns with large amount of
data. However, with a small amount of vulnerable data and coarse-grained statistical features, neural
network’s performance is limited. On the other hand, most of the vulnerable samples we collect are from
Linux, and the fact that neural network does not work as well on Linux affects its overall performance
on all projects.
From Figure 4.5, with more non-vulnerable samples used, all classifiers have better performance on
the non-vulnerable class and decreased performance in the vulnerable class in most cases. The exception
is for naive bayes: naive bayes does not work well with Wireshark on a balanced dataset, but it works
better with a sampling rate of 1.5x for the non-vulnerable dataset. This result may require further
inspection.
Training with Imbalanced Dataset: To demonstrate how the training works with imbalanced
datasets and varied proportions of non-vulnerable samples, we use different subsampling rate for the
non-vulnerable dataset: the size of the non-vulnerable dataset is kept as 1x, 1.5x, 3x, 5x, 10x the size
of the vulnerable dataset respectively. Both the training set and the test set are imbalanced: the ratio
of non-vulnerable samples to vulnerable samples in both the training set and the test set are kept the
same.
Table 4.6 demonstrates the effect of different subsampling ratios on the performance of the ran-
dom forest classifier. Random forest is picked as it performs better than other classifiers with varying
subsampling rates. Cross-fold validation is also applied.
With gradually increasing class imbalance, there is more chance of getting a correct guess by simply
classifying a function as non-vulnerable when it is unclear how to distinguish between the two classes.
This thus increases the false positive rate in the non-vulnerable class, and explains higher recall than
precision in the non-vulnerable class. For similar reasons, the random forest classifier trades off the
recall of the vulnerable class for precision: as ambiguous samples are more likely to be classified as non-
vulnerable, the classified vulnerable samples are more likely to be truly vulnerable (increased precision),
while not all vulnerabilities may be covered (decreased recall).
Chapter 4. Coarse-grained Statistics as Function-Level Features 20
On the other hand, as classifiers gain more non-vulnerable samples, they learn more patterns in
that class. As a result, it is easier to classify vulnerable samples than non-vulnerable ones: accuracy
is thus no longer a good indicator for both classes. This is also reflected by the PRC AUC score:
while the subsampling rate increases, PRC AUC decreases for the vulnerable class, and increases for the
non-vulnerable class. The precision and recall of the non-vulnerable class are thus higher.
While the ROC AUC value is stable across all subsampling rates and shows how a classifier distin-
guishes between two classes, it represents results for both the vulnerable and non-vulnerable class. ROC
AUC does not show how the classifier performs on different classes.
precision and recall of the vulnerable class covered by the top-K percent of functions that are most likely
to be vulnerable. The top-K percent recall/precision is used to show how much effort one saves when
finding vulnerabilities using the results prioritized by the classifiers: higher recall means one needs to
examine fewer functions to find cover most vulnerabilities in the project, while higher precision means
one needs to inspect fewer functions to find one vulnerability. In Figure 4.1, we present top-K precisions
for the vulnerable class across different projects. Similarly, top-K recalls for the vulnerable class are
shown in Figure 4.2. In those figures, the y-axis represents precision/recall value between 0 and 1 (For
example, in Figure 4.1a, 0.125 on the y-axis means 12.5%).
Figure 4.1: Top-K Percent Precision for Cross-Project Evaluation (Random Forest)
Among all projects, machine learning models achieve both higher precision and recall than random
selection on all projects. Mostly, training with different subsampling rate for the non-vulnerable set
mostly does not lead to a huge difference in terms of top-K precision and recall, and could be employed
to increase the accuracy overall. For most projects, the top-K percent precision curve smoothly decreases
as more functions are inspected.
Imagemagick: Imagemagick has the best performance overall. Along with a comparable recall
curve, Imagemagick also has a steeper precision curve with higher precision concentrated to a small
amount of the top-ranked predictions. This means most of the top-ranked vulnerable samples are indeed
vulnerable when evaluating on our Imagemagick dataset. Considering the fact that Imagemagick has
the best intra-project performance, it seems the vulnerability patterns in Imagemagick are more general
and easier to capture with our statistical features.
FFmpeg and Wireshark: FFmpeg and Wireshark also both have decent performance. With a
small percentage of vulnerable samples among all projects, evaluation on wireshark has lower precision
with top-K percent of vulnerable samples. However, comparing with random selection, machine learning
still increases the prediction performance on Wireshark with a similar scale.
Qemu: However, there is an exception is for Qemu. Qemu does not have as good precision and recall
as other projects. Qemu’s first several top-ranked vulnerable functions do not cover as high precision
Chapter 4. Coarse-grained Statistics as Function-Level Features 22
Figure 4.2: Top-K Percent Recall for Cross-Project Evaluation (Random Forest)
as other projects do. The smoothly decreasing pattern on the precision curve in other projects is not
as obvious in Qemu. For Qemu, with a sampling rate of 1x and 1.5x, this pattern shows up with 20%
or more top-ranked vulnerable functions, while with sampling rates higher than 3x, it takes 40-50%
of the ranked functions to get this pattern. This means while there are some similarities between the
vulnerable patterns in Qemu and in other projects, the most obvious patterns for other projects do not
generalize to Qemu.
With lower performance, the subsampling rate starts to take effect for Qemu. With different sub-
sampling rates, evaluation on Qemu exhibits different performance for precision and recall. Its precision
curves over the top ranking samples are not consistent. With a sampling rate over 1.5x and a small
percentage of the top-ranked functions, Qemu has roughly better precision, while its recall rate is just
above random selection. Sampling with 1x does not work for the first several top-ranked functions, but
its precision and recall starts to spike up as more vulnerable functions are included.
Although the vulnerable patterns in Qemu do not generalize, non-vulnerable samples may share some
similar characteristics with functions in other projects. As previously discussed for imbalanced datasets,
with more non-vulnerable samples included, the classifiers are able to more likely to predict an ambiguous
function as non-vulnerable. This thus increases the precision and decreases the recall of the vulnerable
class. However, more non-vulnerable samples would decrease the precision in the vulnerable class as
more top-ranked functions are included. One reason might be the classifier better learn patterns in the
non-vulnerable class than the vulnerable class. Thus for Qemu, overall, using a non-vulnerable dataset
1.5x as large as the vulnerable set generally has better performance and recall curves than training with
non-vulnerable datasets with other sizes.
Linux: Linux also does not achieve as good recall curve comparing to other projects. While projects
such as FFmpeg and Wireshark can cover 80% of all vulnerable functions when inspecting 30-40% of
top-ranked vulnerable samples, Linux can only achieve around 60% recall with 40% of the top-ranking
samples. This results from the fact that Linux contributes more than half of the vulnerable functions
Chapter 4. Coarse-grained Statistics as Function-Level Features 23
in our dataset: when testing on Linux, we need to train on other projects and cannot use samples from
Linux. As a result, there is not as much training data comparing to the case when testing on other
projects. More samples may help the performance of our classifiers.
Another reason is the types of vulnerabilities are more diverse in Linux. As shown in Table 4.5,
although Linux has more samples within the project, it does not achieve as good performance as other
projects when training using samples from Linux itself. However, comparing with cross-project evaluation
on Qemu, Linux still exhibits better precision and recall curves overall. This means that while the types
of vulnerabilities within Linux could be more diverse, they still share similar patterns with vulnerabilities
in other projects. On the other hand, while the patterns in Qemu are relatively easier to capture (Qemu
have better intra-project evaluation), the vulnerability patterns for Qemu do not generalize to other
projects we use.
Although project types have an effect on cross-project training, functionalities of the projects do not
dominate the performance of the classifiers. While some vulnerabilities are project-specific, patterns of
vulnerabilities are not limited to the functionalities of the projects.
Practical Usage: While the machine learning classifiers performs better than random selection and
saves the amount of work needed to check for vulnerabilities in a project, there is still an amount of
work needed to inspect the functions ranked as most likely to be vulnerable.
1) Although the machine learning classifier eliminates some non-vulnerable samples, the precision
for vulnerability prediction also decreases if there are the functions inside the project are less likely to
be vulnerable. As the number of vulnerabilities inside a project is usually not high, this could require
inspecting a large number of functions before finding a vulnerable one. On the other hand, to cover
most vulnerabilities in a project, it is still necessary to analyze a few percentages of all the functions in
the project. While the classifier saves users’ work, we still expect manual analysis will be required for a
good proportion of the project to eliminate most vulnerabilities.
2) Although the statistic-based features have higher precision and recall for some projects, this does
not mean it is easier to discover real vulnerabilities with machine learning classifiers. As the top-ranking
functions generated may both have larger sizes and are more complex (Section 4.4), it is much harder
both to determine where a vulnerability is at the function scope and to guarantee a function is mostly
free of vulnerabilities. This holds both for human reviewers and for program analysis techniques. While
large, complex functions exhibit patterns that are easier for the classifiers to find, this also means more
work to analyze the predicted functions. This issue is further inspected in Section 4.4.
3) A user may also take into account the precision-recall tradeoff when using the machine learning
classifier. To cover more vulnerable functions, one may need to inspect more possibly vulnerable functions
predicted. But as the user inspects more functions, it is less likely for a newly inspected function to be
vulnerable: the precision of the classifier decreases when more lower-ranked functions are checked.
Table 4.7: Ordered Feature Coefficient for Lo- Table 4.8: Ordered Feature Coefficient for Lo-
gistic Regression; Subsampling: 1x gistic Regression; Subsampling: 1.5x
Chapter 4. Coarse-grained Statistics as Function-Level Features 25
Table 4.9: Ordered Mean Decrease Impurity Table 4.10: Ordered Mean Decrease Impurity
for Random Forest; Subsampling: 1x for Random Forest; Subsampling: 1.5x
Chapter 4. Coarse-grained Statistics as Function-Level Features 26
For logistic regression, the coefficients learned by the model are used as features. The coefficient for
logistic regression can be considered as weights for features. A large coefficient means its corresponding
feature is important: a small change in the feature result in a large change in the weighted sum, and
thus the prediction result. A positive coefficient means its corresponding feature is positively correlated
to the vulnerable class: the larger its absolute value is, the more likely a function would be vulnerable.
Vice versa, if a coefficient is negative, a larger absolute value means the function is less likely to be
vulnerable.
For random forest, we use mean decrease impurity. The mean decrease impurity for a feature f can
be considered as the gain in performance/purity when feature f is used. A node in a random
tree contains a decision based on one or several features, which splits the input dataset. The impurity
for a node in a random tree is the probability for that node to misclassify a sample: ∈csses (P ·
P
k6= (1 − Pk)), where P is the proportion of items labeled in class for that node. In this case, the
P
decrease in impurity between a parent node and its child node is considered. The decrease in impurity for
a feature is the sum of the decrease in impurity for nodes related to the decisions made by the feature,
weighted by the probability of reaching every node (can be estimated by the proportion of training
samples reaching that node). The mean decrease impurity for a feature is the decrease in impurity
averaged over all random trees in the forest.
Both the coefficients for logistic regression and the mean decrease impurity for random forest are
ranked by their absolute values, as shown in Table 4.7, 4.8, 4.9 and 4.10.
Logistic Regression: For logistic regression, features such as maximum depth of loop and number
of instructions, which are related to the function size and complexity, do not affect the prediction
result as much. Loads with pointers make a vulnerability more likely to happen. Different from our
assumption, logistic regression treats the proportion of callees as weak features negatively correlated
with vulnerabilities. In addition, the number of loops and the depth of loops are weakly correlated with
vulnerabilities: a vulnerable function has deeper loops, but a fewer number of top-level loops.
Some related features have conflicting implications. Conversion, cast, and comparison with pointers
are all important features, but some subtypes in conversions/casts make a function more likely to be
vulnerable, while others make it less likely to be vulnerable. More extension and truncation indicates
a vulnerability is more likely to occur, but more conversion between literal types means a vulnerability
is not likely to occur. Similarly, comparison with pointers is negatively correlated with vulnerabilities,
but comparisons with null values, which can be considered as a type of comparison with pointers,
are positively correlated. More cast operations mean a vulnerability is less likely to occur, but cast
from/to/between pointers (which are part of cast operations) all mean the corresponding function is
more likely to be vulnerable. It is not very clear what these patterns mean, as coefficients of related
features are canceling out some effects by each other. While our features are coarse-grained, logistic
regression still attempts to capture relationships between the statistical features.
As the model is learning conflicting correlations between related features, With different subsampling
rates, some groups of features may have completely different correlations with vulnerabilities (a flip in the
sign of their coefficients). With a subsampling rate of 1, the proportion of pointers stored is negatively
correlated to vulnerabilities, while the proportion of stored pointers is positively correlated. However,
with a subsampling rate of 1.5, the reverse happens: the proportion of pointers stored is positively
correlated, while the proportion of stored pointers is negatively correlated to vulnerabilities. The similar
also holds for the proportion of gepi instructions and double gepi operations.
Chapter 4. Coarse-grained Statistics as Function-Level Features 27
Random Forest: For random forest, different subsampling rates do not affect the decrease in im-
purity as much. Different from logistic regression, random forest considers function size as an important
type of feature. The number of basic blocks and the number of lines in a function are ranked highly. The
number of instruction does not appear to be as important, it is possibly because this information can be
approximated using the number of lines and basic blocks in the function. Likewise, complexity-related
features such as the proportion of branches and loop-related features (top-level loops and depth of loops)
have more effects on random forest, compared to logistic regression.
The random forest classifier also identifies both the proportion of loads/stores and the proportion of
pointers loaded/stored as important features. The proportion of comparisons is also ranked highly. Also,
compared to logistic regression, the proportion of callees has a more significant effect on the random
forest: the proportion of callees in total and in the same file are both ranked highly. While more callees
in the same file mean more callees in total, it seems these features are not replaceable: this means
random forest learns some patterns that require both features to identify (Ex. more callees in the same
file, while fewer callees in total).
Mean decrease impurity for random forest does not identify pointer casts as significant. However,
this does not necessarily mean these features are not important. This is possibly because there are many
types of pointer cast features: while one such feature is omitted, some of its information can be inferred
from other related pointer cast features.
Different from logistic regression, the random forest classifier does not rank the proportion of con-
versions with literal type to be as high. This is possibly because features have large, but conflicting
coefficients in logistic regression, but they are not as significant when separately considered. In addition,
for the random forest, arithmetic-related features are least important. This is somehow consistent with
the result from logistic regression: for logistic regression, the proportion of arithmetic operations has a
coefficient with small absolute value, while the coefficients of related features such as add/minus and
multiply/divide are canceling out each other.
Overall: For both logistic regression and random forest, the proportion of libc calls is not important.
This is possibly because some open-source projects use project-specific API functions instead of calling
libc functions. For both models, the proportion of comparisons (with/without pointers) is also relatively
more important compared to other groups of features. Some other features appear to be more important
for in one model, but not as important for the other.
While some related features for random forest are ranked similarly, its top-ranked features are more
diverse than logistic regression. This is possibly because logistic regression has conflicting coefficients to
learn vulnerability patterns.
be vulnerable. Table 4.11 compares the average number of instructions for the top-50 ranked vulnerable
functions and all functions in the evaluated projects. For both Figure 4.3 and Table 4.11, the number
of non-vulnerable functions in the training set is randomly sampled to be 1.5 times as the number
of vulnerable ones. In Figure 4.3, random forest is used for classification, and in Table 4.11, logistic
regression is evaluated. While the observed patterns are generalizable to all sampling rates and classifiers,
for simplicity, we only show results for these classifiers and sampling rates. The range for the y-axes in
Figure 4.3 is limited, in order to better show general patterns for all samples: for some functions with a
large number of instructions, their bars extend above the top of the graphs.
Figure 4.3: Number of Instructions in Top-K Vulnerable Predictions for Cross-Project Evaluation
(Random Forest)
Table 4.11: Average Number of Instructions for Cross Project Evaluation (Logistic Regression)
Among all cross-project evaluation in Figure 4.3, the ranked vulnerable functions have generally
decreasing number of instructions, as if they are roughly sorted by function size. While there are
some small functions highly ranked, most large functions are ranked as more likely to be vulnerable.
Additionally, in Table 4.11, for most projects, top-50 most vulnerable functions have sizes around 6-7
times as large as their average function sizes. For Wireshark, the top-50 ranked functions have more
than 90 times the number of instructions as an average function. Although the function size is the least
important group of feature for logistic regression (Section 4.3), it seems logistic regression is also more
likely to predict larger functions as vulnerable.
This means users still need significant effort in order to analyze the predicted vulnerable functions.
Chapter 4. Coarse-grained Statistics as Function-Level Features 29
Although the total number of inspected functions is reduced by using a machine learning classifier, the
time spent to inspect every function is increased. As Table 4.11 shows, top-ranked functions have more
number of instructions. With more instructions, the program dependences between instructions could
also be more complex. Checking the top-50 ranked vulnerable functions could thus require way more
time and effort than checking 50 randomly picked functions. This is a limitation of our statistic-based
approach.
4.5 Summary
In conclusion, with function-level statistical features, some coarse-grained vulnerable patterns can be
captured. While the relationships between statistical features learned by the classifiers are not complex,
the ability to find correlations between features helps classification. With the small amount of vulnerable
data we obtain, some similarities and differences of vulnerable patterns across projects can be roughly
captured by statistical features. Some inaccuracies and inability to learn general patterns may result
from both the quality of the dataset and the coarse nature of the statistical features.
When evaluating the importance for features, we find that different features have different importance
for logistic regression and random forest. The two classifiers attempt to learn correlations between
related features. However, more complex functions are more likely to be predicted as vulnerable, for all
models trained with coarse-grained statistical features. As classifiers are more likely to predict large and
complex functions as vulnerable, more work is required to inspect the predicted samples. This presents
a limitation for the statistic-based method and demonstrates a need to locate the vulnerabilities at a
finer scale.
Chapter 5
To overcome the shortcomings from stat-based function level features, we try raw feature representations
directly generated. This eliminates the need for manual feature engineering and allows classifiers to learn
patterns that are not recognizable by humans. More fine-grained relationships such as control and data
flow can also be represented. However, with raw features, it is not obvious which patterns are more
relevant to classification results, and what the model actually learned. This strategy also requires more
data to prevent overfitting, as the complexity of the model increases. It is also computationally more
expensive.
This project extracts features from the LLVM IR [7]. The LLVM IR, as an intermediate representation
used by the LLVM compiler to represent source code, keeps semantically relevant information. We work
on the IR level instead of the AST, as it removes syntactic features that may not be transferable across
projects. Also, as IR is architecture-independent, working on the IR level can make predictions on all
hardware platforms.
30
Chapter 5. Raw Feature Representation 31
program dependence graph using BFS along both control and data dependence edges, starting at the
slicing point. All traversed instructions are added to the program slice. There are two directions for
program slicing: forward slicing and backward slicing. Forward slicing traverses only along forward
dependence edges, while backward slicing traverses only along backward dependence edges.
In this project, forward slicing and backward slicing are separately performed in order to obtain
contexts before the target instructions and the code that are affected by the target instructions. The
instructions traversed during both forward and backward slicing are put together to generate a program
slice. The number of instructions included by both forward and backward slicing are limited, so the
scopes of code included do not grow arbitrarily large if the function containing the slicing point is large.
For implementation, we use the program slicing tool dg [14] that works on LLVM.
Consider the following example code used for program slicing in Listing 5.1. Every line of C code
corresponds to LLVM IR instructions (we use C code here instead of LLVM IR as C code is easier to
read). If we slice on line 5 (b = 4), line 2, 4, 5, 10, 11, 13 and 14 are included in the slice.
For backward slicing, line 4 is first included, as the if condition (a>0) guards b = 4, and thus line 5
has a control dependency on line 4. The slicing algorithm inspects dependences for the instruction on
line 4: line 2 (a = 3) is added, because (a>0) has a data dependency on a = 3.
For forward slicing, line 10 is added, as (b>2) has data dependence on b = 4. Then the slicing
algorithm finds line 11 and line 13 that have control dependence on line 10. Lastly, line 14 (c = 4 + d)
is added as it has data dependence on line 11.
contain a vulnerability without manual inspection. The implementation of this technique is built upon
the slicing tool dg [14] with some modifications.
distinguish one edge type from another: all neighbors reachable with data/control/use-def relationships
are treated as the same. We use the default embedding dimension of 200, as presented by inst2vec [15],
and found 5 epochs a good number for training.
To ensure there are enough instruction occurrences to make the learned embeddings representative,
instructions that occur less than 300 times are discarded. The embedding of every discarded instruction
is thus calculated as an average of the embedding from all instructions with the same opcode as the
discarded one. If there is no calculated instruction embedding with the same opcode, the instruction
would be discarded and replaced by an embedding of all zeros.
For every question in the semantic analogy test (a:b; c:?), our tester program first evaluates the
target analogical embedding: (b - a + c). It then finds 5 instructions closest to the evaluated embedding
in the instruction set upon test. If the target answer instruction is inside the 5 instructions found, the
question is denoted as correctly answered. The result for an analogy test is the percentage of correctly
answered questions overall, shown in Table 5.2. This technique is also used for evaluating the inst2vec
tool [15].
In the inst2vec paper [15], there are many types of semantic analogies (data structure, conversion, and
syntactic analogies). For data structure analogies, the instruction pairs use types of different data struc-
tures. Instruction pairs for conversion analogies are instructions converting between different operand
types. Instruction pairs for syntactic analogies have different debug information that we do not track.
Below shows an example for data structure analogies with struct types (note we use %STRUCT TYPE
to represent structs, and do not keep specific struct types such as double, double in our instruction
representation; this example is thus not used):
Question :
Chapter 5. Raw Feature Representation 35
We do not use data structure semantic analogies and syntactic analogies presented in the inst2vec
paper [15]. This is because our representation does not account for features that may not represent
vulnerability patterns: contents inside struct types and syntactic features such as options are not included
(Section 5.2.1). The conversion analogy tests we use are a bit different from the analogy tests used by
inst2vec. This is because not all instructions provided by the test from inst2vec are present in our set
of instructions. We thus use available tests from inst2vec and generate some other tests by ourselves
specific to the instructions we evaluate.
2) In semantic distance tests, we test if instructions that perform similar operations (with same
opcode) are close to each other in the embedding space (Ex. a load instruction should be more similar to
other load instructions than store instructions). Instructions with the same opcode are grouped together
into the same category. For every instruction, the distances between the evaluated instruction and
instructions both and outside its corresponding category are compared (distance between embeddings
can be calculated using dot product). For every category, a score tracks the percentage of instruction
pairs inside the current category with a smaller distance than instruction pairs outside of the category.
The overall performance for the semantic test is the score averaged across all categories. Our semantic
test is the same as the semantic tests used by inst2vec [15].
To select a best performing set of instruction embeddings, training results are tuned with different
types of edges and different context sizes. We find that the set of embeddings with neighbors reachable by
all edges with a context size of 1 actually achieves best performance. This set of instruction embeddings
is thus used for the vulnerability prediction tasks in later sections. During tuning, we found some types
of instructions work better with certain types of neighbors and context sizes. While one may train
different instructions using different tuning parameters, this is left for future work.
Although some evaluation strategies we use may be similar to strategies used in natural language
processing, learning instruction representations on programming languages is a much harder problem,
and we should not expect instruction embedding training to achieve similar results as word embedding
training. Thus, the evaluation result for our chosen set of instruction embeddings and the result shown
Chapter 5. Raw Feature Representation 36
in the inst2vec paper [15] are listed in Table 5.2. Note that our goal is not to compare our performance
with inst2vec. In fact, we train on a different set of projects and use different analogy tests. Rather,
we demonstrate the validity of our training result by showing that our instruction embedding achieves
comparable performance and can be used later for our vulnerability prediction task in Section 5.3.
neural network takes as input the vector representation of the entire graph and outputs whether a sample
is vulnerable or not.
To deal with dataset imbalance, we used an oversampling rate of 5 to get more vulnerable samples
(every vulnerable sample is repeated 5 times in training set). The non-vulnerable dataset is subsampled,
so it has a similar size as the oversampled vulnerable dataset. This is done in order to get a roughly
balanced dataset for training. Cross-validation is performed to get rid of randomness when selecting
samples. The performance is shown in Table 5.3.
The graph classifier works on the training set. On the almost balanced training set, the classifier
achieves a bit more than 80% precision and recall for the vulnerable class. While there may be some
tradeoff between recall and precision, its PRC AUC score for the vulnerable class indicates that the
model can obtain considerable recall and precision with changing thresholds. With more than 80% ROC
AUC, the classifier is also able to distinguish between the vulnerable and non-vulnerable class.
Its performance on the test set is not as good as training. On a balanced test set, the PRC AUC
and ROC AUC for the classifier are fair. To estimate how it may perform on datasets with more
non-vulnerable data in the real world, we also test the classifier with more non-vulnerable samples: non-
vulnerable slices from all open-source projects we used are added to the test set. While the classifier has
the same ability to distinguish between the vulnerable and non-vulnerable classes (similar ROC AUC),
with more non-vulnerable samples, mislabeled non-vulnerable samples could greatly drop the precision
in the vulnerable class. As a result, its PRC AUC also greatly drops. On the more realistic, imbalanced
test set, while the graph classifier still gets more than 80% recall, it achieves less than 2% precision on
average for vulnerable samples.
Overall, the graph classifier achieves fair performance on samples from the NVD database. We believe
this classifier may be improved by obtaining a larger dataset with better quality.
of 10 is used for the vulnerable class (every vulnerable sequence is repeated 10 times). Results for the
experiments are shown in Table 5.4.
To show the difference between the graph classifier and the instruction sequence model, results in
Table 5.4 and Table 5.3 are compared. While the BiLSTM classifier does not have as good performance
on the training set, it has a better ROC curve on the balanced dataset. This means the BiLSTM better
distinguishes vulnerable and non-vulnerable class. Its ability to learn patterns in the non-vulnerable also
results in more consistent accuracy on imbalanced test sets.
On the other hand, BiLSTM has slightly worse PRC AUC for the vulnerable class, with both slightly
worse precision and recall values in that class. Similarly, with more non-vulnerable samples added to
the test set, BiLSTM also performs worse on the vulnerable dataset. However, its performance on the
vulnerable class decreases more quickly than the graph classifier (PRC AUC). In comparison, although
the graph classifier does not distinguish between the vulnerable and non-vulnerable classes as well, the
graph classifier learns slightly better for the vulnerable class.
Although the graph classifier has better performance on the training set, it does not give the classifier
that much advantage on the test set. This is because we do not have as much data comparing to the
diversity of the vulnerability. As BiLSTM does not consider explicit control and data dependences, it
has fewer features and is less likely to overfit. In addition, the instruction sequence approach is more
lightweight. As the BiLSTM does not explicitly track the control and data dependences between instruc-
tions, it takes less time to process the same amount of samples. The instruction sequence representation
also requires fewer number of epochs to converge, compared with the graph classifier.
Performance for training on vulnerable loops is comparable to training with NVD vulnerabilities in
general. However, the classifier does not work as well on the test set. On a balanced dataset, the ROC
AUC is even a bit lower than 0.5. This means the classifier does worse than a random guess. The learned
patterns from the training set do not generalize to the test set. Similar patterns can be observed for
predictions on the imbalanced dataset.
This is possibly because code involving loops have more complex patterns. While the classifiers find
some patterns in vulnerable loops, the learned patterns are not real vulnerability patterns generalizable
to test sets. Also, although a slice of code may involve loops, it is uncertain whether the vulnerability is
loop-carried, or it only occurs in some parts of the code inside loops. Additionally, while the classifier is
able to cover these patterns in the training set, it is not quite clear what the classifier actually learned.
This model has high performance on the SARD Juliet dataset. Precision and recall for the vulnerable
class are both very high. Its high ROC AUC score also indicates that the model distinguishes between
the vulnerable and non-vulnerable quite well. There might be multiple reasons that contribute to the
good performance on synthetic dataset.
One reason that may contribute to the good performance is that the vulnerable patterns in the
synthetic dataset are more simple and obvious. As the samples in SARD Juliet are mostly simple example
code with no real functionality, the structure of the code and relationships between instructions are easy
to understand. Although real-world vulnerabilities are sliced to reduce noise, samples in the SARD Juliet
are still smaller and less complex. In fact, in SARD dataset, vulnerabilities are deliberately written to
represent typical vulnerabilities. However, classifiers working on real-world vulnerabilities need to find
vulnerability patterns among complicated benign relationships between different instructions. Not only
for machine learning classifiers, for human reviewers, spotting vulnerabilities in example code would be
much easier than finding potentially vulnerable patterns from large, complex functions in the real world.
The vulnerability examples in SARD Juliet are also written following the criteria of a specific vul-
nerability category. These samples are constructed in order to represent cases of typical vulnerabilities.
However, in the real world, code may use different resources and could be considered as belonging to
different types of vulnerabilities and may be harder to analyze. While the synthetic data is less diverse,
there are also much more vulnerable data in the synthetic dataset. With more amount of data and less
varying features, the synthetic SARD Juliet dataset both easier to learn and is less prone to overfitting.
As a result, while this experiment shows that the graph classifier is able to learn patterns in the
synthetic dataset, the vulnerability patterns learned may not generalize to other programs. Good per-
formance on the synthetic dataset may not be useful. Further evaluation is thus performed to evaluate
how models trained on the synthetic dataset work with vulnerabilities in the real world.
Evaluating Synthetic Patterns on Real World Datasets: In order to examine how the vulner-
ability patterns learned in the synthetic dataset may be applied to real-world applications, we evaluate
classifiers trained using the SARD Juliet dataset on vulnerabilities in the NVD database. As previously
mentioned, because the instruction embeddings for the classifier are trained on both Juliet dataset and
the open-source projects we use, the model works on both the SARD Juliet datasets and open-source
projects containing real vulnerabilities in the CVE/NVD database. The performance are recorded in
Table 5.7 and Table 5.8.
The classifier trained on the synthetic dataset is first evaluated on the samples obtained from the
NVD database in Section 5.3.2. The vulnerable samples are vulnerabilities obtained from the NVD
database, while non-vulnerable samples are randomly sliced pieces of code in which no vulnerability has
been reported. Some samples that do not fit into memory are omitted.
The classifier trained on the synthetic dataset can predict some vulnerable patterns in the NVD
database. Considering the highly imbalanced nature of the test set, the classifier obtains fair PRC AUC
for the vulnerable class. It can find a bit more than half of all the vulnerable samples. However, its
Chapter 5. Raw Feature Representation 42
ROC AUC is worse than a random classifier. This is because the classifier performs badly on the non-
vulnerable class, and wrongly predicts many non-vulnerable samples as vulnerable: the precision for the
vulnerable class (0.00367) is much worse than precision from random prediction (1/173.6 = 0.00576).
We further evaluate the classifier on NVD vulnerability with loops with the dataset is obtained from
Section 5.3.4: both vulnerable and non-vulnerable samples involve loops. However, the trained classifier
does not distinguish loop patterns very well. Although the classifier correctly finds almost all vulnerable
loops, it cannot distinguish vulnerable loops from non-vulnerable ones, and simply considers almost all
loops as vulnerable. This is possibly because there are not a lot of complex patterns such as loops in the
synthetic dataset, and the classifier associates complexities in code with a higher probability of being
vulnerable.
In addition, in Table 5.8, the classifier trained with synthetic data is evaluated on vulnerable code
slices in different categories of the NVD database. These categories are chosen, as they are related to the
vulnerabilities trained in the synthetic dataset. Some categories have overlapping samples. The classifier
is able to distinguish more than half of the vulnerable slices in all categories. Execute code and memory
corruption have the highest detection rate. This may result from the fact that patterns from these two
types of vulnerabilities are simpler and thus easier to identify.
Overall, while the model trained with synthetic data may be able to identify some simple vulnerable
patterns, it does not work very well on vulnerabilities with complex code structures. Especially, it
wrongly classifies many non-vulnerable samples as vulnerable. While we could infer some patterns based
on the result obtained from the experiments, it is also not very clear what the classifier actually learned.
Chapter 6
Fuzzing is a dynamic testing technique commonly used to find defects in programs. During fuzzing,
a fuzzer automatically generates inputs for the target program and observes program states such as
crashes and hangs while it executes. It is able to test the part of the code that is executed during
fuzzing. Fuzzing aims to find test cases in the input space, called “seeds”, that can trigger corner cases
in the program and thus find unexpected programs states.
While fuzzing directly triggers vulnerabilities and has fewer false positives, vulnerability analysis
using fuzzers has different challenges compared to analysis with human inspection/static code analysis.
A human can simply analyze a piece of suspicious code by inspecting the code itself and its surrounding
context. However, a fuzzer needs to execute a piece of code in order to analyze it. This means a fuzzer
needs to pass certain checks in the program that guard code to be analyzed. There also exists code that
cannot be easily reached, because checks guarding those pieces of code require the fuzzer to generate
inputs that satisfy complex constraints.
To find more vulnerabilities, two strategies can be used by fuzzers: 1) cover more code and 2) more
thoroughly fuzz parts of the code that are more likely to be vulnerable. Two different types of fuzzers are
thus designed based on these strategies. Coverage-guided fuzzers attempts to generate inputs that can
cover more code in the program and thus instruments the target program to collect coverage information.
Directed fuzzing attempts to guide execution towards a few target sites inside programs defined by the
user: if users know which parts of the code are more likely to be buggy, directing execution towards
those parts may save time while finding more defects in programs.
In this chapter, we guide directed fuzzers towards vulnerable functions predicted by machine learning
classifiers in Chapter 4. The intuition, similar to the directed fuzzing strategy, is to more thoroughly
test more vulnerable parts of the program code using knowledge from our machine learning models. In
addition, directed fuzzing is used to confirm predicted vulnerabilities: while there are some false positives
in prediction results from machine learning models, fuzzing could eliminate those inaccurately classified
samples by directly triggering real vulnerabilities. We attempt to save fuzzing time by guiding fuzzers
towards predicted vulnerable functions, while increasing precision for machine learning predictions. To
demonstrate how effective the technique is, we measure how much better a directed fuzzer performs by
thoroughly testing the predicted vulnerable functions, compared to a normal coverage guided fuzzer.
43
Chapter 6. Directed Fuzzing on Predicted Vulnerabilities 44
AFL and AFLGo monitor program crashes and hangs. A crash is an execution in which the program
terminates unexpectedly. For example, null pointer dereferences and unhandled exceptions can result in
crashes. A hang is an execution that the program does not terminate. Causes for hangs may include
infinite loops and deadlocks. Although it is impractical to wait for a program forever, a hang can be
found by observing program inactivity until a timeout. We also measure the number of unique hangs
and paths. Unique paths are paths that explore different basic blocks, while unique hangs are hangs
triggered by executing unique paths.
Although no crash is found in our experiment, hangs are found across different runs. The unique
hangs and paths in Table 6.1 are averaged across the 8 fuzzing processes. Similarly, the percent coverage
for reached functions in Table 6.2 are also averaged across all the 8 fuzzing processes. However, the
proportion of target functions reached is summed up for all runs: if one fuzzing process executes a target
function, we include that function in the final result. This is done because for directed fuzzing, different
fuzzing processes target different functions.
During our experiments, AFLGo has a similar execution speed as vanilla AFL. Compared to vanilla
AFL, directed fuzzing with AFLGo has a smaller number of unique execution paths: as AFLGo puts
more importance on test cases that execute paths close to target functions, the paths it explores are less
diverse. Also, although no crash is found during fuzzing, some hangs (timeouts) are spotted. Vanilla
AFL finds the same number of hangs as AFLGo.
AFLGo does not hit more target functions than vanilla AFL: they both reach 9 out of 24 target
functions spread across 8 runs. Also, compared to vanilla AFL, AFLGo does not have better coverage
inside the target functions either. Although directed AFLGo spends more time on fuzzing the target
functions, it does not result in higher coverage. From this, it seems directed AFLGo does not work
better than vanilla AFL.
While AFLGo puts higher priority on inputs that execute towards the target functions, the input
generation algorithms for AFL/AFLGo are only able to randomly mutate inputs. There are certain
checks in the code, requiring input to be of certain format. Those kind of inputs can be hard for
AFL/AFLGo to generate, as random mutation is not capable of solving for inputs that satisfy strict
constraints. Code regions guarded by those checks may be hard to reach. Some generated inputs may
even not be able to pass several validity checks at the beginning of the function. These inputs are
discarded without further deep execution. As a result, there are some target functions not reached and
code regions unexplored in the target functions.
To cover more functions that AFLGo alone is not able to reach, we also tried using static analysis
techniques to generate input seeds that hit those functions. However, this technique also does not
increase the number of crashes or cover more code regions inside the target functions. Generally, there
are a few reasons that the directed fuzzing technique does not work very well with our prediction results
from machine learning:
Chapter 6. Directed Fuzzing on Predicted Vulnerabilities 46
1) While directed fuzzing may reach some target vulnerable functions, the fuzzer still may not be
able to obtain good coverage inside these functions: some functions contain many sanitizing checks that
stop execution with invalid inputs from further progressing. This problem is related a drawback of the
coarse-grained machine learning model mentioned in Section 4.4: as predicted vulnerable functions are
larger and more complex, there are more sanitizing checks a fuzzer needs to bypass and thus more regions
that are hard to reach. A machine learning classifier may be further improved by taking into account of
function complexity when ranking vulnerable functions.
2) On the other hand, even if we are able to pass those sanitizing checks, the target function may
still be too complex to explore. Although a function is predicted as vulnerable, we are not able to
know which parts of the function are actually related to the vulnerability. As a result, we are not sure
if the parts of the function reachable by the fuzzer are worth exploring or not. As future work, more
fine-grained machine learning analysis with higher accuracy may be used to assist fuzzing.
3) Although directed fuzzing attempts to execute target functions, some deep or infrequently used
functions are still hard to reach. While static analysis attempts to find inputs towards those functions,
the analysis can be expensive depending on the complexity of the program. Additionally, the static
analysis may not be very precise: sometimes static analysis assisted directed fuzzing still may not be
able to reach the target functions.
Chapter 7
Related Work
As software vulnerabilities raise wide concerns, there has been work trying to find software vulnerabilities
patterns, in order to save the amount of effort required by human reviewers. In this chapter, we present
other works that attempt to automatically detect or analyze software vulnerabilities. Both code pattern
analysis and machine learning techniques have been proposed by previous work. There has also been
work using machine learning to find vulnerable test cases to assist fuzzers.
47
Chapter 7. Related Work 48
dataset. The word2vec bag-of-word model [12] is used to train embeddings for every token. Both CNN
and RNN models are tried, in order to extract features from obtained token embeddings. A dense
neural network layer, taking token embeddings as inputs, makes classification decisions. This work
aims at detecting memory corruption vulnerabilities such as buffer overflow, improper pointer usage and
improper restriction of boundary conditions. Their tool is able to achieve good recall, but it is still
challenging to obtain good precision using this technique, as there are not as many vulnerable samples
comparing to non-vulnerable samples in their dataset. By evaluating samples in different vulnerability
categories, it shows that this technique works better on some types of vulnerabilities than others.
styles, our project works on the LLVM IR [7] instead of the AST. While our approach also works on the
synthetic SARD Juliet dataset [11], we aim to learn patterns in more realistic scenarios, training only
on real-world vulnerabilities from the CVE database [8].
In this thesis, we use machine learning algorithms to learn vulnerability patterns in code, and thus predict
vulnerable code for further analysis. We experiment with two approaches: coarse-grained statistical
features and raw feature representation for sliced code. Coarse-grained statistical features tradeoff the
expressiveness of the model for a smaller amount of data required, while the raw feature representation
overfits on the training set due to increased complexity/expressiveness of the model. However, the
quality and size of the vulnerability dataset still limit the performance for both techniques we propose.
For the function-level statistical features, while some vulnerability patterns may be learned, the
size/quality of the vulnerability dataset and the coarse-grained nature for the obtained features limit
the performance of the model. Additionally, the models are more likely to predict large and complex
functions as vulnerable. While the predictions may be correct, this behavior causes problems for its
practical usage: it takes more time to analyze the predicted vulnerable functions than randomly selected
functions in general, and thus automatic tools such as fuzzers may not combine very well with the
prediction result from the models. More precision within large functions is needed for this strategy to
be useful.
To learn more fine-grained patterns in programs, we use graph-based classifiers with program slicing.
While experiments show that information such as control/data dependencies between instructions is
useful, the performance on real-world projects is generally fair for graph classifiers trained using raw
feature representations. This is due to a lack of vulnerability dataset with high quality: the amount of
useful training data is not able to cover the diversity in vulnerability patterns and the complexity of the
machine learning model. When training on loop samples with less data and more complex patterns, the
classifier is likely to overfit. On the other hand, while training on the synthetic dataset with more data
and simpler code structures, the graph classifier is able to obtain good performance. However, synthetic
samples are usually simple and cannot capture complex patterns in real-world vulnerabilities.
52
Chapter 8. Conclusion and Future Work 53
pieces of code. The vulnerability samples would hopefully have representations different from normal,
non-vulnerable samples. Similar types of vulnerabilities may also share similarities in their code rep-
resentations. As a result, anomaly detection strategies that detect outliers in program representations
may be used to find vulnerabilities in programs.
2) We may also better integrate prediction results from machine learning with fuzzing techniques. To
inspect prediction result with directed fuzzing, we may use tools such as afl-unicorn [34] to fuzz sliced
pieces of code or functions separately without starting at the entry point of the program. Although
this may take more manual work to both analyze program input and filter out false positives, using this
technique we could improve coverage for the target function.
3) To achieve better fuzzing coverage within the targeted function, we can better combine directed
fuzzing with static analysis: future work may be done to iteratively fuzz the program and generate inputs
for parts of the code that are hard to reach. Through fuzzing, we can find uncovered basic blocks that
are hard to reach by the test cases used. New inputs may thus be generated with further static analysis
and constraint solving in order to lead execution paths towards the uncovered basic blocks.
Bibliography
[1] J. Berr, ““WannaCry” ransomware attack losses could reach $4 billion.” https://www.cbsnews.
com/news/wannacry-ransomware-attacks-wannacry-virus-losses, 2017.
[2] D. Goodin, “Apple pushes fix for facepalm, possibly its creepiest vulner-
ability ever.” https://arstechnica.com/information-technology/2019/02/
apple-pushes-fix-for-facepalm-possibly-its-creepiest-vulnerability-ever, 2019.
[4] D. Marjamäki, “Cppcheck: a tool for static c/c++ code analysis,” 2013.
[6] J. C. King, “Symbolic execution and program testing,” Communications of the ACM, vol. 19, no. 7,
pp. 385–394, 1976.
[11] P. E. Black and P. E. Black, Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce,
National Institute of Standards and Technology, 2018.
[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in
vector space,” arXiv preprint arXiv:1301.3781, 2013.
[13] M. Weiser, “Program slicing,” IEEE Transactions on software engineering, no. 4, pp. 352–357, 1984.
[14] “LLVM static slicer: Dependence graph for programs.” https://github.com/mchalupa/dg, 2019.
[15] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural code comprehension: a learnable represen-
tation of code semantics,” in Advances in Neural Information Processing Systems, pp. 3585–3597,
2018.
[16] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured
data,” in International conference on machine learning, pp. 2702–2711, 2016.
54
Bibliography 55
[17] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep
learning-based system for vulnerability detection,” in Proceedings of the 25th Annual Network and
Distributed System Security Symposium, San Diego, California, USA, pp. 1–15, 2018.
[18] G. Lin, J. Zhang, W. Luo, L. Pan, Y. Xiang, O. De Vel, and P. Montague, “Cross-project transfer
representation learning for vulnerable function discovery,” IEEE Transactions on Industrial Infor-
matics, vol. 14, no. 7, pp. 3289–3297, 2018.
[20] M. Böhme, V.-T. Pham, M.-D. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,
pp. 2329–2344, ACM, 2017.
[22] S. Kim, S. Woo, H. Lee, and H. Oh, “Vuddy: A scalable approach for vulnerable code clone
discovery,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 595–614, IEEE, 2017.
[23] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “Vulpecker: an automated vulnerability detec-
tion system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on
Computer Security Applications, pp. 201–213, ACM, 2016.
[24] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnerability extrapolation using ab-
stract syntax trees,” in Proceedings of the 28th Annual Computer Security Applications Conference,
pp. 359–368, ACM, 2012.
[25] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code
property graphs,” in 2014 IEEE Symposium on Security and Privacy, pp. 590–604, IEEE, 2014.
[26] X. Du, B. Chen, Y. Li, J. Guo, Y. Zhou, Y. Liu, and Y. Jiang, “Leopard: Identifying vulnerable
code for vulnerability assessment through program metrics,” in Proceedings of the 41st International
Conference on Software Engineering, pp. 60–71, IEEE Press, 2019.
[27] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,”
in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 297–308,
IEEE, 2016.
[28] H. K. Dam, T. Tran, T. Pham, S. W. Ng, J. Grundy, and A. Ghose, “Automatic feature learning
for vulnerability prediction,” arXiv preprint arXiv:1708.02368, 2017.
[29] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. Mc-
Conley, “Automated vulnerability detection in source code using deep representation learning,”
in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA),
pp. 757–762, IEEE, 2018.
[31] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, S. Wang, and J. Wang, “SySeVR: a framework for
using deep learning to detect software vulnerabilities,” arXiv preprint arXiv:1807.06756, 2018.
Bibliography 56
[32] Z. Li, D. Zou, J. Tang, Z. Zhang, M. Sun, and H. Jin, “A comparative study of deep learning-based
vulnerability detection system,” IEEE Access, vol. 7, pp. 103184–103197, 2019.
[33] G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vul-
nerability discovery using machine learning,” in Proceedings of the Sixth ACM Conference on Data
and Application Security and Privacy, pp. 85–96, ACM, 2016.