Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Shirley Yang Masc Thesis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Using Machine Learning to Detect Software Vulnerabilities

by

Mingyue Yang

A thesis submitted in conformity with the requirements


for the degree of Master of Applied Science
Graduate Department of Computer Engineering
University of Toronto

c Copyright 2020 by Mingyue Yang


i

Abstract
Using Machine Learning to Detect Software Vulnerabilities

Mingyue Yang
Master of Applied Science
Graduate Department of Computer Engineering
University of Toronto
2020

Due to the size of modern software projects, it is unscalable to protect all parts of the project by

manual analysis. Automatically finding software vulnerabilities is an important problem. However,

the existing code analysis tools are not effective enough at finding vulnerabilities. Lightweight static

analysis tools are imprecise, while expensive symbolic execution strategies do not scale. Dynamic analysis

techniques such as fuzzing can only verify the executed path in programs, and it is impractical to explore

all paths in the program using dynamic analysis.

As a result, we propose machine learning to detect software vulnerabilities: this approach is more

precise than lightweight static analysis, but less expensive than symbolic execution. Prediction results

from machine learning can be used to guide fuzzing. We evaluate two machine learning strategies: coarse-

grained statistical model and fine-grained raw feature representation. The statistical model requires

less data but does not capture all relationships in code. The raw feature representation learns subtle

relationships in programs and does not require manual feature engineering, but needs more training

samples to work well.


ii

Acknowledgements

I would like to express thanks to my supervisor, Professor David Lie, for his advice, guidance, and
financial support that helps me throughout my research. I appreciate his kindness, patience, and the
amount of time he spent. I am also grateful to Ivan, James, and Vasily for their insightful feedback on
this project. I would like to thank my labmates who are always willing to help. I am thankful for the
financial support provided by the University of Toronto and the Ontario Graduate Scholarship. Lastly,
I would like to thank my family for their support throughout my studies.
Contents

1 Introduction 1
1.1 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 LLVM Language Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Abstract Syntax Tree (AST) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 LLVM IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Basic Block & Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Program Dependences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Vulnerability Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 CVE/NVD Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 SARD Juliet Test Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Preliminaries 9
3.1 Definition for Vulnerability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Coarse-grained Statistics as Function-Level Features 11


4.1 Statistics for Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 List of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Unscaled Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Experimental Setup and Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Cross-Project Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Importance of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Complexity of Predicted Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iii
CONTENTS iv

5 Raw Feature Representation 30


5.1 Program Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Slicing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Slicing Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Instruction Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Instruction Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.2 Training Instruction Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.3 Evaluating Instruction Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Graph Representation and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.1 Graph Representation and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.2 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.3 Comparison with Technique using Instruction Sequence . . . . . . . . . . . . . . . . 38
5.3.4 Evaluation on Specific Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.5 Evaluation on Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Directed Fuzzing on Predicted Vulnerabilities 43


6.1 AFL & AFLGo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Evaluation on Directed Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Related Work 47
7.1 Code Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.1 Vulnerable Clone Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Matching Vulnerable AST Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.1.3 Program Metrics for Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Machine Learning for Vulnerable Code Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.1 Automatically Learning Vulnerable Patterns . . . . . . . . . . . . . . . . . . . . . . . 49
7.2.2 Program Slicing with Machine Learning For Vulnerability Detection . . . . . . . . 50
7.3 Vulnerability Discovery with Machine Learning to Guide Fuzzing . . . . . . . . . . . . . . 51

8 Conclusion and Future Work 52


8.1 Future Work: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 54
List of Tables

4.1 List of Function-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


4.2 Number of Vulnerable Functions in Different Projects . . . . . . . . . . . . . . . . . . . . . 15
4.3 List of Machine Learning Models Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 10-Fold Cross-Validation On All Projects with Different Classifiers (Balanced Dataset) . 17
4.5 F1 Score for 10-Fold Cross-Validation Within Projects . . . . . . . . . . . . . . . . . . . . . 18
4.6 Performance of Random Forest Over Different Subsampling Rate . . . . . . . . . . . . . . 20
4.7 Ordered Feature Coefficient for Logistic Regression; Subsampling: 1x . . . . . . . . . . . . 24
4.8 Ordered Feature Coefficient for Logistic Regression; Subsampling: 1.5x . . . . . . . . . . . 24
4.9 Ordered Mean Decrease Impurity for Random Forest; Subsampling: 1x . . . . . . . . . . . 25
4.10 Ordered Mean Decrease Impurity for Random Forest; Subsampling: 1.5x . . . . . . . . . . 25
4.11 Average Number of Instructions for Cross Project Evaluation (Logistic Regression) . . . 28

5.1 Examples of Preprocessed Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


5.2 Evaluation for Instruction Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 5-Fold Cross-Validation on NVD Dataset and Open-Source Projects . . . . . . . . . . . . . 38
5.4 Performance of LSTM on NVD Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Performance on NVD Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Performance on SARD Juliet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.7 Evaluation of Synthetic Training on Real Vulnerabilities in NVD Database . . . . . . . . 41
5.8 Evaluation of Synthetic Training on Different Types of Real Vulnerabilities . . . . . . . . 42

6.1 General Fuzzing Performance for FFmpeg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


6.2 Fuzzing Results on Target Functions for FFmpeg . . . . . . . . . . . . . . . . . . . . . . . . 45

v
List of Figures

4.1 Top-K Percent Precision for Cross-Project Evaluation (Random Forest) . . . . . . . . . . 21


4.2 Top-K Percent Recall for Cross-Project Evaluation (Random Forest) . . . . . . . . . . . . 22
4.3 Number of Instructions in Top-K Vulnerable Predictions for Cross-Project Evaluation
(Random Forest) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vi
List of Algorithms

1 Graph Representation with Embedding Loopy Belief Propagation . . . . . . . . . . . . . . 37

vii
Listings

2.1 Example LLVM IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


5.1 Simple Code Example for Program Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

viii
Chapter 1

Introduction

Although software vulnerabilities raise many concerns these days, there are still many security com-
promises that result from them. For example, the Wannacry ransomware exploited a vulnerability in
Microsoft Windows, causing billions of dollars in losses [1]. Apple’s FacePalm vulnerability allows at-
tackers to eavesdrop on victims by self-answering FaceTime calls [2]. While there are many automated
code analysis approaches to detect software vulnerabilities, their effectiveness is limited. For these ap-
proaches, the tradeoffs between the time/computing power invested and the accuracy of analysis cannot
satisfy the need for vulnerability discovery in modern-day projects.
To detect vulnerabilities, static analysis tools such as FlawFinder [3] and Cppcheck [4] find prede-
fined vulnerability patterns in code. While these tools are lightweight, the predictions they made are
inaccurate: not all vulnerabilities match predefined rules exactly. More precise analysis tools such as
CodeSonar [5] add symbolic execution [6] to static analysis. Symbolic execution is a technique that
explores execution paths in programs and determines what an input needs to be for a part of a program
to execute. However, the cost for analysis greatly increases, as it is expensive to track relationships
between all variables that can affect targeted paths in a program. Also, as it is impractical to precisely
analyze all values for every variable, some paths found by symbolic execution are infeasible.
In contrast, dynamic code analysis techniques such as fuzzing find vulnerabilities by triggering them.
Although the triggered vulnerabilities are real, fuzzing can only check executed parts of the code. To
fully analyze a program, fuzzing has to exhaustively execute every single path in the program. This
makes the analysis intractable.
As a result, for open-source projects and commercial products commonly used, vulnerability discovery
still largely depend on manual auditing, in which human developers inspect the source code to check
for vulnerabilities. However, modern software systems usually contain millions of lines of code. This
renders manual code audits impractical to protect all parts of software projects. To solve this problem,
we explore the hypothesis that there are common patterns in source code that are indicative
of the presence of security vulnerabilities.
These patterns, although detectable, are not explicitly stated. Thus, in this thesis, we propose
machine learning to identify these vulnerable code patterns. Machine learning is more precise than simple
pattern matching techniques, as it aims to find vulnerability patterns in code that are not obviously
stated. Besides, machine learning is also more lightweight than approaches using symbolic execution.
Unlike symbolic execution, machine learning simply collects features to represents the code. It does not

1
Chapter 1. Introduction 2

have to deeply analyze all relationships between instructions/variables and exhaustively solve for feasible
inputs for every single path in the program.
Predictions made by machine learning classifiers can help human reviewers limit their scope for
vulnerability search. Machine learning can also be used to help dynamic analysis techniques such as
fuzzing. Knowing which part of the code is more likely to be vulnerable, a fuzzer does not have to
explore all paths in the program, and can more thoroughly test more vulnerable parts of the code. Also,
fuzzing can be used to automatically verify vulnerabilities found by machine learning, eliminating false
positives in predictions.
Other than the described benefits, there are also challenges for the machine learning approach to
detect software vulnerabilities. Vulnerabilities in the real world are diverse and have implicit patterns.
This means an expressive, and thus complex machine learning model is required to learn the vulnerable
patterns. However, in real-world projects, there are not many samples for each type of vulnerability. A
complex model is likely to overfit on its training data, and the patterns it learns may not be generalizable.
To solve these challenges, there are conflicting requirements a machine learning classifier needs to satisfy.
As a result, we experiment with two different machine learning techniques: coarse-grained statistical
model and fine-grained graph classifiers. Each approach deals with an aspect of the conflicting require-
ments. For coarse-grained statistical features, we collect statistics relevant to vulnerabilities as features,
and train classifiers to find vulnerability patterns in those statistical features. While the features and
models are simple, it requires less data to perform well. On the other hand, we build a graph classifier
to learn raw features from programs. Information such as the types of instructions and the relationships
between the instructions are used to train the classifier. Although this model needs more data to work
well, its complexity allows it to find subtle patterns in code related to vulnerabilities.

1.1 Thesis Structure


This thesis is structured as follows. We give background information on the tools and concepts we use in
Chapter 2. In Chapter 3, dataset preparation and challenges to consider before training are discussed.
We present two strategies for vulnerability prediction: statistic-based function-level features and raw
features with graph representation. In Chapter 4, we describe the coarse-grained statistic-based model,
evaluation results and limitations. In Chapter 5 we discuss and evaluate the results for raw feature
representation. We thus talk about how to use prediction results to fuzz programs in Chapter 6. We
present related work previously done in Chapter 7. Finally, we conclude the thesis and present future
work in Chapter 8.
Chapter 2

Background

In this chapter, we present some concepts and tools later used in this thesis. Section 2.1 explains concepts
about the LLVM language such as instructions, basic blocks, and program dependences in the LLVM IR.
These concepts are used during feature extraction for the two machine learning approaches we propose.
Section 2.2 describes databases we collect vulnerable samples from: they consist of both vulnerabilities
in real world and synthetic vulnerability test cases. Lastly, Section 2.3.2 explains evaluation criteria we
later use to measure performance for machine learning models.

2.1 LLVM Language Representations


2.1.1 Abstract Syntax Tree (AST)
An AST is a tree-like data structure that represents a program. Nodes in an AST denote elements in code
such as expressions, statements, groups of statements and functions. ASTs are organized hierarchically
to preserve semantic meanings in source code. For example, statements inside a loop and the loop
condition are all kept as child nodes of their parent loop statement. For expressions, the operators are
usually represented as parent nodes, and operands are represented as child nodes. Tokens such as if/else
keywords, variable names/literal values and binary operators are kept in the AST.
In this work, we do not keep syntactic information from the AST, because it contains irrelevant
information such as coding style and variable names. This information is not related to the actual
meaning of the program and exhibits some project-specific patterns. The part of the AST that convey
semantic meaning can all be expressed using LLVM IR [7], which would be later discussed in Section
2.1.2. Thus, in order to construct generalizable models, we do not work on the AST.

2.1.2 LLVM IR
The LLVM IR [7] is an intermediate representation (IR) used by the LLVM compiler project. The
IR is a representation used by the compiler after processing the source code, but before generating
platform-dependent machine code: the LLVM compiler first processes the source code and puts them
into ASTs, and then outputs LLVM IR from the AST representation. The LLVM IR is usually used
for code optimization. Unlike AST, while LLVM IR preserves the semantic meaning of the program, it
does not contain as much syntactic information such as tokens from the source code and code structures

3
Chapter 2. Background 4

like if/else statements. On the other hand, the platform-independent nature of the LLVM IR makes it
generalizable to all computer architectures. As a result, we work on the LLVM IR level to ensure our
predictions are not project-specific.
The LLVM IR is a number of instructions in the Static Single Assignment (SSA) form: every value
is assigned at most once, to simplify code optimization. As IRs are platform-independent, the concept
of register is not introduced. Rather, operands for instructions may be results from other instructions,
forming def-use relationships. For example, in Listing 2.1, by using instruction %3 as an operand,
instruction %4 adds the loaded result from %3 with the literal value 2. The add instruction %4 is also
used by a store instruction as an operand.
While no values can be repeatedly assigned, different store instructions can store values into the same
variable. In Listing 2.1, for instance, the first two store instructions can both store values (3 and %4)
into the same variable allocated by %1. The stored values can also be loaded back by load instructions.

Listing 2.1: Example LLVM IR


1 %1 = alloca i32 , align 4
2 %2 = alloca i32 , align 4
3 store i32 3 , i32 * %1 , align 4 , ! dbg !15
4 %3 = load i32 , i32 * %1 , align 4 , ! dbg !16
5 %4 = add nsw i32 %3 , 2 , ! dbg !17
6 store i32 %4 , i32 * %1 , align 4 , ! dbg !18
7 %5 = load i32 , i32 * %1 , align 4 , ! dbg !19
8 %6 = sub nsw i32 %5 , 3 , ! dbg !20
9 store i32 %6 , i32 * %2 , align 4 , ! dbg !21

2.1.3 Basic Block & Control Flow Graph


To manage control flows, the LLVM IR represents code by several basic blocks. A single basic block
contains a single entry point and exit point. Executions within a basic block are sequential: there is no
jump or branches within a basic block. Rather, branch, jump and switch instructions direct execution
from a predecessor basic block to its successor basic block. Basic blocks connected by those jumps form a
control flow graph. To simplify the analysis, a control flow graph may be constructed for each function.
Every function contains an entry basic block, representing the starting point of execution within the
function.

2.1.4 Program Dependences


In a program dependence graph, every node represents an operation (LLVM instruction in our case),
and each edge represents a specific type of program dependence between operations (instructions). If
an instruction B is dependent on instruction A, a forward dependence edge goes from instruction A
to instruction B. Vice versa, a backward dependence edge goes from instruction B to instruction A.
Below we describe the 5 types of program dependences used in Section 5.1, Section 5.2.2 and Section
5.3.1.
Control Dependence: If instruction B has a control dependence on instruction A, it means in-
struction A is one of the nearest branches that guards a path leading to B.
Data Dependence: Data dependences cover relationships between loads and stores on the same
variable. If a load instruction may load value written by a store instruction, the load has data dependence
on that store instruction. For example, in Listing 2.1, the load instruction on line 7 has data dependence
on the store instruction on line 6, but not the store instruction on line 3. For arrays/structs, offsets with
Chapter 2. Background 5

literal indices are treated as separate variables. However, for array accesses whose offsets are variables,
they are treated as the same variable, as the value of the variable cannot be easily determined without
precise/expensive analysis.
Def-Use Relationship: As shown in Section 2.1.2, some instructions are used by other instructions
as operands. This results in def-use relationships between instructions.
Control Flow Dependence Within Basic Block: For instructions in the same basic block, as
program execution is sequential inside a basic block, instructions have control flow dependences on their
immediate predecessor instructions.
Control Flow Dependence Across Basic Blocks: For basic blocks in the control flow graph,
a successor basic block has control flow dependence on its predecessor block. In terms of instructions,
we consider the first instruction of the successor block to be dependent on the last instruction of the
predecessor block.

2.2 Vulnerability Databases


2.2.1 Version Control
A code repository consists of a software project and updates made to it. For open-source code reposito-
ries, all code in the project and changes previously made to the project are visible to the public.
Commit: A commit makes changes to existing source code in a repository. In a commit, code in a
project can be added, deleted or modified. All changes made by a commit can be represented as a diff,
which shows the difference between the code before the commit and the code after the commit is made.
Every commit is associated with a commit number. A commit usually contains a commit message,
explaining what changes are made by the commit.
Parent Commit: As a project is developed, the code in the project may be changed by multiple
commits. One can denote a piece of the code between changes using a commit number (Ex. function
“rb alloc aux page” in commit b16155a0b01ae999add72b2ad2791b9c66285880 is the “rb alloc aux page”
function after commit b16155a0b01ae999add72b2ad2791b9c66285880 is made). A parent commit is the
commit that a certain commit is based on. All code in the parent commit is in the state immediately
before any change in the current commit is made.

2.2.2 CVE/NVD Database


The CVE database [8] consists of a list of vulnerability entries reported from real-world projects. Each
entry contains a CVE number, text description of the vulnerability. While it also contains references to
public projects, the references usually do not precisely point to the code location where the vulnerability
occurs.
The NVD (National Vulnerability Database) [9] is built upon the CVE database: it adds information
such as severity score, vulnerability type and impact to each CVE entry. We use C/C++ vulnerability
samples from these databases as training and testing data for our models. The part of the information
we use is mostly shared by the CVE and NVD database. Therefore we CVE and NVD interchangeably
most of the time.
Chapter 2. Background 6

2.2.3 SARD Juliet Test Suite


SARD [10] is a dataset of vulnerability test cases collected in order to evaluate code analysis tools. The
Juliet Test Suite [11] in SARD consists of small synthetic C/C++ test cases. There are a large number
of vulnerable and non-vulnerable examples in SARD Juliet under different categories of vulnerabilities.
The test cases are usually much simpler than code in the real world and may not have the same features
as actual vulnerabilities. In Section 5.3.5, we use the C/C++ samples in the SARD Juliet dataset, both
to evaluate our model’s ability to learn simple vulnerability patterns and to examine how these learned
patterns can generalize to vulnerabilities in the real world.

2.3 Machine Learning


2.3.1 Machine Learning Models
Here we introduce a number of models later used for classification.
Bayesian Network: Bayesian network is a probabilistic graphical model that is able to represent
random variables and their relationships. Each vertex of the bayesian network denotes a random variable
(probability of an event occurring), and each directed edge denotes the conditional dependence between
two random variables. Given its parents, every vertex is conditionally independent of its non-descendants.
Naive Bayes: Naive bayes can be considered as a simple type of bayesian network. It models both
the attributes and the class variables. Naive bayes assumes all attributes are conditionally independent
of each other: every attribute only depends on the class variable.
Logistic Regression: In logistic regression, its input feature vector is multiplied by a set of coef-
ficients trained by the model. Each coefficient corresponds to one input feature. The weighted sum is
1
then fed into a sigmoid function y = 1+e−
to smoothen the outcome and restrict its range to 0-1. The
output of the sigmoid function is the prediction result.
Neural Network: A neural network has several layers of neurons. Each neuron holds a weighted
sum for neurons from its previous layer. Activation functions such as the sigmoid function can be further
applied to the weighted sum of each layer. The last layer of a neural network represents its prediction
result. A neural network is similar to several layers of logistic regression units.
Random Tree: A random tree, similar to a decision tree, has a flowchart-like structure. A random
tree has many internal nodes. Each node tests whether one or several features match certain conditions.
Depending on the result of the test, branches can be taken from the testing nodes to either different
descendant nodes or prediction outcomes. However, instead of using all features for decision making, a
random tree randomly selects a subset of features for use.
Random Forest: A random forest consists of several random trees. The prediction made by a
random forest classifier is obtained by averaging the predictions made by individual decision trees. This
prevents the model from overfitting. To further decrease the correlation between decision trees, each
random tree can also be trained on a different subset of the training set.
Bidirectional LSTM: An LSTM model is a type of recurrent neural network that is able to capture
long term dependencies. An LSTM consists of a sequence of connected cells. Each cell takes a vector as
input, and outputs a vector state. Every LSTM cell forgets some information carried by the previous cell
and adds some new information from its current input. The entire LSTM model thus takes a sequence
of vectors as input.
Chapter 2. Background 7

In a bidirectional LSTM (BiLSTM), information flows both forward and backward. A BiLSTM
consists of two LSTMs: the output of two LSTM can be merged together with operations such as add,
multiply and concatenate. Output for an LSTM may be fed into other machine learning models for tasks
such as classification.
Word2Vec Skip-gram Model: The word2vec skip-gram model [12] is commonly used in natural
language processing. It generates an embedding representation for every word in the language. The skip-
gram model first finds neighbors for every word in a corpus of training text. Context size is specified to
limit the number of neighbors before and after each word. For example, a context size of 3 means that
only 3 words before and after the target words are neighbors.
Every input word is represented as a one-hot embedding, whose dimension is the same as the size
of the vocabulary. The input embeddings are then fed into a neural network classifier. The output of
the classifier is also a vector, whose size is the same as the size of the input embedding. To train the
skip-gram model, each dimension of the output layer is set to 1 if its corresponding word is a neighbor
of the input word, and set to 0 otherwise. After the loss is minimized, weight matrix at the hidden layer
is the final word embedding. To get embedding for a certain word, the one-hot embedding of that word
can be fed into the trained neural network. The vector representation obtained at the hidden layer is
the word embedding. In Section 5.2.2 of this thesis, the word2vec skip-gram model is used to generate
embeddings for instructions in programs.

2.3.2 Evaluation Criteria


In this section, we introduce a number of evaluation measures used for our machine learning techniques.
True Positive & False Postive: True positives (TP) for a certain class C denotes the number
of samples in class C that are correctly predicted as C. False positives(FP) are samples not in class C
but misclassified as C. Similarly, true negatives(TN) are correctly classified samples not in C, while false
negatives (FN) are samples actually in class C, but misclassified as not in C.
Precision & Recall: Precision for class C is the percentage of samples actually in C over the
predicted class C samples. Recall for class C, on the other hand, is the percentage of all samples in C
correctly found by the classifier. Precision and recall can thus be calculated by the following formulas:
TP
precision = (TP+FP)
TP
recall = (TP+FN)
There is also a precision-recall tradeoff for a specific class C: to get a high recall, one may need to
classify ambiguous results as belonging to class C. However, as more ambiguous samples are classified
as C, there would be more false positives for class C and the precision for C would decrease.
F1 Score: As there is a precision-recall tradeoff, to represent how well a classifier performs in a
specific class, F1 score is developed as a metric for evaluation. F1 score is defined as the harmonic mean
of precision and recall:
precson·rec
F1 = 2 precson+rec
The F1 score thus summarizes precision and recall into a single number for performance evaluation.
As precision and recall are class-specific, F1 score is also associated with a specific class. Due to class
imbalance, F1 score for two different classes could be significantly different.
PRC AUC: With different thresholds used for classification, one may tradeoff precision and recall
in a specific class. To show the overall performance, different precision and recall pairs obtained with
different thresholds are connected, forming a PRC curve. The PRC curve plots how one metric decreases
Chapter 2. Background 8

while the other increases. The area under PRC curve (PRC AUC), as it is named, is the area under the
PRC curve ranged between 0 and 1. PRC AUC is class-specific: higher PRC AUC means the classifier
generally has better precision and recall for a certain class. When testing on an imbalanced dataset, the
underrepresented class should have a lower PRC AUC. Similarly, the overrepresented class has higher
PRC AUC.
True Positive Rate (TPR) & False Positive Rate (FPR): The true positive rate (TPR) is
defined as the recall for the positive class. The false positive rate (FPR) is the proportion of negative
samples wrongly classified as positive over all actual negative samples. The false positive rate (FPR)
can also be considered as 1 minus the recall for the negative class.
TP
TPR = (TP+FN)
= recall for positive class
FP TN
FPR = (TN+FP)
= 1− TN+FP
ROC AUC: The ROC curve plots how TPR varies with FPR under different thresholds (its x-axis
is the FPR, while its y-axis is the TPR). The area under ROC curve (ROC AUC) is also a number
between 0 and 1. ROC AUC is measured for both classes: one can get the ROC curve for the negative
class by flipping the axes from the ROC curve for the positive class. ROC AUC measures how well
a classifier distinguishes between two classes. As a classifier is able to better separate data from two
classes, it is easier to get higher TPR with low FPR, and thus the ROC AUC is closer to 1. In other
words, with a good ROC AUC, it is easier to get good recall values for both classes.
ROC AUC represents the probability a model is able to separate the positive and the negative class.
An ROC AUC below 0.5 means the classifier does no better than a random classifier. Class imbalance
in test data generally does not affect the ROC AUC much. For example, with more negative samples,
there are both more true negatives and false positives, and thus the FPR on an imbalanced set would
be similar to the FPR on a balanced set. On the other hand, true positives and false negatives are not
affected by the number of negative samples, so TPR is unchanged.
Chapter 3

Preliminaries

Before model training can start, there are a few preparations need to be done and a number of challenges
to be considered. In this chapter, we first define what constitutes a vulnerability in this project. Then
we present both the challenges we face and the methods we use to prepare the set of vulnerabilities.

3.1 Definition for Vulnerability


Software vulnerabilities are usually flaws in code that can be exploited by attackers to put programs into
abnormal execution states for their own benefits. In this project, we aim to detect patterns that indicate
such flaws. A vulnerability is thus defined as code patterns that cause abnormal/unexpected
behaviors for a program. For example, a memory corruption vulnerability may contain both an
incorrect check and memory copying code guarded by the check.

3.2 Dataset Preparation


Before training and testing the machine learning models, we need to build a set of known vulnerabilities
with relevant code. As a result, we collect C/C++ vulnerability samples from the CVE/NVD database,
which consists of reported vulnerabilities in real-world projects. While there are tens of thousands
of vulnerability entries in the CVE/NVD database, they mostly only describe vulnerabilities in plain
english. On the other hand, there are many public code repositories on open-source platforms such as
Github, but they do not explicitly state which versions of code contain vulnerabilities and which versions
fix vulnerabilities. As a result, we find commits in open-source open-source code repositories or links
connected to commits in the reported CVE entries. Then, using web crawlers, we download diff patches
from various open-source platforms that fix vulnerabilities. These patches indicate lines of code deleted
and added by a commit.
After downloading these diff patches, we compile code in open-source projects corresponding to the
diff patches into LLVM bitcode for further feature extraction. Although in CVE/NVD, a large number
of vulnerabilities are described in plain text, there are not as many vulnerabilities that we can obtain
compilable code patches from. In total, we have collected 2177 code samples before patch and 2145 code
samples after patch.

9
Chapter 3. Preliminaries 10

3.3 Challenges
There are a number of challenges that need to be considered before designing a strategy for vulnerability
prediction. While these challenges may not be completely resolved, different techniques we present later
can achieve better performance in some aspects by aiming to solve some of the problems discussed below.
Size and Quality of Dataset: As the proportion of vulnerable and non-vulnerable code is imbal-
anced (there is much more non-vulnerable code than vulnerable code in the real world), compared to
the number of non-vulnerable samples, there are not as many samples for vulnerable code. In fact, in
the CVE database consisting of reported vulnerabilities, vulnerabilities with patches available have more
variance and complexity than the size of the dataset represents. While there are vulnerable samples in
synthetic datasets, they do not accurately reflect the nature of real-world vulnerabilities. This presents
challenges for machine learning: with a limited amount of vulnerable data, it is hard for a classifier to
learn representative patterns for vulnerabilities.
Another challenge is that there is no ground truth for non-vulnerable samples. It could be tempting
to assume that a piece of code in which no vulnerability has been reported is non-vulnerable. However, a
previously undiscovered vulnerability may still occur in there. Expert knowledge and manual inspection
could help sanitize the data, but this would be expensive in terms of time and human effort. In fact,
while expert’s inspection improves the quality of the dataset, it still cannot be guaranteed that a piece of
code is entirely free of vulnerabilities after human inspection. Although the mislabeled non-vulnerable
samples do not add up to a significant proportion, it could affect the performance of the classifier. If
some non-vulnerable samples exhibit certain characteristics of vulnerable samples, some false positive
predictions obtained by the classifier may actually be correctly predicted vulnerable samples.
Location of Vulnerability: One observation is that the difference between vulnerable and non-
vulnerable code is usually very small. Patches for vulnerabilities usually involve only changes in only a
few lines of code. The change is typically much less than the size of a function/file, which could consist
of hundreds of lines of code.
As a result, it would be ineffective to simply feed detailed features representing a large chunk of code
into a machine learning classifier. If the representation is not well selected, the small amount of code
indicative of the difference between vulnerable and non-vulnerable code would have little or even no
effect on the final representation of the code: other features such as the functionality of the code may
instead dominate the difference in representation.
In addition, it is also a challenge to accurately identify which parts of the code a known vulnerability
resides in exactly, even after knowing the lines of code changed in patches: the lines changed in a patch
may not contain complete patterns for the vulnerability. For example, a fix for a buffer overflow may be
an added/changed check before memory access: the added/modified check, although related, should not
be considered as vulnerable by itself. It is not clear where the point in which unexpected behaviors may
occur by simply looking at the patch. For the buffer overflow example just described, as there could be a
lot of code guarded by the check, it is hard to identify which part is actually related to the vulnerability
without careful analysis.
Some interprocedural vulnerabilities may also have several checks in different functions. For this
type of vulnerability, it may not be clear in which function or part of the code a vulnerability is in. One
function could return incorrect offset calculation, while another function uses the result without check.
For simplicity, we currently only deal with intraprocedural vulnerabilities.
Chapter 4

Coarse-grained Statistics as
Function-Level Features

Although the difference between vulnerable and patched code usually involves only a few lines of code,
there are certain patterns at a larger scale of the code that indicates if a function is more likely to have
vulnerabilities. We aim to distinguish these vulnerability patterns with coarse-grained function-level
statistics. For our statistic-based strategy, rather than trying to directly distinguish between functions
with or without vulnerabilities, the goal is thus to predict which functions are more likely to be
vulnerable.
For every function, a feature vector is computed, of which each dimension represents a certain statistic
measure in the function. This model addresses the problem of a limited dataset previously mentioned:
simpler features and models need fewer training points to work well. It is also easier to see which
features are more relevant to the result of classification. However, these statistical features are rather
coarse-grained. This means not all meanings of instructions and relationships in code are preserved. In
addition, as the features are manually picked, the performance of the model is highly dependent on the
types of features selected.

4.1 Statistics for Features


For better performance, we try using features closely related to potential vulnerability patterns, while
excluding information specific to only certain types of functions or projects. Thus, we mostly choose
semantic features from the LLVM IR [7] representation of the functions. We do not use syntactic features
other than the lines of code in a function, mostly because the syntactic features are more relevant to
the coding style of specific groups of programmers and thus limit the performance of classifiers across
projects. In contrast, we would like to preserve semantic features generalizable to all projects, and
therefore these features may be used to find vulnerabilities in unseen projects.

4.1.1 List of Features


We investigate a list of statistics as features for our machine learning model. Table 4.1 lists the features
experimented in our statistics-based approach. For features such as load/store/branch instructions

11
Chapter 4. Coarse-grained Statistics as Function-Level Features 12

called, the number of certain instructions/function calls is divided by the total number of instructions
in the function. We use the proportion, rather than the number of certain instructions in the function,
because for larger functions that have more lines of code (and thus more instructions), the number of all
types of instructions increases proportionally. As a result, the number of certain instructions/function
calls does not represent the overall density for a specific type of operation. Scaling is performed, because
these features are supposed to be independent of function size. Below we describe the list of features
we use.
Size of Function: Number of lines, basic blocks, and instructions are all included as indicators of
the size of a function. Size of functions are included for two reasons: 1) Larger functions have more
code and thus may also have more chances for a vulnerability to occur. 2) Machine learning models can
use these features as normalizers for other features so they do not overemphasize or deemphasize certain
patterns when they occur more often in larger functions.
There are also correlations between these three features. Fewer lines of code while more instructions
mean every line of code written is more complex. Fewer instructions comparing to basic blocks can also
indicate more control flow paths within a function.
Loads/Stores with Pointers: Both the proportion of load/store instructions and loads/stores
involving pointers/arrays in the function are included as features. This is based on the intuition that
memory errors are usually associated with loads/stores with pointers. For example, during a buffer
overflow, the program stores data into a buffer (represented as a pointer in LLVM) without properly
checking if the end of the buffer is hit, and overwrites sensitive data. Additionally, loading values from
pointers pointing to invalid addresses could result in program crashes and denial of service vulnerabilities.
For every load/store with pointers, we also track pointer offsets and load operations to find the
original pointers involved. (For example, loads/stores to array[i], array[1] and *array all correspond
to the same pointer array.) The proportion of pointers loaded/stored is then kept as features. If there
are more load/store instructions while relatively a small number of pointers loaded/stored, this means a
few important pointers are frequently used and the code associated with these pointers would be more
complex. These patterns may be associated with vulnerabilities, as with increased complexity and denser
load/store with pointers, it requires more effort for the programmers to check for proper usage of those
pointers.
Pointer Cast Operations: Cast operations are also tracked: specifically, we take into account
casts with pointers. Both casts to and from pointers are considered. These operations are included,
as improper casts would result in vulnerabilities. For example, incorrectly casting between pointers to
different data structures can lead to misalignments. Accessing fields using the casted pointers may thus
yield unexpected results.
Type Conversions: The proportion of convert instructions is included. Different types of convert
instructions are treated separately: the proportion of truncations, extensions, and conversions between
different literal types are separately calculated. Truncations limit the number of bits in types, while
extensions add more bits to existing types. Conversions between literal types, as it is described, convert
one literal type to another (Ex. float to int).
Type conversions without proper checks could lead to vulnerabilities. For example, when converting
an 8-bit char to an unsigned integer, if the char is negative, the converted unsigned integer would become
a large positive number. When this large positive number is used later, for instance, as a length variable
for copying data, the program could overwrite sensitive parts of memory outside of the buffer boundary.
Chapter 4. Coarse-grained Statistics as Function-Level Features 13

# of Lines
# of Basic Blocks
# of Instructions
Proportion of Load Instructions with Pointers
Proportion of Store Instructions with Pointers
Proportion of Pointers Loaded
Proportion of Pointers Stored
Proportion of Cast Instructions
Proportion of Cast Instructions with Pointers
Proportion of Cast to Pointers
Proportion of Cast from Pointers
Proportion of Convert Instructions
Proportion of Extension Instructions
Proportion of Trunc Instructions
Proportion of Conversion Between Literal Types
Proportion of getelementptr Instructions (GEPI)
Proportion of GEPI with first zero index
Proportion of double GEPI (loaded GEPI as operand of GEPI)
Proportion of Cmp Instructions
Proportion of Cmp Instructions with Pointers
Proportion of Cmp Instructions with NULL Pointers
Proportion of Arithmetic Op Instructions
Proportion of Add & Minus
Proportion of Multiply & Divide
Proportion of MOD
Proportion of AND & OR
Proportion of XOR
Proportion of Shift
Proportion of Branch Instructions
Maximum Depth of Nested Loop
# of Top-Level Loops
Proportion of Functions Called
Proportion of Functions Called in the Same File
Proportion of specific LIBC Calls (Multiple Features Grouped)

Table 4.1: List of Function-Level Features

Pointer Index Operation (GEPI): The getelementptr instruction in LLVM calculates a memory
location with relative offsets to a pointer. It takes a pointer operand and can have many indices. The
type pointed to by the first index should be data structures such as arrays and structs: the data structure
is indexed by the following elements. The first index of a getelementptr instruction represents the offset
to the pointer operand. A first index of zero means the calculated memory location is in the element
pointed to by the operand itself.
The proportion of getelementptr instruction is included as a feature, as this indicates the frequency
of pointer operations. We also keep track of the getelementptr instructions whose first offset is zero.
This feature is kept as it is related to direct pointer dereference. We define a double getelementptr as
a getelementptr instruction whose operand contains loads to another getelementptr instruction. This
pattern is associated with many levels of pointer offset calculations and deep data structures. These
features are included, as more frequent and complex getelementptr instructions mean there are complex
pointer operations in code, and can be correlated to vulnerabilities.
Cmp Instructions: The proportion of cmp instruction is also included, as they are usually necessary
Chapter 4. Coarse-grained Statistics as Function-Level Features 14

checks in code guarding crucial operations such as memory copying and API calls. Comparisons with
pointers and null values are also tracked. Checks involving pointers and null values would be relevant
to null pointer dereferences. A function with more of these checks usually need more careful attention,
as one of such checks missing could cause the occurrence of a vulnerability.
Arithmetic Operations: Arithmetic operations are included as features. Different types of arith-
metic operations are separately considered. Add and minus instructions are often used as counters in
programs. Multiply and divide may be used to calculate the size or offset of memory. MOD, XOR, AND,
OR and shift operations are not as frequently used and may be associated with specific functionality
of programs such as encryption. These features, when combined with other features such as pointer
operations, may reveal patterns in code that could be related to vulnerabilities.
Branches and Loops: The proportion of branch instructions indicates the complexity of the code:
more branch instructions means more varied execution paths. The number of loops and the maximum
depth of the loops are also included. The existence of a loop could significantly increase the number
of potential execution paths and complicate the control/data dependence of instructions involved in
the loop. Nested loops bring dependences between loops and further complicates the amount of code
analysis that needs to be done in order to guarantee correctness. These features are included as there
is usually a relationship between the complexity of the code and the probability for it to be vulnerable.
More complex a piece of code is, there exists both more effort for human programmers/reviewers and
more difficulty for a program analysis tool to fully analyze the code and therefore ensure its correctness.
This thus increases the likelihood of which a mistake or vulnerability would occur.
Functions Called: The number of functions called is included as features as well. The inappropriate
usage/checking for the result of API functions could result in vulnerabilities. More function calls also
mean potentially more complicated interactions between functions.
To distinguish between commonly used utility API functions in a project and local helper functions
with specific functionalities, we include the number of called functions in the same file. A set of commonly
used libc functions are also tracked, this list of functions are related to program input/output, file
reading/writing, memory operations, and string manipulation. Improper usage of these libc functions
could also lead to vulnerabilities.

4.1.2 Unscaled Features


Although the proportion of instructions is normalized to the number of instructions in the function, the
depth and number of loops are not normalized. While with more loops in code, it is easier to write
larger functions, loops are more related to the structure of the code and indicate a function’s complexity
overall. This feature cannot be treated as the frequency of specific patterns occurring over a certain
amount of code: a small function with one loop does not have the same complexity a function twice the
size but with two nested loops with a depth of two. As the number of lines, basic blocks and instructions
are already included as features, the relation between the size of the function and features for loops can
be captured by classifiers. Thus, there is no need to scale them with respect to the size of the function.

4.2 Performance Evaluation


In Section 4.1, we extract a vector of selected features from every function. To evaluate our function-
level statistical features, we feed them into machine learning models. We evaluate the performance of
Chapter 4. Coarse-grained Statistics as Function-Level Features 15

different machine learning classifiers on a set of open-source projects. Cross validations are used both
for training and evaluation.

4.2.1 Experimental Setup and Evaluation Criteria


Dataset Generation: To obtain a set of vulnerable functions, we use commits patching CVE/NVD
vulnerabilities. As previously mentioned in Section 3.3, the difference between patched and unpatched
code is very small. While the vulnerability in a patched function may be fixed, certain patterns in its
code still indicate that the function is likely to be vulnerable. Therefore, these patched functions should
not be included in the non-vulnerable dataset.
However, these patched functions should not be included in the vulnerable dataset either, although
they may be considered as potentially vulnerable. This is because the patched functions are very similar
to their vulnerable counterparts before the fixing commit (as the difference between patched and un-
patched code is small), and statistical features are thus not able to clearly distinguish between the two.
Including both of them would result in duplicates in the dataset. If one sample is the training set while
its duplicate is in the test set, it could cause misleadingly high evaluation performance on the test set.
Therefore, we find the lines of code that are deleted/modified in commits patching vulnerabilities,
and treat the corresponding functions in their parent commits as vulnerable. To further sanitize the
vulnerable dataset, only one copy is kept for vulnerable functions changed in multiple commits. For non-
vulnerable samples, we instead use functions in which no vulnerabilities have been previously reported,
and assume they are not vulnerable. While this could lead to inaccuracies in the non-vulnerable dataset
as some of those functions may contain previously unreported vulnerabilities, we still keep these functions
in the dataset because it is impractical to obtain ground truth (Section 3.3).
Dataset Imbalance and Subsampling:: Table 4.2 shows the number of vulnerable functions
obtained in the open-source projects we use. These projects are selected because they contain adequate
number of vulnerable functions we can compile, and thus can be used for cross-project evaluation. The
vulnerable and non-vulnerable samples from each project are generated using the method described
above. We included functions from all compilable code in each project. As it is shown, there are far
more non-vulnerable functions (functions in which no vulnerability has been discovered) comparing to
vulnerable ones.

Project Number of Vulnerable Samples Number of Non-Vulnerable Samples


FFmpeg 162 20,357
Imagemagick 140 5,907
Linux 916 62,247
Qemu 229 64,780
Wireshark 155 116,028
Total 1602 269,319

Table 4.2: Number of Vulnerable Functions in Different Projects

This leads to another issue that we need to deal with: dataset imbalance. With too many non-
vulnerable samples in the training set, classifiers would prefer making predictions in the non-vulnerable
class. In the extreme case, a classifier labeling every function as non-vulnerable could still have a good
loss, but this classifier is entirely useless for us.
Chapter 4. Coarse-grained Statistics as Function-Level Features 16

As a result, for training, the non-vulnerable functions are subsampled: we randomly select only part
of the samples from the non-vulnerable dataset. To reduce the number of non-vulnerable samples, we
let the size of the non-vulnerable set to be 1x, 1.5x, 5x, and 10x than that of the vulnerable set. More
non-vulnerable samples are selected as vulnerable ones, to take into account the fact that there are more
non-vulnerable functions in the real world.
Evaluation Criteria for Imbalanced Dataset: In addition, to measure performance over the
imbalanced datasets, using accuracy and loss alone is not a very good measure: bad performance in the
underrepresented class may still lead to good result overall. As a result, along with accuracy, we use
recall, precision and F1 score for both classes in our evaluation. We also incorporate the area under
curve (AUC) for ROC and PRC curves to get a measure for performance under all possible thresholds
for classification.
Another limitation due to class imbalance reduces the precision of the trained classifiers. As there are
much more non-vulnerable functions than vulnerable ones in practice, a small probability of misclassifi-
cations predicting a non-vulnerable as vulnerable could greatly increase false positive rate. On one hand,
measures such as precision and PRC curve thus may not be as viable, since their performance would be
significantly affected if a small chance of false positive prediction is made. On the other hand, this issue
may also be practical, since reduced precision would cause an increased amount of work inspecting the
potentially vulnerable functions predicted in order to find real vulnerabilities. As a result, precision and
recall for top-k percent of functions in vulnerable class may be considered, to measure the amount of
work saved for inspection using these machine learning classifiers.
Machine Learning Models: We evaluate our statistical features using a variety of machine learning
models. Table 4.3 below lists the models we use. Abbreviations that we later use to represent their
corresponding classifiers are also included.

Bayes Net (BN)


Naive Bayes (NB)
Neural Net (NN)
Logistic Regression (LR)
Random Forest (RF)
Random Tree (RT)

Table 4.3: List of Machine Learning Models Evaluated

4.2.2 Overall Performance


In this section, we discuss the overall performance for models trained using coarse-grained statistical
features. Different machine learning classifiers described in Section 4.2.1 are used. We evaluate the per-
formance of our proposed strategy using cross-validation on all projects, with different subsampling rates.
We also test how classifiers are able to learn vulnerable patterns inside the same projects. Additionally,
we evaluate the performance of classifiers trained and tested on imbalanced datasets.
Performance On All Projects: To evaluate the performance of classifiers overall, we evaluate
different classifiers on vulnerabilities obtained in all open-source projects we use. The types of machine
learning models and their abbreviations previously described in Section 4.2.1. To even out randomness
during the evaluation, we use 10-fold cross-validation.
Chapter 4. Coarse-grained Statistics as Function-Level Features 17

Table 4.4 above shows the performance of different classifiers we use over all projects when training
with a balanced dataset. The non-vulnerable set is subsampled, so the number of samples in the non-
vulnerable set is the same as the number of samples in the vulnerable set. Overall, while there are
differences in performance, the differences between classifiers do not vary too much. The ROC AUC is
around 0.7-0.8, and the accuracy is around 0.7. This means around 70% of the samples in the balanced
dataset are correctly classified. The classifier has 70-80% probability to rank a vulnerable sample as more
likely to be vulnerable than a non-vulnerable one. While this performance is not excellent, the classifier
can learn some vulnerability patterns and make some correct predictions on these real-world projects
we use. Although different classifiers or tuning may cause slight differences in precision and recall, these
two values are almost the same: as the dataset is balanced, and there is no need to significantly tradeoff
one metric for another to get better loss.
For all models, the vulnerable class has slightly better PRC AUC scores than the non-vulnerable class.
This indicates that using the set of features we pick, it would be slightly easier to classify vulnerable
functions than to classify non-vulnerable ones. While this could demonstrate some effectiveness of the
set of features picked, other reasons could be non-vulnerable samples have more diverse functionality
and characteristics, compared to the vulnerable patterns we may capture using statistical features. This
may be also due to the fact that some non-vulnerable functions exhibit certain features in the vulnerable
set: there may be a high chance for a function to vulnerable, but it is possible that either the code is
checked carefully so potential vulnerabilities are eliminated. Indeed, some non-vulnerable functions may
actually be vulnerable, with an existing vulnerability that has not yet been found or reported.

Model Class Precision Recall F1 ROC AUC PRC AUC Accuracy


vuln 0.739 0.684 0.710 0.803
BN 0.779 0.721
not vuln 0.706 0.758 0.731 0.733
vuln 0.706 0.691 0.699 0.779
NB 0.762 0.702
not vuln 0.698 0.713 0.705 0.708
vuln 0.663 0.734 0.697 0.749
NN 0.748 0.680
not vuln 0.702 0.626 0.662 0.732
vuln 0.747 0.649 0.695 0.786
LR 0.781 0.715
not vuln 0.690 0.781 0.733 0.745
vuln 0.752 0.744 0.748 0.846
RF 0.837 0.750
not vuln 0.747 0.755 0.751 0.820
vuln 0.671 0.684 0.678 0.620
RT 0.670 0.674
not vuln 0.678 0.664 0.671 0.612

Table 4.4: 10-Fold Cross-Validation On All Projects with Different Classifiers (Balanced Dataset)

Performance on Individual Projects: As functions in the same project share more common
programming patterns, we also evaluated the performance of different classifiers on individual projects.
We train and test on samples obtained from the same project, to see how well vulnerable patterns
discovered may generalize within projects. The dataset used is balanced. We use F1 score as a measure
for performance: F1 score is used as it considers both the precision and recall of the measured class.
To show the effect of slightly imbalanced datasets, we train the classifiers with non-vulnerable datasets
both 1 and 1.5 times the size of the vulnerable dataset. Note that the number of vulnerable samples for
all projects is previously shown in Table 4.2. The evaluated results are displayed in Table 4.5.
Chapter 4. Coarse-grained Statistics as Function-Level Features 18

In Table 4.5, all classifiers have better F1 scores for individual projects, compared with general
cross-validation among all 5 projects in Table 4.4. While this indicates that vulnerabilities within one
project are less diverse and are easier to be learned and captured, it also presents a challenge to find
vulnerabilities in new projects using existing data. We will evaluate this problem later in Section 4.2.3.

Model Class FFmpeg Imagemagick Linux Qemu Wireshark


1x 1.5x 1x 1.5x 1x 1.5x 1x 1.5x 1x 1.5x
vuln 0.808 0.739 0.857 0.813 0.664 0.580 0.849 0.822 0.795 0.748
BN
not vuln 0.816 0.821 0.871 0.884 0.699 0.712 0.874 0.894 0.802 0.829
vuln 0.783 0.692 0.890 0.851 0.636 0.561 0.848 0.821 0.695 0.711
NB
not vuln 0.779 0.741 0.888 0.900 0.650 0.673 0.850 0.873 0.547 0.811
vuln 0.796 0.744 0.914 0.909 0.619 0.533 0.809 0.797 0.738 0.736
NN
not vuln 0.783 0.826 0.914 0.937 0.664 0.738 0.802 0.867 0.740 0.824
vuln 0.774 0.696 0.908 0.861 0.646 0.506 0.819 0.806 0.734 0.696
LR
not vuln 0.775 0.809 0.906 0.907 0.685 0.773 0.827 0.877 0.750 0.823
vuln 0.853 0.771 0.943 0.925 0.701 0.609 0.849 0.828 0.818 0.770
RF
not vuln 0.844 0.845 0.943 0.950 0.696 0.779 0.862 0.899 0.818 0.854
vuln 0.783 0.660 0.878 0.867 0.650 0.546 0.797 0.751 0.745 0.733
RT
not vuln 0.785 0.774 0.872 0.908 0.649 0.697 0.788 0.842 0.752 0.819

Table 4.5: F1 Score for 10-Fold Cross-Validation Within Projects

The performance of the classifiers on FFmpeg, Qemu and Wireshark are fair. With more data, Qemu
does better than FFmpeg and Wireshark. It can also be seen that Imagemagick has the best performance
across all classifiers. Its F1 scores are around 0.8-0.9. On the other hand, while Linux has the most
number of vulnerable samples (more than half of the total vulnerable samples we obtained), it has the
worst intra-project evaluation, with less than 0.7 F1 score overall for the balanced dataset. While the set
of obtained vulnerabilities in Linux is large, we believe that the patterns in those vulnerable samples are
too diverse and is not well captured, compared to projects like Imagemagick. We believe that this may
also result from the fact that errors in kernel programming have more catastrophic effects or crashes
and code in Linux have higher quality: while there are certain functions that may be easier to have
vulnerabilities, the vulnerabilities in them have already been eliminated.
Difference in Classifiers: Different classifiers have different performance for both cross-fold vali-
dation and training within individual projects.
The random forest classifier achieves the best performance across almost all measures. While the
above shows results from mostly balanced training, the performance for the random forest is consistently
good with all different rates of subsampling, even when training with an imbalanced dataset: random
forest’s sampling method makes it perform better on imbalanced datasets. Interestingly, the random
tree classifier, which can be considered as components building a random forest, performs almost the
worst among all models. For the random tree classifier, the k randomly chosen features for a decision
tree may not cover the most important features that are relevant to vulnerability patterns. While with
random forest, it is able to find average prediction among different trees and thus takes into account the
most relevant features.
Chapter 4. Coarse-grained Statistics as Function-Level Features 19

Another reason that the overall performance for random forest is better than classifiers such as neural
network may result from the fact we have a large number of features while a small vulnerability dataset.
The ability for random forests to handle large number of features leads to its good overall performance.
(Indeed, in our experiments with fewer selected features and lower performance, random forest does not
perform the best.) Potentially with more vulnerable samples collected in future, random forest may or
may not be the best algorithm to choose.
Similarly, bayesian network performs generally better than naive bayes, both in cross-validation and
intra-project training: bayesian network takes into account the dependences between features, while
Naive Bayes assumes the features are independent. The ability to learn correlated patterns between the
features we picked result in a better performance for bayesian network. While bayesian network does
not outperform naive bayes too much, this shows that some small correlations need to be made from the
statistical features, in order to find vulnerable patterns.
For individual projects, while neural network works better than the simple logistical regression clas-
sifier in most projects, there is not much difference between the performance of the two classifiers. As
logistical regression can be considered as similar to a one-layer neural network, this means the relation-
ship between the picked statistical features is not too complex, and simple logistical regression could also
work well. The reason that neural network may not work that well overall may result from the fact that
the power of the neural network comes from its ability to learn complex patterns with large amount of
data. However, with a small amount of vulnerable data and coarse-grained statistical features, neural
network’s performance is limited. On the other hand, most of the vulnerable samples we collect are from
Linux, and the fact that neural network does not work as well on Linux affects its overall performance
on all projects.
From Figure 4.5, with more non-vulnerable samples used, all classifiers have better performance on
the non-vulnerable class and decreased performance in the vulnerable class in most cases. The exception
is for naive bayes: naive bayes does not work well with Wireshark on a balanced dataset, but it works
better with a sampling rate of 1.5x for the non-vulnerable dataset. This result may require further
inspection.
Training with Imbalanced Dataset: To demonstrate how the training works with imbalanced
datasets and varied proportions of non-vulnerable samples, we use different subsampling rate for the
non-vulnerable dataset: the size of the non-vulnerable dataset is kept as 1x, 1.5x, 3x, 5x, 10x the size
of the vulnerable dataset respectively. Both the training set and the test set are imbalanced: the ratio
of non-vulnerable samples to vulnerable samples in both the training set and the test set are kept the
same.
Table 4.6 demonstrates the effect of different subsampling ratios on the performance of the ran-
dom forest classifier. Random forest is picked as it performs better than other classifiers with varying
subsampling rates. Cross-fold validation is also applied.
With gradually increasing class imbalance, there is more chance of getting a correct guess by simply
classifying a function as non-vulnerable when it is unclear how to distinguish between the two classes.
This thus increases the false positive rate in the non-vulnerable class, and explains higher recall than
precision in the non-vulnerable class. For similar reasons, the random forest classifier trades off the
recall of the vulnerable class for precision: as ambiguous samples are more likely to be classified as non-
vulnerable, the classified vulnerable samples are more likely to be truly vulnerable (increased precision),
while not all vulnerabilities may be covered (decreased recall).
Chapter 4. Coarse-grained Statistics as Function-Level Features 20

Sampling Class Precision Recall ROC AUC PRC AUC Accuracy


vuln 0.752 0.744 0.846
1x 0.837 0.750
not vuln 0.747 0.755 0.820
vuln 0.740 0.630 0.783
1.5x 0.822 0.764
not vuln 0.776 0.853 0.853
vuln 0.735 0.480 0.681
3x 0.818 0.827
not vuln 0.845 0.942 0.916
vuln 0.759 0.332 0.577
5x 0.829 0.896
not vuln 0.905 0.984 0.960
vuln 0.769 0.259 0.497
10x 0.821 0.926
not vuln 0.930 0.992 0.972

Table 4.6: Performance of Random Forest Over Different Subsampling Rate

On the other hand, as classifiers gain more non-vulnerable samples, they learn more patterns in
that class. As a result, it is easier to classify vulnerable samples than non-vulnerable ones: accuracy
is thus no longer a good indicator for both classes. This is also reflected by the PRC AUC score:
while the subsampling rate increases, PRC AUC decreases for the vulnerable class, and increases for the
non-vulnerable class. The precision and recall of the non-vulnerable class are thus higher.
While the ROC AUC value is stable across all subsampling rates and shows how a classifier distin-
guishes between two classes, it represents results for both the vulnerable and non-vulnerable class. ROC
AUC does not show how the classifier performs on different classes.

4.2.3 Cross-Project Evaluation


To investigate whether our model is able to learn general vulnerable patterns across projects, we further
evaluate it across different projects. This evaluation is important is it simulates how useful our model
is in practical scenarios. When the classifier is used to predict vulnerabilities on unseen projects, the
reviewer is unable to train the model on known vulnerable and non-vulnerable functions within the
same project (in fact, this is what the classifier is supposed to predict). By measuring the cross-project
performance of the model and examining how much effort it saves for a reviewer, we can get an idea of
how useful our model is.
The classifier is tested on all functions in one project, while trained on data from other projects.
In the test set, all functions in the non-vulnerable set are considered: this includes every function in
the project, either included or excluded in the subsampling stage done previously. For example, when
we test on FFmpeg, the training set consists of all vulnerable samples and subsampled non-vulnerable
samples from all the other projects we use: Imagemagick, Linux, Qemu and Wireshark. However, the
test set contains every function we can compile for FFmpeg: all instances from the vulnerable and non-
vulnerable classes are included. We also test different subsampling rates for the non-vulnerable dataset
for training. During training, the size of the non-vulnerable dataset is set to 1x, 1.5x, 3x, 5x, and 10x
as the size of the vulnerable dataset respectively.
One goal of measuring cross-project performance is to assess how the picked statistical features could
work potentially when predicting vulnerabilities in unseen projects using projects that they are trained
on. Thus, functions are ranked in terms of their probabilities of being vulnerable. We examine the
Chapter 4. Coarse-grained Statistics as Function-Level Features 21

precision and recall of the vulnerable class covered by the top-K percent of functions that are most likely
to be vulnerable. The top-K percent recall/precision is used to show how much effort one saves when
finding vulnerabilities using the results prioritized by the classifiers: higher recall means one needs to
examine fewer functions to find cover most vulnerabilities in the project, while higher precision means
one needs to inspect fewer functions to find one vulnerability. In Figure 4.1, we present top-K precisions
for the vulnerable class across different projects. Similarly, top-K recalls for the vulnerable class are
shown in Figure 4.2. In those figures, the y-axis represents precision/recall value between 0 and 1 (For
example, in Figure 4.1a, 0.125 on the y-axis means 12.5%).

(a) Test on FFmpeg (b) Test on Imagemagick (c) Test on Linux

(d) Test on Qemu (e) Test on Wireshark

Figure 4.1: Top-K Percent Precision for Cross-Project Evaluation (Random Forest)

Among all projects, machine learning models achieve both higher precision and recall than random
selection on all projects. Mostly, training with different subsampling rate for the non-vulnerable set
mostly does not lead to a huge difference in terms of top-K precision and recall, and could be employed
to increase the accuracy overall. For most projects, the top-K percent precision curve smoothly decreases
as more functions are inspected.
Imagemagick: Imagemagick has the best performance overall. Along with a comparable recall
curve, Imagemagick also has a steeper precision curve with higher precision concentrated to a small
amount of the top-ranked predictions. This means most of the top-ranked vulnerable samples are indeed
vulnerable when evaluating on our Imagemagick dataset. Considering the fact that Imagemagick has
the best intra-project performance, it seems the vulnerability patterns in Imagemagick are more general
and easier to capture with our statistical features.
FFmpeg and Wireshark: FFmpeg and Wireshark also both have decent performance. With a
small percentage of vulnerable samples among all projects, evaluation on wireshark has lower precision
with top-K percent of vulnerable samples. However, comparing with random selection, machine learning
still increases the prediction performance on Wireshark with a similar scale.
Qemu: However, there is an exception is for Qemu. Qemu does not have as good precision and recall
as other projects. Qemu’s first several top-ranked vulnerable functions do not cover as high precision
Chapter 4. Coarse-grained Statistics as Function-Level Features 22

(a) Test on FFmpeg (b) Test on Imagemagick (c) Test on Linux

(d) Test on Qemu (e) Test on Wireshark

Figure 4.2: Top-K Percent Recall for Cross-Project Evaluation (Random Forest)

as other projects do. The smoothly decreasing pattern on the precision curve in other projects is not
as obvious in Qemu. For Qemu, with a sampling rate of 1x and 1.5x, this pattern shows up with 20%
or more top-ranked vulnerable functions, while with sampling rates higher than 3x, it takes 40-50%
of the ranked functions to get this pattern. This means while there are some similarities between the
vulnerable patterns in Qemu and in other projects, the most obvious patterns for other projects do not
generalize to Qemu.
With lower performance, the subsampling rate starts to take effect for Qemu. With different sub-
sampling rates, evaluation on Qemu exhibits different performance for precision and recall. Its precision
curves over the top ranking samples are not consistent. With a sampling rate over 1.5x and a small
percentage of the top-ranked functions, Qemu has roughly better precision, while its recall rate is just
above random selection. Sampling with 1x does not work for the first several top-ranked functions, but
its precision and recall starts to spike up as more vulnerable functions are included.
Although the vulnerable patterns in Qemu do not generalize, non-vulnerable samples may share some
similar characteristics with functions in other projects. As previously discussed for imbalanced datasets,
with more non-vulnerable samples included, the classifiers are able to more likely to predict an ambiguous
function as non-vulnerable. This thus increases the precision and decreases the recall of the vulnerable
class. However, more non-vulnerable samples would decrease the precision in the vulnerable class as
more top-ranked functions are included. One reason might be the classifier better learn patterns in the
non-vulnerable class than the vulnerable class. Thus for Qemu, overall, using a non-vulnerable dataset
1.5x as large as the vulnerable set generally has better performance and recall curves than training with
non-vulnerable datasets with other sizes.
Linux: Linux also does not achieve as good recall curve comparing to other projects. While projects
such as FFmpeg and Wireshark can cover 80% of all vulnerable functions when inspecting 30-40% of
top-ranked vulnerable samples, Linux can only achieve around 60% recall with 40% of the top-ranking
samples. This results from the fact that Linux contributes more than half of the vulnerable functions
Chapter 4. Coarse-grained Statistics as Function-Level Features 23

in our dataset: when testing on Linux, we need to train on other projects and cannot use samples from
Linux. As a result, there is not as much training data comparing to the case when testing on other
projects. More samples may help the performance of our classifiers.
Another reason is the types of vulnerabilities are more diverse in Linux. As shown in Table 4.5,
although Linux has more samples within the project, it does not achieve as good performance as other
projects when training using samples from Linux itself. However, comparing with cross-project evaluation
on Qemu, Linux still exhibits better precision and recall curves overall. This means that while the types
of vulnerabilities within Linux could be more diverse, they still share similar patterns with vulnerabilities
in other projects. On the other hand, while the patterns in Qemu are relatively easier to capture (Qemu
have better intra-project evaluation), the vulnerability patterns for Qemu do not generalize to other
projects we use.
Although project types have an effect on cross-project training, functionalities of the projects do not
dominate the performance of the classifiers. While some vulnerabilities are project-specific, patterns of
vulnerabilities are not limited to the functionalities of the projects.
Practical Usage: While the machine learning classifiers performs better than random selection and
saves the amount of work needed to check for vulnerabilities in a project, there is still an amount of
work needed to inspect the functions ranked as most likely to be vulnerable.
1) Although the machine learning classifier eliminates some non-vulnerable samples, the precision
for vulnerability prediction also decreases if there are the functions inside the project are less likely to
be vulnerable. As the number of vulnerabilities inside a project is usually not high, this could require
inspecting a large number of functions before finding a vulnerable one. On the other hand, to cover
most vulnerabilities in a project, it is still necessary to analyze a few percentages of all the functions in
the project. While the classifier saves users’ work, we still expect manual analysis will be required for a
good proportion of the project to eliminate most vulnerabilities.
2) Although the statistic-based features have higher precision and recall for some projects, this does
not mean it is easier to discover real vulnerabilities with machine learning classifiers. As the top-ranking
functions generated may both have larger sizes and are more complex (Section 4.4), it is much harder
both to determine where a vulnerability is at the function scope and to guarantee a function is mostly
free of vulnerabilities. This holds both for human reviewers and for program analysis techniques. While
large, complex functions exhibit patterns that are easier for the classifiers to find, this also means more
work to analyze the predicted functions. This issue is further inspected in Section 4.4.
3) A user may also take into account the precision-recall tradeoff when using the machine learning
classifier. To cover more vulnerable functions, one may need to inspect more possibly vulnerable functions
predicted. But as the user inspects more functions, it is less likely for a newly inspected function to be
vulnerable: the precision of the classifier decreases when more lower-ranked functions are checked.

4.3 Importance of Features


Among the features we use, some features may be more important than others. To find out which
features are more crucial for classification, we inspect their effect on logistic regression and random
forest. For both models, the measurements are made by training the models on all data from the open-
source projects we previously mentioned in Section 4.2.1. Different subsampling rates are used: the size
of the non-vulnerable class is set to be 1x or 1.5x as large as the size of the vulnerable class.
Chapter 4. Coarse-grained Statistics as Function-Level Features 24

Feature Log Reg Coefficient Feature Log Reg Coefficient


literalConvertProp -24.61 cmpPtrProp -17.93
extProp 22.66 literalConvertProp -13.54
ptrToPtrProp 19.61 nullCmpProp 11.44
cmpPtrProp -17.42 extProp 11.07
truncProp 15.54 fromPtrCastProp 9.62
castProp -14.89 mulDivProp -9.29
fromPtrCastProp 13.55 ptrToPtrProp 8.06
nullCmpProp 12.73 truncProp 5.69
mulDivProp -8.90 castProp -5.66
toPtrCastProp 7.74 andOrProp 5.55
modProp 6.97 modProp 5.41
ptrsStoredProp -6.94 shiftProp -4.54
andOrProp 6.11 xorProp 3.61
storePtrProp 5.95 storePtrProp -3.27
convertProp -5.03 convertProp -2.95
shiftProp -3.59 ptrCastProp 2.80
ptrIndProp 3.52 ptrsStoredProp 2.63
cmpProp -3.12 cmpProp -2.42
ptrsLoadedProp 3.04 ptrsLoadedProp 2.31
calleesPropInFile -2.31 doubleGepiProp -2.07
ptrCastProp 2.27 ptrIndProp 1.94
branchProp 1.82 branchProp 1.76
gepiProp -1.77 calleesPropInFile -1.64
ldPtrProp 1.40 ldPtrProp 1.42
xorProp 1.35 addMinusProp -1.27
calleeProp -0.89 calleeProp -1.12
addMinusProp -0.62 toPtrCastProp 1.10
arithProp 0.42 gepiProp 0.41
loopMaxDepth 0.33 loopMaxDepth 0.34
topLoops -0.21 topLoops -0.22
doubleGepiProp 0.19 arithProp 0.16
libcCallProp -0.16 libcCallProp 0.091
numLines 0.028 numLines 0.024
numBBs -0.013 numBBs -0.015
numInsts 0.0003 numInsts 0.0011

Table 4.7: Ordered Feature Coefficient for Lo- Table 4.8: Ordered Feature Coefficient for Lo-
gistic Regression; Subsampling: 1x gistic Regression; Subsampling: 1.5x
Chapter 4. Coarse-grained Statistics as Function-Level Features 25

Feature Rand Forest Mean Feature Rand Forest Mean


Decrease Impurity Decrease Impurity
ldPtrProp 0.370 numBBs 0.376
numBBs 0.365 ldPtrProp 0.360
numLines 0.355 ptrsLoadedProp 0.346
ptrsLoadedProp 0.353 storePtrProp 0.333
storePtrProp 0.339 branchProp 0.326
branchProp 0.317 ptrsStoredProp 0.325
ptrsStoredProp 0.310 numLines 0.324
cmpProp 0.306 topLoops 0.318
calleeProp 0.292 cmpProp 0.302
calleesPropInFile 0.289 loopMaxDepth 0.291
castProp 0.286 calleeProp 0.290
topLoops 0.283 castProp 0.282
loopMaxDepth 0.282 calleesPropInFile 0.281
cmpPtrProp 0.281 toPtrCastProp 0.278
numInsts 0.281 nullCmpProp 0.278
ptrCastProp 0.277 numInsts 0.277
convertProp 0.266 cmpPtrProp 0.276
nullCmpProp 0.265 extProp 0.270
modProp 0.264 ptrCastProp 0.269
toPtrCastProp 0.263 fromPtrCastProp 0.264
fromPtrCastProp 0.257 convertProp 0.261
extProp 0.257 arithProp 0.260
ptrToPtrProp 0.253 addMinusProp 0.257
libcCallProp 0.252 ptrToPtrProp 0.248
doubleGepiProp 0.251 truncProp 0.245
ptrIndProp 0.250 andOrProp 0.241
truncProp 0.247 doubleGepiProp 0.241
addMinusProp 0.242 literalConvertProp 0.238
arithProp 0.239 gepiProp 0.231
gepiProp 0.238 modProp 0.230
literalConvertProp 0.235 ptrIndProp 0.229
andOrProp 0.219 shiftProp 0.218
xorProp 0.201 mulDivProp 0.199
shiftProp 0.199 libcCallProp 0.197
mulDivProp 0.183 xorProp 0.187

Table 4.9: Ordered Mean Decrease Impurity Table 4.10: Ordered Mean Decrease Impurity
for Random Forest; Subsampling: 1x for Random Forest; Subsampling: 1.5x
Chapter 4. Coarse-grained Statistics as Function-Level Features 26

For logistic regression, the coefficients learned by the model are used as features. The coefficient for
logistic regression can be considered as weights for features. A large coefficient means its corresponding
feature is important: a small change in the feature result in a large change in the weighted sum, and
thus the prediction result. A positive coefficient means its corresponding feature is positively correlated
to the vulnerable class: the larger its absolute value is, the more likely a function would be vulnerable.
Vice versa, if a coefficient is negative, a larger absolute value means the function is less likely to be
vulnerable.
For random forest, we use mean decrease impurity. The mean decrease impurity for a feature f can
be considered as the gain in performance/purity when feature f is used. A node in a random
tree contains a decision based on one or several features, which splits the input dataset. The impurity
for a node in a random tree is the probability for that node to misclassify a sample: ∈csses (P ·
P

k6= (1 − Pk)), where P is the proportion of items labeled in class  for that node. In this case, the
P

decrease in impurity between a parent node and its child node is considered. The decrease in impurity for
a feature is the sum of the decrease in impurity for nodes related to the decisions made by the feature,
weighted by the probability of reaching every node (can be estimated by the proportion of training
samples reaching that node). The mean decrease impurity for a feature is the decrease in impurity
averaged over all random trees in the forest.
Both the coefficients for logistic regression and the mean decrease impurity for random forest are
ranked by their absolute values, as shown in Table 4.7, 4.8, 4.9 and 4.10.
Logistic Regression: For logistic regression, features such as maximum depth of loop and number
of instructions, which are related to the function size and complexity, do not affect the prediction
result as much. Loads with pointers make a vulnerability more likely to happen. Different from our
assumption, logistic regression treats the proportion of callees as weak features negatively correlated
with vulnerabilities. In addition, the number of loops and the depth of loops are weakly correlated with
vulnerabilities: a vulnerable function has deeper loops, but a fewer number of top-level loops.
Some related features have conflicting implications. Conversion, cast, and comparison with pointers
are all important features, but some subtypes in conversions/casts make a function more likely to be
vulnerable, while others make it less likely to be vulnerable. More extension and truncation indicates
a vulnerability is more likely to occur, but more conversion between literal types means a vulnerability
is not likely to occur. Similarly, comparison with pointers is negatively correlated with vulnerabilities,
but comparisons with null values, which can be considered as a type of comparison with pointers,
are positively correlated. More cast operations mean a vulnerability is less likely to occur, but cast
from/to/between pointers (which are part of cast operations) all mean the corresponding function is
more likely to be vulnerable. It is not very clear what these patterns mean, as coefficients of related
features are canceling out some effects by each other. While our features are coarse-grained, logistic
regression still attempts to capture relationships between the statistical features.
As the model is learning conflicting correlations between related features, With different subsampling
rates, some groups of features may have completely different correlations with vulnerabilities (a flip in the
sign of their coefficients). With a subsampling rate of 1, the proportion of pointers stored is negatively
correlated to vulnerabilities, while the proportion of stored pointers is positively correlated. However,
with a subsampling rate of 1.5, the reverse happens: the proportion of pointers stored is positively
correlated, while the proportion of stored pointers is negatively correlated to vulnerabilities. The similar
also holds for the proportion of gepi instructions and double gepi operations.
Chapter 4. Coarse-grained Statistics as Function-Level Features 27

Random Forest: For random forest, different subsampling rates do not affect the decrease in im-
purity as much. Different from logistic regression, random forest considers function size as an important
type of feature. The number of basic blocks and the number of lines in a function are ranked highly. The
number of instruction does not appear to be as important, it is possibly because this information can be
approximated using the number of lines and basic blocks in the function. Likewise, complexity-related
features such as the proportion of branches and loop-related features (top-level loops and depth of loops)
have more effects on random forest, compared to logistic regression.
The random forest classifier also identifies both the proportion of loads/stores and the proportion of
pointers loaded/stored as important features. The proportion of comparisons is also ranked highly. Also,
compared to logistic regression, the proportion of callees has a more significant effect on the random
forest: the proportion of callees in total and in the same file are both ranked highly. While more callees
in the same file mean more callees in total, it seems these features are not replaceable: this means
random forest learns some patterns that require both features to identify (Ex. more callees in the same
file, while fewer callees in total).
Mean decrease impurity for random forest does not identify pointer casts as significant. However,
this does not necessarily mean these features are not important. This is possibly because there are many
types of pointer cast features: while one such feature is omitted, some of its information can be inferred
from other related pointer cast features.
Different from logistic regression, the random forest classifier does not rank the proportion of con-
versions with literal type to be as high. This is possibly because features have large, but conflicting
coefficients in logistic regression, but they are not as significant when separately considered. In addition,
for the random forest, arithmetic-related features are least important. This is somehow consistent with
the result from logistic regression: for logistic regression, the proportion of arithmetic operations has a
coefficient with small absolute value, while the coefficients of related features such as add/minus and
multiply/divide are canceling out each other.
Overall: For both logistic regression and random forest, the proportion of libc calls is not important.
This is possibly because some open-source projects use project-specific API functions instead of calling
libc functions. For both models, the proportion of comparisons (with/without pointers) is also relatively
more important compared to other groups of features. Some other features appear to be more important
for in one model, but not as important for the other.
While some related features for random forest are ranked similarly, its top-ranked features are more
diverse than logistic regression. This is possibly because logistic regression has conflicting coefficients to
learn vulnerability patterns.

4.4 Complexity of Predicted Functions


We further analyze some properties of the predicted vulnerable functions to see if there are some common
features in the most vulnerable samples that are relevant to its complexity and thus require more effort
for code analysis. Results for cross-project are used to estimate what characteristics are shared by
predicted functions in unseen projects. We find function size still a dominant factor for vulnerability
prediction, although features are chosen and scaled in such a way that the effect of function size is
reduced.
Figure 4.3 shows the number of instructions in each function, ranked in terms of their likelihood to
Chapter 4. Coarse-grained Statistics as Function-Level Features 28

be vulnerable. Table 4.11 compares the average number of instructions for the top-50 ranked vulnerable
functions and all functions in the evaluated projects. For both Figure 4.3 and Table 4.11, the number
of non-vulnerable functions in the training set is randomly sampled to be 1.5 times as the number
of vulnerable ones. In Figure 4.3, random forest is used for classification, and in Table 4.11, logistic
regression is evaluated. While the observed patterns are generalizable to all sampling rates and classifiers,
for simplicity, we only show results for these classifiers and sampling rates. The range for the y-axes in
Figure 4.3 is limited, in order to better show general patterns for all samples: for some functions with a
large number of instructions, their bars extend above the top of the graphs.

(a) Test on FFmpeg (b) Test on Imagemagick (c) Test on Linux

(d) Test on Qemu (e) Test on Wireshark

Figure 4.3: Number of Instructions in Top-K Vulnerable Predictions for Cross-Project Evaluation
(Random Forest)

Function Group Average Number of Instructions


FFmpeg Imagemagick Linux Qemu Wireshark
Top-50 Vulnerable 5153.0 1739.6 832.0 2125.6 4095.5
All 167.1 99.4 61.3 49.3 45.0

Table 4.11: Average Number of Instructions for Cross Project Evaluation (Logistic Regression)

Among all cross-project evaluation in Figure 4.3, the ranked vulnerable functions have generally
decreasing number of instructions, as if they are roughly sorted by function size. While there are
some small functions highly ranked, most large functions are ranked as more likely to be vulnerable.
Additionally, in Table 4.11, for most projects, top-50 most vulnerable functions have sizes around 6-7
times as large as their average function sizes. For Wireshark, the top-50 ranked functions have more
than 90 times the number of instructions as an average function. Although the function size is the least
important group of feature for logistic regression (Section 4.3), it seems logistic regression is also more
likely to predict larger functions as vulnerable.
This means users still need significant effort in order to analyze the predicted vulnerable functions.
Chapter 4. Coarse-grained Statistics as Function-Level Features 29

Although the total number of inspected functions is reduced by using a machine learning classifier, the
time spent to inspect every function is increased. As Table 4.11 shows, top-ranked functions have more
number of instructions. With more instructions, the program dependences between instructions could
also be more complex. Checking the top-50 ranked vulnerable functions could thus require way more
time and effort than checking 50 randomly picked functions. This is a limitation of our statistic-based
approach.

4.5 Summary
In conclusion, with function-level statistical features, some coarse-grained vulnerable patterns can be
captured. While the relationships between statistical features learned by the classifiers are not complex,
the ability to find correlations between features helps classification. With the small amount of vulnerable
data we obtain, some similarities and differences of vulnerable patterns across projects can be roughly
captured by statistical features. Some inaccuracies and inability to learn general patterns may result
from both the quality of the dataset and the coarse nature of the statistical features.
When evaluating the importance for features, we find that different features have different importance
for logistic regression and random forest. The two classifiers attempt to learn correlations between
related features. However, more complex functions are more likely to be predicted as vulnerable, for all
models trained with coarse-grained statistical features. As classifiers are more likely to predict large and
complex functions as vulnerable, more work is required to inspect the predicted samples. This presents
a limitation for the statistic-based method and demonstrates a need to locate the vulnerabilities at a
finer scale.
Chapter 5

Raw Feature Representation

To overcome the shortcomings from stat-based function level features, we try raw feature representations
directly generated. This eliminates the need for manual feature engineering and allows classifiers to learn
patterns that are not recognizable by humans. More fine-grained relationships such as control and data
flow can also be represented. However, with raw features, it is not obvious which patterns are more
relevant to classification results, and what the model actually learned. This strategy also requires more
data to prevent overfitting, as the complexity of the model increases. It is also computationally more
expensive.
This project extracts features from the LLVM IR [7]. The LLVM IR, as an intermediate representation
used by the LLVM compiler to represent source code, keeps semantically relevant information. We work
on the IR level instead of the AST, as it removes syntactic features that may not be transferable across
projects. Also, as IR is architecture-independent, working on the IR level can make predictions on all
hardware platforms.

5.1 Program Slicing


Our early experiments showed that both in terms of program graphs and instruction sequences, classifiers
perform poorly when taking all code from either an entire function or execution path as input without
discrimination. This is because the difference between vulnerable and non-vulnerable samples is usually
small (the changes are usually a few instructions) comparing the size of the entire function (usually
hundreds, if not thousands of instructions). While the changed instructions only do not preserve as
much information, most of the code in the function may not be related to the vulnerability. As a result,
there must be a method to reduce irrelevant noise in input samples, while keeping information crucial
for identifying vulnerabilities.

5.1.1 Slicing Technique


To effectively sanitize relevant input, this project employs program slicing [13], a technique that obtains
a set of instructions either affecting or affected by target points in the program. Our program slicing
technique first builds a program dependence graph with both control and data dependences: every
instruction is represented as a node in the program dependence graph, while both forward and backward
control / data dependences are denoted as edges. The program slicing algorithm then traverses the

30
Chapter 5. Raw Feature Representation 31

program dependence graph using BFS along both control and data dependence edges, starting at the
slicing point. All traversed instructions are added to the program slice. There are two directions for
program slicing: forward slicing and backward slicing. Forward slicing traverses only along forward
dependence edges, while backward slicing traverses only along backward dependence edges.
In this project, forward slicing and backward slicing are separately performed in order to obtain
contexts before the target instructions and the code that are affected by the target instructions. The
instructions traversed during both forward and backward slicing are put together to generate a program
slice. The number of instructions included by both forward and backward slicing are limited, so the
scopes of code included do not grow arbitrarily large if the function containing the slicing point is large.
For implementation, we use the program slicing tool dg [14] that works on LLVM.
Consider the following example code used for program slicing in Listing 5.1. Every line of C code
corresponds to LLVM IR instructions (we use C code here instead of LLVM IR as C code is easier to
read). If we slice on line 5 (b = 4), line 2, 4, 5, 10, 11, 13 and 14 are included in the slice.
For backward slicing, line 4 is first included, as the if condition (a>0) guards b = 4, and thus line 5
has a control dependency on line 4. The slicing algorithm inspects dependences for the instruction on
line 4: line 2 (a = 3) is added, because (a>0) has a data dependency on a = 3.
For forward slicing, line 10 is added, as (b>2) has data dependence on b = 4. Then the slicing
algorithm finds line 11 and line 13 that have control dependence on line 10. Lastly, line 14 (c = 4 + d)
is added as it has data dependence on line 11.

Listing 5.1: Simple Code Example for Program Slicing


1 int a , b , c , d ;
2 a = 3;
3 d = 1;
4 if ( a > 0)
5 b = 4;
6 else
7 b = 1;
8 a = 6;
9
10 if ( b > 2)
11 d = 3;
12 else
13 b = 2;
14 c = 4 + d;
15 a = a - 10;

5.1.2 Slicing Point


In order to initiate program slicing, a starting point needs to be chosen to generate related code. For
samples obtained from the CVE/NVD dataset, we perform program slicing on lines changed in a commit.
For vulnerable samples, it slices on the lines of code that are deleted by a commit, with the assumption
that deleted code contains information that makes the program vulnerable/buggy.
On the other hand, the non-vulnerable samples are sliced on representative instructions inside func-
tions in which no vulnerabilities have been previously reported. To distinguish potentially vulnerable
code from ordinary code, non-vulnerable samples can be obtained by either slicing on random lines of
code or with a similar distribution of instructions as the slicing point of vulnerable samples. Some details
are also further discussed later in the evaluation sections.
However, the SARD Juliet dataset [11] we used later in Section 5.3.5 is not sliced: they are already
small enough code pieces representing vulnerabilities and it is not clear which parts of the code actually
Chapter 5. Raw Feature Representation 32

contain a vulnerability without manual inspection. The implementation of this technique is built upon
the slicing tool dg [14] with some modifications.

5.2 Instruction Representation


To get embedding representation for every instruction, we use ideas presented in the inst2vec tool [15].
Preprocessing is first done for each instruction to transform it into a more general representation. The
purpose of preprocessing is to abstract enough representative semantic information, and get rid of infor-
mation that is not as important for predicting vulnerabilities. Then, the generalized representations are
fed into a word2vec skip-gram model [12], in order to obtain a vector embedding for every corresponding
instruction. The skip-gram model is used, as it infers the semantic meaning of every instruction by its
neighbors and keeps semantically relevant instructions close together in the vector space.

5.2.1 Instruction Preprocessing


During preprocessing, every LLVM instruction is abstracted into a more general instruction representa-
tion. Some irrelevant information in original LLVM representation is eliminated.
Alignment and debug information is not included in the transformed instructions, as this information
is usually irrelevant to the semantic meaning of the program and is thus not useful for predicting
vulnerabilities. Variable names are replaced with the %ID token as they are not semantically relevant.
Similarly, labels are also replaced with %LABEL. Integer and float literals are replaced by @INT VAL
and @FLOAT VAL respectively, since these literals have wide ranges and it is impossible to learn a
representation for every single value.
For functions within the same file of the sliced code, their names are replaced with %FUNC NAME.
We keep the names for functions in different files than the sliced code, as they are likely to be utility API
functions either implemented within the project itself or in some other libraries frequently used. The
function types consist of abstracted function parameters. Function types are preserved to distinguish
one function from another, when the function names are abstracted. As types of functions with the
same function types are usually used in similar ways, keeping this information could be helpful when
generating instruction embeddings in later stages.
Different from inst2vec [15], instead of keeping the content of every struct, structure types are replaced
by %STRUCT TYPE. While we can iteratively find out the content of every struct from its name in
LLVM, this information is not kept. On one hand, we do not keep this information because this is more
relevant to the functionality or style of the program, rather than vulnerability types. For every struct,
the vulnerabilities are usually related to only a few of its elements that are used, while we do not need
to keep information for the unused elements. As the used elements and their types are included in the
load instructions later in the code, it is unnecessary to keep this information in struct types either. On
the other hand, the type of structs are usually too diverse to generalize: if we keep a different instruction
embedding when there is a slight difference in the type of struct operand used, there are not enough
occurrences for every type of struct in order to learn a good representative embedding for every one of
those instructions.
Table 5.1 below shows some examples for preprocessed instructions:
Chapter 5. Raw Feature Representation 33

LLVM Instruction Abstracted Representation


store i32 %15, i32* %length left, align 4, !dbg !182 store i32 %ID, i32* %ID
br i1 %cmp27, label %if.end30, label %if.then29, br i1 %ID, label %LABEL, label %LABEL
!dbg !201
%ru6 nets32 = bitcast %union.anon* %rip6un31 bitcast [ @INT VAL x %STRUCT TYPE ]*,
to [1 x %struct.netinfo6]*, !dbg !207 %STRUCT TYPE* %ID
%call99 = call i32 call i32 (%STRUCT TYPE*,
@rip6 entry print(%struct.netdissect options* %STRUCT TYPE*, i32) %FUNC NAME
%79, %struct.netinfo6* %80, i32 0), !dbg !268
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %11, call void (i8*, i8*, i64, i32, i1)
i8* %13, i64 4, i32 2, i1 false), !dbg !474 llvm.memcpy.p0i8.p0i8.i64

Table 5.1: Examples of Preprocessed Instructions

5.2.2 Training Instruction Embeddings


Similar to words [12], LLVM instructions with similar contexts also have similar semantic meanings [15].
However, as natural language and programs are interpreted in different ways, the definitions for semantic
meaning and context of program instructions are also different from that of words in natural languages.
Program instructions with similar semantic meanings either take similar types of operands or produce
similar execution results on same program states. For example, a “mul i32, i32 %ID, i32%ID” is more
similar to a “udiv i32, i32 %ID, i32 %ID” than a “udiv i64, i64 %ID, i64 %ID”: the first udiv with the
same operand types as the mul instruction has closer semantic meaning. Also, an “add i32, i32 %ID, i32
%ID” has semantic meaning more similar to an “add i64, i64 %ID, i64 %ID” instruction, compared to
a “div i64, i64 %ID, i64 %ID” instruction: instructions performing same operations should have similar
representations.
While the context for a word are words several steps before and after it, we define the context for
an instruction to be the instructions that have program dependences related to it. Instructions with
program dependences are included because program dependences can track long-term relationships that
may be otherwise lost by simply inspecting several instructions before and after the target instruction.
For example, load instructions are usually related to store instructions: the presence of store instructions
as context of an instruction may indicate the target instruction is a load. However, a load instruction
may assign to a variable at the beginning of a function, but the variable may be used only at the end the
function. While this relationship can be included by simply following the immediate data dependence
between the related load and store, it cannot be easily found by tracking predecessor and successor
instructions.
We use the intuition that instructions with similar operands or execution results are used in similar
ways and thus have similar contexts. To obtain meaningful instruction embeddints, we consider both
the similarities and the differences between program instructions and words in natural languages. After
preprocessing, a word2vec skip-gram model [12] is used to train instruction embeddings. For every
instruction, we acquire neighboring instructions up to a certain context size, where neighbors are pairs of
nodes with control, data or use-def dependences. Neighboring instructions with control flow dependencies
are also taken into account. Immediate successor/predecessor instructions are also treated as neighbors.
Both forward and backward dependences are followed when computing neighbors. Currently, we do not
Chapter 5. Raw Feature Representation 34

distinguish one edge type from another: all neighbors reachable with data/control/use-def relationships
are treated as the same. We use the default embedding dimension of 200, as presented by inst2vec [15],
and found 5 epochs a good number for training.
To ensure there are enough instruction occurrences to make the learned embeddings representative,
instructions that occur less than 300 times are discarded. The embedding of every discarded instruction
is thus calculated as an average of the embedding from all instructions with the same opcode as the
discarded one. If there is no calculated instruction embedding with the same opcode, the instruction
would be discarded and replaced by an embedding of all zeros.

5.2.3 Evaluating Instruction Representations


To find a good instruction representation, the instruction embeddings are evaluated to ensure instructions
with similar meanings have similar representations in the embedding space. We use two strategies for
evaluation as used by inst2vec [15]: semantic analogy test and semantic distance test. We train the
instruction representations in code from several open-source projects: Imagemagick, FFmpeg, Linux,
Qemu, Wireshark, PHP, Radare2, and Tcpdump.
1) The semantic analogy tests have intuitions similar to analogy tests in natural language pro-
cessing tasks. As the analogical terms have similar relationships, the direction and distance between
embeddings within analogical pairs should be similar (i.e. for analogical embedding pairs (a,b) and (c,d),
a-b should equal to c-d). For example, embedding(“king”) - embedding(“man”) + embedding(“woman”)
should yield result similar to embedding(“queen”). Just as relations between words in natural languages
can be identified by offsets between word embeddings, the difference in semantic meanings such as
operand types and opcodes can also be represented by vectors in the embedding space. Thus, the
instruction embeddings can also be evaluated using analogy tests.
The format of an analogy question is as follows: A is to B, as instruction C is to what (A:B; C:?).
Below shows an example of an instruction analogy.
Question :
add i8 , i8 @INT_VAL , i8 @INT_VAL : add i32 , i32 @INT_VAL , i32 @INT_VAL
sub i8 , i8 @INT_VAL , i8 @INT_VAL : ?
Answer :
sub i32 , i32 @INT_VAL , i32 @INT_VAL

For every question in the semantic analogy test (a:b; c:?), our tester program first evaluates the
target analogical embedding: (b - a + c). It then finds 5 instructions closest to the evaluated embedding
in the instruction set upon test. If the target answer instruction is inside the 5 instructions found, the
question is denoted as correctly answered. The result for an analogy test is the percentage of correctly
answered questions overall, shown in Table 5.2. This technique is also used for evaluating the inst2vec
tool [15].
In the inst2vec paper [15], there are many types of semantic analogies (data structure, conversion, and
syntactic analogies). For data structure analogies, the instruction pairs use types of different data struc-
tures. Instruction pairs for conversion analogies are instructions converting between different operand
types. Instruction pairs for syntactic analogies have different debug information that we do not track.
Below shows an example for data structure analogies with struct types (note we use %STRUCT TYPE
to represent structs, and do not keep specific struct types such as double, double in our instruction
representation; this example is thus not used):
Question :
Chapter 5. Raw Feature Representation 35

bitcast i8 * % ID to {{ double , double }} * : bitcast i8 * % ID to <2 x


double > *
bitcast i64 * % ID to {{ double , double }} * : ?
Answer :
bitcast i64 * % ID to <2 x double > *

Here is an example for the conversion analogies we use:


Question :
inttoptr i64 * , i64 % ID : ptrtoint i64 , i64 * % ID
inttoptr i8 * , i64 % ID : ?
Answer :
ptrtoint i64 , i8 * % ID

We do not use data structure semantic analogies and syntactic analogies presented in the inst2vec
paper [15]. This is because our representation does not account for features that may not represent
vulnerability patterns: contents inside struct types and syntactic features such as options are not included
(Section 5.2.1). The conversion analogy tests we use are a bit different from the analogy tests used by
inst2vec. This is because not all instructions provided by the test from inst2vec are present in our set
of instructions. We thus use available tests from inst2vec and generate some other tests by ourselves
specific to the instructions we evaluate.
2) In semantic distance tests, we test if instructions that perform similar operations (with same
opcode) are close to each other in the embedding space (Ex. a load instruction should be more similar to
other load instructions than store instructions). Instructions with the same opcode are grouped together
into the same category. For every instruction, the distances between the evaluated instruction and
instructions both and outside its corresponding category are compared (distance between embeddings
can be calculated using dot product). For every category, a score tracks the percentage of instruction
pairs inside the current category with a smaller distance than instruction pairs outside of the category.
The overall performance for the semantic test is the score averaged across all categories. Our semantic
test is the same as the semantic tests used by inst2vec [15].
To select a best performing set of instruction embeddings, training results are tuned with different
types of edges and different context sizes. We find that the set of embeddings with neighbors reachable by
all edges with a context size of 1 actually achieves best performance. This set of instruction embeddings
is thus used for the vulnerability prediction tasks in later sections. During tuning, we found some types
of instructions work better with certain types of neighbors and context sizes. While one may train
different instructions using different tuning parameters, this is left for future work.

Training Context Size Semantic Conversion Analogies Semantic


Distance Test
Inst2Vec-Specific This Work
This Work 1 - 12% 70%
Inst2Vec 1 6.6% - 61%
Inst2Vec 2 8.9% - 79%
Inst2Vec 3 3.2% - 63%

Table 5.2: Evaluation for Instruction Embeddings

Although some evaluation strategies we use may be similar to strategies used in natural language
processing, learning instruction representations on programming languages is a much harder problem,
and we should not expect instruction embedding training to achieve similar results as word embedding
training. Thus, the evaluation result for our chosen set of instruction embeddings and the result shown
Chapter 5. Raw Feature Representation 36

in the inst2vec paper [15] are listed in Table 5.2. Note that our goal is not to compare our performance
with inst2vec. In fact, we train on a different set of projects and use different analogy tests. Rather,
we demonstrate the validity of our training result by showing that our instruction embedding achieves
comparable performance and can be used later for our vulnerability prediction task in Section 5.3.

5.3 Graph Representation and Classification


To obtain an input format that the machine learning classifier can take, there is a need to convert the
sliced pieces of code into suitable representations. One approach is to convert the code into sequences
of instructions, in the order they appear in the LLVM IR format. While the sequential format works
well for natural languages, we believe there could be better representation for programs: the execution
of the program do not follow the same sequence as in the IR. Although the execution of instructions
within the same basic block is sequential, there are branch instructions jumping to different locations.
The control and data dependences between different LLVM instructions may also be lost.
As a result, we attempt to represent code slices are as directed program graphs. In this graph
representation, both the type of individual instructions and the relationships between instructions are
preserved. This allows the classifier to capture semantic information from the input slices.

5.3.1 Graph Representation and Model


Individual instructions are represented by vertices in graphs, and relationships between instructions are
presented as edges. Feature vectors for nodes are embeddings trained from inst2vec [15] mentioned in
the previous section. We track 5 types of program dependences between LLVM instructions as shown
below.
Control Flow Dependence Within Basic Block
Control Flow Dependence Across Basic Blocks
Control Dependence
Data Dependence
Def-Use Relationship
Both forward edge and backward edges are taken into account, resulting in a 10-dimensional edge
vector. Every dimension of an edge vector contains either a 0 or a 1. A “1” means a forward/backward
dependence is present, while a “0” means there is no such dependence between the two instructions.
Only instruction pairs that have program dependences are connected (there is no edge vector with all
zeros).
For classification with the obtained graph representation, we use the structure2vec [16] tool. We
currently use its Loopy Belief Propagation algorithm. This tool is slightly modified to incorporate
directed edges. During every iteration, each edge takes a weighted sum of both its neighbors and its
initial value. This allows each edge vector to incorporate information from its neighboring nodes and
edges. As the number of iterations increases, every edge gradually propagates information from the
nodes and edges a number of edges away from it. Nodes and edges further away have are weighted
less, as it takes more iterations and thus weighted multiplications for their value to reach the current
edge/node.
A final representation of the entire graph is obtained as the sum of all node features. To train the
weights in the structure2vec model, a neural network is used to evaluate the graph representation. The
Chapter 5. Raw Feature Representation 37

Algorithm 1 Graph Representation with Embedding Loopy Belief Propagation


1: procedure GraphRepWithEmbeddingLoopyBP(Weight: W, Vertex Features: V nt , Edge Fea-
tures: Ent )
2: for i in Vertices do . Map Initial Vertex Features to Edge Dimension
nt
3: X = W1 ∗ V
4: for (i, j) in Edges do . Get Initial Edge Representation
(0)
5: Ej = X + W2 ∗ Ent
j
6: for t in 1...T do . Embedding Loopy BP: Propagate Information To Neigbors
7: for (i, j) in E do
(t) (t−1)
+ W4 ∗ X )
P
8: Ej := σ(W3 ∗ k∈Neghbor()\j Ek
9: for i in Vertices do . Sum up Vertex Representation
ƒ n
= σ(W5 ∗ k∈Neghbor() ETk + W6 ∗ Vnt )
P
10: V
P ƒ n
11: return ∈Vertces V

Figure 5.1: Model Structure

neural network takes as input the vector representation of the entire graph and outputs whether a sample
is vulnerable or not.

5.3.2 Overall Evaluation


To show how our proposed model in Section 5.3.1 works on real-world vulnerabilities, the graph classifier
is trained on samples obtained from the NVD database. We use a list of open-source projects for
our evaluation, as used during instruction embedding training: Imagemagick, FFmpeg, Linux, Qemu,
Wireshark, Php, Radare2, and Tcpdump. The program slicing technique previously discussed in Section
5.1 is used to get code slices relevant to vulnerabilities: we slice on lines deleted by patches to get
vulnerable samples. For non-vulnerable samples, we obtain slices for functions in which no vulnerabilities
have been previously reported. At most one slice can be obtained from one function: this is done to
eliminate potential duplicate code. For non-vulnerable samples, we attempt to slice on instructions that
have similar distribution as the slicing point of the vulnerable functions (Ex. if X% of the vulnerable
samples are obtained by slicing on load instructions, we collect non-vulnerable samples by slicing on
load instructions for X% of the time.) However, we discard vulnerabilities that have commits in multiple
functions, as we only deal with intraprocedural vulnerabilities. In total, we use 933 pieces of vulnerable
code slices.
Chapter 5. Raw Feature Representation 38

To deal with dataset imbalance, we used an oversampling rate of 5 to get more vulnerable samples
(every vulnerable sample is repeated 5 times in training set). The non-vulnerable dataset is subsampled,
so it has a similar size as the oversampled vulnerable dataset. This is done in order to get a roughly
balanced dataset for training. Cross-validation is performed to get rid of randomness when selecting
samples. The performance is shown in Table 5.3.

Dataset non-vuln/ ROC Vuln PRC Vuln Vuln Accuracy


vuln rate AUC AUC Prec Recall
Train 0.931 0.812 0.868 0.813 0.860 0.825
Test 1 0.671 0.760 0.698 0.818 0.691
Test 143.5 0.688 0.416 0.019 0.819 0.594

Table 5.3: 5-Fold Cross-Validation on NVD Dataset and Open-Source Projects

The graph classifier works on the training set. On the almost balanced training set, the classifier
achieves a bit more than 80% precision and recall for the vulnerable class. While there may be some
tradeoff between recall and precision, its PRC AUC score for the vulnerable class indicates that the
model can obtain considerable recall and precision with changing thresholds. With more than 80% ROC
AUC, the classifier is also able to distinguish between the vulnerable and non-vulnerable class.
Its performance on the test set is not as good as training. On a balanced test set, the PRC AUC
and ROC AUC for the classifier are fair. To estimate how it may perform on datasets with more
non-vulnerable data in the real world, we also test the classifier with more non-vulnerable samples: non-
vulnerable slices from all open-source projects we used are added to the test set. While the classifier has
the same ability to distinguish between the vulnerable and non-vulnerable classes (similar ROC AUC),
with more non-vulnerable samples, mislabeled non-vulnerable samples could greatly drop the precision
in the vulnerable class. As a result, its PRC AUC also greatly drops. On the more realistic, imbalanced
test set, while the graph classifier still gets more than 80% recall, it achieves less than 2% precision on
average for vulnerable samples.
Overall, the graph classifier achieves fair performance on samples from the NVD database. We believe
this classifier may be improved by obtaining a larger dataset with better quality.

5.3.3 Comparison with Technique using Instruction Sequence


To demonstrate whether our intuition for using graph representation would yield a better model, we
evaluate the instruction sequence model on all vulnerabilities from the NVD database. After program
slicing is applied, the instructions present in the target slices are collected. The obtained instructions
are put into a sequence in the order as they appear in the LLVM IR. The instruction sequences are then
fed into a bidirectional LSTM (BiLSTM) model. A neural network is connected to the output state of
the last LSTM layer, and outputs with probability, whether a piece of code is likely to be vulnerable or
not.
The BiLSTM is chosen to get information both forward and backward in the instruction sequence.
BiLSTM models are used in other work for vulnerability detection, such as VulDeePecker [17], and work
by Lin et al. [18]. However, they work on the ASTs and include syntactic features of programs, while
we only use semantic features from the LLVM IR. We thus run experiments using instruction sequences
obtained from the LLVM IR to compare with our graph representation approach. An oversampling rate
Chapter 5. Raw Feature Representation 39

of 10 is used for the vulnerable class (every vulnerable sequence is repeated 10 times). Results for the
experiments are shown in Table 5.4.

Dataset non-vuln/ ROC Vuln PRC Vuln Vuln Accuracy


vuln rate AUC AUC Prec Recall
Train 1.09 0.797 0.797 0.741 0.807 0.761
Test 1 0.730 0.705 0.691 0.754 0.709
Test 183.94 0.740 0.017 0.014 0.743 0.716

Table 5.4: Performance of LSTM on NVD Database

To show the difference between the graph classifier and the instruction sequence model, results in
Table 5.4 and Table 5.3 are compared. While the BiLSTM classifier does not have as good performance
on the training set, it has a better ROC curve on the balanced dataset. This means the BiLSTM better
distinguishes vulnerable and non-vulnerable class. Its ability to learn patterns in the non-vulnerable also
results in more consistent accuracy on imbalanced test sets.
On the other hand, BiLSTM has slightly worse PRC AUC for the vulnerable class, with both slightly
worse precision and recall values in that class. Similarly, with more non-vulnerable samples added to
the test set, BiLSTM also performs worse on the vulnerable dataset. However, its performance on the
vulnerable class decreases more quickly than the graph classifier (PRC AUC). In comparison, although
the graph classifier does not distinguish between the vulnerable and non-vulnerable classes as well, the
graph classifier learns slightly better for the vulnerable class.
Although the graph classifier has better performance on the training set, it does not give the classifier
that much advantage on the test set. This is because we do not have as much data comparing to the
diversity of the vulnerability. As BiLSTM does not consider explicit control and data dependences, it
has fewer features and is less likely to overfit. In addition, the instruction sequence approach is more
lightweight. As the BiLSTM does not explicitly track the control and data dependences between instruc-
tions, it takes less time to process the same amount of samples. The instruction sequence representation
also requires fewer number of epochs to converge, compared with the graph classifier.

5.3.4 Evaluation on Specific Vulnerabilities


In an attempt to enhance performance for the graph classifier, we try to improve the quality of the
dataset. The intuition is that vulnerabilities inside loops are more obvious to identify, while there are
a good number of vulnerable program slices involving loops. The vulnerable slices previously obtained
in Section 5.3.2 are thus filtered: we use only the vulnerable samples containing loops. To avoid the
classifier learning only loops patterns rather than vulnerability patterns, non-vulnerable samples all need
to have loops as well. We similarly attempt to slice on instructions with the same distribution as the
vulnerable samples with loops. However, we discard slices that do not have loops in non-vulnerable
samples. In total, we get 441 pieces of vulnerable slices with loops and 4788 pieces of non-vulnerable
slices. Around 10% of the vulnerable samples are used for testing. Table 5.5 lists the performance of the
graph classifier on vulnerability with loops.
Chapter 5. Raw Feature Representation 40

Dataset non-vuln/ ROC Vuln PRC Vuln Vuln Accuracy


vuln rate AUC AUC Prec Recall
Train 1.06 0.838 0.861 0.820 0.811 0.822
Test 1 0.459 0.585 0.455 0.455 0.455
Test 12.6 0.502 0.281 0.083 0.455 0.623

Table 5.5: Performance on NVD Loops

Performance for training on vulnerable loops is comparable to training with NVD vulnerabilities in
general. However, the classifier does not work as well on the test set. On a balanced dataset, the ROC
AUC is even a bit lower than 0.5. This means the classifier does worse than a random guess. The learned
patterns from the training set do not generalize to the test set. Similar patterns can be observed for
predictions on the imbalanced dataset.
This is possibly because code involving loops have more complex patterns. While the classifiers find
some patterns in vulnerable loops, the learned patterns are not real vulnerability patterns generalizable
to test sets. Also, although a slice of code may involve loops, it is uncertain whether the vulnerability is
loop-carried, or it only occurs in some parts of the code inside loops. Additionally, while the classifier is
able to cover these patterns in the training set, it is not quite clear what the classifier actually learned.

5.3.5 Evaluation on Synthetic Dataset


As previously discussed, one problem we face is the lack of vulnerable examples. While we can obtain
as many labeled samples with high quality from real-world projects, there is an abundant amount of
data in synthetic datasets. The SARD Juliet dataset [11] is a set of software assurance test cases.
It contains examples written for many different vulnerability categories. We choose distinguishable
categories from the SARD Juliet dataset. This includes memory corruption related vulnerabilities such
as buffer overflow/underwrite, memory leak, and double free/use after free. Vulnerabilities such as format
string, integer overflow, and mishandled file descriptors are also added. In total we use 23899 pieces of
code samples from SARD Juliet: these contain both vulnerable and non-vulnerable code samples.
We first train our classifier on the SARD Juliet dataset to evaluate how our models can learn synthetic
vulnerability patterns. Then, we also test the generated model on real vulnerabilities obtained from the
NVD database, in order to examine how the learned patterns from the synthetic SARD Juliet dataset may
generalize to vulnerabilities in real-world projects. To obtain instruction representations generalizable
to all projects, instruction embeddings are trained using samples from both the SARD Juliet dataset
and the open-source projects present in NVD repositories.
Peformance on Synthetic Dataset: We first evaluate our classifier on the synthetic SARD Juliet
dataset. For generality, the instruction embeddings are trained on both the Juliet dataset and the
open-source projects we use. The results are shown below in Table 5.6.

Dataset non-vuln/vuln ROC AUC Vuln Vuln Vuln Accuracy


rate PRC AUC Prec Recall
Training 0.926 (10341 / 11169) 0.9996 0.9996 0.998 0.998 0.998
Test 0.947 (1162 / 1227) 0.9996 0.9997 1.000 0.995 0.997

Table 5.6: Performance on SARD Juliet Dataset


Chapter 5. Raw Feature Representation 41

This model has high performance on the SARD Juliet dataset. Precision and recall for the vulnerable
class are both very high. Its high ROC AUC score also indicates that the model distinguishes between
the vulnerable and non-vulnerable quite well. There might be multiple reasons that contribute to the
good performance on synthetic dataset.
One reason that may contribute to the good performance is that the vulnerable patterns in the
synthetic dataset are more simple and obvious. As the samples in SARD Juliet are mostly simple example
code with no real functionality, the structure of the code and relationships between instructions are easy
to understand. Although real-world vulnerabilities are sliced to reduce noise, samples in the SARD Juliet
are still smaller and less complex. In fact, in SARD dataset, vulnerabilities are deliberately written to
represent typical vulnerabilities. However, classifiers working on real-world vulnerabilities need to find
vulnerability patterns among complicated benign relationships between different instructions. Not only
for machine learning classifiers, for human reviewers, spotting vulnerabilities in example code would be
much easier than finding potentially vulnerable patterns from large, complex functions in the real world.
The vulnerability examples in SARD Juliet are also written following the criteria of a specific vul-
nerability category. These samples are constructed in order to represent cases of typical vulnerabilities.
However, in the real world, code may use different resources and could be considered as belonging to
different types of vulnerabilities and may be harder to analyze. While the synthetic data is less diverse,
there are also much more vulnerable data in the synthetic dataset. With more amount of data and less
varying features, the synthetic SARD Juliet dataset both easier to learn and is less prone to overfitting.
As a result, while this experiment shows that the graph classifier is able to learn patterns in the
synthetic dataset, the vulnerability patterns learned may not generalize to other programs. Good per-
formance on the synthetic dataset may not be useful. Further evaluation is thus performed to evaluate
how models trained on the synthetic dataset work with vulnerabilities in the real world.
Evaluating Synthetic Patterns on Real World Datasets: In order to examine how the vulner-
ability patterns learned in the synthetic dataset may be applied to real-world applications, we evaluate
classifiers trained using the SARD Juliet dataset on vulnerabilities in the NVD database. As previously
mentioned, because the instruction embeddings for the classifier are trained on both Juliet dataset and
the open-source projects we use, the model works on both the SARD Juliet datasets and open-source
projects containing real vulnerabilities in the CVE/NVD database. The performance are recorded in
Table 5.7 and Table 5.8.

Dataset non-vuln/vuln ROC Vuln PRC Vuln Vuln Accuracy


rate AUC AUC Prec Recall
NVD All 173.6 (156248 / 900) 0.412 0.208 0.00367 0.539 0.165
NVD Loops 10.8 (4777 / 441) 0.458 0.069 0.086 0.998 0.099

Table 5.7: Evaluation of Synthetic Training on Real Vulnerabilities in NVD Database

The classifier trained on the synthetic dataset is first evaluated on the samples obtained from the
NVD database in Section 5.3.2. The vulnerable samples are vulnerabilities obtained from the NVD
database, while non-vulnerable samples are randomly sliced pieces of code in which no vulnerability has
been reported. Some samples that do not fit into memory are omitted.
The classifier trained on the synthetic dataset can predict some vulnerable patterns in the NVD
database. Considering the highly imbalanced nature of the test set, the classifier obtains fair PRC AUC
for the vulnerable class. It can find a bit more than half of all the vulnerable samples. However, its
Chapter 5. Raw Feature Representation 42

ROC AUC is worse than a random classifier. This is because the classifier performs badly on the non-
vulnerable class, and wrongly predicts many non-vulnerable samples as vulnerable: the precision for the
vulnerable class (0.00367) is much worse than precision from random prediction (1/173.6 = 0.00576).
We further evaluate the classifier on NVD vulnerability with loops with the dataset is obtained from
Section 5.3.4: both vulnerable and non-vulnerable samples involve loops. However, the trained classifier
does not distinguish loop patterns very well. Although the classifier correctly finds almost all vulnerable
loops, it cannot distinguish vulnerable loops from non-vulnerable ones, and simply considers almost all
loops as vulnerable. This is possibly because there are not a lot of complex patterns such as loops in the
synthetic dataset, and the classifier associates complexities in code with a higher probability of being
vulnerable.

Vulnerability Type Number of Vulnerabilities Proportion Predicted as Vulnerable


Denial of Service 539 0.547
Execute Code 42 0.833
Memory Corruption 54 0.611
Overflow 209 0.536

Table 5.8: Evaluation of Synthetic Training on Different Types of Real Vulnerabilities

In addition, in Table 5.8, the classifier trained with synthetic data is evaluated on vulnerable code
slices in different categories of the NVD database. These categories are chosen, as they are related to the
vulnerabilities trained in the synthetic dataset. Some categories have overlapping samples. The classifier
is able to distinguish more than half of the vulnerable slices in all categories. Execute code and memory
corruption have the highest detection rate. This may result from the fact that patterns from these two
types of vulnerabilities are simpler and thus easier to identify.
Overall, while the model trained with synthetic data may be able to identify some simple vulnerable
patterns, it does not work very well on vulnerabilities with complex code structures. Especially, it
wrongly classifies many non-vulnerable samples as vulnerable. While we could infer some patterns based
on the result obtained from the experiments, it is also not very clear what the classifier actually learned.
Chapter 6

Directed Fuzzing on Predicted


Vulnerabilities

Fuzzing is a dynamic testing technique commonly used to find defects in programs. During fuzzing,
a fuzzer automatically generates inputs for the target program and observes program states such as
crashes and hangs while it executes. It is able to test the part of the code that is executed during
fuzzing. Fuzzing aims to find test cases in the input space, called “seeds”, that can trigger corner cases
in the program and thus find unexpected programs states.
While fuzzing directly triggers vulnerabilities and has fewer false positives, vulnerability analysis
using fuzzers has different challenges compared to analysis with human inspection/static code analysis.
A human can simply analyze a piece of suspicious code by inspecting the code itself and its surrounding
context. However, a fuzzer needs to execute a piece of code in order to analyze it. This means a fuzzer
needs to pass certain checks in the program that guard code to be analyzed. There also exists code that
cannot be easily reached, because checks guarding those pieces of code require the fuzzer to generate
inputs that satisfy complex constraints.
To find more vulnerabilities, two strategies can be used by fuzzers: 1) cover more code and 2) more
thoroughly fuzz parts of the code that are more likely to be vulnerable. Two different types of fuzzers are
thus designed based on these strategies. Coverage-guided fuzzers attempts to generate inputs that can
cover more code in the program and thus instruments the target program to collect coverage information.
Directed fuzzing attempts to guide execution towards a few target sites inside programs defined by the
user: if users know which parts of the code are more likely to be buggy, directing execution towards
those parts may save time while finding more defects in programs.
In this chapter, we guide directed fuzzers towards vulnerable functions predicted by machine learning
classifiers in Chapter 4. The intuition, similar to the directed fuzzing strategy, is to more thoroughly
test more vulnerable parts of the program code using knowledge from our machine learning models. In
addition, directed fuzzing is used to confirm predicted vulnerabilities: while there are some false positives
in prediction results from machine learning models, fuzzing could eliminate those inaccurately classified
samples by directly triggering real vulnerabilities. We attempt to save fuzzing time by guiding fuzzers
towards predicted vulnerable functions, while increasing precision for machine learning predictions. To
demonstrate how effective the technique is, we measure how much better a directed fuzzer performs by
thoroughly testing the predicted vulnerable functions, compared to a normal coverage guided fuzzer.

43
Chapter 6. Directed Fuzzing on Predicted Vulnerabilities 44

6.1 AFL & AFLGo


In this section, we introduce the fuzzing tools we later use: the coverage-based AFL fuzzer and the
directed AFLGo fuzzer.
AFL [19] is a popular fuzzer that uses a genetic algorithm to generate inputs for tested programs.
While the user needs to provide seeds/initial inputs for the tested programs, with genetic algorithms,
AFL is able to mutate and generate new input test cases from existing test cases. Additionally, AFL
can automatically select interesting input test cases, in an attempt to either cover more code or more
thoroughly test more vulnerable code.
AFLGo [20] is a directed fuzzer built based on AFL. It uses an algorithm similar to Dijkstra’s
algorithm to find distances between basic blocks and target basic blocks. The distances for basic blocks
are instrumented into the program binary to calculate how close the corresponding execution path is
to the target points and thus how useful is the input test case. The cost of an execution path, or its
corresponding seed, is the average distance of all basic blocks on the path.
AFLGo consists of two phases: the exploration phase and the exploitation phase. As the program first
starts, it enters the exploration phase, in which AFLGo generates new random inputs to gain sufficient
coverage that may be used to reach target points in the program. In the exploitation phase, AFLGo
attempts to direct input towards the target basic blocks, in order to exploit vulnerabilities in programs
around the target points. AFLGo uses simulated annealing to select inputs: it gradually assigns more
weight to seeds that are closer to the target points.

6.2 Evaluation on Directed Fuzzing


For directed fuzzing, we use AFLGo [20] to guide execution towards the target functions in code. While
AFLGo takes basic blocks as target input, the predictions made in Chapter 4 are at function levels.
As we do not know which part of code inside the function is actually vulnerable, we choose entry basic
blocks of the predicted vulnerable functions (the block executed first when its corresponding function is
entered) as target blocks.
Table 6.1 and Table 6.2 show results of directed fuzzing on FFmpeg after fuzzing for 3 days. For
evaluation, we choose 8 top-ranking vulnerable functions predicted in Chapter 4 as target functions.
Some other predicted vulnerable functions that are likely to be reached along with the top-ranking
functions (ex. have a caller-callee relationship) are grouped together as targets, yielding 24 target
functions in total. To reduce the effect of randomness, 8 similar fuzzing processes are run for both
vanilla AFL and directed AFLGo. For directed AFLGo, the 8 fuzzing processes use different predicted
vulnerable functions, as some top-ranking functions may be hard to reach or mispredicted.
Additionally, to gather coverage information, we use the llvm-cov [21] tool. llvm-cov is a code
profiling tool implemented for LLVM. It can be used to measure how many times a code region has been
executed in a program binary. We slightly modified AFLGo’s binary instrumentation to collect coverage
information for every code region defined by llvm-cov.

Fuzzing Technique Average Number of Average Number of


Unique Hangs Unique Paths
Vanilla AFL 0.375 3505.625
Directed AFLGo 0.375 2123
Chapter 6. Directed Fuzzing on Predicted Vulnerabilities 45

Table 6.1: General Fuzzing Performance for FFmpeg

Fuzzing Technique Proportion of Total Average Percent Coverage for


Target Functions Reached Reached Target Functions
Vanilla AFL 0.208 (9/24) 0.667
Directed AFLGo 0.208 (9/24) 0.661

Table 6.2: Fuzzing Results on Target Functions for FFmpeg

AFL and AFLGo monitor program crashes and hangs. A crash is an execution in which the program
terminates unexpectedly. For example, null pointer dereferences and unhandled exceptions can result in
crashes. A hang is an execution that the program does not terminate. Causes for hangs may include
infinite loops and deadlocks. Although it is impractical to wait for a program forever, a hang can be
found by observing program inactivity until a timeout. We also measure the number of unique hangs
and paths. Unique paths are paths that explore different basic blocks, while unique hangs are hangs
triggered by executing unique paths.
Although no crash is found in our experiment, hangs are found across different runs. The unique
hangs and paths in Table 6.1 are averaged across the 8 fuzzing processes. Similarly, the percent coverage
for reached functions in Table 6.2 are also averaged across all the 8 fuzzing processes. However, the
proportion of target functions reached is summed up for all runs: if one fuzzing process executes a target
function, we include that function in the final result. This is done because for directed fuzzing, different
fuzzing processes target different functions.
During our experiments, AFLGo has a similar execution speed as vanilla AFL. Compared to vanilla
AFL, directed fuzzing with AFLGo has a smaller number of unique execution paths: as AFLGo puts
more importance on test cases that execute paths close to target functions, the paths it explores are less
diverse. Also, although no crash is found during fuzzing, some hangs (timeouts) are spotted. Vanilla
AFL finds the same number of hangs as AFLGo.
AFLGo does not hit more target functions than vanilla AFL: they both reach 9 out of 24 target
functions spread across 8 runs. Also, compared to vanilla AFL, AFLGo does not have better coverage
inside the target functions either. Although directed AFLGo spends more time on fuzzing the target
functions, it does not result in higher coverage. From this, it seems directed AFLGo does not work
better than vanilla AFL.
While AFLGo puts higher priority on inputs that execute towards the target functions, the input
generation algorithms for AFL/AFLGo are only able to randomly mutate inputs. There are certain
checks in the code, requiring input to be of certain format. Those kind of inputs can be hard for
AFL/AFLGo to generate, as random mutation is not capable of solving for inputs that satisfy strict
constraints. Code regions guarded by those checks may be hard to reach. Some generated inputs may
even not be able to pass several validity checks at the beginning of the function. These inputs are
discarded without further deep execution. As a result, there are some target functions not reached and
code regions unexplored in the target functions.
To cover more functions that AFLGo alone is not able to reach, we also tried using static analysis
techniques to generate input seeds that hit those functions. However, this technique also does not
increase the number of crashes or cover more code regions inside the target functions. Generally, there
are a few reasons that the directed fuzzing technique does not work very well with our prediction results
from machine learning:
Chapter 6. Directed Fuzzing on Predicted Vulnerabilities 46

1) While directed fuzzing may reach some target vulnerable functions, the fuzzer still may not be
able to obtain good coverage inside these functions: some functions contain many sanitizing checks that
stop execution with invalid inputs from further progressing. This problem is related a drawback of the
coarse-grained machine learning model mentioned in Section 4.4: as predicted vulnerable functions are
larger and more complex, there are more sanitizing checks a fuzzer needs to bypass and thus more regions
that are hard to reach. A machine learning classifier may be further improved by taking into account of
function complexity when ranking vulnerable functions.
2) On the other hand, even if we are able to pass those sanitizing checks, the target function may
still be too complex to explore. Although a function is predicted as vulnerable, we are not able to
know which parts of the function are actually related to the vulnerability. As a result, we are not sure
if the parts of the function reachable by the fuzzer are worth exploring or not. As future work, more
fine-grained machine learning analysis with higher accuracy may be used to assist fuzzing.
3) Although directed fuzzing attempts to execute target functions, some deep or infrequently used
functions are still hard to reach. While static analysis attempts to find inputs towards those functions,
the analysis can be expensive depending on the complexity of the program. Additionally, the static
analysis may not be very precise: sometimes static analysis assisted directed fuzzing still may not be
able to reach the target functions.
Chapter 7

Related Work

As software vulnerabilities raise wide concerns, there has been work trying to find software vulnerabilities
patterns, in order to save the amount of effort required by human reviewers. In this chapter, we present
other works that attempt to automatically detect or analyze software vulnerabilities. Both code pattern
analysis and machine learning techniques have been proposed by previous work. There has also been
work using machine learning to find vulnerable test cases to assist fuzzers.

7.1 Code Pattern Discovery


There have been techniques finding patterns in code that may be related to vulnerabilities or software
defects. These techniques can be used to clearly identify representative vulnerability patterns, in order
to assist reviewers to find vulnerabilities. They consist of clone detection, AST pattern matching and
metrics measuring statistics in code. A summary for these type of techniques is shown below.

7.1.1 Vulnerable Clone Detection


Existing works such as VUDDY [22] and VulPecker [23] find vulnerabilities by identifying similar code.
Different from our machine learning techniques that attempt to learn patterns generalizable to a group of
vulnerabilities, these techniques find code reuses: they aim to detect occurrences of the same vulnerability
in a large amount of code.
Kim et al. presented VUDDY [22], a technique that finds vulnerable code clones. It performs
syntax analysis on retrieved functions and obtains representations for functions by replacing parameters,
variables or data types depending on the level of abstraction. Semantically irrelevant information such
as whitespace and comments are removed. The length and the hash value of the extracted representation
are used to lookup code clones. This strategy is able to detect variants of vulnerable code reuses from
a large code base.
Li et al. proposed VulPecker [23], a tool to detect vulnerable code clones at various levels of code
granularities. VulPecker first constructs a dataset of diffs from known vulnerabilities including lines of
code changed in patches. To describe vulnerabilities, it uses different code-reuse features at different
code levels such as variable/function modification, if/else/for condition modification, and entire function
additions/deletions. VulPecker represents code patches at different code fragment levels, trying to min-
imize the difference between the lines of code included in the code fragment and lines of code directly

47
Chapter 7. Related Work 48

related to the vulnerability (changed by a patch).


VulPecker then generates vulnerability signatures to represent vulnerabilities with specific features.
It also evaluates the effectiveness of different code similarity algorithms such as manhattan distance of
vectors, substring matching, and subgraph isomorphism. A code similarity algorithm selection engine
is used to choose the most suitable algorithms working at the corresponding code fragment level of the
sample code. A piece of code is considered vulnerable if it matches at least one of the vulnerability
signatures. As the code clone patterns are different, different similarity algorithms at different code
fragment levels are good at identifying different types of vulnerabilities.

7.1.2 Matching Vulnerable AST Patterns


Work by Yamaguchi et al. [24] aims at finding vulnerabilities with patterns of API function usage. This
work first identifies possible API function calls, and represent each function as a vector, whose dimension
is the number of total API function calls found. Each element of the vector contains the TF-IDF score
of an API call, a weighting capturing the frequency of the API function call in the represented function.
PCA analysis was then performed to find dominant patterns of API calls in the dataset. This could help
security experts to find crucial API calls that are more likely to be associated with vulnerabilities.
Yamaguchi et al. also presented another strategy for vulnerability discovery [25]. With this strategy,
programs are represented as graphs. Information from control flow graphs, program dependence graphs,
and abstract syntax trees in the program are combined into a structure called the code property graph.
The code property graphs are then traversed, and the patterns resulted from these traversals are used as
features for vulnerability classification. This strategy was able to identify patterns in memory corruption
and integer overflow vulnerabilities.
While some PCA analysis can be performed to help analyze the discovered patterns, these strategies
match specific patterns in known vulnerabilities to new code: machine learning is not used to make
predictions. Unlike our approach, they aim to help human reviewers understand vulnerability patterns,
instead of fully automate vulnerability.

7.1.3 Program Metrics for Vulnerabilities


LEOPARD [26] uses program metrics to detect vulnerabilities in code. Its program metrics are separated
into two types: complexity metrics and vulnerability metrics. Complexity metrics contain features
such as function size and number of nested loops, while vulnerability metrics include statistics such as
number of function parameters, pointer arithmetics and control structures. A complexity score and a
vulnerability score are obtained by respectively adding scores from complexity metrics and vulnerability
metrics together. Functions with similar complexities are first grouped into bins using the complexity
score. Then, the vulnerability score is used to identify the top k ranking functions in each bin for further
inspection. LEOPARD’s detection results can be used to help manual auditing. Their result is able
to guide directed fuzzing, successfully finding vulnerabilities after targeting top-500 ranked functions in
PHP.
This result suggests that program metrics can capture features for vulnerabilities. However, different
from our proposed coarse-grained statistical method, LEOPARD does not use machine learning and all
its complexity/vulnerability metrics are weighted equally.
Chapter 7. Related Work 49

7.2 Machine Learning for Vulnerable Code Patterns


The works mentioned above aim to identify dominant code patterns or guidelines for finding vulnerable
code, in order to help experts to make their decisions. In this section, we present work using machine
learning techniques that may be directly used to predict whether a piece of code is vulnerable or not.
However, with machine learning, it may not as clear what the code actually learned and how the
predictions are made.

7.2.1 Automatically Learning Vulnerable Patterns


Below presents a list of existing work for learning vulnerable patterns from programs. While these works
either use tokens directly from the source code or the AST, our project works on the LLVM IR to
ensure syntactical information is not included in our features. Similar to our raw feature representation
approach, these works all need little manual feature engineering and aim to learn implicit, unstated
patterns from programs. However, the graph representation for our raw feature approach takes into
account relationships between instructions such as control and data dependencies, while they treat
programs as sequences of tokens and do not capture such information.
Wang et al. [27] selected specific types of AST nodes to capture patterns for file-level vulnerabilities.
AST nodes such as method invocation, declaration, and control-flow are extracted. To generalize the
features learned, some patterns such as calls to specific functions within a project are replaced with AST
node types. Every unique AST node selected is then treated as a token and mapped to an integer. These
processed sequences are then normalized and fed into a deep belief network (DBN) to obtain semantic
feature representations. The semantic feature vectors are used to train classifiers such as Naive Bayes
and Logistic Regression, in order to predict whether a file is vulnerable or not. This technique can detect
defect patterns both within a project and across projects.
Dam et al. [28] incorporated both syntactic and semantic code features to predict file-level vulnera-
bilities. In this approach, every function/method is treated as a sequence of code tokens extracted from
the AST. Every code token is then converted into an embedding using a token embedding matrix. Then
the entire sequence is fed into an LSTM classifier: each LSTM unit in the classifier outputs a vector (code
token state) representing the corresponding code token along with its context. To learn project-specific
syntactical features, code token states in the sequence are aggregated together using pooling techniques
to form method features. Similarly, the method features are further aggregated to generate a single
vector for the entire file. The file-level feature vectors are then fed into machine learning classifiers for
vulnerability prediction. This approach is able to learn vulnerability patterns across Java projects.
Lin et al. [18] tried capturing AST patterns for vulnerable functions. To represent ASTs as vectors,
depth-first traversal is performed on a function’s entire AST. Tokens are put in the order of the traversal.
Each token element in the AST vector is then converted to word embeddings using the word2vec model
[12] in order to capture the semantic meaning of each token. The vector sequence is then fed into a Bi-
LSTM network for representation learning and used by a random forest model for classification. With a
small vulnerable dataset from open source projects, Lin et al. is able to reduce the number of vulnerable
functions to inspect. Some patterns learned by their model can also be applied across projects.
Russell et al. [29] learned features directly from source code. Static analyzers such as Cppcheck
[4] and Flawfinder [3] are used to label their dataset, and then a group of researchers assign labeled
vulnerabilities to their respective categories. Duplicate functions are also removed to sanitize their
Chapter 7. Related Work 50

dataset. The word2vec bag-of-word model [12] is used to train embeddings for every token. Both CNN
and RNN models are tried, in order to extract features from obtained token embeddings. A dense
neural network layer, taking token embeddings as inputs, makes classification decisions. This work
aims at detecting memory corruption vulnerabilities such as buffer overflow, improper pointer usage and
improper restriction of boundary conditions. Their tool is able to achieve good recall, but it is still
challenging to obtain good precision using this technique, as there are not as many vulnerable samples
comparing to non-vulnerable samples in their dataset. By evaluating samples in different vulnerability
categories, it shows that this technique works better on some types of vulnerabilities than others.

7.2.2 Program Slicing with Machine Learning For Vulnerability Detection


Techniques that combine program slicing with machine learning vulnerability detection have also been
introduced.
Li et al. proposed VulDeePecker [17]. Before feeding input to a machine learning classifier, this
tool tries grouping semantically related lines of code together. It builds upon the observation that
different types of vulnerabilities have different types of “key points”: for example, buffer overflow errors
usually associate with pointers, arrays, and library/API calls. Instead of taking an entire function/file
as input, VulDeePecker applies the program slicing technique from a commercial code analysis product
Checkmarx [30], to acquire small pieces of relevant code, named “code gadgets”. It performs slicing on
library function calls which may be indicative of vulnerabilities, and only follows data dependences.
To obtain vector representation for the obtained code gadgets, VulDeePecker uses word2vec [12] to
train an embedding representation for every token in source code. A Bi-LSTM network that takes the
vector sequences as input is used to classify the code gadgets. The training data mainly consists of the
SARD Juliet synthetic dataset [11], while some samples from the CVE database [8] are also included.
With close to 50K pieces of code gadgets, VulDeePecker is able to achieve good precision and recall on
the synthetic SARD Juliet dataset. Although this tool does not perform as well on samples collected
from the CVE database, it can still find new vulnerabilities in real-world projects.
To improve VulDeePecker, SySeVR [31] further expands key points for slicing to include other vulner-
ability rules from the program analysis product Checkmarx [30]. Rules such as array/pointer usage, API
calls and arithmetic expressions are included as slicing points. SySeVR also adds control dependences
on backward slicing along with the data dependences when constructing code gadgets, and thus obtains
more effective classifiers. In addition, its performance is also evaluated with different types of machine
learning models such as GRU, CNN, and DBN. This work covers more types of vulnerabilities and finds
that this strategy works better on some types of vulnerabilities than others.
This method is further evaluated [32] with imbalanced datasets and different types of classifiers on
more types of vulnerabilities. Labels on code gadgets in the obtained dataset are manually checked to
ensure their correctness. With k-fold cross-validation, it shows that bidirectional RNNs are more effective
than unidirectional RNNs and CNNs. Also, in their experiments, using a dataset (both synthetic and
real-world) whose non-vulnerable samples are 4-6 times of the size of its vulnerable samples, techniques
with imbalanced data processing performs worse.
While we also use program slicing in our raw feature representation approach, our method is different
from their techniques in several ways. Our graph representation attempts to learn features about the
data and control flow relationships between instructions, rather than treating them as a sequence of
tokens. Also, to capture general features of vulnerabilities irrelevant to specific projects/programming
Chapter 7. Related Work 51

styles, our project works on the LLVM IR [7] instead of the AST. While our approach also works on the
synthetic SARD Juliet dataset [11], we aim to learn patterns in more realistic scenarios, training only
on real-world vulnerabilities from the CVE database [8].

7.3 Vulnerability Discovery with Machine Learning to Guide


Fuzzing
To enhance performance of fuzzers, Gustavo et al. presented VDiscover [33], a tool that uses machine
learning to help choose test cases that trigger vulnerabilities. To catch up with the speed of the fuzzing
tool, its code analysis and machine learning parts need to work on binary programs and achieve good
speed/scalability. VDiscover represents both the entire program structure and state of execution using
sequences of library calls. Both static and dynamic features are fed into the machine learning classifier.
For static features, VDiscover constructs possible sequences of library calls obtained from random
walks. For dynamic features, it uses both the concretely executed sequences of library calls with types
of function parameters and the final state of the program after execution. VDiscover processes every
sequence of library calls as a document: it uses bag-of-word and word2vec models to generate vector
representation for every function call. Vectors in each sequence are concatenated together to a single
vector for classification. VDiscover tries different classifiers such as random forest, neural network, and
logistic regression. Although the performance of the classifier is fair due to a lack of data, VDiscover
can still potentially improve the fuzzing speed.
Instead of capturing features relevant to the entire program like VDiscover does, our techniques
detect vulnerabilities at function level or in program slices. In our work, it is easier for human reviewers
to find where the vulnerability is. While only using static features, our set of features is more diverse
in both approaches we use. However, VDiscover is able to work on binary programs without access to
source code, and better assist fuzzing.
Chapter 8

Conclusion and Future Work

In this thesis, we use machine learning algorithms to learn vulnerability patterns in code, and thus predict
vulnerable code for further analysis. We experiment with two approaches: coarse-grained statistical
features and raw feature representation for sliced code. Coarse-grained statistical features tradeoff the
expressiveness of the model for a smaller amount of data required, while the raw feature representation
overfits on the training set due to increased complexity/expressiveness of the model. However, the
quality and size of the vulnerability dataset still limit the performance for both techniques we propose.
For the function-level statistical features, while some vulnerability patterns may be learned, the
size/quality of the vulnerability dataset and the coarse-grained nature for the obtained features limit
the performance of the model. Additionally, the models are more likely to predict large and complex
functions as vulnerable. While the predictions may be correct, this behavior causes problems for its
practical usage: it takes more time to analyze the predicted vulnerable functions than randomly selected
functions in general, and thus automatic tools such as fuzzers may not combine very well with the
prediction result from the models. More precision within large functions is needed for this strategy to
be useful.
To learn more fine-grained patterns in programs, we use graph-based classifiers with program slicing.
While experiments show that information such as control/data dependencies between instructions is
useful, the performance on real-world projects is generally fair for graph classifiers trained using raw
feature representations. This is due to a lack of vulnerability dataset with high quality: the amount of
useful training data is not able to cover the diversity in vulnerability patterns and the complexity of the
machine learning model. When training on loop samples with less data and more complex patterns, the
classifier is likely to overfit. On the other hand, while training on the synthetic dataset with more data
and simpler code structures, the graph classifier is able to obtain good performance. However, synthetic
samples are usually simple and cannot capture complex patterns in real-world vulnerabilities.

8.1 Future Work:


In order to better solve the attempted problems in this thesis, here we discuss a number of approaches
that we may explore as future work.
1) While we do not have many labeled samples, the number of normal code samples are abundant.
Machine learning models such as autoencoders can thus be used to learn representations of typical

52
Chapter 8. Conclusion and Future Work 53

pieces of code. The vulnerability samples would hopefully have representations different from normal,
non-vulnerable samples. Similar types of vulnerabilities may also share similarities in their code rep-
resentations. As a result, anomaly detection strategies that detect outliers in program representations
may be used to find vulnerabilities in programs.
2) We may also better integrate prediction results from machine learning with fuzzing techniques. To
inspect prediction result with directed fuzzing, we may use tools such as afl-unicorn [34] to fuzz sliced
pieces of code or functions separately without starting at the entry point of the program. Although
this may take more manual work to both analyze program input and filter out false positives, using this
technique we could improve coverage for the target function.
3) To achieve better fuzzing coverage within the targeted function, we can better combine directed
fuzzing with static analysis: future work may be done to iteratively fuzz the program and generate inputs
for parts of the code that are hard to reach. Through fuzzing, we can find uncovered basic blocks that
are hard to reach by the test cases used. New inputs may thus be generated with further static analysis
and constraint solving in order to lead execution paths towards the uncovered basic blocks.
Bibliography

[1] J. Berr, ““WannaCry” ransomware attack losses could reach $4 billion.” https://www.cbsnews.
com/news/wannacry-ransomware-attacks-wannacry-virus-losses, 2017.

[2] D. Goodin, “Apple pushes fix for facepalm, possibly its creepiest vulner-
ability ever.” https://arstechnica.com/information-technology/2019/02/
apple-pushes-fix-for-facepalm-possibly-its-creepiest-vulnerability-ever, 2019.

[3] D. Wheeler, “Flawfinder.” http://www.dwheeler.com/flawfinder, 2006.

[4] D. Marjamäki, “Cppcheck: a tool for static c/c++ code analysis,” 2013.

[5] “Static analysis with codesonar.” https://www.grammatech.com/products/


source-code-analysis, 2020.

[6] J. C. King, “Symbolic execution and program testing,” Communications of the ACM, vol. 19, no. 7,
pp. 385–394, 1976.

[7] “LLVM language reference manual.” https://llvm.org/docs/LangRef.html, 2020.

[8] “Common vulnerabilities and exposures.” https://cve.mitre.org, 2020.

[9] “National vulnerability database.” https://nvd.nist.gov, 2020.

[10] “Software assurance reference dataset - nist.” https://samate.nist.gov/SARD, 2020.

[11] P. E. Black and P. E. Black, Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce,
National Institute of Standards and Technology, 2018.

[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in
vector space,” arXiv preprint arXiv:1301.3781, 2013.

[13] M. Weiser, “Program slicing,” IEEE Transactions on software engineering, no. 4, pp. 352–357, 1984.

[14] “LLVM static slicer: Dependence graph for programs.” https://github.com/mchalupa/dg, 2019.

[15] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural code comprehension: a learnable represen-
tation of code semantics,” in Advances in Neural Information Processing Systems, pp. 3585–3597,
2018.

[16] H. Dai, B. Dai, and L. Song, “Discriminative embeddings of latent variable models for structured
data,” in International conference on machine learning, pp. 2702–2711, 2016.

54
Bibliography 55

[17] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep
learning-based system for vulnerability detection,” in Proceedings of the 25th Annual Network and
Distributed System Security Symposium, San Diego, California, USA, pp. 1–15, 2018.

[18] G. Lin, J. Zhang, W. Luo, L. Pan, Y. Xiang, O. De Vel, and P. Montague, “Cross-project transfer
representation learning for vulnerable function discovery,” IEEE Transactions on Industrial Infor-
matics, vol. 14, no. 7, pp. 3289–3297, 2018.

[19] M. Zalewski, “American fuzzy lop.” http://lcamtuf.coredump.cx/afl, 2017.

[20] M. Böhme, V.-T. Pham, M.-D. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” in
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security,
pp. 2329–2344, ACM, 2017.

[21] “llvm-cov - emit coverage information.” http://llvm.org/docs/CommandGuide/llvm-cov.html,


2020.

[22] S. Kim, S. Woo, H. Lee, and H. Oh, “Vuddy: A scalable approach for vulnerable code clone
discovery,” in 2017 IEEE Symposium on Security and Privacy (SP), pp. 595–614, IEEE, 2017.

[23] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “Vulpecker: an automated vulnerability detec-
tion system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on
Computer Security Applications, pp. 201–213, ACM, 2016.

[24] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnerability extrapolation using ab-
stract syntax trees,” in Proceedings of the 28th Annual Computer Security Applications Conference,
pp. 359–368, ACM, 2012.

[25] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code
property graphs,” in 2014 IEEE Symposium on Security and Privacy, pp. 590–604, IEEE, 2014.

[26] X. Du, B. Chen, Y. Li, J. Guo, Y. Zhou, Y. Liu, and Y. Jiang, “Leopard: Identifying vulnerable
code for vulnerability assessment through program metrics,” in Proceedings of the 41st International
Conference on Software Engineering, pp. 60–71, IEEE Press, 2019.

[27] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,”
in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 297–308,
IEEE, 2016.

[28] H. K. Dam, T. Tran, T. Pham, S. W. Ng, J. Grundy, and A. Ghose, “Automatic feature learning
for vulnerability prediction,” arXiv preprint arXiv:1708.02368, 2017.

[29] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. Mc-
Conley, “Automated vulnerability detection in source code using deep representation learning,”
in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA),
pp. 757–762, IEEE, 2018.

[30] “Checkmarx.” https://www.checkmarx.com, 2020.

[31] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, S. Wang, and J. Wang, “SySeVR: a framework for
using deep learning to detect software vulnerabilities,” arXiv preprint arXiv:1807.06756, 2018.
Bibliography 56

[32] Z. Li, D. Zou, J. Tang, Z. Zhang, M. Sun, and H. Jin, “A comparative study of deep learning-based
vulnerability detection system,” IEEE Access, vol. 7, pp. 103184–103197, 2019.

[33] G. Grieco, G. L. Grinblat, L. Uzal, S. Rawat, J. Feist, and L. Mounier, “Toward large-scale vul-
nerability discovery using machine learning,” in Proceedings of the Sixth ACM Conference on Data
and Application Security and Privacy, pp. 85–96, ACM, 2016.

[34] “afl-unicorn: Fuzzing arbitrary binary code.” https://hackernoon.com/


afl-unicorn-fuzzing-arbitrary-binary-code-563ca28936bf, 2017.

You might also like