[orcid=0000-0002-6747-8151]

Explaining the Contributing Factors for Vulnerability Detection in Machine Learning

Esma Mouine e_mouine@encs.concordia.ca Yan Liu yan.liu@concordia.ca Lu Xiao lxiao6@stevens.edu Rick Kazman kazman@hawaii.edu Xiao Wang xwang97@stevens.edu Concordia University, 1455 De Maisonneuve Blvd. W. Montreal, Qc, Canada, H3G 1M8 Stevens Institute of Technology, Castle Point Terrace, Hoboken, NJ 07030, United States University of Hawaii, 2500 Campus Rd, Honolulu, HI 96822, United States

Abstract

There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to automatically detect software vulnerabilities. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, there lacks of experience regarding the transferability of the vulnerability signatures from project to project. This study investigates how the combination of different vulnerability features and three representative machine learning models impact the accuracy of vulnerability detection in 17 real-world projects. We examine two types of vulnerability representations: 1) code features extracted through NLP with varying tokenization strategies and three different embedding techniques (bag-of-words, word2vec, and fastText) and 2) a set of eight architectural metrics that capture the abstract design of the software systems. The three machine learning algorithms include a random forest model, a support vector machines model, and a residual neural network model. Overall, 95% of the learning metrics (precision, recall, and f1 score, etc.) are above 0.77 in the experiments out of 10 hypothesis tests and 408 experiments. Further analysis shows a recommended baseline model with signatures extracted through bag-of-words embedding, combined with the random forest, consistently increases the detection accuracy by about 4% compared to other combinations in all 17 projects. Furthermore, we observe the limitation of transferring vulnerability signatures across domains based on our experiments.

keywords:

software vulnerability, machine learning, natural language processing, architectural metrics

1 INTRODUCTION

The National Institute of Standards and Technology (NIST) defines security vulnerability as a weakness in an information system, system security procedures, internal controls, or implementation that could be exploited or triggered by a threat source [1]. Software vulnerability management is the practice of identifying, classifying, remediating, and mitigating vulnerabilities. Early detection of vulnerable code reduces the risks of run-time errors, faults, threats, and the collapse of a system. As software scales expand, vulnerability detection with sufficient accuracy and efficiency remains a challenge from both research [2, 3, 4, 5] and industrial perspectives [6, 7]. The goal is to learn from representations of vulnerable features and to automate the discovery of vulnerabilities in source code.

In industrial practice, security flaws are regularly reported to the Common Vulnerabilities and Exposures (CVE) database [8]. This database is used to collect and share publicly disclosed information about security vulnerabilities. Likewise, Common Weakness Enumeration (CWE) is a community-developed list of common software and hardware security weaknesses [9]. The Open Web Application Security Project (OWASP) Benchmark is a Java test suite that contains thousands of exploitable test cases where each one maps to a specific CWE. NIST’s Software Assurance Reference Dataset (SARD) [10] provides a set of known security flaws for researchers and software security assurance developers. Within SARD, a set of test suites exist including the Juliet Tests for Java and C++ [11], mobile apps and Web apps [12]. These sources of information are used to search for known vulnerabilities to identify potential exploits as part of a forensics process.

In software engineering, static code analysis helps to identify bugs or flaws in software. Code analysis techniques are embedded in security scanners and raise alerts when vulnerabilities are detected [13, 14, 15, 16, 17, 18, 19]. The identified vulnerabilities are confirmed by security engineers. One technique of static code analysis is pattern matching [17, 18, 19] that searches based on a set of rules. These rules, usually defined by security experts, enumerate known vulnerabilities. One limitation of scanners based on static analysis is the high false-positive rate [20]. For example, one case study [20] performed using a static analysis tool on Java source files showed that 45.7% of discovered vulnerabilities were false positives.

To improve the precision and recall of detecting vulnerabilities, research has been conducted to build a feature engineering methodology. Dam et al. [2] used a Long Short Term Memory (LSTM) model to capture relationships between code elements. Likewise, Russel et al. [6] developed a fast and scalable vulnerability detection tool for C and C++ based on deep feature representation learning that interprets source code. Hovsepyan et al. [21] analyzed Java source code using bag-of-words and support vector machines to classify vulnerabilities.

Recent research has focused on machine learning models to mine feature representations from software repositories [22]. In addition to machine learning, other methods based on natural language processing (NLP) have emerged. To extract features, these techniques treat the source code as a form of text. Software repositories contain code that forms the corpus upon which feature representations can be learned. The concept of a corpus, originating in linguistics, is a collection of text in one or more languages. In NLP, the corpus is used to train learning models. For example, in the classic Word2Vec [23] model, a corpus is used to produce the embedding of tokens that forms the relations of these tokens to each other in a multi-dimensional space. Zhou and Sharma [24] use commit messages and bug reports from repositories to identify software flaws.

The representation of software code as tokens does not contain the code dependencies and structural complexity. Software architecture metrics measure the complexity of software entities [25, 26, 27, 28, 29, 30]. For example, Fan-In and Fan-Out of source files and classes are shown to impact the propagation of software quality issues through the inter-dependencies among software entities [29]. Due to the intrinsic connections between software architecture and security, prior studies have investigated how software architecture impacts the security of a system [25, 26, 27, 28]. In this paper, we observe whether software architecture metrics are dominating contributor to vulnerability classification combined with token embedding without structural representation.

Finally, it remains unclear how transferable the identified signatures from one set of projects are able to detect the vulnerability of other projects. To test the vulnerability signatures, we need a baseline model to output the vulnerability classification. Such a model helps to establish a base to investigate techniques on feature representation, learning models, factors such as code structure, and complexity in learning vulnerability patterns. A baseline model is commonly used as in the artificial intelligence community [31, 32]. A baseline model serves as a reference point to compare the performances of other models that are usually more complex. A baseline model relies on the understanding of the key factors contributing to the discovery of vulnerability signatures through a combination of techniques and machine learning models.

In this paper, we assume tokens in a software repository form the corpus used to learn vulnerability patterns.These tokens are further embedded as numerical features for learning vulnerability classification. The ultimate goal is to develop a learning method that takes input as code embeddings from project repositories so that the learning is transferable to other projects. Hence, the core research question is:

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] What are the contributing factors in learning processes that impact the accuracy of identifying vulnerabilities across software projects?

Our paper concentrates on four aspects of the learning process: (A1) the tokenization of the source code, (A2) the generation of embeddings, (A3) architectural metrics and (A4) machine learning models. For tokenization we experiment with two tokenization approaches—with and without symbols and comments. For embeddings, we investigate three types of embedding methods (namely bag-of-words [33]; word2vec [23], and fastText [34]). For architectural metrics, we explore eight file-based metrics to augment vulnerability representations. These metrics are introduced in detail in Section 3.4. They measure how source files are connected to each other in a system. We aim to examine whether adding architectural metrics helps to improve detection accuracy. For machine learning models, we build three machine learning algorithms including a weak learner-based model (random forest [35]), a kernel vector-based model (support vector machines [36]), and a neural network model (residual neural network [37]).

We evaluate the combined effects of the above four aspects on the accuracy of vulnerability classification over 17 java projects. These combinations led to a total of 408 experiments to collect all results and evaluate the ten hypotheses derived. The accuracy of the learning results is measured by six metrics, including precision, recall, f-measure, false-positive rate, the area under the precision-recall curve, and the area under the receiver operating characteristic curve.

Our results indicate that $95\%$ of the learning metrics are above $0.77$ over all experiments. Further analysis shows that feature representations derived from the source code including all tokens, using bag-of-words embedding, combined with the random forest model, consistently increases the detection accuracy by about $4\%$ compared to other combinations in all 17 projects listed in Table 3. This learning model is further evaluated in a transfer learning context on 5 out of 15 Android projects in Table 3 with both precision and recall above 0.8. Comparing the features of token embeddings and the architecture metrics, token embeddings contribute more to the vulnerability classification. We observe that the combination of token pre-processing, conventional NLP embedding, and random forest model is sufficient for building the baseline of learning vulnerability with comparable performance to deep learning models including the structure of ResNet or LSTM[2]. Such a baseline provides a reference point to quantify the minimal expected performance that the new vulnerability learning model should achieve.

2 RELATED WORK

2.1 Machine Learning and Natural Language Processing for Detecting Vulnerabilities

Our research aims to identify the factors contributing to the learning of software vulnerabilities from source code repositories. Several approaches have been developed aiming to improve the detection of vulnerabilities (e.g., [38, 2, 21, 39, 40]). One example is applying pattern recognition techniques to detect malware [41]. This technique [41] consists of visualizing malware binary gray-scale images and classifying these images according to observations that show that malware from the same families appears to be very similar in layout and texture.

To build a vulnerability prediction model, the selection of features is essential. The most frequent features used in previous works are software metrics [42, 43, 44] and developers activity [5]. Basili et al.[43] used source code metrics to classify the C++ code into binary code vulnerabilities back to 1996. Nagappan et al. [44] used complexity metrics like module metrics that consist of the number of classes, functions and variables in the module M, in addition to per-function and per-class metrics. They used those metrics with some Microsoft systems to identify faulty components. Perl et al. [22] considered metrics from developer activities by analyzing if commits were related to a vulnerability or not. The methodology of this work [22] consists of combining machine learning using a support vector machine (SVM) classifier with code metrics gathered from repository metadata.

Recent work treats code as a form of text and uses natural language processing based methods for code analysis. Zhou and Sharma [24] used commit messages and bug reports from repositories to identify software flaws using NLP techniques such as word2vec to create the embeddings used as features and machine learning classifiers. Hovsepyan et al.[21] analyzed Java source code from Android applications using a bag-of-words representation and SVM for vulnerability prediction.

Pang et al. [39] further include n-grams in the feature vectors and used SVM for classification. Jackson and Bennett [45] using the Python Natural Language Toolkit
(NLTK) to develop a machine learning agent that uses NLP techniques to convert the code to a matrix and identify a specific flaw—SQL injection—in Java byte code using decision trees and random forests for classification.

Other works focus more on using deep learning techniques such as Russel et al. [6] attempts to identify vulnerabilities using C and C++ source code at the function level based on deep feature representation learning that directly interprets lexed source code and also Dam et al.[2] present an approach based on deep learning using an LSTM model, to automatically learn both semantic and syntactic features of code.

Apart from the work of Hovsepyan et al. [21] most of these approaches focus on the feature engineering part, like Russel et al. [6] that uses a convolutional neural network to build the feature vectors.

A recent survey [46] summarizes the techniques, the datasets and results obtained from vulnerability detection research that uses machine learning. According to their categories, our work falls in the text-based category, since we use a convolutional neural network (Resnet).

Our approach focuses on detecting vulnerabilities in source code using machine learning and natural language processing techniques. However, we use general NLP-based techniques (bag-of-word, word2vec, fastText, and tokenizing code) associated with different machine learning models to identify the key factors contributing to the learning of the software flaws from code.

2.2 Software Architecture and Security

Software architecture is the high-level abstract of a software system. Poor software architectural decisions are responsible for various software quality problems. Numerous previous research has underscored the impact of software architecture on security.

Software architecture is the most important determinant to systematically achieve quality attributes in a software system, including software security [47]. Software security is, for many systems, the most important quality attribute driving the design.

Due to the intrinsic connections between software architecture and security, prior studies have investigated how software architecture impacts the security of a system [25, 26, 27, 28]. However, little work has investigated how to leverage software architecture characteristics and metrics in machine learning processes to discover vulnerabilities. Researchers in software architecture have developed some measures to capture the complexity of software architecture entities [25, 26, 27, 28, 29, 30]. For example, Fan-In and Fan-Out of source files and classes are shown to impact the propagation of software quality issues through the inter-dependencies among software entities [29]. What remains unclear is whether and how different architecture metrics can be used as vulnerability representations for machine learning models to detect software vulnerabilities.

Previous research mostly focused on security assessment and evaluation from an architectural perspective. For example, Feng et al. found that software vulnerabilities are highly correlated with flawed architectural connections among source files [25]. Sohr and Berger found that software architecture analysis helps to concentrate on security-critical software modules and detect certain security flaws at the architectural level, such as the circumvention of APIs or incomplete enforcement of access control [48]. Brian and Issarny showed how software architecture benefits security by encapsulating security-related requirements at design-time [49]. Antonino et al. [50] evaluated the security of existing service-oriented systems on the architectural level. Their method is based on recovering security-relevant facts about the system and interactive security analysis at the structural level. Alkussayer and Allen [51] proposed a security risk evaluation approach by leveraging the architectural model of a system, assuming that components propagate their security risks to higher-level components in the architecture model. Alkussayer and Allen [52] assessed the level of security supported by a given architecture and qualitatively compared multiple architectures with respect to their security support.

Despite the high recognition of an architecture’s impact on security, the is little focus on using architectural metrics as vulnerability signatures for machine learning models [53, 54]. Alshammari et al. [55] is one of the few studies that investigated security metrics based on the composition, coupling, extensibility, inheritance, and design size of an object-oriented project. However, these metrics have not been compared with other vulnerability signatures, such as code features extracted using NLP. In addition, these metrics tightly tie into object-oriented concepts and may not be easy to transfer to other programming paradigms. Motivated by the work of Feng et al., our study focuses on eight architectural metrics that capture how software elements, i.e. source files, are interdependent on each other [25]. And these metrics are generally applicable to software projects of different characteristics, such as the programming language used. In addition, although they are measured at the file level in this work, it is easy to roll up and down to the component level or method level following the same rationale to detect vulnerabilities at different granularities in future studies. Most importantly, to the best of our knowledge, we are the first to compare architectural metrics with code features extracted through NLP as vulnerability representations.

3 RESEARCH METHODOLOGY

The research method considers the learning task as a classification problem to the vulnerability signature. We consider four relevant aspects to the learning process, including (A1) tokenization: With regards to how tokens are extracted from software, we consider questions such as if code comments as tokens impact detection results; (A2) embedding: Tokens are transformed into numerical values. The effects of different embedding techniques are investigated; (A3) architectural metrics: We focus on architectural metrics that measure the complexity of the inter-dependencies among fine-grained software architecture elements at the file level. We consider eight architecture metrics which will be detailed later; lastly, (A4) machine learning algorithms for classification.

We consider software as a corpus to develop the feature representation through token encoding. The tokens are the terms from the software coded separated according to the spaces and special characters. For the corpus, we use software code from open repositories. Then the encodings are embedded in machine learning models for vulnerability detection on datasets such as OWASP benchmark, Juliet test suite for Java, and Android Study. The architectural metrics are used as additional feature representations, along with code-based representations. Based on the above rationale, we propose to answer the following research questions:

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] RQ1 How does the filtering of tokens affect source code vulnerability detection?

When using NLP techniques to extract features, an essential preprocessing step is the tokenization of the source code. This step involves separating the code into tokens before creating the embeddings. Generally, special symbols (including , . ; : [ ] ) ( + - = | & ! ? * ˆ \< > @ " ’ # %) should be filtered out from the source code before separating it into tokens. Another question is: do the comments contain meaningful features and affect the features representations? To answer this question, we compare the performance of vulnerability detection with and without comments.

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] RQ2 Does a specific embedding technique perform better across software projects?

Embeddings are the process that maps each token to one vector, and the vector values are learned using a class of techniques such as bag-of-words [33], word2vec [23] and fastText [56]. This research question evaluates whether a particular embedding technique constantly improves the performance of vulnerability detection across all 17 software projects.

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] RQ3 Can architectural metrics that measure the structural complexity of software improve vulnerability detection? We answer this question in two ways. First, we compare the learning performance separately using the NLP-based token embedding and using the architectural metric representation, respectively. Next, we merge these representations into the learning process to observe if the combination improves vulnerability detection compared to using either of them alone.

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] RQ4 Which machine learning model performs better across different projects?

We compare three kinds of machine learning models, namely, decision tree based (Random Forests), kernel-based (Support Vector Machines), and deep neural networks (Residual Neural Networks). Each model is combined with the feature representation extracted through different techniques of tokenization, embedding techniques, and architectural metrics. The goal is to discover whether a particular machine learning model performs best in terms of vulnerability detection in different settings and across software projects.

{mdframed}

[backgroundcolor=yellow!20, innermargin=1cm, outermargin=1cm, font=,] RQ5 How transferable is the learning in predicting vulnerabilities of projects in cross-validation? For this last research question, we aim to evaluate the transferability of the learned features in vulnerability prediction. The learning model is fine-tuned by training projects in cross-validation. We define different sets of experiment where we train our model with a project and predict the vulnerabilities on other projects.

3.1 The Vulnerability Detection Process

The process of vulnerable code detection, as shown in Figure 1, contains a software repository, which provides the corpus for developing a vocabulary. Any project (even without vulnerable code labels) can be used for this purpose. Such a vocabulary is used to build the embedding of software tokens. The embedding is pretrained using word2Vec and fastText with the corpus. The tokens are then converted to numerical representations by running the embedding. In addition, architecture metrics can be extracted from projects with tags. Next, architectural metrics and embeddings of code tokens are the features input to a supervised classification model. We consider the vulnerability detection under two sources of vulnerability code labels: (1) the labels are from code within the same domain as the target software for vulnerability detection (Tables 9, 10 and 11); and (2) the models are trained with software code in one domain with vulnerability labels and used to classify software code in a different domain. For example, a model is learnt with the dataset from the Juliet dataset, then used to predict the vulnerability of source code in Android projects (Tables 7 and 12).

Figure 1: Learning software vulnerability as a classification task

3.2 Tokenization

Tokenization is a common pre-processing step in natural language processing to transform the raw input text into a format that is more easily processed. The raw code contains 1) special symbols include punctuation characters (such as, . , : ; ? ) ( ] [ ’ " } {), 2) mathematical and logical operators (such as, + - / = * & ! % | < >); and 3) others (such as # \@ ˆ), in NLP special characters add no value to text-understanding and can induce noise in algorithms. In addition to other meta text that usually appears as code comments. These comments are any text that starts with two forward slashes (//) and any text between /* and */. To determine if those special characters and comments are important in the vulnerability prediction in source code, we use regular expressions to remove the code comments and special symbols.

3.3 Token Embeddings

Token embeddings are learned numerical representations for text where tokens or words that have similar meanings are approximated by the same value. In the domain of NLP, a corpus is a collection of texts. All tokens or words in the multiple corpora form a high-dimensional space. A learning model on this space calibrates the positions of each word or token according to its relations with all other tokens. Finally, each token has a numerical vector representation called an embedding.

In this work, the corpus is formed with software projects selected from the Github repository. We trained three models as follows to create the numerical vector representation of the source code tokens:

3.3.1 Bag-of-words (BOW)

Bag-of-words is a representation of the text [33]. It represents the text as a vector where each element is an index of a token from the vocabulary. Each token is associated with its frequency in the text. Hence, the resulting vector has the same length as the number of unique tokens. The BOW vectors are limited to be the size of the text that is used for the training.

3.3.2 Word2vec

Word2vec is a method to create word embeddings that have been around since 2003 [23]. The algorithm uses a neural network associated with a large corpus of text. Word2vec can use skip-gram or CBOW to learn the representations of tokens. Given a context, CBOW is the same as BOW, but instead of using sparse vectors (a vector with a lot of 0) to represent words, it uses dense vectors. CBOW predicts the probability of a target word. Skip-gram aims to predict the context of a word given its surrounding words. We use Skip-gram since we want to maintain the context of the token. The model takes a target term and creates a numerical vector from the surrounding terms.

3.3.3 FastText

FastText is a library for learning of word embeddings and text classification created by Facebook’s AI research lab [34]. Akin to word2vec, fastText supports CBOW and skip-gram. Instead of feeding individual tokens into the neural network, fastText exploits the subterms information, which means each token is represented as a bag of characters in addition to the token itself. This allows the handling of unknown tokens, which aids cases where we want to take into account the internal structure of the words and handle unseen words.

Word2vec & fastText use the same parameters. We use a dimensionality of 300 for the feature vectors and a window size of 5 and words with a total frequency lower than two are ignored. To obtain the source code embeddings of the files, we average the token vectors of all the terms of the file using tf-idf¹¹1Term Frequency - Inverse Document Frequency weighting. These embeddings are calculated by multiplying each vector by the tf-idf weight of the corresponding term before calculating the average.

The resulting vector is the length of the vocabulary size. Our feature extractor uses the Python scikit-learn [57] library to generate the bag-of-words vector and the Gensim [58] library for word2vec and fastText models. To train these models, we use source code from large repositories to learn the similarities between the source code tokens. The vocabulary is created from the source code of three large projects: (1) the IntelliJ community project[59], (2) the Android repository [60] and (3) the Android framework project. These repositories contain more than 70,000 Java files.

3.4 Architectural Metrics

Software architecture refers to software elements, their relationships, and the properties of both [61]. As discussed in Section 2.2, prior research has revealed the significant impact of architecture design decisions on software security. In particular, the study in [25] reported that complicated architectural connections among source files in a project contribute positively to the propagation of software vulnerability issues. Hence we are motivated to investigate whether metrics that measure the complexity of architectural connections at the file level contribute positively to the detection of software vulnerabilities using machine learning models.

We model a set of architectural connections as a graph, namely $G=\{F,D\}$ , where $F$ is the set of source files in the system, and $D$ is the set of structural dependencies among the source files. The graph $G$ of a software system can be reverse-engineered using existing tools, such as Scitool Understand ²²2https://scitools.com/.

For each source file $f\in F$ , we capture eight metrics to measure the file’s connections with the rest of the system $G$ . We assume these metrics are feature representations to learn more vulnerabilities. These eight metrics are from three different but related aspects of software architecture:

First, we measure Fan-in and Fan-out of a file $f$ , which counts the number of direct dependencies with $f$ , and are commonly used for various analysis:

1.

Fan-in: The number of source files in $G$ that directly depends on $f$ .
2.

Fan-out: The number of source files in $G$ that $f$ directly depends on.

Next, we measure the position of $f$ in the entire dependency hierarchy of $G$ . Cai et al. proposed an algorithm to cluster source files into hierarchical dependency layers based on their structural dependencies in $G$ [62]. The key features of the layers are: 1) source files in the same layers form independent modules, and 2) source files in a lower layer structurally depend on the upper layer, but not vice-versa. This layered structure is called the Architectural Design Rule Hierarchy (ArchDRH). The rationale is that source files in a higher layer structurally impact source files in the lower layers. Therefore, the higher the layer of $f$ , the more influential it is for the rest of the system.

3.

Design Rule Hierarchy Layer: the layer number of $f$ in the ArchDRH clustering.

Finally, we measure the complexity of the transitive connections to each $f$ in $G$ . For any $f\in F$ , we define the $Butterfly\_Space_{f}=\{f,UpperWing,LowerWing\}$ , where $f$ is the center of the space. $UpperWing$ is the set of source files that directly and transitively depend on $f$ . Similarly, $LowerWing$ is the set of source files that $f$ directly and transitively depends on. For any $f\in G$ , we calculate five metrics based on the $Butterfly\_Space$ notions:

4.

Space Size: the total number of source files in
$Butterfly\_Space_{f}$ . This measures the total number of source files that $f$ is connected to directly and transitively. The higher this value, the more significant is $f$ connected to the rest of the system.
5.

Upper Width: the width of the $UpperWing$ . This measures the maximal number of branches that depend on $f$ .
6.

Upper Depth: the length of the longest path in the $UpperWing$ . This measures the most far-reaching transitive dependency on $f$ .
7.

Lower Width: the width of the $LowerWing$ . This measures the maximal number of branches that $f$ depends on.
8.

Lower Depth: the length of the longest path in the $LowerWing$ . This measures the most far-reaching transitive dependency from $f$ .

In this study, we investigate whether and to what extent these metrics contribute to the learning of software vulnerabilities.

3.5 Machine Learning Models

We perform a classification task to predict if a file is vulnerable or not. Our objective is to observe the effects of machine learning models. Since a machine learning model is part of the decision process of the classification task, we consider the model’s transparency to the classification decision. A random forest model has one form of transparency as the feature importance to the classification performance. A kernel-based Support Vector Machine is useful for data with irregular distribution or unknown distribution. The residual neural network (ResNet) model has been used to examine explainability methods[63]. Since the LSTM model was studied in the literature, we compare three kinds of machine learning models, decision tree-based Random Forests, kernel-based SVMs and deep neural networks as ResNet.

3.5.1 Random Forest

The Random forest (RF) is an ensemble learning method for supervised classification [35]. This model is constructed from multiple random decision trees. Those decision trees vote on how to classify a given instance of input data, and the random forest bootstraps those votes to prevent overfitting.

3.5.2 Support Vector Machines

Support Vector Machines (SVM) uses a kernel function to perform both linear and non-linear classifications [36]. The SVM algorithm creates a hyper-plane in a high-dimensional space that can separate the instances in the training set according to their class labels. SVM is one of the widely used machine learning algorithms for sentiment analysis in NLP.

3.5.3 Residual Neural Network

Residual Neural Network (ResNet) is a deep neuronal network model with residual blocks carrying linear data between neural layers. In our case, we construct the structure of a ResNet model composed of one convolutional layer, one dense layer and 7 ResNet blocks. Each ResNet block is composed of $16$ layers. The detailed ResNet structure is depicted in Figure 2. We apply a residual block with the structure as follows:

\mathbf{x}_{l+1}=h\left(\mathbf{x}_{l}\right)+\mathcal{F}\left(\hat{f}\left(% \mathbf{x}_{l}\right),\mathcal{W}_{l}\right)

(1)

Where $\mathbf{x}$ is the input to the residual block and $\mathbf{l}$ indicates the $\mathbf{l-th}$ residual block. $\mathbf{\hat{f}}$ is the activation function which we use ReLU here. $\mathbf{F}$ is the residual function that contains two $1\times 3$ convolutional layers. $\mathbf{W}$ stands for the corresponding parameters. We define the short cut $\mathbf{h}$ as one $1\times 1$ convolutional layer if the dimension of $\mathbf{x}_{l}$ and $\mathbf{x}_{l+1}$ doesn’t match, otherwise h will be:

h\left(\mathbf{x}_{l}\right)=\mathbf{x}_{l}

(2)

Refer to caption — Figure 2: The data flow of the feature engineering and learning. The feature engineering is consistent for all the classification models. The modeling part illustrates the structure of the revised ResNet model that consists of one convolutional layer, one dense layer and 7 ResNet blocks. Each ResNet block is composed of 16 layers.

4 EXPERIMENTS AND RESULTS

Our evaluations are based on the tasks of 1) defining hypotheses to answer each research question; 2) preparing appropriate datasets; 3) defining metrics to evaluate the learning effects, and 4) running experiments and collecting results to test the hypotheses. The hypotheses test if tokenization techniques, embedding techniques, architectural metrics and machine learning models have significant effects on the ability to learn software vulnerabilities.

4.1 Datasets

We prepare three datasets with vulnerabilities labelled, including the OWASP Benchmark project [64], the Juliet test suite for Java [11] and 15 Android applications from the previous Android study [65]. OWASP and Juliet have the vulnerability types available online. Android study follows the labels published in the paper [65].

4.1.1 OWASP Benchmark Project

The OWASP Benchmark is a free test suite designed to evaluate automated software vulnerability detection tools. It contains 2740 test cases with 1415 vulnerable files (52%) and 1325 non-vulnerable files (48%). Table 1 enumerates the different types of vulnerabilities found in the OWASP project.

Table 1: OWASP vulnerability Types

Vulnerability Area	CWE	# of files
Command Injection	78	251
Weak Cryptography	327	246
Weak Hashing	328	236
LDAP Injection	90	59
Path Traversal	22	268
Secure Cookie Flag	614	67
SQL Injection	89	504
Trust Boundary Violation	501	126
Weak Randomness	330	493
XPath Injection	643	35
XSS (Cross-Site Scripting)	79	455

4.1.2 Test Suite for Java

This test suite contains 217 vulnerable files (58%) and 297 non-vulnerable files (42%). There are 112 different vulnerabilities and errors such as buffer overflow, OS injection, hard-coded password, absolute path traversal, NULL pointer dereference, uncaught exception, deadlock, missing releases of resource and others listed in Table 2.

Table 2: Juliet Test Suite Vulnerability Types

Vulnerability Area	CWE	# of files
Integer Overflow or Wraparound	190	115
Integer Underflow	191	92
Improper Validation of Array Index	129	72
SQL Injection	89	60
Divide By Zero	369	50
Uncontrolled Memory Allocation	789	42
Uncontrolled Resource Consumption	400	39
HTTP Response Splitting	113	36
Numeric Truncation Error	197	33
Basic Cross-site scripting	80	18
Use of Externally-Controlled Format String	134	18
XPath Injection	643	12
Assignment to Variable without Use	563	12
Unchecked Input for Loop Condition	606	12
OS Command Injection	78	12
Relative Path Traversal	23	12
Unsafe Reflection	470	12
LDAP Injection	90	12
Absolute Path Traversal	36	12
Configuration Setting	15	12
Others		67

As shown in Table 1 and Table 2, the common vulnerabilities between the three datasets are the SQL injection (CWE 89) and the command injection (CWE 78). The vulnerabilities in common between OWASP and Juliet are command injection (CWE 78), LDAP injection (CWE 90), SQL injection and XPATH injection (CWE 643). And the vulnerability type that we can find in all three projects is Cross-site scripting (CWE 79 & 80).

4.1.3 Android Study

The Android Study is a public dataset that contains 20 different Java applications that cover a variety of domains. This dataset is used in the work of Scandariato et al [66]. According to [66], the source code was scanned using the Fortify Source Code Analyzer, a security scanning tool to mark the vulnerable files. In total, the Android Study contains $2321$ vulnerable files such as cross-site scripting, SQL injection, header manipulation, privacy violation and command injection. The label is binary that is vulnerable or not, without the exact type of vulnerability for each file. We collect the information of the application names, the versions and the paths of the file with its vulnerable label. Using these references, we develop scripts to retrieve 15 projects for our evaluation. Table 3 shows the 17 applications we use in this project and the vulnerability rate of the labelled source code for each. Since Fortify itself may produce errors in the vulnerability scanning, the quality of labelling is not fully evaluated. This is a potential threat to validity.

Table 3: Dataset Vulnerability Statistics

Projects		Vulnerability rate	Number of files	# of tokens
1	QuickSearchBox	23%	654	4301
2	FBReader	30%	3450	6589
3	Contacts	31%	787	13438
4	Browser	37%	433	9561
5	Mms	37%	865	7965
6	Camera	38%	475	7851
7	KeePassDroid	39%	1580	2872
8	Calendar	44%	307	8003
9	ConnectBot	46%	104	4109
10	Crosswords	46%	842	4223
11	K9	47%	2660	13175
12	Deskclock	47%	127	2163
13	Coolreader	49%	423	5424
14	OWASP	52%	2740	6154
15	Email	54%	840	15454
16	Juliet	58%	514	1268
17	AnkiDroid	59%	275	8408

Table 4: Common vulnerabilities in the three datasets

Vulnerability Area	CWE	OWASP	Juliet	Android
Cross-Site Scripting	79	X	X	X
SQL injection	89	X	X	X
Command Injection	78			X
XPath Injection	643	X	X
OS Command Injection	78	X	X
LDAP Injection	90	X	X

4.2 Analysis of Tokens

Each line of code is parsed to produce tokens including variables, preserved keywords, operators, symbols and separators.

First, we analyze the token statistics and observe if any significant characters of the tokens. For each dataset OWASP, Juilet, and Android project, we separate the tokens in vulnerable files from tokens in non-vulnerable files and plot the token frequency distribution in Figure 3, Figure 4, and Figure 5 respectively.

For the OWASP project (shown in Figure 3), tokens are mostly grouped in the counts of occurrence that are less than 20. Beyond 20 occurrences, the counts of tokens are significantly smaller. In Juliet source code (shown in Figure 4), the distribution of the token frequency has more peaks than the OWASP token distribution. In all the Android source code (shown in Figure 5), the tokens are mostly grouped with occurrences less than 30.

The charts show two facts: (1) the frequency distribution of each project varies. This could be a factor in downgrading the accuracy of cross domain learning; (2) the high-frequency tokens are neutral such as “main”, “string_builder”. This indicates the feature representation is learnt mainly from tokens with less frequency. These two facts relate to the experimental results presented in section 4.4.

Table 5: Number of tokens in each dataset according to the vulnerability of the files and number of the common tokens in the vulnerable files and non vulnerable files.

Dataset	# of tokens in	# of tokens in	# of tokens in
Dataset	vulnerable files	non vulnerable files	common
Owasp	2982	3599	605
Juliet	678	764	460
Android	54196	28698	18339

4.3 Evaluation Metrics

For each experiment, we evaluate the performance of vulnerability detection using traditional Information Retrieval metrics. In the context of this study, True positives (TP) are the correct identification of source files with vulnerabilities. True negatives (TN) are the correct identification of source files without vulnerabilities. False positives (FP) are the incorrect identification of source files with vulnerabilities. False negatives (FN) are the incorrect identification of source files without vulnerabilities. Based on these metrics, we define the following metrics to measure the performance of vulnerability detection:

•

The precision (P), which is the probability a file that is classified as vulnerable, is truly vulnerable.

$P=\frac{TP}{(TP+FP)}$ (3)
•

The recall (R) which is the probability that a vulnerable sample of code is classified as vulnerable.

$R=\frac{TP}{(TP+FN)}$ (4)

•

The f-measure (F1) is the harmonic average of precision and recall.

F1=2*\frac{1}{\frac{1}{P}+\frac{1}{R}}

(5)

•

The false positive rate (FPR) is the proportion of negative cases incorrectly identified as positive cases.

FPR=\frac{FP}{(FP+TN)}

(6)

•

The area under the precision-recall curve (PR AUC) summarizes the information in the precision-recall curve.
•

The area under the receiver operating characteristic curve (ROC AUC) shows the capability of the model to distinguish between classes.

To aggregate the above values of metrics for comparing different hypotheses, we define an aggregation formula below. We first add all the metrics defined above in Eq(7), $\forall s\in S\{P,R,F1,1-FPR,ROCAUC,PRAUC\}$ . We then normalize the value using Eq(8), where $N$ is the number of comparison cases.

\mathbf{x}_{i}=\sum\mathbf{s}_{i}

(7)

\mathbf{z}_{i}=\frac{\mathbf{x}_{i}-\min_{\forall k\in[1,N]}(x_{k})}{\max_{% \forall k\in[1,N]}(x_{k})-\min_{\forall k\in[1,N]}(x_{k})}

(8)

4.4 Experiment Design and Hypotheses

To answer the research questions, we first design experiments for each project and then developed experiments for cross-project validation. In the first set of experiments, each project’s source code is divided into training and testing partitions. The vulnerability detection models are trained per project and tested on the same project. Due to space limitations, we include the result tables in the appendix LABEL:sec:appendix.

We use the $z_{i}$ to compute the p-value from a paired t test for each comparison. When $\alpha\geq 0.5$ the hypothesis is accepted to be significantly different. Otherwise, the hypothesis is rejected. Table 6 shows the different p-values obtained for each hypothesis. A detailed analysis is presented in the following sections.

Table 6: p-value obtained from the t test for the 10 hypotheses

	Tokenization	Embeddings			Architectural metrics			Models
Hypothesis	1	2	3	4	5	6	7	8	9	10
p-value	0.01	$1.02\mathrm{e}{-10}$	$1.08\mathrm{e}{-06}$	0.69	$1.4\mathrm{e}{-16}$	$0.56$	$7\mathrm{e}{-15}$	$3.8\mathrm{e}{-06}$	$3.8\mathrm{e}{-12}$	$3\mathrm{e}{-05}$
Conclusion	Reject	Accept	Accept	Reject	Accept	Reject	Accept	Accept	Accept	Accept

4.5 Experiment Results for Tokenization (RQ1)

Observation: We observe whether tokenization with removing code comments and/or special symbols may improve or create noise for the detection. We run experiments with the tokens including the comments and symbols (Table 9 in Appendix) compared to tokens without them (Table 10). Each table shows the learning scores of 153 experiments (= (3 models x 3 embeddings) per project x 17 projects). Each experiment produces six scores that compute the value of $z$ . Totally 306 data points of $z$ are used to compute p-value from the t test. The p-value is compared to the significance level $\alpha$ .

Hypothesis Analysis: Hypothesis (1) is defined as follows:

1.

There is a statistically significant difference between the results obtained from using all tokens vs. using tokens without comments and symbols.

According to Table 6, the p-value obtained for hypothesis (1) is less than the significance level 0.05. This hypothesis is then rejected, which means there is no statistically significant difference between the two tokenization strategies.

Conclusion: We conclude that comments and symbols do not affect the learning of software vulnerabilities from source code in our experiments.

4.6 Experiment Results for Feature Extraction (RQ2)

Observation: Feature extraction techniques convert tokens into a vector of features. In this experiment, we observe the effects of three feature extraction techniques, including (1) bag-of-words, (2) word2vec embedding and (3) fastText. We run experiments with features obtained from bag-of-words (Table 9 and Table 10). For each embedding technique we have 102 experiments (= (3 models x 2 tokenization methods) per project x 17 projects). Each experiment produces six scores that compute the value of $z$ . Totally 204 data points of $z$ are used to compute p-value from the t test. The p-value is compared to the significance level $\alpha$ .

Hypothesis Analysis: To compare those three vector representation techniques, we consider the following hypotheses:

2.

There is a statistically significant difference between the results obtained from using bag-of-words and
word2vec as embeddings.
3.

There is a statistically significant difference between the results obtained from using bag-of-words and fastText as embeddings.
4.

There is a statistically significant difference between the results obtained from using word2vec and fastText as embeddings.

According to the Table 6 the p-values obtained from t test of hypothesis (2) and hypothesis (3) are accepted, but hypothesis (4) is rejected. This means there is a statistically significant difference between bag-of-words and word2vec and also between bag-of-words and fastText. However, there is no statistically significant difference between word2vec and fastText. Additionally, the results obtained from the classification show that, on average, the precision and recall of the experiments with bag-of-words are 6% more than the performance of the other embedding methods. That indicates that bag-of-words is better than the other two models in the learning process of vulnerabilities in our experiments.

Conclusion: We choose bag-of-words as the best way to generate embeddings for the remainder of the experiments.

4.7 Experiment Results using Architectural Metrics (RQ3)

Observation: In addition to the NLP-based method, we use architectural metrics as structural features to learn vulnerability patterns from the software repositories. We compare the features with code tokens only. We run experiments using the architecture metrics only and compared them to using the architecture metrics with the bag-of-words features. The Table 11 in Appendix shows the learning score of 51 (= 3 models x 17 projects) experiments for each feature we use. Totally 102 data points of $z$ are used to compute p-value from the t test. The p-value is compared to the significance level $\alpha$ .

Hypothesis Analysis: The effects of using architecture metrics, extracted from the structures of the project, are explored via three hypotheses:

5.

There is a statistically significant difference between 1) the results obtained from using tokens, vs. 2) the results obtained from using the architectural metrics.
6.

There is a statistically significant difference between 1) the results obtained from only using tokens, vs. 2) the results obtained from using the combination of architectural metrics and tokens.
7.

There is a statistically significant difference between 1) the results obtained from only using the architectural metrics, vs. 2) the results obtained from using the combination of tokens and architectural metrics.

As shown in the Table 6, the p-values indicate hypotheses (5) and (7) are accepted, while hypothesis (6) is rejected. This means there is no significant difference when using tokens as input features with or without architectural metrics. Hypotheses (5) and (7) further indicate the tokens have a stronger influence on the learning performance than the architectural metrics.

Conclusion: We choose to use tokens without architectural metrics for the remainder of the experiments.

4.8 Experiment Results on Classification models (RQ4)

Observation: We aim to identify whether a certain machine learning model produces better vulnerability detection. We run experiments with each of the three models to compare them (Table 9 and Table 10 in Appendix). For each model, the table shows 102 experiments (= (2 tokenization methods x 3 embeddings) per project x 17 projects) Each experiment produces six scores that compute the value of $z$ . Totally 204 data points of $z$ are used to compute p-value from the t test. The p-value is compared to the significance level $\alpha$ .

Hypothesis Analysis: To compare the models, we define these three hypotheses:

8.

There is a statistically significant difference between the performance of the random forest model and the SVM.
9.

There is a statistically significant difference between the performance of the random forest model and the ResNet.
10.

There is a statistically significant difference between the performance between the SVM and the ResNet.

According to Table 6, the p-values of the three hypotheses (8), (9), and (10) are less than the significance level, $\alpha$ . All three hypotheses are accepted. Overall the random forest model performs better than the SVM and ResNet in most of the experiments with a precision and recall higher by an average of 8%.

Conclusion: We decide to use the random forest as the model to learn the patterns of vulnerabilities in the cross-project validation experiments.

5 CROSS VALIDATION (RQ5)

In this evaluation, we explore the answer to the question "How transferable is the learning method in predicting vulnerabilities in new projects?". We define two sets of experiments to investigate this question.

Train-One-Predict-Multiple In this test, a learning model is trained with source code from a single project and then tested on other projects. We compare the learning performance with existing work [2]. The 15 projects used in this experiment overlap with those used in [2]. We use the same score to evaluate the learning performance as in [2]: we count the number of projects with the classification metrics of precision and recall with a certain threshold. Table 7 reports comparison between our work and the learning with LSTM models. With the threshold value of 0.7, our results are comparable to the results in [2]. With a threshold of 0.8, our results degrade to the average value of 1.4 projects with both a precision and a recall equal to or greater than 80%.

Table 7: Training-One-Predicting-Multiple compared with the LSTM model in [2] with the threshold value settings.

Projects		Random Forest	Random Forest	LSTM³³3The experiments are performed using 18 Android project. In our case we only retrieved 15 of them that are still available [2]
Projects		(precision $>$ 70%,	(precision $>$ 80%,	(precision $>$ 80%,
		recall $>$ 70%)	recall $>$ 80%)	recall $>$ 80%)
1	Camera	7	4	6
2	FBReader	6	3	6
3	Mms	6	2	6
4	Contacts	6	2	2
5	KeePassDroid	6	2	4
6	ConnectBot	6	2	5
7	AnkiDroid	5	1	5
8	Email	5	0	4
9	Crosswords	4	1	1
10	Browser	4	1	1
11	Coolreader	4	1	6
12	Calendar	4	0	5
13	K9	3	2	8
14	DeskClock	0	0	1
15	QuickSearchBox	0	0	3

Train-Multiple-Predict-One To further improve the learning performance, we conduct 15-fold cross-validation by choosing 14 projects from the same domain of the Android project for training. The remaining project is reserved for testing. Table 8 contains the cross-validation results, ordered by precision and recall values. 5 out of the 15 experiments have both precision and recall values equal to or greater than 80%; 10 out of 15 experiments have both precision and recall equal to or greater than 70%. Referring to Table 3, the 5 experiments with precision and recall below 70% have the ratio of vulnerable files below 40%.

Referring to Table 8, cross-project validation improves the learning performance under the threshold of 80% to 5 projects out of 15 projects. This approach of transfer learning, by combining the features from the Android project repository to tune the random forest model, achieves comparable learning performance to deep learning models ResNet and LSTM [2] (4.2 projects out of 15 projects).

Table 8: The cross project validation from 15 Android projects, with 5 projects having both precision and recall higher than 80% (ConnectBot, Email, Coolreader, Crosswords, AnkiDroid)

Projects		P	R	F1	FPR	ROC AUC	PR AUC	z
1	ConnectBot	0.90	0.86	0.88	0.08	0.89	0.84	1.00
2	Email	0.90	0.81	0.85	0.10	0.86	0.83	0.95
3	Coolreader	0.88	0.82	0.85	0.11	0.86	0.81	0.93
4	Crosswords	0.81	0.87	0.84	0.14	0.86	0.76	0.89
5	K9	0.94	0.60	0.74	0.05	0.78	0.78	0.81
6	AnkiDroid	0.81	0.86	0.83	0.29	0.78	0.78	0.80
7	Calendar	0.75	0.88	0.81	0.24	0.82	0.71	0.79
8	Camera	0.74	0.87	0.80	0.32	0.77	0.71	0.72
9	FBReader	0.73	0.71	0.72	0.11	0.80	0.61	0.68
10	Contacts	0.69	0.92	0.79	0.39	0.77	0.67	0.67
11	KeePassDroid	0.64	0.90	0.75	0.34	0.78	0.62	0.63
12	Deskclock	0.64	0.88	0.74	0.33	0.77	0.61	0.61
13	Browser	0.70	0.70	0.70	0.17	0.76	0.60	0.61
14	Mms	0.68	0.70	0.70	0.20	0.76	0.59	0.59
15	QuickSearchBox	0.45	0.93	0.60	0.46	0.74	0.44	0.38

6 DISCUSSION

Our approach consists of comparing the different factors that contributed to the detection of vulnerabilities in source code. The result tables contain the metrics for the different experiments that we have performed. In each experiment we used a Java project with a combination of the aspects explained in the previous sections of this paper. The vulnerability detection models are trained and tested with the same project. Each dataset is separated into a training set and a test set.

Table 9: Singular project vulnerability detection with tokenization with comments across embeddings and machine learning models.

			Bag-of-words							Word2vec							FastText
Project		Classifier	P	R	F1	FPR	ROC AUC	PR AUC	z	P	R	F1	FPR	ROC AUC	PR AUC	z	P	R	F1	FPR	ROC AUC	PR AUC	z
1	OWASP	RF	1.00	1.00	1.00	0.21	1.00	1.00	0.95	0.68	0.76	0.72	0.21	0.70	0.63	0.60	1.00	1.00	1.00	0.21	1.00	1.00	0.95
		ResNet	0.99	0.92	0.95	0.10	0.96	0.95	0.92	0.95	0.93	0.94	0.04	0.94	0.92	0.92	0.76	0.96	0.84	0.04	0.85	0.87	0.82
		SVM	0.99	0.99	0.99	0.24	0.99	0.99	0.93	0.91	0.93	0.92	0.19	0.93	0.89	0.86	0.82	0.90	0.86	0.08	0.93	0.92	0.85
2	Juliet	RF	1.00	1.00	1.00	0.08	1.00	1.00	0.98	0.07	0.05	0.05	0.06	0.29	0.41	0.02	0.12	0.09	0.10	0.04	0.30	0.40	0.05
		ResNet	1.00	0.73	0.84	0.13	0.86	0.84	0.80	0.21	0.18	0.20	0.02	0.34	0.38	0.13	0.38	0.68	0.49	0.11	0.44	0.64	0.42
		SVM	1.00	1.00	1.00	0.07	1.00	1.00	0.98	0.17	0.05	0.07	0.05	0.50	0.42	0.10	0.42	0.23	0.29	0.84	0.50	0.42	0.07
3	AnkiDroid	RF	0.80	0.89	0.84	0.15	0.84	0.76	0.76	0.85	0.97	0.91	0.10	0.89	0.84	0.85	0.85	0.97	0.91	0.10	0.89	0.84	0.85
		ResNet	0.79	0.85	0.82	0.21	0.82	0.75	0.72	0.88	0.50	0.64	0.08	0.71	0.71	0.62	0.80	0.13	0.23	0.06	0.55	0.57	0.35
		SVM	0.80	0.89	0.84	0.13	0.86	0.78	0.78	0.80	0.93	0.86	0.34	0.83	0.78	0.73	0.82	0.93	0.87	0.63	0.85	0.80	0.69
4	Browser	RF	0.97	0.93	0.95	0.08	1.00	1.00	0.95	0.97	0.94	0.95	0.06	0.96	0.93	0.92	0.89	1.00	0.94	0.06	0.97	0.92	0.92
		ResNet	0.94	0.88	0.91	0.09	0.92	0.87	0.87	0.38	0.97	0.55	0.10	0.56	0.38	0.47	0.83	0.91	0.87	0.02	0.90	0.88	0.85
		SVM	0.91	0.94	0.93	0.10	0.94	0.88	0.88	0.82	0.90	0.86	0.17	0.90	0.78	0.79	0.88	0.72	0.79	0.46	0.90	0.85	0.69
5	Calendar	RF	0.87	0.87	0.89	0.00	0.92	0.82	0.85	0.89	0.86	0.88	0.00	0.88	0.84	0.85	1.00	0.86	0.93	0.00	0.92	0.95	0.92
		ResNet	0.85	1.00	0.92	0.00	0.95	0.85	0.90	0.58	0.97	0.73	0.00	0.67	0.58	0.66	0.88	0.48	0.62	0.00	0.71	0.80	0.65
		SVM	0.88	0.95	0.91	0.00	0.94	0.85	0.89	0.86	0.86	0.86	0.00	0.87	0.81	0.83	0.84	0.72	0.78	0.00	0.88	0.91	0.80
6	Camera	RF	0.94	0.91	0.93	0.08	0.94	0.89	0.89	0.89	0.83	0.86	0.08	0.89	0.80	0.81	0.82	0.75	0.78	0.11	0.95	0.84	0.77
		ResNet	0.91	0.91	0.91	0.10	0.93	0.87	0.87	0.55	1.00	0.71	0.05	0.77	0.55	0.65	0.92	0.46	0.61	0.10	0.72	0.76	0.62
		SVM	0.91	0.94	0.93	0.04	0.94	0.88	0.90	0.82	0.77	0.79	0.06	0.80	0.67	0.72	0.72	0.75	0.73	0.37	0.90	0.82	0.66
7	ConnectBot	RF	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	0.80	0.89	0.04	0.90	0.90	0.87	1.00	0.80	0.89	0.00	0.90	0.90	0.88
		ResNet	1.00	1.00	1.00	0.00	1.00	0.90	0.98	1.00	0.07	0.13	0.12	0.53	0.55	0.33	1.00	0.80	0.89	0.12	0.90	0.90	0.85
		SVM	1.00	1.00	1.00	0.02	1.00	1.00	0.99	1.00	0.80	0.89	0.06	0.90	0.90	0.87	1.00	0.80	0.89	0.12	0.90	0.90	0.85
8	Contacts	RF	0.90	0.96	0.93	0.11	0.96	0.88	0.89	0.83	0.86	0.85	0.06	0.89	0.76	0.80	0.89	1.00	0.94	0.06	0.99	0.97	0.94
		ResNet	0.78	0.90	0.83	0.08	0.89	0.73	0.78	0.56	1.00	0.72	0.15	0.81	0.56	0.65	0.79	0.67	0.73	1.00	0.79	0.79	0.48
		SVM	0.90	0.96	0.93	0.21	0.96	0.88	0.86	0.80	0.78	0.79	0.36	0.87	0.77	0.69	0.85	0.80	0.82	1.00	0.93	0.82	0.58
9	Coolreader	RF	1.00	0.98	0.99	0.09	1.00	1.00	0.97	1.00	1.00	1.00	0.03	1.00	1.00	0.99	1.00	1.00	1.00	0.05	1.00	1.00	0.99
		ResNet	1.00	0.93	0.96	0.12	0.96	0.98	0.93	0.80	0.73	0.76	0.08	0.80	0.69	0.70	0.83	0.91	0.87	0.30	0.89	0.79	0.76
		SVM	0.97	0.97	0.97	0.03	0.96	0.93	0.95	1.00	1.00	1.00	0.11	1.00	1.00	0.97	1.00	1.00	1.00	0.39	1.00	1.00	0.91
10	Deskclock	RF	0.89	1.00	0.94	0.02	0.97	0.89	0.92	0.86	1.00	0.92	0.02	0.93	0.86	0.89	0.90	1.00	0.95	0.02	0.99	0.98	0.95
		ResNet	0.88	0.88	0.88	0.03	0.91	0.80	0.84	0.46	1.00	0.63	0.05	0.50	0.46	0.53	0.35	1.00	0.51	0.01	0.50	0.67	0.54
		SVM	0.89	1.00	0.94	0.02	0.97	0.89	0.92	0.80	1.00	0.89	0.04	0.93	0.86	0.87	0.82	1.00	0.90	0.02	1.00	0.99	0.93
11	Email	RF	0.97	0.98	0.97	0.30	0.99	0.99	0.91	0.98	0.98	0.98	0.00	0.99	1.00	0.98	0.91	0.95	0.93	0.00	0.98	0.97	0.94
		ResNet	0.96	0.90	0.93	0.23	0.93	0.96	0.87	0.74	0.88	0.81	0.33	0.75	0.84	0.69	0.76	0.91	0.83	0.56	0.80	0.86	0.67
		SVM	0.95	0.82	0.88	0.23	0.94	0.95	0.84	0.79	0.84	0.81	0.27	0.88	0.89	0.75	0.80	0.92	0.86	0.27	0.91	0.90	0.79
12	FBReader	RF	0.96	0.93	0.94	0.01	0.98	0.98	0.95	0.96	0.95	0.95	0.01	0.99	0.99	0.96	0.97	0.94	0.95	0.06	0.99	0.99	0.95
		ResNet	0.95	0.96	0.96	0.10	0.97	0.93	0.92	0.96	0.94	0.95	0.00	0.96	0.96	0.94	0.97	0.90	0.93	0.01	0.94	0.95	0.93
		SVM	0.95	0.97	0.96	0.00	0.97	0.93	0.95	0.76	0.72	0.74	0.00	0.91	0.83	0.75	0.84	0.91	0.87	0.05	0.95	0.92	0.87
13	K9	RF	0.97	0.99	0.98	0.00	1.00	1.00	0.99	0.99	1.00	0.99	0.00	1.00	1.00	0.99	0.99	1.00	1.00	0.01	1.00	1.00	0.99
		ResNet	0.94	1.00	0.97	0.00	0.97	0.97	0.96	0.94	0.85	0.89	0.01	0.90	0.94	0.88	0.99	1.00	0.99	0.01	0.99	1.00	0.99
		SVM	0.99	1.00	0.99	0.01	0.99	0.99	0.99	0.83	0.88	0.85	0.00	0.92	0.92	0.86	0.98	0.98	0.98	0.01	0.99	0.99	0.98
14	KeePassDroid	RF	0.99	1.00	1.00	0.01	1.00	1.00	1.00	1.00	0.99	1.00	0.01	1.00	1.00	0.99	0.98	0.99	0.98	0.01	1.00	1.00	0.99
		ResNet	0.99	0.99	0.99	0.05	0.99	0.98	0.97	0.97	0.84	0.90	0.03	0.91	0.94	0.89	0.99	1.00	0.99	0.08	1.00	0.99	0.97
		SVM	0.99	0.99	0.99	0.02	0.99	0.98	0.98	0.90	0.82	0.86	0.07	0.95	0.92	0.85	0.98	1.00	0.99	0.42	1.00	0.99	0.89
15	Mms	RF	0.98	0.97	0.98	0.22	1.00	0.99	0.93	0.98	0.97	0.98	0.01	0.98	0.96	0.97	0.94	1.00	0.97	0.01	0.98	0.93	0.96
		ResNet	0.98	0.93	0.96	0.44	0.96	0.94	0.84	0.57	0.98	0.72	0.17	0.78	0.56	0.64	0.86	0.95	0.90	0.32	0.94	0.91	0.82
		SVM	0.98	0.97	0.97	0.12	0.97	0.95	0.93	0.98	0.95	0.97	0.08	0.97	0.95	0.94	0.89	0.93	0.91	0.07	0.98	0.95	0.90
16	Crosswords	RF	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	0.97	0.99	0.00	0.99	0.99	0.98	0.99	1.00	0.99	0.00	1.00	0.99	0.99
		ResNet	1.00	1.00	1.00	0.01	1.00	1.00	1.00	0.94	0.91	0.92	0.11	0.93	0.89	0.88	0.88	0.89	0.88	0.09	0.90	0.82	0.83
		SVM	1.00	0.99	0.99	0.15	0.99	0.99	0.96	0.97	0.92	0.95	0.00	0.97	0.95	0.95	0.95	1.00	0.97	0.05	0.98	0.95	0.95
17	QuickSearchBox	RF	0.95	0.87	0.91	0.01	0.93	0.85	0.96	0.77	0.89	0.83	0.07	0.91	0.71	0.85	0.94	0.94	0.94	0.03	0.96	0.90	1.00
		ResNet	1.00	0.78	0.88	0.00	0.89	0.82	0.93	0.58	0.96	0.72	0.18	0.89	0.56	0.72	0.85	0.79	0.82	0.06	0.87	0.74	0.84
		SVM	0.95	0.87	0.91	0.01	0.93	0.85	0.96	0.74	0.63	0.68	0.06	0.79	0.54	0.66	0.86	0.86	0.90	0.06	0.94	0.84	0.92

Table 10: Singular project vulnerability detection with tokenization without comments and symbols across embeddings and machine learning models.

			Bag-of-Words							Word2vec							FastText
Project		Classifier	P	R	F1	FPR	ROC AUC	PR AUC	z	P	R	F3	FPR	ROC AUC	PR AUC	z	P	R	F2	FPR	ROC AUC	PR AUC	z
1	OWASP	RF	0.99	1.00	0.99	0.21	0.99	0.99	0.94	0.71	0.73	0.72	0.21	0.72	0.65	0.60	0.75	0.83	0.79	0.21	0.89	0.89	0.75
		Resnet	0.82	1.00	0.90	0.17	0.89	0.82	0.83	0.79	0.93	0.86	0.19	0.85	0.77	0.77	0.57	0.58	0.58	0.19	0.57	0.68	0.48
		SVM	0.99	0.99	0.99	0.24	0.99	0.99	0.93	0.88	0.90	0.89	0.31	0.89	0.84	0.78	0.60	0.79	0.68	0.19	0.68	0.67	0.58
2	Juliet	RF	0.03	0.02	0.03	0.04	0.26	0.42	0.00	0.12	0.09	0.10	0.04	0.30	0.40	0.05	0.23	0.18	0.20	0.02	0.52	0.36	0.17
		Resnet	0.33	0.05	0.08	0.04	0.35	0.42	0.11	0.22	0.09	0.13	0.03	0.43	0.40	0.12	0.47	0.34	0.37	0.07	0.54	0.54	0.34
		SVM	0.41	0.02	0.03	0.04	0.68	0.54	0.21	0.20	0.09	0.13	0.03	0.47	0.42	0.13	0.42	0.16	0.23	0.02	0.62	0.46	0.27
3	Anki-Android	RF	0.81	0.96	0.88	0.08	0.88	0.80	0.83	0.78	0.93	0.85	0.08	0.84	0.76	0.78	0.84	0.96	0.90	0.08	0.90	0.83	0.85
		Resnet	0.82	1.00	0.90	0.11	0.90	0.82	0.84	0.78	0.93	0.85	0.05	0.84	0.76	0.79	0.82	0.52	0.64	0.00	0.71	0.66	0.61
		SVM	0.81	0.96	0.88	0.16	0.90	0.82	0.82	0.80	0.89	0.84	0.13	0.84	0.76	0.77	0.77	0.85	0.81	0.09	0.65	0.57	0.66
4	Browser	RF	0.94	0.91	0.93	0.04	0.94	0.89	0.90	0.94	0.94	0.94	0.04	0.95	0.90	0.91	0.93	0.84	0.89	0.04	0.96	0.87	0.87
		Resnet	0.88	0.85	0.87	0.03	0.89	0.81	0.83	0.88	0.91	0.90	0.03	0.92	0.84	0.86	0.81	0.91	0.86	0.07	0.89	0.77	0.81
		SVM	0.91	0.91	0.91	0.05	0.94	0.88	0.89	0.89	1.00	0.94	0.07	0.96	0.89	0.91	0.90	0.82	0.86	0.06	0.88	0.80	0.81
5	Calendar	RF	0.88	0.95	0.91	0.00	0.94	0.85	0.89	0.73	0.70	0.71	0.00	0.77	0.62	0.65	0.81	0.74	0.77	0.00	0.82	0.70	0.73
		Resnet	0.78	0.95	0.86	0.00	0.90	0.76	0.82	0.76	0.70	0.73	0.00	0.78	0.64	0.68	0.78	0.89	0.83	0.00	0.84	0.75	0.79
		SVM	0.88	0.95	0.91	0.00	0.94	0.85	0.89	0.71	0.74	0.72	0.00	0.78	0.62	0.67	0.77	0.74	0.76	0.00	0.80	0.67	0.71
6	Camera	RF	0.87	0.91	0.93	0.07	0.94	0.89	0.88	0.87	0.83	0.85	0.05	0.89	0.77	0.81	0.92	0.88	0.90	0.05	0.97	0.94	0.90
		Resnet	0.88	0.88	0.88	0.92	0.90	0.83	0.64	0.78	0.88	0.82	0.05	0.89	0.72	0.77	0.81	0.85	0.83	0.07	0.88	0.85	0.80
		SVM	0.91	0.91	0.91	0.04	0.94	0.88	0.89	0.88	0.92	0.90	0.07	0.93	0.93	0.88	0.81	0.81	0.81	0.08	0.89	0.74	0.76
7	Connectbot	RF	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	0.82	0.90	0.00	0.91	0.93	0.90
		SVM	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00
8	Contacts	RF	0.85	0.96	0.90	0.06	0.98	0.95	0.90	0.93	0.91	0.92	0.05	0.94	0.88	0.89	0.96	0.89	0.92	0.06	0.93	0.89	0.89
		Resnet	0.83	0.92	0.87	0.08	0.92	0.89	0.85	0.90	0.67	0.77	0.15	0.81	0.72	0.70	0.81	0.87	0.84	0.06	0.88	0.74	0.78
		SVM	0.80	0.67	0.73	0.07	0.90	0.84	0.73	0.86	0.77	0.81	0.14	0.85	0.76	0.75	0.85	0.83	0.84	0.14	0.74	0.62	0.70
9	CoolReader	RF	1.00	0.97	0.99	0.04	0.99	0.99	0.97	1.00	1.00	1.00	0.01	1.00	1.00	1.00	1.00	0.98	0.99	0.04	0.99	0.99	0.98
		Resnet	1.00	0.97	0.99	0.04	0.99	0.99	0.97	0.98	1.00	0.99	0.01	0.99	0.98	0.98	1.00	1.00	1.00	0.10	1.00	1.00	0.98
		SVM	0.97	0.97	0.97	0.09	0.96	0.93	0.94	0.98	1.00	0.99	0.00	0.99	0.98	0.98	1.00	0.98	0.99	0.03	0.99	0.99	0.98
10	DeskClock	RF	0.89	1.00	0.94	0.02	0.97	0.89	0.92	0.91	0.83	0.87	0.02	0.88	0.83	0.84	0.92	0.79	0.85	0.02	0.93	0.87	0.84
		Resnet	0.98	1.00	0.89	0.01	0.94	0.80	0.91	0.75	0.75	0.75	0.01	0.77	0.68	0.69	0.50	0.07	0.13	0.01	0.49	0.54	0.23
		SVM	0.89	1.00	0.94	0.01	0.97	0.89	0.93	0.83	0.83	0.83	0.02	0.85	0.77	0.79	0.80	0.57	0.67	0.02	0.86	0.88	0.71
11	Email	RF	0.97	0.98	0.97	0.50	1.00	1.00	0.86	0.93	0.99	0.96	0.00	0.98	0.97	0.96	0.97	0.98	0.97	0.00	0.97	0.96	0.97
		Resnet	0.93	1.00	0.96	0.41	0.95	0.97	0.86	0.97	0.75	0.85	0.47	0.86	0.87	0.72	0.91	0.99	0.95	0.45	0.93	0.91	0.82
		SVM	0.96	0.85	0.90	0.50	0.96	0.97	0.80	0.97	0.98	0.97	0.45	0.97	0.96	0.86	0.94	0.95	0.94	0.45	0.93	0.92	0.82
12	FBReaderJ	RF	0.96	0.96	0.97	0.01	0.98	0.95	0.96	0.98	0.95	0.97	0.02	0.97	0.95	0.95	0.97	0.95	0.96	0.02	1.00	0.99	0.96
		Resnet	0.96	0.96	0.96	0.01	0.97	0.93	0.95	0.95	0.90	0.93	0.00	0.94	0.89	0.91	0.92	0.82	0.87	0.01	0.90	0.90	0.86
		SVM	0.95	0.97	0.96	0.02	0.97	0.93	0.94	0.93	0.85	0.89	0.01	0.91	0.83	0.86	0.75	0.50	0.60	0.01	0.86	0.69	0.62
13	K9Mail	RF	0.99	1.00	0.99	0.00	0.99	0.99	0.99	0.98	0.99	0.98	0.00	1.00	1.00	0.99	0.99	0.99	0.99	0.00	1.00	1.00	0.99
		Resnet	0.99	0.99	0.99	0.00	0.99	0.99	0.99	1.00	0.94	0.97	0.00	0.97	0.97	0.96	0.92	0.91	0.91	0.01	0.91	0.94	0.90
		SVM	0.99	1.00	1.00	0.01	0.99	0.99	0.99	0.98	1.00	0.99	0.00	0.98	0.97	0.98	0.77	0.88	0.82	0.00	0.85	0.80	0.79
14	KeePassAndroid	RF	1.00	1.00	1.00	0.01	1.00	1.00	1.00	0.99	1.00	1.00	0.01	1.00	0.99	0.99	0.99	1.00	1.00	0.01	1.00	1.00	1.00
		Resnet	1.00	1.00	1.00	0.03	1.00	1.00	0.99	0.99	1.00	1.00	0.03	1.00	0.99	0.99	1.00	0.99	0.99	0.03	0.99	1.00	0.98
		SVM	0.98	0.97	0.97	0.03	1.00	1.00	0.97	0.99	1.00	1.00	0.03	1.00	0.99	0.99	0.83	0.86	0.84	0.01	0.92	0.84	0.83
15	MMS	RF	0.98	0.97	0.97	0.01	0.98	0.96	0.96	0.96	0.98	0.97	0.01	0.98	0.95	0.96	0.96	0.98	0.97	0.01	0.98	0.95	0.96
		Resnet	0.98	0.91	0.94	0.28	0.95	0.96	0.87	0.96	0.67	0.79	0.00	0.82	0.76	0.76	0.91	0.95	0.93	0.00	0.95	0.89	0.92
		SVM	0.98	0.97	0.97	0.12	0.98	0.96	0.94	0.96	0.98	0.97	0.04	0.97	0.94	0.95	0.96	0.98	0.97	0.09	0.98	0.95	0.94
16	Xwords	RF	1.00	1.00	1.00	0.00	1.00	1.00	1.00	0.99	1.00	0.99	0.00	0.99	0.99	0.99	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	1.00	0.99	0.99	0.00	0.99	0.99	0.99	0.86	1.00	0.92	0.00	0.92	0.86	0.90	0.95	0.25	0.40	0.01	0.62	0.62	0.49
		SVM	1.00	0.99	0.99	0.01	0.99	0.99	0.99	1.00	0.99	0.99	0.00	0.99	0.98	0.99	1.00	1.00	1.00	0.00	1.00	1.00	1.00
17	QuickSearchBox	RF	0.95	0.87	0.91	0.01	0.93	0.85	0.96	0.88	0.81	0.85	0.03	0.89	0.76	0.87	0.93	0.95	0.94	0.03	0.96	0.90	1.00
		Resnet	0.95	0.91	0.93	0.01	0.95	0.89	0.99	0.77	0.89	0.83	0.07	0.91	0.71	0.85	0.86	0.99	0.92	0.07	0.96	0.86	0.97
		SVM	0.95	0.87	0.91	0.01	0.93	0.85	0.96	0.89	0.89	0.89	0.03	0.90	0.82	0.92	0.85	0.88	0.86	0.07	0.89	0.77	0.88

Table 11: Singular project vulnerability detection with Bag-of-word and the Architectural metrics

			Metrics only							Metrics + bag-of-words
Project		Classifier	P	R	F1	FPR	ROC AUC	PR AUC	z	P	R	F1	FPR	ROC AUC	PR AUC	z
1	OWASP	RF	0.66	0.78	0.71	0.38	0.70	0.62	0.56	0.82	0.85	0.84	0.17	0.95	0.96	0.73
		Resnet	0.48	1.00	0.65	1.00	0.50	0.48	0.32	0.70	0.89	0.79	0.34	0.77	0.82	0.51
		SVM	0.57	0.93	0.70	0.66	0.64	0.56	0.48	0.67	0.74	0.70	0.74	0.82	0.85	0.30
2	Juliet	RF	0.50	0.41	0.45	0.23	0.59	0.42	0.33	1.00	0.88	0.93	0.00	0.94	0.92	0.88
		Resnet	0.35	0.97	0.52	1.00	0.48	0.35	0.21	1.00	0.81	0.90	0.00	0.91	0.88	0.82
		SVM	0.00	0.00	0.00	0.00	0.50	0.36	0.01	1.00	0.84	0.92	0.00	0.94	0.92	0.86
3	Anki-Android	RF	0.62	0.73	0.67	0.26	0.74	0.55	0.55	0.87	0.91	0.89	0.08	0.92	0.82	0.76
		Resnet	0.36	1.00	0.53	1.00	0.50	0.36	0.23	0.83	0.86	0.84	0.10	0.88	0.76	0.67
		SVM	0.71	0.23	0.34	0.05	0.50	0.36	0.32	0.88	0.95	0.91	0.08	0.94	0.85	0.81
4	Browser	RF	0.72	0.62	0.67	0.16	0.73	0.60	0.59	0.94	0.91	0.93	0.04	0.94	0.89	0.84
		Resnet	0.00	0.00	0.00	0.00	0.50	0.40	0.02	0.91	0.91	0.91	0.06	0.93	0.87	0.81
		SVM	0.00	0.00	0.00	0.00	0.50	0.40	0.02	0.89	0.94	0.93	0.06	0.94	0.88	0.83
5	Calendar	RF	1.00	0.94	0.97	0.00	0.97	0.98	1.00	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.90	0.53	0.67	0.08	0.72	0.75	0.66	0.75	0.88	0.81	0.42	0.73	0.85	0.50
		SVM	0.90	0.53	0.67	0.08	0.72	0.75	0.66	1.00	0.88	0.94	0.00	1.00	1.00	0.94
6	Camera	RF	0.57	0.60	0.59	0.20	0.70	0.46	0.47	0.90	0.96	0.93	0.05	0.96	0.88	0.85
		Resnet	0.39	0.56	0.46	0.39	0.59	0.35	0.28	0.84	0.88	0.86	0.07	0.90	0.77	0.71
		SVM	0.00	0.00	0.00	0.00	0.50	0.30	0.00	0.90	0.96	0.93	0.05	0.96	0.88	0.85
7	Connectbot	RF	0.80	0.76	0.78	0.15	0.80	0.71	0.71	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.45	0.46	0.45	0.46	0.50	0.45	0.26	1.00	0.97	0.99	0.00	0.99	0.99	0.98
		SVM	0.65	0.54	0.61	0.47	0.43	0.04	0.25	0.92	0.93	0.93	0.06	0.98	0.97	0.88
8	Contacts	RF	0.89	1.00	0.94	0.06	0.97	0.89	0.95	0.89	1.00	0.94	0.06	0.89	0.92	0.85
		Resnet	0.31	1.00	0.47	1.00	0.50	0.31	0.19	0.80	1.00	0.89	0.11	0.94	0.80	0.76
		SVM	0.54	0.88	0.67	0.33	0.77	0.51	0.55	0.89	1.00	0.94	0.06	0.89	0.92	0.85
9	CoolReader	RF	0.83	0.86	0.84	0.20	0.83	0.79	0.77	0.99	0.96	0.97	0.01	0.97	0.97	0.94
		Resnet	0.61	0.84	0.71	0.62	0.61	0.60	0.48	0.98	0.96	0.97	0.03	0.97	0.96	0.93
		SVM	0.63	0.77	0.69	0.52	0.62	0.61	0.49	0.98	0.96	0.97	0.03	0.97	0.96	0.93
10	DeskClock	RF	0.83	0.68	0.75	0.06	0.81	0.67	0.71	0.98	0.96	0.97	0.01	0.97	0.95	0.94
		Resnet	0.46	0.53	0.49	0.29	0.62	0.39	0.35	0.98	0.91	0.94	0.01	0.95	0.92	0.89
		SVM	0.00	0.00	0.00	0.00	0.50	0.32	0.00	0.97	0.97	0.97	0.01	0.97	0.94	0.93
11	Email	RF	0.00	0.00	0.00	0.00	0.50	0.42	0.03	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.41	1.00	0.60	1.00	0.50	0.42	0.28	1.00	0.98	0.99	0.00	0.99	0.99	0.98
		SVM	0.00	0.00	0.00	0.00	0.50	0.42	0.03	1.00	1.00	1.00	0.00	1.00	1.00	1.00
12	FBReaderJ	RF	0.80	0.82	0.81	0.20	0.81	0.74	0.73	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.47	0.38	0.42	0.41	0.48	0.48	0.25	1.00	0.98	0.99	0.00	0.99	0.99	0.98
		SVM	0.64	0.86	0.74	0.47	0.70	0.62	0.56	0.99	1.00	0.99	0.01	0.99	0.99	0.98
13	K9Mail	RF	0.90	0.83	0.86	0.07	0.88	0.82	0.84	0.99	0.99	0.99	0.01	0.99	0.98	0.98
		Resnet	0.71	0.27	0.39	0.08	0.59	0.51	0.39	0.46	1.00	0.63	0.90	0.55	0.46	0.00
		SVM	0.00	0.00	0.00	0.00	0.50	0.43	0.03	0.99	0.99	0.99	0.01	0.99	0.98	0.98
14	KeePassAndroid	RF	0.86	0.83	0.84	0.07	0.88	0.77	0.81	0.98	0.97	0.97	0.01	0.98	0.96	0.95
		Resnet	0.43	0.71	0.54	0.45	0.63	0.40	0.36	0.95	0.95	0.96	0.01	0.97	0.95	0.92
		SVM	0.57	0.55	0.56	0.20	0.68	0.68	0.50	0.98	0.97	0.97	0.01	0.97	0.95	0.94
15	MMS	RF	0.52	0.89	0.65	0.82	0.54	0.51	0.37	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.51	0.49	0.50	0.46	0.51	0.50	0.31	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		SVM	0.50	1.00	0.66	1.00	0.50	0.50	0.33	0.99	0.99	0.99	0.01	0.99	0.99	0.98
16	Xwords	RF	0.86	0.87	0.87	0.16	0.86	0.82	0.82	1.00	1.00	1.00	0.00	1.00	1.00	1.00
		Resnet	0.61	0.92	0.74	0.65	0.64	0.61	0.51	0.98	0.95	0.96	0.03	0.96	0.98	0.93
		SVM	0.75	0.67	0.70	0.25	0.71	0.68	0.60	1.00	0.93	0.96	0.00	0.99	0.99	0.95
17	QuickSearchBox	RF	0.65	0.74	0.69	0.08	0.83	0.53	0.67	0.95	0.87	0.91	0.01	0.93	0.85	0.96
		Resnet	0.50	0.04	0.08	0.01	0.52	0.19	0.16	0.95	0.87	0.91	0.01	0.93	0.85	0.96
		SVM	0.00	0.00	0.00	0.00	0.50	0.18	0.00	0.95	0.87	0.91	0.01	0.93	0.85	0.96

Tables 9, 10 and 11 show the results of the 408 experiments performed to evaluate the classification on three domains of the source code. Table 9 contains the results obtained from the experiments using the source code files of all tokens. Table 10 are the results obtained from the experiments using the source code files without the comments and symbols. Table 11 shows the results of using the architectural metrics as features compared to bag-of-words.

These tables and the statistical test performed in section 4.4 demonstrate 95% of the learning metrics are above 0.77 after over 400 experiments. The tokenization choice, which consists of removing the comments or not, shows that the comments and symbols do not affect the learning of the vulnerabilities by the model. Using the architectural metrics as features, in this case, has no significant improvement on the learning of vulnerabilities. One reason is the complexity of the code and its dependencies are not captured by the tokens only. As a result, a baseline emerges with feature representations extracted through bag-of-words embedding and using the random forest model. This baseline increases the accuracy by about 4% compared to other combinations of examined factors.

We further conducted cross-validation to observe how transferable across domains the vulnerability signatures are. The training of a single project and predicting multiple projects method achieves an average of 4.4 projects with a precision and recall higher than 70%. With the 15-fold cross-validation method of training multiple projects and predicting on one project, the baseline model slightly outperforms the LSTM model with a proprietary embedding method [2].

7 THREATS TO VALIDITY

We summarize aspects of threats to validity, including (1) dataset and limited architectural metrics relative to internal validity; and (3) domains of experiments relative to external validity.

Dataset. We choose to use publicly available datasets that were previously labelled. The OWASP dataset and the Juliet dataset contain both source code files and vulnerability labels. The Android Study dataset only includes information on the tagged file but without the source code files. We retrieved the source code according to the file names and project versions. We only used Java source code because many C++ source codes lack data labels. The number of projects examined for transferability is still limited to reach a statistically significant conclusion.

The vulnerable code labels for the Android Study project followed the data in [66]. The labels are determined by Fortify [67]. It has been recognized that static code analysis tools may contain false positive labels. In the literature [22, 24] path and commit data have been mined to identify vulnerable code. In the security development and operation process, this was addressed by manual correction. In this work, we focus on the factors that contribute to the baseline, and thus assume that the labels are of stable quality.

Architectural Metrics. The token-based feature representation is considered a flattened structure. Such a token-based feature representation is combined with aggregated architectural metrics. The architecture metrics have not contributed significantly to the learning, which indicates either the current learning representation has not utilized the architectural metrics in the optimal embedding or other kinds of learning models should be applied to architectural metrics. This remains further research.

Table 12: Cross domain comparison to observe how transferable the vulnerability signature is

Juliet

OWASP

Android

Juliet

Table 9

P: 0.54

R: 0.77

P: 0.44

R: 0.53

OWASP

P: 0.4

R: 0.8

Table 9

1 out of 15 project (precision and recall greater than 0.7) P: 0.74
R: 0.39

Android

P: 0.4

R: 0.46

P: 0.49

R: 0.64

Table 9

Cross Domain Validation. The cross-domain validation means training a model on datasets from one domain and predicting vulnerabilities on datasets from another domain. The three datasets presented in this paper—OWASP, Juliet, and Android—are from different domains. The previous discussion of the vulnerable files and types in Table 1, Table 2 and Table 3 show this heterogeneity. Table 12 shows the learning performance has degraded. A key contributing factor is the disparateness of vulnerability signatures. Our cross domain validation is also limited because we could only assess three different domains.

8 CONCLUSION

This paper proposes to reveal the most contributing factors for detecting software vulnerabilities. The observations from 17 Java projects and over 400 experiments lead to a baseline model on how to choose tokenization techniques, embedding methods, and machine learning models. The baseline model with under cross-validation training approach on the same project domain achieves comparable and slightly better learning performance to the models using deep learning networks. This provides the reference as the least learning performance that a future vulnerability detection approach should achieve. We observe that cross-domain learning is subject to the extent of vulnerability signature disparateness. We envision a promising research direction that integrates transfer learning techniques to a software DevOps process and feeds target domain inputs to augment the training from the source domain.

9 ACKNOWLEDGMENTS

We acknowledge colleague Jincheng Sun for providing the ResNet model.

References

of Standards and Technology [date] N. I. of Standards, Technology, Vulnerability definition, Computer Security Resource Center, . URL: csrc.nist.gov/glossary/term/vulnerability, [online] https://csrc.nist.gov/glossary/term/vulnerability.
Dam et al. [2019] H. Dam, T. Tran, T. Pham, S. Ng, J. Grundy, A. Ghose, Automatic feature learning for predicting vulnerable software components, IEEE Transactions on Software Engineering (2019).
Li et al. [2018] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, Y. Zhong, Vuldeepecker: A deep learning-based system for vulnerability detection, arXiv preprint arXiv:1801.01681 (2018).
Ghaffarian and Shahriari [2017] S. M. Ghaffarian, H. R. Shahriari, Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey, ACM Comput. Surv. 50 (2017) 56:1–56:36.
Shin et al. [2010] Y. Shin, A. Meneely, L. Williams, J. A. Osborne, Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities, IEEE transactions on software engineering 37 (2010) 772–787.
Russell et al. [2018] R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, M. McConley, Automated vulnerability detection in source code using deep representation learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2018, pp. 757–762.
Zimmermann et al. [2010] T. Zimmermann, N. Nagappan, L. Williams, Searching for a needle in a haystack: Predicting security vulnerabilities for windows vista, in: Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 421–428. URL: http://dx.doi.org/10.1109/ICST.2010.32. doi:10.1109/ICST.2010.32.
of Standards and Technology [2018] N. I. of Standards, Technology, Common vulnerabilities and exposures, 2018. URL: cve.mitre.org.
Corporation [2006] M. Corporation, The common weakness enumeration community, 2006. URL: cwe.mitre.org/community/.
of Standards and Technology [date] N. I. of Standards, Technology, Software assurance reference dataset, . URL: samate.nist.gov/SARD/.
of Standards and Technology [2017] N. I. of Standards, Technology, Juliet test suite for java v1.3, 2017. URL: https://samate.nist.gov/SRD/testsuite.php.
of Standards and Technology [2015] N. I. of Standards, Technology, Applications, 2015. URL: https://samate.nist.gov/SRD/testsuite.php#applications.
Kals et al. [2006] S. Kals, E. Kirda, C. Krügel, N. Jovanovic, Secubat: a web vulnerability scanner, 2006, pp. 247–256. doi:10.1145/1135777.1135817.
PortSwigger [date] PortSwigger, Burp suite web vulnerability scanner, . URL: portswigger.net/burp/.
Acunetix [date] Acunetix, Acunetix web vulnerability scanner, . URL: www.acunetix.com/vulnerability-scanner/.
Netsparker [date] Netsparker, Netsparker web vulnerability scanner, . URL: www.netsparker.com/web-vulnerability-scanner/.
Wheeler [date] D. A. Wheeler, Flawfinder, . URL: https://www.dwheeler.com/flawfinder/.
Checkmarx [date] Checkmarx, Checkmarx software security platform, . URL: www.checkmarx.com/.
Inc [date] S. S. Inc, Rough audit tool for security, . URL: code.google.com/archive/p/rough-auditing-tool-for-security/.
Nadeem et al. [2012] M. Nadeem, B. J. Williams, E. B. Allen, High false positive detection of security vulnerabilities: A case study, in: Proceedings of the 50th Annual Southeast Regional Conference, ACM-SE ’12, Association for Computing Machinery, New York, NY, USA, 2012, p. 359–360. URL: https://doi.org/10.1145/2184512.2184604. doi:10.1145/2184512.2184604.
Hovsepyan et al. [2012] A. Hovsepyan, R. Scandariato, W. Joosen, J. Walden, Software vulnerability prediction using text analysis techniques, in: Proceedings of the 4th international workshop on Security measurements and metrics, ACM, 2012, pp. 7–10.
Perl et al. [2015] H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck, S. Fahl, Y. Acar, Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits, in: Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, ACM, New York, NY, USA, 2015, pp. 426–437. URL: http://doi.acm.org/10.1145/2810103.2813604. doi:10.1145/2810103.2813604.
Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
Zhou and Sharma [2017] Y. Zhou, A. Sharma, Automated identification of security issues from commit messages and bug reports, in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, ACM, New York, NY, USA, 2017, pp. 914–919. URL: http://doi.acm.org/10.1145/3106237.3117771. doi:10.1145/3106237.3117771.
Feng et al. [2016] Q. Feng, R. Kazman, Y. Cai, R. Mo, L. Xiao, Towards an architecture-centric approach to security analysis, in: 2016 13th Working IEEE/IFIP Conference on Software Architecture (WICSA), IEEE, 2016, pp. 221–230.
Sachitano et al. [2004] A. Sachitano, R. O. Chapman, J. Hamilton, Security in software architecture: a case study, in: Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, 2004., IEEE, 2004, pp. 370–376.
Sohr and Berger [2010] K. Sohr, B. Berger, Idea: Towards architecture-centric security analysis of software, in: International Symposium on Engineering Secure Software and Systems, Springer, 2010, pp. 70–78.
Almorsy et al. [2013] M. Almorsy, J. Grundy, A. S. Ibrahim, Automated software architecture security risk analysis using formalized signatures, in: 2013 35th International Conference on Software Engineering (ICSE), IEEE, 2013, pp. 662–671.
Schwanke et al. [2013] R. Schwanke, L. Xiao, Y. Cai, Measuring architecture quality by structure plus history analysis, in: 2013 35th International Conference on Software Engineering (ICSE), IEEE, 2013, pp. 891–900.
Mo et al. [2016] R. Mo, Y. Cai, R. Kazman, L. Xiao, Q. Feng, Decoupling level: A new metric for architectural maintenance complexity, in: 2016 International Conference on Software Engineering (ICSE), IEEE, 2016.
Wang and Manning [2012] S. Wang, C. Manning, Baselines and bigrams: Simple, good sentiment and topic classification, 2012, pp. 90–94.
Google [date] Google, Machine learning glossary, . URL: https://developers.google.com/machine-learning/glossary#baseline.
Zhang et al. [2010] Y. Zhang, R. Jin, Z.-H. Zhou, Understanding bag-of-words model: A statistical framework, International Journal of Machine Learning and Cybernetics 1 (2010) 43–52.
Joulin et al. [2016] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, CoRR abs/1607.01759 (2016).
Tin Kam Ho [1995] Tin Kam Ho, Random decision forests, in: Proceedings of 3rd International Conference on Document Analysis and Recognition, volume 1, 1995, pp. 278–282 vol.1.
Cortes and Vapnik [1995] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297.
He et al. [2015] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, 2015. arXiv:1512.03385.
Yamaguchi et al. [2011] F. Yamaguchi, F. Lindner, K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning, in: Proceedings of the 5th USENIX Conference on Offensive Technologies, WOOT’11, USENIX Association, Berkeley, CA, USA, 2011, pp. 13--13. URL: http://dl.acm.org/citation.cfm?id=2028052.2028065.
Pang et al. [2015] Y. Pang, X. Xue, A. S. Namin, Predicting vulnerable software components through n-gram analysis and statistical feature selection, in: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), 2015, pp. 543--548. doi:10.1109/ICMLA.2015.99.
Milosevic et al. [2017] N. Milosevic, A. Dehghantanha, K.-K. R. Choo, Machine learning aided android malware classification, Computers & Electrical Engineering 61 (2017) 266 -- 274.
Nataraj et al. [2011] L. Nataraj, S. Karthikeyan, G. Jacob, B. Manjunath, Malware images: Visualization and automatic classification (2011).
Radjenović et al. [2013] D. Radjenović, M. Heričko, R. Torkar, A. Živkovič, Software fault prediction metrics: A systematic literature review, Information and software technology 55 (2013) 1397--1418.
Basili et al. [1996] V. R. Basili, L. C. Briand, W. L. Melo, A validation of object-oriented design metrics as quality indicators, IEEE Transactions on Software Engineering 22 (1996) 751--761.
Nagappan et al. [2006] N. Nagappan, T. Ball, A. Zeller, Mining metrics to predict component failures, in: Proceedings of the 28th International Conference on Software Engineering, ICSE ’06, ACM, New York, NY, USA, 2006, pp. 452--461. URL: http://doi.acm.org/10.1145/1134285.1134349. doi:10.1145/1134285.1134349.
Jackson and Bennett [2018] K. A. Jackson, B. T. Bennett, Locating sql injection vulnerabilities in java byte code using natural language techniques, in: SoutheastCon 2018, 2018, pp. 1--5. doi:10.1109/SECON.2018.8478870.
Lin et al. [2020] G. Lin, S. Wen, Q.-L. Han, J. Zhang, Y. Xiang, Software vulnerability detection using deep neural networks: A survey, Proceedings of the IEEE 108 (2020) 1825--1848.
Brosig et al. [2011] F. Brosig, N. Huber, S. Kounev, Automated extraction of architecture-level performance models of distributed component-based systems, in: 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), IEEE, 2011, pp. 183--192.
Sohr and Berger [2010] K. Sohr, B. Berger, Idea: Towards architecture-centric security analysis of software, in: F. Massacci, D. Wallach, N. Zannone (Eds.), Engineering Secure Software and Systems, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 70--78.
Bidan and Issarny [1997] C. Bidan, V. Issarny, Security benefits from software architecture, in: D. Garlan, D. Le Métayer (Eds.), Coordination Languages and Models, Springer Berlin Heidelberg, Berlin, Heidelberg, 1997, pp. 64--80.
Oliveira Antonino et al. [2010] P. Oliveira Antonino, S. Duszynski, C. Jung, M. Rudolph, Indicator-based architecture-level security evaluation in a service-oriented environment, 2010, pp. 221--228. doi:10.1145/1842752.1842795.
Alkussayer and Allen [2011] A. Alkussayer, W. H. Allen, Security risk analysis of software architecture based on ahp, in: 7th International Conference on Networked Computing, 2011, pp. 60--67.
Alkussayer and Allen [2010] A. Alkussayer, W. H. Allen, A scenario-based framework for the security evaluation of software architecture, in: 2010 3rd International Conference on Computer Science and Information Technology, volume 5, 2010, pp. 687--695. doi:10.1109/ICCSIT.2010.5564015.
Mellado et al. [2010] D. Mellado, E. Fernández-Medina, M. Piattini, A comparison of software design security metrics, in: ECSA ’10, 2010.
Jain and Ingle [2011] S. Jain, M. Ingle, A review of security metrics in software development process, 2011.
Alshammari et al. [2010] B. Alshammari, C. Fidge, D. Corney, Security metrics for object-oriented designs, in: 2010 21st Australian Software Engineering Conference, 2010, pp. 55--64. doi:10.1109/ASWEC.2010.34.
Bojanowski et al. [2016] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606 (2016).
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825--2830.
Řehůřek and Sojka [2010] R. Řehůřek, P. Sojka, Gensim: Software Framework for Topic Modelling with Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45--50. http://is.muni.cz/publication/884893/en.
JetBrains [2019] JetBrains, Jetbrains/intellij-community, 2019. URL: https://github.com/JetBrains/intellij-community.
Android [date] Android, Android development platform - git repository, . URL: https://android.googlesource.com/platform/development.git.
Bass et al. [2012] L. Bass, P. Clements, R. Kazman, Software architecture in practice, 3 ed., Addison-Wesley Professional, 2012.
Cai et al. [2013] Y. Cai, H. Wang, S. Wong, L. Wang, Leveraging design rules to improve software architecture recovery, in: Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures, ACM, 2013, pp. 133--142.
Gilpin et al. [2019] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, L. Kagal, Explaining explanations: An overview of interpretability of machine learning, 2019. arXiv:1806.00069.
OWASP [2018] OWASP, Owasp benchmark project, 2018. URL: https://www.owasp.org/index.php/Benchmark.
R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen [2014] R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen, Android study, 2014. URL: https://sites.google.com/site/textminingandroid/.
Scandariato et al. [2014] R. Scandariato, J. Walden, A. Hovsepyan, W. Joosen, Predicting vulnerable software components via text mining, IEEE Transactions on Software Engineering 40 (2014) 993--1006.
Fortify [date] Fortify, Fortify, . URL: https://www.joinfortify.com/.