Software Vulnerability Prediction Using Text Analysis Techniques
Software Vulnerability Prediction Using Text Analysis Techniques
Software Vulnerability Prediction Using Text Analysis Techniques
Techniques
7
base. The disadvantage of this approach is that the learning We have leveraged the concept of the support vector ma-
may fail to create any meaningful features. In the following chine (SVM) for both the training phase where a prediction
section, we present the proposed approach in detail. model is built from a set of training examples, and the pre-
diction phase where a feature vector is classified based on
3. OUR APPROACH the previously built prediction model. In our initial explo-
ration, we have used a radial basis function with a set of
As Java is a language, we looked at Java files as text. The parameters (cost and gamma) that are selected by running
starting point for our approach is the source code of a soft- a grid search algorithm. The precise details of the training
ware system that consists of a number of Java files. Each file algorithm are out of the scope of this paper.
is transformed into a feature vector where every word (also
called a “monogram” in text processing) within that file is
treated as a feature. 4. PRELIMINARY RESULTS
Before splitting the file source code into a set of words rep- We have performed an initial exploration of the presented
resenting the features we run a preprocessing step. Certain approach using a concrete application. In this section, we
blocks in the source code files are likely to pollute the pre- briefly present the preliminary results of our investigation.
diction model. Such blocks are, for instance, the comments.
Indeed, we believe that it is rather unlikely that comments
4.1 Application
could have an impact on the vulnerability of a file. Hence, in Market analysis has shown that consumers are purchas-
a preprocessing step we filter out all the comments from the ing more smart phones than PCs since the last quarter of
source. For the same reasons, we also filter out all strings 2010 [1]. Hence, a potential vulnerability in any mobile ap-
and numerical values. plication may affect a huge number of users. Most of these
In order to transform the preprocessed source code into a mobile applications are running on the Android platform [7].
feature vector, we need to tokenize the textual representa- This is why we have chosen to investigate the vulnerabilities
tion of the source into a set of monograms. As a set of of mobile applications developed for the Android platform.
delimiting we have chosen to use not only white spaces, but Repositories containing a large version history of open source
also the Java “punctuation” characters (such as, “. , ; ) ( } { mobile applications for the Android platform are readily
] [”) as well as mathematical and logical operators (such as, available and represent an ideal testbed for our approach.
“+ - / * ˆ | || & && !”). In a feature vector each monogram For the purposes of our initial exploration, we have selected
(i.e., feature) must also have an assigned value. We use the to use 19 versions of the K9 mail client application spread
count of a given monogram in a given file source code as its over the period of 22 months. The timespan between each
value. version is approximately one month. We have used the first
Consider the figure 1 that depicts the HelloWorld.java file. version in order to build the prediction model and we have
predicted the vulnerabilities of the files of each subsequent
version using this prediction model.
In order to assign the vulnerability labels we have lever-
aged the state-of-the-practice Fortify tool [3] that analyzes
the source code for various known types of software security
vulnerabilities. Fortify not only spots a vulnerability, but
also assigns a severity for each vulnerability found. In our
exploratory work, we have treated a file as vulnerable if For-
tify has assigned any type of vulnerability to it and as clean
Figure 1: Hello World Java File otherwise. By using Fortify we rely on vulnerabilities that
are extracted during a static analysis of the source (based
In order to transform this file into a corresponding feature on common vulnerabilities and exposures) rather than re-
vector we filter out all the comments from this file as well as ported vulnerabilities. There are systematic studies that
the “Hello World!” string. What remains from the source have shown that there are strong correlations between such
of this file is tokenized into a feature vector that treats each static analysis metrics and the quantity of subsequently re-
monogram as a feature. Hence, the feature vector of the ported vulnerabilities [6]. Nevertheless, this issue is rather
HelloWorld.java file becomes: controversial as commercial tools are said to produce high
class:1, HelloWorldApp:1, public:1, static:1, false positives [2].
void:1, main:1, String:1, args:1, System:1, out:1, 4.2 Results
println:1
We have used the version k9-2.504 to build the prediction
where each of the monograms is followed by a count (in model. We assessed the model performance (in terms of
this case 1). Note that in this example we do not follow any prediction power) by means of three indicators:
particular (e.g., SVM) notation. • Accuracy is the percentage of correct results.
During the learning phase each file represented as a feature
vector also has a vulnerability label assigned to it. We use • Precision is the probability that a file classified as vul-
this training set to build a prediction model. Throughout nerable is indeed vulnerable.
this paper we consider a binary classification scheme where
• Recall is the probability that a vulnerable file is clas-
a file is either classified as vulnerable or clean. Once the
sified as such.
prediction model is created from the training set, we can
use this prediction model to predict the vulnerability of ar- Figures 2 and 3 illustrate the initial results that we have ob-
bitrary files each represented as a feature vector. tained. The main observation is that the prediction model
8
scores very high (above 80%) for all three indicators. Figure In the future, we plan to further investigate the presented
2 also shows the positive rate of the application, i.e., the per- approach by looking at various alternatives in building the
centage of vulnerable files, which is between 40% and 60%. feature vector. We also plan to investigate the possibilities
Therefore, a “naive” classifier that classifies all files as vul- to build a vulnerability prediction model that uses the six-
nerable (or alternatively as clean) would achieve a precision class classification supported by Fortify (i.e., non-vulnerable,
in the range of 40% to 60% as well. This range is a base- vulnerable with severity 1 to 5). Finally, we believe that
line for the accuracy indicator and our approach performs our approach is complementary to using the existing tech-
substantially better compared to the baseline. niques that use, e.g., internal metrics for building a predic-
tion model. Hence, an even more interesting research track
would be to expand our approach to use a feature vector
that consists both of the complete source code treated as
text and a list of code metrics.
6. REFERENCES
[1] Android rises, symbian and windows phone 7 launch as
worldwide smartphone shipments increase 87.2% year
over year, according to idc (2011),
http://www.idc.com/
[2] Austin, A., Williams, L.: One technique is not enough:
A comparison of vulnerability discovery techniques. In:
ESEM. pp. 97–106 (2011)
[3] Fortify: Fortify. https://www.fortify.com/ (2011)
Figure 2: Accuracy vs % of vulnerable files identified [4] Neuhaus, S., Zimmermann, T., Holler, C., Zeller, A.:
by fortify Predicting vulnerable software components. In:
Proceedings of the 14th ACM Conference on Computer
and Communications Security (October 2007)
[5] Shin, Y., Meneely, A., Williams, L., Osborne, J.A.:
Evaluating complexity, code churn, and developer
activity metrics as indicators of software vulnerabilities.
IEEE Trans. Software Eng. 37(6), 772–787 (2011)
[6] Walden, J., Doyle, M.: Savi: Static analysis
vulnerability indicator. IEEE Security and Privacy (to
appear) (2012)
[7] Zeman, E.: Android, ios crush blackberry market share
(2011), http://www.informationweek.com
[8] Zimmermann, T., Nagappan, N., Williams, L.:
Searching for a needle in a haystack: Predicting security
vulnerabilities for windows vista. In: Proceedings of the
3rd International Conference on Software Testing,
Verification and Validation (April 2010)
Figure 3: Precision vs recall