Ahts04 Sandia National Laboratories: Multimodal Deep Learning For Flaw Detection in Software Programs
Ahts04 Sandia National Laboratories: Multimodal Deep Learning For Flaw Detection in Software Programs
Ahts04 Sandia National Laboratories: Multimodal Deep Learning For Flaw Detection in Software Programs
Abstract
We explore the use of multiple deep learning models for detecting flaws in software pro-
grams. Current, standard approaches for flaw detection rely on a single representation of
a software program (e.g., source code or a program binary). We illustrate that, by us-
ing techniques from multimodal deep learning, we can simultaneously leverage multiple
representations of software programs to improve flaw detection over single representation
analyses. Specifically, we adapt three deep learning models from the multimodal learning
literature for use in flaw detection and demonstrate how these models outperform tradi-
tional deep learning models. We present results on detecting software flaws using the Juliet
Test Suite and Linux Kernel.
Keywords: multimodal deep learning, software flaw detection
1. Introduction
Efficient, reliable, hardened software plays a critical role in cybersecurity. Auditing software
for flaws is still largely a manual process. Manual review of software becomes increasingly
intractable as software grows in size and is constantly updated through increasingly rapid
version releases. Thus, there is a need to automate the identification of flawed code where
possible, enabling auditors to make the most efficient use of their scant time and attention.
Most current approaches for flaw detection rely on analysis of a single representation
of a software program (e.g., source code or program binary compiled in a specific way for
a specific hardware architecture). This is often the case when analyzing commercial soft-
ware or embedded system software developed by external groups, where program binaries
are available without associated source code or build environments. However, different pro-
gram representations can 1) contain unique information, and 2) provide specific information
more conveniently than others. For instance, source code may include variable names and
U.S. DEPARTMENT OF
comments that are removed through compilation, and program binaries may omit irrele-
vant code detected through compiler optimization processes. Thus, there can be an added
benefit in understanding software using multiple program representations.
In this paper, we explore the use of multiple program representations (i.e., source code
and program binary) to develop machine learning models capable of flaw detection, even
when only a single representation is available for analysis (i.e., a program binary). Our
contributions include:
• The first application of multimodal learning for software flaw prediction (as far as we
are aware);
• A comparative study of three deep learning architectures for multimodal and learning
applied to software flaw prediction; and
• A data set of software flaws with alignment across source code and binary function
instances that can be used by the multimodal learning and software analysis research
communities for benchmarking new methods.
2. Related Work
Multimodal learning is a general approach for relating multiple representations of data
instances. For example, in speech recognition applications, Ngiam et al. demonstrated
that audio and video inputs could be used in a multimodal learning framework to 1) create
improved speech classifiers over using single modality inputs, and 2) reconstruct input
modalities that were missing at model evaluation time (also known as crossmodal learning)
[17].
Much of the work in multimodal and crossmodal learning has focused on recognition
and prediction problems associated with two or more representations of transcript, audio,
video, or image data [4, 17, 22, 5, 23, 20]. We aim to leverage those methods and results to
develop an improved software flaw detector that learns from both source code and binary
representations of software programs.
Recent research has demonstrated the utility of using deep learning to identify flaws
in source code by using statistical features captured from source, sequences of source code
tokens, and program graphs [7, 1]. Other research efforts have demonstrated the utility of
using deep learning to identify flaws in software binaries [24, 14, 21, 15, 2]. In all of this
previous work, there is clear evidence that using deep learning over traditional machine
learning methods helps improve flaw prediction. Furthermore, more recently, Harer, et
al. [11] demonstrated improved flaw modeling by combining information from both the
source and binary representations of code. Although not presented as such, the latter work
can be considered an instance of multimodal learning using early fusion. Our work differs
2
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
from Harer, et al. [11] in that we employ models that learn joint representations of the
input modalities, which have been demonstrated to outperform early fusion models for
many applications [4, 17, 22, 5, 23, 20].
Gibert et al. [8] describe recent advances in malware detection and classification in
a survey covering multimodal learning approaches. They argue that deep learning and
multimodal learning approaches provide several benefits over traditional machine learn-
ing methods for such problems. Although the survey covers similar multimodal learning
methods to those we investigate here (i.e., intermediate fusion methods), those methods
for malware analysis leverage sequences of assembly instructions and Portable Execution
metadata/import information as features, whereas in our work we leverage source code and
static binary features.
3. Data
We use two data sets, described below, to assess the performance of our methods in detecting
flaws. Flaws are labeled in source code at the function level; i.e., if one or more flaws appear
in a function, we label that function as flawed, otherwise it is labeled as not flawed. We
compile all source code in these data sets using the GCC 7.4.0 compiler [gcc.gnu.org] with
debug flags, which enables mapping of functions between source code and binaries. Since
we focus on multimodal deep learning in this work, we use only the functions that we can
map one-to-one between the source code and binary representations in our experiments.
3
S. HEIDBRINK, K.N. RODHOUSE, AND D.M. DUNLAVY
# Flawed;
CWE Flaw Description # Not Flawed
121 Stack Based Buffer Overflow 6346; 16868
190 Integer Overflow 3296; 12422
369 Divide by Zero 1100; 4142
377 Insecure Temporary File 146; 554
416 Use After Free 152; 779
476 NULL Pointer Dereference 398; 1517
590 Free Memory Not on Heap 956; 2450
680 Integer to Buffer Overflow 368; 938
789 Uncontrolled Mem Alloc 612; 2302
78 OS Command Injection 6102; 15602
not flawed, and 2) by changing the code we have injected a flaw. We expect each of these
assumptions to generally hold for stable software products. There are more than 530,000
functions in the Linux Kernel 5.1 source code. We corrupt 12,791 calls to raemcpy in 8,273
of these functions. After compiling using the default configuration, these flaws appear in
1,011 functions identified in the binaries.
4
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
4. Methods
4.1 Multimodal Deep Learning Models
We investigate three multimodal deep learning architectures for detecting flaws in software
programs. Figure 1 presents high level schematics of the different architectures, highlighting
the main differences between them. These models represent both the current diversity of
architecture types in joint representation multimodal learning as well as the best performing
models across those types [3, 23, 5, 22].
Figure 1: The general architectures for the three approaches examined in this work: (a) CorrNet,
(b) JAE, and (c) BiDNN.
5
S. HEIDBRINK, K.N. RODHOUSE, AND D.M. DUNLAVY
a deep autoencoder architecture that has shown promise on several applications [23]. The
main distinguishing feature of CorrNet models is the use of a loss function term taken from
Canonical Correlation Analysis(CCA) models, where the goal is to find a reduced-dimension
vector space in which linear projections of two data sets are maximally correlated [13]. The
contribution of this CCA loss term used in CorrNet is weighted, using a scalar term denoted
as A, to balance the impact of CCA loss with other autoencoder loss function terms. The
implementation of the CorrNet model used in our experiments follows Wang, et al. [23].
• Are there differences in flaw prediction performance across the various deep learning
models explored?
The data is first normalized, per feature, to have sample mean of 0 and sample standard
deviation of 1, and split into standard training sizes of(80%), validation (10%), and testing
6
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
sets (10%). In the cases where a feature across all training set instances is constant, the
sample standard deviation is set to 0 so that the feature does not adversely impact the net-
work weight assignments in the validation and testing phases. We use 5-fold cross validation
to fit and evaluate instances of each model to assess how well our approach generalizes.
We implement each architecture in PyTorch v1.5.1 using Linear layers and LeakyReLU
activations on each layer excluding the final one. In all experiments, we use PyTorch's
default parameterized Adam optimizer, and, excluding the Initialization experiment, Py-
Torch's default Kaiming initialization with LeakyReLU gain adjustment, and 100 epochs
for training. To regularize the models, we use the best performing parameters as evaluated
on the fold's validation set. In all experiments but the Architecture Size experiment, each of
the Encoder, Mixing, and Decoder layers contain 50 nodes and 1 layer. For all decisions not
explicitly stated or varied within the experiment, we rely on PyTorch's default behavior.
We construct a deep neural network with the same number of parameters as the multi-
modal deep learning models and use it as the baseline classifier in our experiments. These
baseline classifier models are composed of the Encoder and Mixing layers (see Figure 1)
followed by two Linear layers. This approach is an instance of early fusion multimodal
deep learning as demonstrated in [11] as an improvement over single modality deep learning
models for flaw prediction.
Our experiments consist of predicting flaws on several Juliet Test Suite CWE test cases
and the flaw-injected Linux Kernel data, both described in Section 3. We also perform
several experiments using the multimodal deep learning models on the flaw-injected Linux
Kernel data to study the impact of the construction and training of these models on the
performance of predicting flaws. Specifically, we assess the impacts of 1) the size and
shape of the neural networks, 2) the method used for setting initial neural network weights,
3) using both single and multimodal inputs versus multimodal inputs alone, and 4) the
amount of correlation that is used in training the CorrNet models. We include the results
of these additional experiments to provide insight into the robustness of the multimodal
deep learning models in predicting flaws.
5. Results
In this section we present the results of our experiments of using multimodal deep learning
models for flaw prediction. In stable software products, the number of functions containing
flaws is much smaller than the number of functions that do not contain flaws. This class
imbalance of flawed and not flawed functions is also reflected in the Juliet Test Suite and
flaw-injected Linux Kernel data that we use for our experiments. Thus, when reporting
model performance, we report accuracy weighted by the inverse of the size of the class to
control for any bias.
7
S. HEIDBRINK, K.N. RODHOUSE, AND D.M. DUNLAVY
Lee
0.90
1 1 1 1 1 1 1 ,1 1 1 1 , 1 1 1
ORO
0.70
0.60
050
OAD
[WE CarE369 (14E311 e1ME4t6 CINE.176 OWS90 evdasus MEM Nam unux
•Baseine •GC N[AageNan,et aL,1020] •Corr Net •JAE •WWI
Figure 2: Flaw Detection Results on Juliet Test Suite (CWE) and Linux Kernel Data Sets
As noted in Section 3, there are numerous published results on using the Juliet Test Suite
in assessing deep learning approaches to flaw predictions. However, existing results cover
a wide variety of flaws across the various CWE categories, and there is no single set of
CWEs that are used in all assessments. To illustrate comparison of the multimodal deep
learning models to published results, we also include in Figure 2 recent results of using
a graph convolutional network (GCN) deep learning model for flaw prediction [2] for the
CWEs which overlap between our results and theirs.
In all experiments, multimodal deep learning models perform significantly better than
the baseline deep learning models. Moreover, in all but one experiment, the multimodal
deep learning models perform better (and often significantly better) than the published
results for the GCN deep learning models. In that one exception, CWE416, the multimodal
deep learning models perform the worst across data explored in our research presented here.
We hypothesize that the diminished performance of our multimodal methods is partly due
to the small size of the data there are fewer than 1000 total functions in CWE416 as
deep learning models often require a lot of training data for good performance. However, it
could also be due to the specific type of flaw (Use After Free) in that category. Determining
the sources of such differences between published results and the results we present here is
left as future work.
Compared with other published results on predicting flaws in the Juliet Test Suite, the
multimodal deep learning models perform as well as(and often better than) existing machine
learning approaches. Although not as many specific, direct comparisons can be made as
with the GCN results discussed above, we present here several comparisons with published
results. Instruction2vec [14] uses convolutional neural networks on assembly instruction
input to achieve 0.97 accuracy on CWE121 compared to the multimodal deep learning
model results of greater than 0.99 accuracy shown in Figure 2. VulDeeLocator [15] uses
bidirectional recurrent neural networks on source code and intermediate representations
from the LLVM compiler to achieve accuracies of 0.77 and 0.97, respectively, across a
collection of Juliet Test Suite CWE categories; this is comparable to our results where
on average across all Juliet CWEs tested the multimodal deep learning models achieve an
average accuracy of 0.95 (leaving out the anomalous results for CWE416 as discussed above).
And BVDetector [21] uses bidirectional neural networks on graph features from binaries to
8
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
achieve at most 0.91 accuracy across collections of the Juliet Test Suite associated with
memory corruption and numerical issues, where the multimodal deep learning models on
average achieve 0.96 accuracy on data related to those flaw types (i.e., CWE121, CWE190,
CWE369, CWE590, CWE680, and CWE789).
5.2.2 MODEL „
wEIGHTS INITIALIZATION
The weights of the deep learning models explored in this work can be initialized using a
variety of approaches. For example, the authors of the original JAE and BiDNN models
9
S. HEIDBRINK, K.N. RODHOUSE, AND D.M. DUNLAVY
use Xavier initialization [9]. He et al. demonstrate that when using LeakyReLU activation,
Kaiming initialization may lead to improved model performance [12]. Furthermore, LSUV
initialization has recently gained popularity [16]. In this section, we present the impact of
weight initialization schemes on model performance, baselining with constant initialization;
bias values are set using random initialization.
Table 3 presents the results of using the different initialization methods with the dif-
ferent multimodal deep learning models. As in the previous experiments, the best model
performances per architecture per performance measure are highlighted in bold. From these
results, we see that there is not significant variation across the models using the different
types of initialization. However, since the best initialization method varies across the dif-
ferent models, we recommend comparing these methods when applying multimodal deep
learning models in practice.
The only unique parameter across the deep learning models explored in this work is Cor-
rNet's A value, which balances the correlation loss term with the autoencoder loss terms.
We vary the values of A to determine its effect on model performance; results are shown
in Table 4. We include several A values in the range of [0, 10] and an empirical A value
("auto") that equalizes the magnitudes of the correlation loss term with the autoencoder
terms for a sample of the training data (as recommended by Chandar, et al. [4]). A small
correlation (A = .1) performs best, it seems there is no significant difference in including it
as a loss term as long as it is not unduly weighted.
10
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
multimodal inputs for training We first augment our dataset by adding instances that have
one modality zeroed out, resulting in a dataset with three times as many instances as the
original. Otherwise, model training is conducted similarly to the previous experiments.
As shown in Table 5, the addition of single modality training data hinders model per-
formance. This is a surprising result, especially given the current recommendations in the
multimodal deep learning literature. Furthermore, as single modality training is part of
the default CorrNet and BiDNN models, we suggest future research focus on this potential
discrepancy when applying to flaw prediction problems.
6. Conclusions
As discussed in Section 4.2, we set out to answer the following questions: 1) can multimodal
models improve flaw prediction accuracy, 2) how sensitive is the performance of multimodal
models with respect to model parameter choices as applied to software flaw prediction, and
3) does one of the evaluated multimodal models outperform the others?
In Section 5.1 we demonstrated that the multimodal deep learning models—CorrNet,
JAE, and BiDNN—can significantly improve performance over other deep learning methods
in predicting flaws in software programs. In Section 5.2, we addressed the second question
of parameter sensitivities associated with multimodal deep learning models, illustrating
the relative robustness of these methods across various model sizes, model initializations,
and model training approaches. We see across all of the results presented in Section 5
that amongst the three multimodal deep learning models we studied in this work, no one
model is clearly better than the others in predicting flaws in software programs across all
flaw types. Deeper examination of the individual flaw predictions could provide a better
understanding of the differences between these three models and identify which model may
be best for different flaw types encountered by auditors.
In the case where sufficient training data is available, performance of the multimodal
deep learning models was much higher on the Juliet Test Suite compared to that of the
flaw-injected Linux Kernel data (see Figure 2). A main difference between these data sets
(as described in Section 3), is that the Linux Kernel data is comprised of functions from
a modern, complex code base, whereas the Juliet Test Suite is comprised of independent
examples of flaws designed to enumerate various contexts in which those flaws may arise
within single functions. We believe the Linux Kernel data, inspired by real flaws logged
within the National Vulnerability Database, reflects more realistic software flaws, and for
this reason, presents more difficult flaw prediction problems than the Juliet Test Suite.
Thus, we share this data set for use by the machine learning and software analysis commu-
nities as a potential benchmark. However, this data is limited to a single flaw type, and
software programs often contain several different types of flaws concurrently. Future work
11
S. HEIDBRINK, K.N. RODHOUSE, AND D.M. DUNLAVY
should address this by providing larger data sets with sufficient complexity and flaw type
variability.
References
[1] Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton.
A survey of machine learning for big code and naturalness. ACM Computing Surveys,
51:81:1-81:37, 2017.
[2] Shushan Arakelyan, Christophe Hauser, Erik Kline, and Aram Galstyan. Towards
learning representations of binary executable files for security tasks. In Proc. Inter-
national Conference on Artificial Intelligence and Computer Science, pages 364-368,
2020.
[5] Baruch Epstein, Ron Meir, and Tomer Michaeli. Joint autoencoders: A flexible meta-
learning framework. In Proc. European Conference on Machine Learning (ECML-
PKDD), pages 494-509, 2018.
[6] Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. discovRE: Efficient
cross-architecture identification of bugs in binary code. In NDSS, 2016.
[7] Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. Software vulnerability anal-
ysis and discovery using machine-learning and data-mining techniques: A survey. ACM
Computing Surveys, 50(4):56, 2017.
[8] Daniel Gibert, Carles Mateu, and Jordi Planes. The rise of machine learning for
detection and classification of malware: Research developments, trends and challenges.
Journal of Network and Computer Applications, 153:102526, 2020.
[9] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-
forward neural networks. In Proc. Artificial Intelligence and Statistics, pages 249-256,
2010.
[10] Jacob Harer, Onur Ozdemir, Tomo Lazovich, Christopher Reale, Rebecca Russell,
Louis Kim, and Peter Chin. Learning to repair software vulnerabilities with generative
adversarial networks. In Proc. Advances in Neural Information Processing Systems,
pages 7933-7943. 2018.
[11] Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R.
Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key,
Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Sang Peter Chin, and
Tomo Lazovich. Automated software vulnerability detection with machine learning.
arXiv:1803.04497, 2018.
12
MULTIMODAL DEEP LEARNING FOR FLAW DETECTION IN SOFTWARE PROGRAMS
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti-
fiers: Surpassing human-level performance on imagenet classification. In Proc. IEEE
International Conference on Computer Vision, pages 1026-1034, 2015.
[13] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321-377,
1936.
[14] Yongjun Lee, Hyun Kwon, Sang-Hoon Choi, Seung-Ho Lim, Sung Hoon Baek, and
Ki-Woong Park. Instruction2vec: Efficient preprocessor of assembly code to detect
software weakness with cnn. Applied Sciences, 9(19), 2019.
[15] Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin. VulDee-
Locator: A deep learning-based fine-grained vulnerability detector. arXiv:2001.02350,
2020.
[16] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv:1511.06422, 2015.
[17] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y.
Ng. Multimodal deep learning. In Proc. International Conference on Machine Learning,
pages 689-696, 2011.
[20] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot
learning through cross-modal transfer. In Advances in Neural Information Processing
Systems 26, pages 935-943. 2013.
[21] Junfeng Tian, Wenjing Xing, and Zhen Li. BVDetector: A program slice-based binary
code vulnerability intelligent detection system. Information and Software Technology,
123:106289, 2020.
[22] Vedran Vukotié, Christian Raymond, and Guillaume Gravier. Bidirectional joint rep-
resentation learning with symmetrical deep neural networks for multimodal and cross-
modal applications. In Proc. Multimedia Retrieval, pages 343-346, 2016.
[23] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view
representation learning. In Proc. International Conference on Machine Learning, pages
1083-1092, 2015.
[24] H. Xue, S. Sun, G. Venkataramani, and T. Lan. Machine learning-based analysis of
program binaries: A comprehensive study. IEEE Access, 7:65889-65912, 2019.
[25] Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. Generalized vulnerability
extrapolation using abstract syntax trees. In Proc. Computer Security Applications,
pages 359-368, 2012.
13