Buffer Overflow
Buffer Overflow
Buffer Overflow
Mst Shapna Akter1, Hossain Shahriar2, Juan Rodriguez Cardenas3, Sheikh Iqbal Ahamed4, and Alfredo Cuzzocrea5
1
Department of Computer Science, Kennesaw State University, USA
2
Department of Information Technology, Kennesaw State University, USA
3
Department of Information Technology, Kennesaw State University, USA
4
Department of Computer Science, Marquette University, USA
5
iDEA Lab, University of Calabria, Rende, Italy
Abstract—One of the most significant challenges in the field represents the greatest proportion of software vulnerabilities in
of software code auditing is the presence of vulnerabilities in at least three years, accounting for 41.8% [3]. Furthermore,
software source code. Every year, more and more software flaws according to a Frost and Sullivan analysis released in 2018,
are discovered, either internally in proprietary code or publicly
disclosed. These flaws are highly likely to be exploited and can severe and high severity vulnerabilities increased from 693 in
lead to system compromise, data leakage, or denial of service. 2016 to 929 in 2017, with Google Project Zero coming in
To create a large-scale machine learning system for function-level second place in terms of disclosing such flaws. On August 14,
vulnerability identification, we utilized a sizable dataset of C and 2019, Intel issued a warning about a high-severity vulnerability
C++ open-source code containing millions of functions with poten- in the software it uses to identify the specifications of Intel
tial buffer overflow exploits. We have developed an efficient and
scalable vulnerability detection method based on neural network processors in Windows PCs [4]. The paper claims that these
models that learn features extracted from the source codes. The defects, including information leaking and denial of service
source code is first converted into an intermediate representation assaults, might substantially affect software systems. Although
to remove unnecessary components and shorten dependencies. We the company issued an update to remedy the problems, attack-
maintain the semantic and syntactic information using state-of- ers can still use these vulnerabilities to escalate their privileges
the-art word embedding algorithms such as GloVe and fastText.
The embedded vectors are subsequently fed into neural networks on a machine that has already been compromised. In June
such as LSTM, BiLSTM, LSTM-Autoencoder, word2vec, BERT, 2021, a vulnerability in the Windows Print Spooler service
and GPT-2 to classify the possible vulnerabilities. Furthermore, was discovered that allowed attackers to execute code remotely.
we have proposed a neural network model that can overcome The vulnerability, known as PrintNightmare, was caused by a
issues associated with traditional neural networks. We have used buffer overflow and affected multiple versions of Windows in
evaluation metrics such as F1 score, precision, recall, accuracy,
and total execution time to measure the performance. We have 2021 [5]. Microsoft released a patch to address the issue, but
conducted a comparative analysis between results derived from reports later emerged that the patch was incomplete and still
features containing a minimal text representation and semantic left systems vulnerable.
and syntactic information. We have found that all neural network
models provide higher accuracy when we use semantic and syntac- To reduce losses, early vulnerability detection is a good
tic information as features. However, this approach requires more technique. The proliferation of open-source software and code
execution time due to the added complexity of the word embed- reuse makes these vulnerabilities susceptible to rapid propaga-
ding algorithm. Moreover, our proposed model provides higher tion. Source code analysis tools are already available; however,
accuracy than LSTM, BiLSTM, LSTM-Autoencoder, word2vec
and BERT models, and the same accuracy as the GPT-2 model
they often only identify a small subset of potential problems
with greater efficiency. based on pre-established rules. Software vulnerabilities can be
found using a technique called vulnerability detection. Con-
Keywords: Cyber Security; Vulnerability Detection; Neural ventional vulnerability detection employs static and dynamic
Networks; Feature Extraction; techniques [6]. Static approaches evaluate source code or exe-
cutable code without launching any programs, such as data flow
I. INTRODUCTION analysis, symbol execution [7], and theorem proving [8]. Static
Security in the digital realm is becoming increasingly im- approaches can be used early in software development and have
portant, but there is a significant threat to cyberspace from excellent coverage rates, but they have a significant false pos-
invasion. Attackers can breach systems and applications due to itive rate. By executing the program, dynamic approaches like
security vulnerabilities caused by hidden software defects. In- fuzzy testing and dynamic symbol execution can confirm or
ternally, proprietary programming contains thousands of these ascertain the nature of the software. Dynamic methods depend
flaws each year [1]. For example, the ransomware Wannacry on the coverage of test cases, which results in a low recall
swept the globe by using a flaw in the Windows server mes- despite their low false positive rate and ease of implementation.
sage block protocol [2]. According to the Microsoft Security The advancement of machine learning technology incorporates
Response Center, there was an industry-wide surge in high- new approaches to address the limitations of conventional
severity vulnerabilities of 41.7% in the first half of 2015. This approaches. One of the key research directions is to develop
intelligent source code-based vulnerability detection systems. It successfully used for classifying software security activities
can be divided into three categories: using software engineering like malware, ransomware, and network intrusion detection.
metrics, anomaly detection, and weak pattern learning [9]. We have examined machine learning-related papers that have
Initially, software engineering measures, including software been applied to the software security domain. Previously,
complexity [10], developer activity [11], and code commits Zeng et al. [16] reviewed software vulnerability analysis and
[12] were investigated to train a machine learning model. This discovery using deep learning techniques. They found four
strategy was motivated by the idea that software becomes game-changing methods that contributed most to software
more susceptible as it becomes more complicated, but accuracy vulnerability detection using deep learning techniques. These
and recall need to be improved. Allamanis et al. [13] have concepts are automatic semantic feature extraction using deep
shown that the syntactic and semantic information in the learning models, end-to-end solutions for detecting buffer over-
code increases the detection accuracy in anomaly detection. flow vulnerabilities, applying a bidirectional Long Short-Term
Moreover, one work has shown the detection of the anomaly Memory (BiLSTM) model for vulnerability detection, and deep
using fully-fledged codes [14]. It reveals previously uniden- learning-based vulnerability detectors for binary code. Zhou et
tified weaknesses, but false positive and false negative rates al. [17] proposed a method called graph neural network for
are high. Another work has shown an approach with clean vulnerability identification with function-level granularity to
and vulnerable samples to learn vulnerable patterns [15]. This address the issue of information loss during the representation
method performs very well but relies on the quality of the learning process. They transformed the samples into a code
dataset. In our work, we propose a solution for detecting property graph format. Then, a graph neural network made
software buffer overflow vulnerability using neural networks up of a convolutional layer and a gated graph recurrent layer
such as Simple RNN, LSTM, BilSTM, word2vec, BERT, learned the vulnerable programming pattern. This method im-
GPT2, and LSTM-Autoencoder. We first transform source code proves the detection of intra-procedural vulnerabilities. How-
samples into the minimum intermediate representations through ever, they did not address inter-procedural vulnerabilities. Iorga
a tokenizer provided by the Keras library. Later, we extract et al. [18] demonstrated a process for early detection of
semantic features using word embedding algorithms such as cyber vulnerabilities from Twitter, building a corpus of 650
GloVe and fastText. After finishing the data preprocessing annotated tweets related to cybersecurity articles. They used
stage, we feed the input representation to the neural networks the BERT model and transfer learning model for identifying
for classification. Moreover, we develop a neural network that cyber vulnerabilities from the articles. The BERT model shows
works best among all the models. All the models have been 91% accuracy, which they found adequate for identifying
evaluated using evaluation metrics such as f1 score, precision, relevant posts or news articles. Sauerwein et al. [19] presented
recall, accuracy, and total execution time. The following is a an approach for automated classification of attackers’ TTPs
summary of our contributions: by combining NLP with ML techniques. They extracted the
1. Extracting semantic and syntactic features using GloVe attackers’ TTPs from unstructured text. To extract the TTPs,
and fastText. 2. Vulnerability Detection in Source Code using they used a combination of NLP with ML techniques. They
LSTM, BilSTM, LSTM-Autoencoder, word2vec, BERT, and assessed all potential combinations of the specified NLP and
GPT-2 with an minimal intermediate feature representation of ML approaches with 156 processing pipelines and an automati-
the texts. 3. Vulnerability Detection in Source Code using cally generated training set. They found that tokenization, POS
LSTM, BiLSTM, LSTM-Autoencoder, word2vec, BERT, and tagging, IoC replacement, lemmatization, one-hot encoding,
GPT-2 with semantic and syntactic features. 4. Proposal of binary relevance, and support vector machine performed best
a neural network that outperforms the results derived from for the classification of techniques and tactics. Harer et al.
existing models. Comparison between results derived from [20] created a dataset composed of millions of open-source
neural networks trained with a minimal intermediate feature functions annotated with results from static analysis. The
representation of the texts and semantic and syntactic features. performance of source-based models is then compared against
The rest of the paper is organized as follows: we provide a approaches applied to artifacts extracted from the build process,
brief background study on software vulnerability detection in with source-based methods coming out on top. The best
section 2. Then we explain the methods we followed for our performance is found when combining characteristics learned
experimental research in section 3. The results derived from by deep models with tree-based models. They evaluated the use
the experiment are demonstrated in Section 4. Finally, section of deep neural network models alongside more conventional
5 concludes the paper. models like random forests. Finally, their best model achieved
an area under the ROC curve of 0.87 and an area under
II. LITERATURE REVIEW the precision-recall curve of 0.49. Pistoia et al. [21] surveyed
Researchers are interested in the recently developed ma- static analysis methods for identifying security vulnerabilities
chine learning strategy for identifying and preventing soft- in software systems. They discussed three topics that have
ware and cybersecurity vulnerabilities in order to address been linked to security vulnerability sources: application pro-
the shortcomings of conventional static and dynamic code gramming interface conformance, information flow, and access
analysis techniques. Various machine learning techniques, in- control. They addressed static analysis methods for stack-based
cluding naive bayes, logistic regression, recurrent neural net- access control and role-based access control separately since
works (RNN), decision trees, and support vector machines are access control systems can be divided into these two main
types. They reviewed some effective static analysis techniques, understanding of what kind of important features the dataset
including the Mandatory Access Rights Certification of Objects might have.
(MARCO) algorithm, the Enterprise Security Policy Evaluation
(ESPE) algorithm, the Static Analysis for Validation of Enter- TABLE I: Most common words and their frequencies
prise Security (SAVES) algorithm, and Hammer, Krinke, and index Common words Count
Snelting’s algorithm. However, static analysis produces false
0 = 505570
positive results and relies on predefined rules. For new errors,
1 if 151663
the static analysis method is unsuitable, as it cannot recognize
and detect them. 2 {\n 113301
3 == 92654
III. METHODOLOGY 4 return 77438
From the standpoint of source code, the majority of flaws 5 * 71897
originate in critical processes that pose security risks, such as 6 the 71595
functions, assignments, or control statements. Adversaries can 7 }\n 63182
directly or indirectly affect these crucial operations by manip- 9 int 53673
ulating factors or circumstances. To successfully understand
10 /* 51910
patterns of security vulnerabilities from code, neural network
models must be trained on a large number of instances. In this 11 ¡ 43703
study, we analyze the lowest level of codes in software package 12 */\n 43591
functions, capable of capturing vulnerable flows. We utilized a 13 + 41855
sizable dataset containing millions of function-level examples 14 to 39072
of C and C++ code from the SATE IV Juliet Test Suite, the 15 && 36180
Debian Linux distribution, and open-source Git repositories
16 for 35849
on GitHub, as mentioned in Russell’s work [22]. Our project
17 }\n\n 34017
employs the CWE-119 vulnerability feature, which indicates
issues related to buffer overflow vulnerability. Buffer overflow 18 char 33334
occurs when data written to a buffer exceeds its length, 19 else 31358
overwriting storage units outside the buffer. According to a
2019 Common Weakness Enumeration report, buffer overflow 1) Data Preprocessing: In this study, we conducted a series
vulnerability has become the most adversely affected issue. of data preprocessing techniques to prepare our dataset for
Although we focus on buffer overflow, our method can identify the neural networks. The data preprocessing steps we em-
other vulnerabilities. Figure 1 illustrates an intra-procedural ployed include tokenization, stop word removal, stemming,
buffer overflow vulnerability. Our dataset is divided into three lemmatization, and the use of pre-trained embeddings. Initially,
subfolders—train, validation, and test—each containing a CSV we performed tokenization, which is the process of breaking
file with 100,000, 20,000, and 10,000 data instances, respec- down the source code into smaller units called tokens. Tokens
tively. The CSV files store text data and corresponding labels, represent the basic units of analysis for computational purposes
allowing systematic evaluation of the model’s performance and in natural language processing tasks. For this process, we
adaptability throughout the learning process. utilized the Keras tokenizer, which provides methods such as
tokenize() and detokenize() to process plain text and separate
words [23]. Following tokenization, we applied stop word
removal, stemming, and lemmatization techniques to further
preprocess the tokens. Stop word removal eliminates common
words that do not provide significant information, while stem-
ming and lemmatization normalize the tokens by reducing them
to their root form. These techniques help in reducing noise and
improving the efficiency of the neural networks.
We first converted the tokens into numerical representations
using minimal intermediate representation with the Keras to-
kenizer. The Keras tokenizer assigns a unique integer index
to each token in the vocabulary and represents the source
code as a sequence of these integer indices. This represen-
Fig. 1: An example of buffer overflow vulnerability. tation is more efficient than one-hot encoding, as it does not
involve creating large, sparse vectors. However, it still lacks
semantic information about the tokens. To further enhance the
We analyzed the dataset and found some common words representation of the source code tokens and better capture
(shown in Table 1) with their corresponding counts. The visual- semantic and syntactic information, we utilized pre-trained
ization of common words in the dataset provides a preliminary embeddings, namely GloVe and fastText. We stacked GloVe
and fastText embeddings together for extracting the semantic gradient problem of traditional RNNs.It was first proposed
and syntactic information from the source code. Both of these by Hochreiter and Schmidhuber [27]. Using this model for
embeddings have demonstrated strong performance in various sequential datasets is effective, as it can handle single data
NLP tasks and can effectively capture the relationships between points. It follows the Simple RNN model’s design and is an
words in the source code. GloVe is an unsupervised learning extended version of that model [28, 29]. Our LSTM model
algorithm that generates vector representations of words based consists of an input layer that determines the dimensionality
on global word-word co-occurrence statistics from a corpus of the input data features. We incorporated three hidden layers,
[24]. FastText, an extension of the skip-gram method, generates each containing 128 memory cells that can capture long-term
character n-grams of varying lengths for each word and learns dependencies in the input sequence. The output of each LSTM
weights for each n-gram, as well as the entire word token, layer is fed into a dropout layer with a dropout rate of 0.2 to
allowing the model to capture the meaning of suffixes, prefixes, prevent overfitting. The final output of the last LSTM layer is
and short words [25]. We separately fed the minimal inter- fed into a dense layer with two units and a sigmoid activation
mediate representation with Keras tokenizer and the semantic function to produce the final binary classification output. The
and syntactic representations derived from GloVe and fastText LSTM cell comprises three gates: the input gate, forget gate,
into our neural network models. This approach allowed us to and output gate, which regulate the flow of information into
compare the performance of the models when using different and out of the cell. To introduce non-linearity into the model,
input representations, helping us identify the most effective we use the hyperbolic tangent (tanh) activation function in
method for detecting security vulnerabilities in the source code. the LSTM cell. Furthermore, we utilize the Rectified Linear
Unit (ReLU) activation function in the output layer to generate
A. Classification Models
non-negative predictions. We optimize the LSTM model using
In this section, we discuss various classification models that the Binary Cross-Entropy loss function and Adam optimization
were utilized in our study. These models include Simple RNN, algorithm. The model’s hyperparameters include a learning rate
LSTM, BiLSTM, LSTM-Autoencoder, Word2vec, BERT, and of 0.001, batch size of 32, and 50 training epochs.
GPT-2. These models are designed to work with different types
of data, such as text, time series, and sequences, and have D. Bidirectional Long short-term memory (BiLSTM)
been widely employed in natural language processing and other The Bidirectional Long Short-Term Memory (BiLSTM) is a
related tasks. type of recurrent neural network that enhances the capabilities
B. Simple Recurrent Neural Network (RNN) of the traditional LSTM by introducing bidirectional processing
The Simple Recurrent Neural Network (RNN) is a type of of the input sequence. It was first proposed by Graves [30].
artificial neural network that can model sequential data by This idea sets it apart from the LSTM model, which can learn
utilizing a directed graph and temporally dynamic behavior. patterns from the past to the future [31] .Our BiLSTM model
RNNs consist of an input layer, a hidden layer, and an output comprises an input layer that determines the dimensionality
layer [26]. These networks have a memory state added to each of the input data features. We have incorporated three hidden
neuron, allowing them to capture temporal dependencies in the layers, each containing 128 memory cells that can capture long-
data. The dimensionality of the input layer in our Simple Re- term dependencies in the input sequence. The output of each
current Neural Network (RNN) model is determined based on BiLSTM layer is fed into a dropout layer with a dropout rate of
the input data features. The hidden layer consists of 256 units, 0.2 to prevent overfitting. The final output of the last BiLSTM
which use memory states to capture temporal dependencies layer is fed into a dense layer with two units and a sigmoid
in the data. We use the hyperbolic tangent (tanh) activation activation function to produce the final binary classification
function in the hidden layer to introduce non-linearity into output. The BiLSTM cell has two sets of three gates, namely
the model. We chose this activation function due to its ability the input gate, forget gate, and output gate, one set that
to handle vanishing gradients more effectively compared to processes the input sequence in the forward direction and
other activation functions like sigmoid. The output layer of another set that processes the input sequence in the backward
the Simple RNN model is designed to generate predictions direction. This bidirectional processing allows the model to
based on the processed input data. The number of units in capture dependencies in both the past and future context of
the output layer corresponds to the number of classes, which the input sequence. To introduce non-linearity into the model,
is two. We use an appropriate activation function, such as we use the hyperbolic tangent (tanh) activation function in the
sigmoid for binary classification, in the output layer to generate BiLSTM cell. Furthermore, we utilize the Rectified Linear Unit
probability scores for each class. To optimize the model, we (ReLU) activation function in the output layer to generate non-
choose the Binary Cross entropy loss function and employ the negative predictions. We optimize the BiLSTM model using
Adam optimization algorithm. We set hyperparameters such as the Binary Cross-Entropy loss function and Adam optimization
learning rate to 0.001, batch size to 32, and the number of algorithm. The model’s hyperparameters include a learning rate
training epochs to 50. of 0.001, batch size of 32, and 50 training epochs.
1
yifinal = weighted
)
1+ e−(p i
TABLE III: Vulnerable Source code Classification results using different Neural network models with embedding algorithms
GloVe + fastText
Models Accuracy precision Recall F1 Execution
Score time
Simple RNN 0.92 0.93 0.93 0.97 42min 8s
LSTM 0.92 0.93 0.95 0.97 33min 13s
BiLSTM 0.93 0.96 0.96 0.99 45min 3s
Word2vec 0.94 1.00 0.98 0.99 42min 56s
LSTMAutoencoder 0.90 0.93 0.94 0.95 59min 53s
BERT 0.94 0.95 0.95 0.99 5h 16min
Gpt2 0.95 0.97 0.98 0.99 8h 33min
Proposed Model 0.95 0.97 0.98 0.99 2h 46min
achieves the highest accuracy of 95% compared to the previ- VI. CONCLUSION
ous works. Moreover, we contribute to efficient measurement Our research aims to detect implementation vulnerabilities
analysis and perform an in-depth analysis of the features that early in the development cycle by leveraging the power of
were not considered in previous studies. This comprehensive
neural networks. We have collected a large dataset of open-
approach allows us to better understand the factors influencing source C and C++ code and developed a scalable and efficient
the performance of vulnerability detection models and develop
vulnerability detection method based on various neural network
more effective methods for detecting security vulnerabilities in models. We compared the performance of different models,
source code. including Simple RNN, LSTM, BiLSTM, LSTM-Autoencoder,
Fig. 3: Performance Metrics for Different Neural Network Models on Vulnerable Source Code without Word Embedding
Algorithms
Word2Vec, BERT, and GPT-2, and found that models with ceedings of the 28th international conference on Software
semantic and syntactic information extracted using state-of-the- engineering, pp. 492–501, 2006.
art word embedding algorithms such as GloVe and FastText [2] T. Manikandan, B. Balamurugan, C. Senthilkumar,
outperform those with a minimal text representation. Our R. R. A. Harinarayan, and R. R. Subramanian, “Cyberwar
proposed neural network model has shown to provide higher is coming,” Cyber Security in Parallel and Distributed
accuracy with greater efficiency than the other models evalu- Computing: Concepts, Techniques, Applications and Case
ated. We have also analyzed the execution time of the models Studies, pp. 79–89, 2019.
and proposed a trade-off between accuracy and efficiency. [3] A. Arora and R. Telang, “Economics of software vulnera-
Overall, our research contributes to the development of large- bility disclosure,” IEEE security & privacy, vol. 3, no. 1,
scale machine learning systems for function-level vulnerability pp. 20–25, 2005.
identification in source code auditing. [4] K. Jochem, “It security matters,”
[5] “cisa.” https://www.cisa.gov/
ACKNOWLEDGEMENT
news-events/alerts/2021/06/30/
The work is supported by the National Science Founda- printnightmare-critical-windows-print-spooler-vulnerability,
tion under NSF Award #2209638, #2100115, #2209637, 2022. Accessed April 26, 2023.
#2100134, #1663350. Any opinions, findings, recommenda- [6] T. N. Brooks, “Survey of automated vulnerability de-
tions, expressed in this material are those of the authors and tection and exploit generation techniques in cyber rea-
do not necessarily reflect the views of the National Science soning systems,” in Science and Information Conference,
Foundation. pp. 1083–1102, Springer, 2018.
REFERENCES [7] C. Cadar, D. Dunbar, D. R. Engler, et al., “Klee: unas-
[1] T. D. LaToza, G. Venolia, and R. DeLine, “Maintaining sisted and automatic generation of high-coverage tests for
mental models: a study of developer work habits,” in Pro- complex systems programs.,” in OSDI, vol. 8, pp. 209–
Fig. 4: Performance Metrics for Different Neural Network Models on Vulnerable Source Code without Word Embedding
Algorithms