1. Introduction
The size and complexity of modern software systems are constantly increasing. Additionally, the number of software vulnerabilities has significantly grown, leading to an increase in the security concerns expressed both by the end users and the software enterprises. The exploitation of software vulnerabilities can lead to important financial damages, which renders the need of decision makers to assess the security level of software products absolutely necessary. They need to determine (i) whether software systems along with their respective software components (e.g., packages, classes, methods) are vulnerable or not, (ii) the impact of potential vulnerability exploits, and (iii) the likelihood that a specific number of vulnerabilities will be reported in a certain period of time.
To deal with this, many researchers have proposed models capable of discovering vulnerabilities. A lot of effort has been placed on the prediction of vulnerable software components using software attributes extracted by the source code. In these studies, researchers commonly train machine learning models based on either software metrics (e.g., cohesion, coupling, and complexity metrics) [
1,
2,
3] or text features [
4,
5,
6,
7]. They aim to identify patterns in the source code that indicate that a file or a function is vulnerable or not. However, these approaches do not predict the number of vulnerabilities in future versions. Although they assess whether a component contains vulnerabilities or not, they do not provide any indication of the evolution of the number of their vulnerabilities in time.
An indication of the expected number of vulnerabilities and the trends of their occurrences can be a very useful tool for decision makers, enabling them to prioritize their valuable time and limited resources for testing an existing software project and patching its reported vulnerabilities. For this purpose, there is a need for forecasting models that can predict the trend and the number of vulnerabilities that are expected to be discovered in a specific time horizon for a given software project. Studies that propose mechanisms to model the evolution of the number of vulnerabilities in time [
8,
9] aim not to detect vulnerabilities, but to forecast the number of vulnerabilities that are likely to be identified in the future. These studies utilize either statistical or machine learning algorithms in order to estimate the expected number of vulnerabilities based on the vulnerabilities that have been already reported (e.g., in the National Vulnerability Database [
10]). The majority of these algorithms are time series models that keep track of all the vulnerabilities in terms of calendar time and interpret that time as an independent variable [
11]. Statistical models such as Autoregressive Integrated Moving Average (ARIMA), Croston’s method, logistic regression, and exponential smoothing models have attracted the interest of the researchers in the field [
12,
13]. Machine Learning (ML) models have been considered as well. Jabeen et al. conducted a comparative analysis evaluating different statistical models with various ML models such as Support Vector Machines and Bagging [
14].
Deep learning (DL) has emerged as a promising solution for time series forecasting, as well. Karasu et al. utilized the long short-term memory network in order to predict the crude oil price [
15]. Altan et al. used DL and, more specifically, a hybrid model including the LSTM network for time series forecasting as well [
16]. In their study, they focused on the prediction of the digital currency price through time. Recent studies have introduced DL models as predictors capable of modelling the evolution of the vulnerabilities number in time [
12,
17]. Despite the existing attempts, a lack in the literature of a thorough DL analysis for vulnerability forecasting was noticed. In fact, although the existing studies [
12,
14,
18] use neural networks to forecast the vulnerabilities number, their predictive capacity has not been thoroughly studied. Gencer et al. [
17] recently conducted a more in-depth analysis, by considering several DL algorithms; however, they focused only on Android systems. In addition to that, the authors did not follow a project-specific approach, i.e., they did not build models to predict the future number of vulnerabilities for each Android application, but they aggregated all the vulnerabilities relative to Android applications and attempted to provide forecast for their aggregated value. This way, their results cannot be generalised for the task of predicting the number of vulnerabilities of a software project in a future version.
To this end, in the present paper we empirically examine the capacity of statistical and DL models in forecasting the number of vulnerabilities that a software product may exhibit in the future and we compare their predictive performance. For this purpose, we utilize data provided by the National Vulnerability Database (NVD) that provides files with the reported vulnerabilities of several software products. We gathered data about the reported vulnerabilities of five popular software applications, which have been reported in the last two decades (i.e., from 2002 to 2022), and, based on these data, we build several statistical and DL models for each one of the five projects, for providing vulnerability forecasts in a horizon of 24 months ahead. The produced models are evaluated and compared based on their goodness-of-fit, as well as on their predictive performance on unseen data. To the best of our knowledge, this is the first study that thoroughly evaluates the capacity of DL models in vulnerability forecasting and compares their predictive performance with traditional statistical models, in an attempt to emerge the DL models as adequate predictors of the future number of software vulnerabilities’ occurrences.
The contribution of our work is summarized in the followings:
An in-depth examination of DL models as a method for forecasting the number of vulnerabilities that a software project will contain in the future.
A comprehensive comparison between DL and statistical models in vulnerability forecasting.
A methodology for forecasting the number of vulnerabilities related to an individual future timestep, instead of predicting the cumulative number of vulnerabilities until that timestep.
To sum up, in comparison with the state-of-the-art approaches, our contribution is primarily that we proceeded to the first thorough investigation of the capacity of DL in forecasting the number of vulnerabilities for a specific project in a future timestep, and subsequently that we performed an in-depth comparison between statistical and DL methods for the task of vulnerability forecasting. Most studies in the field examine traditional statistical time-series models [
8,
9,
11,
13] while few of them also examine ML models [
12,
14] or they just include in their analysis a kind of neural network, without proceeding with an in-depth analysis of DL models [
12]. The only known work being closely related to ours that attempts to address the aforementioned issues is the work conducted by Gencer et al. [
17]. However, contrary to Gencer et al. [
17] who focused on the vulnerabilities of Android applications and built DL models that forecast the future evolution of the number of vulnerabilities in Android applications in general (by accumulating the vulnerability reports of all the studied Android applications), we provide a project-specific approach in which we build a different forecasting model for each one of the studied software projects. In addition to this, as opposed to Gencer et al. [
17], we do not focus on a specific programming language. Finally, contrary to the approaches of [
11,
14], we do not predict the accumulative number of vulnerabilities until a timestep, but instead we try to predict the exact number of vulnerabilities that will occur in that specific future timestep.
The rest of the paper is structured as follows. In
Section 2, the related work in the field of Vulnerability Forecasting in software systems is presented.
Section 3 provides information about the proposed models, the overall methodology and the experimental setup. Finally,
Section 4 thoroughly discusses the obtained results of our analysis, while
Section 5 sums up the paper, provides the overall conclusion and also discusses potential future research directions.
2. Related Work
Examination of previous studies in the literature regarding vulnerability prediction shows that approaches based on statistics, mathematical modeling, and ML have been used. Code attribute-based models and time series-based models are the two primary categories of these approaches. The models based on code attributes concentrate on identifying the relationship between code attributes and the existence of vulnerabilities. On the other hand, time series-based models focus on predicting the number of vulnerabilities in a future time step based on historical data.
Regarding the code-based models, Automated Static Analysis (ASA) is often used for the early identification of security issues in the source code of software projects [
19,
20]. ASA manages to identify potential security threats that reside in the source code, by applying some pre-defined rules and identifying the violations of these rules. Based on ASA, more advanced models have been also proposed [
21]. Siavvas et al. proposed a model that combines low-level indicators (i.e., security-relevant static analysis alerts and software metrics) in order to generate a high-level security score that reflects the internal security level of the examined software [
21].
Apart from ASA, for the prediction of vulnerabilities in the source code the research community has widely employed ML and DL algorithms that try to identify code patterns or code attributes relative to the existence of vulnerabilities. Shin and Williams [
1,
2] examined how well software metrics, in particular complexity metrics, can predict software vulnerabilities. Several regression models were developed to differentiate between vulnerable and non-vulnerable functions. The effectiveness of feeding artificial neural networks with software measures to anticipate cross-project vulnerabilities was examined by the authors in [
22]. Several ML models were developed, evaluated, and compared using a dataset of popular PHP products.
Neuhaus et al. proposed Vulture [
6], a vulnerability prediction model that identified vulnerabilities based on import statements and function calls that are more frequent in vulnerable components. In VulDeePecker [
5], Li et al. proposed a DL model for vulnerability identification. They separated the original code into several lines of code that were semantically connected, and then they used word2vec embeddings [
23,
24] to turn those lines of code into vectors. Subsequently, a neural network was trained to identify library/API function calls associated with known defects. In [
25], the authors compared text mining and software metrics approaches using several machine and DL models for vulnerability prediction and then they attempted to combine these two approaches, as well. The aforementioned code-based models are categorized in
Table 1.
Regarding the time series-based models, Alhazmi et al. proposed a time-based model [
11]. Their approach is based on the fact that interest in newly released software rises in the beginning, peaks after a while, and then drops as new competitive versions are introduced. Yasasin et al. examined the issue of estimating the quantity of software security flaws in operating systems, browsers, and office applications [
18]. They retrieved their data from NVD and they used mainly statistical models such as ARIMA and exponential smoothing. They also investigated the suitability of the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) in the measurement of vulnerability forecasting. Furthermore, Jabeen et al., conducted an empirical analysis, where they compared different statistical algorithms with ML techniques showing that many of the ML models provide better results [
14]. Shukla et al. in their study [
26] attempted to include in their model the so called change point in order to assimilate the information about the time point when changes in the code size, popularity or known vulnerabilities can cause changes in the time series distribution. Wang et al. attempted to include the effort factor (i.e., any factor that can depict environmental changes of the software) in their cumulative vulnerabilities discovery model in order to predict trends in the vulnerability time series data [
27]. In
Table 2, there is a list of the categories of the time series-based models.
In this study, we propose an approach to predict the evolution of vulnerabilities in an horizon of two years (i.e., 24 steps) ahead using both statistical and DL models. Actually, we compare these two kinds of time series models. We follow a univariate approach, considering only the number of the already reported vulnerabilities in the NVD regarding two operating systems, two browsers and one Office product (see
Section 3.1). To the best of our knowledge, it is the first study that thoroughly examines the capacity of DL in forecasting the evolution of software vulnerabilities. While Gencer et al. [
17] also compared the ARIMA with several DL models, their work focused solely on Android vulnerabilities by considering Android as a whole. In contrast, we follow a project-specific approach (i.e., specific browsers, operating systems) in order to be in line with a real-world scenario where a decision maker would desire to know the expected number of vulnerabilities for his/her product. We are also differentiated from the [
11,
14] approaches, as we attempt to predict the exact number of vulnerabilities until a specific month instead of the cumulative number of vulnerabilities until that month.
4. Results and Discussion
In this section, the evaluation results that were obtained through fitting the models (i.e., goodness-of-fit level) on the five selected software projects (see
Section 3.3), as well as results on the held-out (i.e., testing) data are presented and discussed. For the conduction of all the experiments with deep neural networks the CUDA [
57] platform running on an NVIDIA GeForce GTX 1660 GPU was utilized. For the statistical models, we used an i5-9600K CPU at 3.70 GHz with 16 GB RAM.
Table 9 provides the performance metrics that were obtained while examining the goodness-of-fit (i.e., descriptive power) of the investigated models described above, as well as their predictive power on unknown data, for each vulnerability dataset covered in our study. As a reminder, the first three metrics (i.e., fit metrics) reflect the fitting performance of the models on the first n-24 observations used for model training, whereas the remaining two metrics (i.e., test metrics) reflect the model’s performance on the last 24 observations (i.e., 24 months). In
Table 9, the best values of each metric for each software are demonstrated in bold.
By inspecting
Table 9, the following observations can be deduced. First of all, none of the models acts as a “silver bullet”, meaning that there is a different optimal model for each particular dataset (i.e., software project). As a matter of fact, the optimal models vary even when compared per category, since there is no statistical or DL approach that performs better than its competitors within the same class. On the other hand, we can clearly observe that the statistical models present a better goodness-of-fit level, whereas the DL models provide lowest errors on the test data of each project.
Regarding the
goodness-of-fit level, we notice that in each covered dataset (i.e., software project) there is always a statistical model providing higher R
and lower MAE and RMSE scores than the DL approaches. While in most of the cases the R
cannot be considered high, possibly creating doubts about how well the models fit the data, both MAE and RMSE are low enough (at least for the cases of Internet Explorer, Ubuntu Linux and Microsoft Office) showing that the models’ predictions are really close to the real values. To provide a visual inspection of the models’ fit capabilities,
Figure 3 shows the ARIMA model (in red colour) fitted to the Google Chrome vulnerability dataset (in blue colour). Based on
Table 9, ARIMA demonstrated the best fitting performance as regards this particular dataset. As can be seen by inspecting the plot, the ARIMA model has managed to learn the peculiarities (e.g., level, trend) of Google Chrome’s vulnerability evolution patterns to a quite satisfactory extent, with the only exception being a couple of random spikes, where the number of reported vulnerabilities was unusually high. However, it should be noted that, although the model cannot accurately estimate the exact value of the spikes, the predicted values are higher than the mean value of the predictions, meaning that it can indeed capture this trend, i.e., an expected sudden rise in the number of vulnerabilities. This is important as the purpose of vulnerability forecasting models is to facilitate decision making during the development of software products, by detecting potential future trends in the number of reported vulnerabilities.
While we do not provide respective plots for the rest of the vulnerability datasets due to reasons of brevity, we inform the reader that fitting lines very close to the ground truth (i.e., showing similar promising performance) were also observed in the rest of the examined cases (i.e., software projects).
As regards the most important part of model evaluation, i.e., their
predictive power on unseen data, by inspecting the results presented on
Table 9, we can argue that as far as the cases of Internet Explorer, Ubuntu Linux and Microsoft Office are concerned, MAE and RMSE values indicate that both the statistical and DL models are quite efficient in producing 24 steps ahead forecasts. On the other hand, in the cases of Google Chrome and Apple MacOS the models provide forecasts that are quite far from the “ground truth” (i.e., the real values). To provide a visual inspection of the models’ predictive capabilities,
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show the forecasted values (in red colour) of the last 24 months for each of the five examined software projects (ground truth in blue colour), generated by the best-performing model in each particular case. As can be seen by inspecting these plots, in most cases the models have managed to learn the peculiarities (e.g., levels, trends) of the projects’ vulnerability evolution patterns to a quite satisfactory extent, with an exception in the Apple macOS case where they are struggling to follow the random spikes that reflect unusual high numbers of reported vulnerabilities.
Furthermore, from
Table 9, one can observe that for each covered project both the lowest MAE and RMSE values are obtained by employing DL models. However, it can be noticed that the DL superiority is very slight, since the differences with the statistical models, regarding MAE and RMSE, are very small. The only exception is the case of Internet Explorer, where there is a clear advantage of DL. To complement our analysis, the bar charts illustrated in
Figure 9 and
Figure 10 depict the slight lead of the DL models in a clear manner. We plotted the lowest MAE and RMSE values per project in
Figure 9 and
Figure 10 respectively.
By inspecting
Figure 9 and
Figure 10, it can be argued that although deep learning seems to be a promising option for vulnerability forecasting and, as regards the five demonstrated projects, a slightly more accurate option than the statistical models, the two types of modelling approaches are very similar in terms of predictive performance, since their MAE and RMSE deviate slightly. In order to verify this observation (i.e., investigate whether there is an important difference in models’ predictive power) we conducted the Wilcoxon signed rank test [
58], which can define if there is an important statistical difference between two paired samples (i.e., errors produced by statistical models and errors produced by DL models). For the needs of the Wilcoxon test, we computed the raw errors between the actual and the predicted values for each one of the 24 timestamps of the testing period.
Table 10 presents the results (i.e.,
p-values) of the Wilcoxon analysis. By inspecting the results, it is clear that important statistical difference between the two samples (i.e., errors) can be observed only in the case of the Internet Explorer, since a statistical difference is considered important only if the
p-value is lower than the 0.05 threshold. Hence, we cannot safely conclude that the DL models are better than the statistical models in vulnerability forecasting, since, despite the fact that the former demonstrated lower errors compared to the latter, this difference was not observed to be statistically significant.
To sum up, our experimental analysis has shown that the produced forecasting models, either statistical or DL, can be deemed efficient for predicting the evolution of vulnerabilities up to a period of 24 months ahead for three of our examined datasets. More specifically, the models provide satisfactorily accurate forecasts for the cases of Internet Explorer, Ubuntu Linux, and Microsoft Office, whereas they have difficulties in following the unusual spikes and the outliers of Google Chrome and Apple MacOS. In these two cases the forecasts are not so close to the actual values due to the unusual behavior of their data with respect to the reported vulnerabilities. Contrary to the Internet Explorer, Ubuntu Linux, and Microsoft Office where both the statistical and the DL models generate sufficiently accurate forecasts, we can observe that in Google Chrome and Apple MacOS both of the models types do not seem sufficient enough. This observation led us to the conclusion that the vulnerability forecasting task as regards the Google Chrome and Apple MacOS cases is challenging, not because of the models incapacity per se, but because of the inherently noisy nature of their data.
At this point, a crucial comment about the obtained RMSE and MAE values has to be discussed. The RMSE estimation is more sensitive to extreme cases and outliers, whereas the estimation of MAE is more robust to outliers. In some cases where vulnerability evolution is characterized by many unusual spikes, RMSE can be low, but the model can capture the trend of the vulnerabilities evolution. In
Figure 7, we can see that the LSTM manages to predict the trend of the vulnerabilities in Ubuntu Linux, even if it cannot predict their exact value in the spikes. It also manages to predict the ordinary values, which leads to a low MAE. In vulnerability forecasting, we believe that MAE is more critical for the evaluation of a model, as it depicts the model’s ability to predict the number of vulnerabilities in the majority of timesteps, whereas the exact values of the outliers and the spikes are not so important, since the models capture the trend. In other words, the models can identify the spikes and any sharp increase/decrease, which is more important from the actual values in the spikes, for decision making support.
Regarding the comparison of these two model types, which is the main subject of the present study, we found out that although the statistical models achieved a better goodness-of-fit level with higher R in the training dataset, the DL models predicted more accurately the held-out test data providing lower MAE and RMSE scores. Despite their marginal superiority, DL models’ results indicate that they can be considered a promising technique on the field of software vulnerabilities forecasting, especially in the near future when more data about reported vulnerabilities are expected to become available.
However, based on the specific models that we applied and the specific datasets that we utilized, none of the examined models managed to demonstrate good results consistently in all the studied projects. Different models demonstrated better results in different software projects. An interesting observation though was that the model type did not seem to affect the predictive capability of the final forecasting models, since in the case of the Internet Explorer, Ubuntu Linux, and Microsoft Office, both statistical and DL models were able to provide sufficient predictions with highly similar predictive performance, whereas in the other two studied software products both of them failed to provide good forecasts, again with highly similar performance. This leads to the conclusion that the choice among statistical and DL models is still project-specific and associated to the project’s particular vulnerability characteristics (e.g., unusual spikes, outliers, and zero-inflated time series).
Hence, the main results of the current work can be summarized as following:
The statistical and the DL models had a similar performance regarding the forecasting of the number of vulnerabilities in software systems.
The statistical models achieved a better goodness-of-fit level (i.e., on the training dataset), whereas the DL models had a slight (but not statistically significant) superiority in terms of predictive power on unseen data.
The model selection process depends on the specific project under examination.
The vulnerability forecasting task is significantly affected by the nature of the data. When there were unusual spikes and many zero values in data, all the examined models had difficulty in predicting the number of vulnerabilities.
5. Conclusions and Future Work
In this paper, we compared the capacity of statistical and Deep Learning (DL) models in forecasting the future number (i.e., 24 months ahead) of vulnerabilities in software projects. For this purpose, we exploited the NVD database and we gathered data about the number of reported vulnerabilities for five popular software projects. We proceeded with the development of several models and their evaluation both in terms of goodness-of-fit and predictive power. We showed that DL-based models are competent enough to forecast vulnerabilities and their performance is comparable to the traditional statistical models that are commonly used in vulnerabilities forecasting. Actually, DL models were found to have slightly better predictive power than the statistical models, but the observed difference in their predictive performance was not found to be statistically significant.
Furthermore, we noticed that the selection of the optimal forecasting model is a project-specific task, as it depends on the special characteristics of each dataset. There were software projects where a DL model had a clear advantage (e.g., Internet Explorer), whereas in other projects the best statistical and DL predictors were really close to each other (e.g., Ubuntu Linux). There were also some projects (e.g., AppleMacOS) where the 2 years ahead forecast appeared to be a really challenging task for either statistical or DL models.
There are several potential directions for future work. First of all, an interesting direction would be to explore whether there are patterns inside the source code that can be related to the evolution of the number of vulnerabilities and whether these patterns can be identified by natural language processing techniques or by software metrics- based models. In other words, we are planning to examine whether multi-variate forecasting models could lead to better results in vulnerability forecasting, by incorporating features retrieved from the source code of the software products. We also aim to utilize more information that exists in NVD about the reported vulnerabilities, such as the severity and the impact score of the vulnerabilities, in order to build multi-variate forecasting models.