An ML model is a mathematical model that generates predictions by finding relationships between patterns of the input features and labels in the data [
137]. Thus, when using machine learning for any task, it is common to test different types of models and fine-tune them to find the one that best suits the application [
23]. In cybersecurity, due to the dynamic scenarios presented in many tasks, streaming data models are strongly recommended to achieve a good performance, given that they belong to non-stationary distributions, new data are produced all the time, and they can be easily updated or adapted with them [
20]. As a consequence, it is important to understand how to effectively use and, sometimes, implement an ML model in these scenarios, given that they may present many drawbacks that are not feasible in a real application.
6.1 Concept Drift and Evolution
Concept drift is the situation in which the relation between the input data and the target variable (the variable that needs to be learned, such as class or regression variable) changes over time [
60]. It usually happens when there are changes in a hidden context, which makes it challenging, since this problem spans different research fields [
156]. In cybersecurity, these changes are caused by the arms race between attackers and defenders, once attackers are constantly changing their attack vectors when trying to bypass defenders’ solutions [
34]. In addition, concept evolution is another problem related to this challenge, which refers to the process of defining and refining concepts, resulting in new labels according to the underlying concepts [
92]. Thus, both problems (drift and evolution) might be correlated in cybersecurity, given that new concepts may result in new labels, such as new types of attacks produced by attackers. As shown in Figure
11, there are four types of concept drift according to the literature: (i) sudden drift, when a concept is suddenly replaced by a new one; (ii) recurring concepts, when a previous active concept reappears after some time; (iii) gradual drift, when the probability of finding the previous concept decreases and the new one increases until it is completely replaced; and (iv) incremental drift, when the difference between the old concept and the new one is very small and the difference is only noticed when looking at a longer period [
95]. In security contexts, a sudden drift is when an attacker creates a totally new attack; gradual drift is when new types of attacks are created and replace previous ones; the recurring concept is when an old type of attack starts to appear again after a given time; and incremental drift is when the attackers make few modifications in their attacks in a way that their concepts change over a large period.
Despite being considered a challenge in cybersecurity [
62], few works addressed both problems in the literature. For instance, Masud et al., to the best of our knowledge, were the first to treat malware detection as a data stream classification problem and mention concept drift. The authors proposed an ensemble of classifiers that are trained from consecutive chunks of data using
v-fold partitioning of the data, reducing classification error compared to other ensembles and making it more resistant to changes when classifying real botnet traffic data and real malicious executables [
102]. Singh et al. proposed two measures to track concept drift in static features of malware families: relative temporal similarity and meta-features [
143]. The former is based on the similarity score (cosine similarity or Jaccard index) between two time-ordered pairs of samples and can be used to infer the direction of the drift. The latter summarizes information from a large number of features, which is an easier task than monitoring each feature individually. Narayanan et al. presented an online ML-based framework named DroidOL to handle it and detect malware [
110]. To do so, they use inter-procedural control-flow sub-graph features in an online passive-aggressive classifier, which adapts to the malware drift and evolution by updating the model more aggressively when the error is large and less aggressively when it is small. They also propose a variable feature-set regimen that includes new features to samples, including their values when present and ignoring them when absent (i.e., their values are zero). Deo et al. proposed the use of Venn-Abers predictors to measure the quality of binary classification tasks and identify antiquated models, which resulted in a framework capable of identifying when they tend to become obsolete [
46]. Jordaney et al. presented Transcend, a framework to identify concept drift in classification models, which compares the samples used to train the models with those seen during deployment [
82]. To do it, their framework uses a conformal evaluator to compute algorithm credibility and confidence, capturing the quality of the produced results that may help to detect concept drift. Anderson et al. showed that, by using reinforcement learning to generate adversarial samples, it is possible to retrain a model and make these attacks less effective, also protecting it against possible concept drift, given that it hardens a machine learning model against worst-case inputs [
4]. Xu et al. proposed DroidEvolver, an Android malware detection system that can be automatically updated without any human involvement, requiring neither retraining nor true labels to update itself [
157]. The authors use online learning techniques with evolving feature sets and pseudo labels, keeping a pool of different detection models and calculating a juvenilization indicator, which determines when to update its feature set and each detection model. Finally, Ceschin et al. compared a set of Windows malware detection classifiers that use batch machine-learning models with ones that take into account the change of concept using data streams, emphasizing the need to update the decision model immediately after a concept drift is detected by a concept drift detector, which are state-of-the-art techniques used in the data stream learning literature [
37]. The authors also show that the malware concept drift is strictly related to their concept evolution, i.e., due to the appearance of new malware families.
In contrast, data stream learning literature already proposed some approaches to deal with concept drift and evolution, called concept drift detectors, that, to the best of our knowledge, were not totally explored by cybersecurity researchers. There are supervised drift detectors that take into account the ground-truth label to make a decision and unsupervised ones that do not.
DDM (Drift Detection Method) [
59],
EDDM (Early Drift Detection Method) [
14], and
ADWIN (ADaptive WINdowing) [
21] are examples of supervised approaches. Both DDM and EDDM are online supervised methods based on sequential error (prequential) monitoring, where each incoming example is processed separately estimating the prequential error rate. This way, they assume that the increase in consecutive error rate suggests the occurrence of concept drifts. DDM directly uses the error rate, while EDDM uses the distance error rate, which measures the number of examples between two classification errors [
14]. These errors trigger two levels: warning and drift. The warning level suggests that the concept starts to drift, updating an alternative classifier using the examples that rely on this level. The drift level suggests that the concept drift occurred, and the alternative classifier built during the warning level replaces the current classifier. ADWIN keeps statistics from sliding windows of variable size, which are used to compute the average of the change observed by cutting these windows at different points. If the difference between two windows is greater than a predefined threshold, then it considers that a concept drift happened, and the data from the first window is discarded [
21]. Different from the other two methods, ADWIN has no warning level. Once a change occurs, the data that is out of the window is discarded and the remaining ones are used to retrain the classifier. Unsupervised drift detectors such as the ones proposed by Žliobaité et al. may be useful when delays are expected given that they do not rely on the real label of the samples, which need to be known by supervised methods, and most of the time in cybersecurity it does not happen in practice [
161]. These unsupervised strategies consist in comparing different detection windows of fixed length using statistical tests over the data themselves, on the classifier output labels or its estimations (that may contain errors) to detect if both come from the same source. In addition, active learning may complement these unsupervised methods by requiring the labels of only a subset of the unlabeled samples, which could improve drift detection and overall classification performance.
Some authors also created different classification models and strategies that deal with both concept drift and concept evolution. Shao et al. proposed SyncStream, a classification model for evolving data streams that use prototype-based data representation, P-Tree data structure, and just a small set of both short- and long-term samples based on error-drive representativeness learning (instead of using base classifiers or windows of data) [
139]. ZareMoodi et al. created a new supervised chunk-based method for novel class detection using ensemble learners, local patterns, and connected components of neighborhood graphs [
158]. The same authors also proposed a new way to detect evolving concepts by optimizing an objective function using a fuzzy agglomerative clustering method [
159]. Hosseini et al. created
SPASC (Semi-supervised Pool and Accuracy-based Stream Classification), an ensemble of classifiers where each classifier holds a specific concept, and new samples are used to add new classifiers to the ensemble or to update the existing ones according to their similarity to the concepts [
74]. Dehghan et al. proposed a method based on the ensemble to detect concept drift by monitoring the distribution of its error, training a new classifier on the new concept to keep the model updated [
45]. Ahmadi et al. created GraphPool, a classification framework that deals with recurrent concepts by looking at the correlation among features, using a statistical multivariate likelihood test, and maintaining the transition among concepts via a first-order Markov chain [
2]. Gomes et al. presented the
Adaptive Random Forest (ARF) algorithm, an adaptation of the classical random forest algorithm with dynamic update methods to deal with evolving data streams. The ARF also contains an adaptive strategy that uses a concept drift detector in each tree to track possible changes and to train new trees in the background [
63]. Finally, Siahroudi et al. proposed a method using multiple kernel learning to detect novel classes in non-stationary data streams [
142]. The authors do it by classifying each new instance by computing their distance to the previously known classes in the feature space and updating the model based on their true labels.
We advocate for more collaboration between data stream learning and cybersecurity, given that the majority of cybersecurity works presented in this section do not use data stream approaches (including concept drift detectors), they both have a lot of practice problems in common and may benefit each other. For instance, data stream learning could benefit from real cybersecurity datasets that could be used to build real-world ML security solutions, resulting in higher-quality research that may also be useful in other ML research fields. Finally, developing new drift detection algorithms is important to test their effectiveness in different cybersecurity scenarios and ML models.
6.2 Adversarial Attacks
In most cybersecurity solutions that use Machine Learning, models are prone to suffer adversarial attacks, where attackers modify their malicious vectors to somehow make them not be detected [
34]. These techniques were proven effective in both malware and intrusion scenarios [
101], for instance. We already mentioned this problem related to feature robustness in Section
5.2, but ML models are also subject to adversaries. These adversarial attacks may have several consequences such as allowing the execution of malicious software, poisoning an ML model or drift detector if they use new unknown samples to update their definitions (without a ground truth from other sources), and producing, as a consequence, concept drift and evolution. Thus, when developing cybersecurity solutions using ML, both features and models must be robust against adversaries.
Aside from using adversarial features, attackers may also directly attack ML models. There are two types of attacks: white-box attacks, where the adversary has full access to the model, and black-box attacks, where the adversary has access only to the output produced by the model, without directly accessing it [
13]. A good example of white-box attacks is gradient-based adversarial attacks, which consist in using the weights of a neural network to obtain perturbation vectors that, combined with an original instance, can generate an adversarial one that may be classified by the model as being from another class [
66]. Many strategies use neural network weights to produce these perturbations [
13], which not only affects neural networks but a wide variety of models [
66]. Other simpler white-box attacks such as analyzing the model, for instance, the nodes of a decision tree or the support vectors used by an SVM, could be used to manually craft adversarial vectors by simply changing the original characteristics of a given sample in a way that it can affect its output label. In contrast, black-box attacks tend to be more challenging and real for adversaries, given that they usually do not have access to implementations of cybersecurity solutions or ML models, i.e., they have no knowledge about which features and classifiers a given solution is using and usually only know which is the raw input and the output. Thus, black-box attacks rely on simply creating random perturbations and testing them in the input data [
71], changing characteristics from samples looking at instances from all classes [
13], or trying to mimic the original model by creating a local model trained with samples submitted to the original one, using the labels returned by it, and then analyzing or using this new model to create an adversarial sample [
118].
In response to adversarial attacks, defenders may try different strategies to overcome them, searching for more robust models that make this task harder for adversaries. One response to these attacks is
Generative Adversarial Networks (GANs), which are two-part, coupled deep learning systems in which one part is trained to classify the inputs generated by the other. The two parts simultaneously try to maximize their performance, improving the generation of adversaries, which are used to defeat the classifier and then used to improve their detection by training the classifier with them [
65,
76]. Another valid strategy is to create an algorithm that, given a malign sample, automatically generates adversarial samples, similar to data augmentation or oversampling techniques that insert benign characteristics into it, which are then used to train or update a model. This way, the model will learn not only the normal concept of a sample but also the concept of its adversaries’ versions, which will make it more resistant to attacks [
70,
119]. Instead of using the hard class labels, Apruzzese et al. propose using the probability labels to make random forest-based models more resilient to adversarial perturbations, achieving comparable or even superior results even in the absence of attacks [
7].
Some approaches also tried to fix limitations of already developed models, such as
MalConv [
125], an end-to-end deep learning model, which takes as input raw bytes of a file to determine its maliciousness.
Non-Negative MalConv proposes an improvement to
MalConv, with an identical structure, but having only non-negative weights, which forces the model to look only for malicious evidence rather than look for both malicious and benign ones, being less prone to adversaries that try to copy benign behavior [
54]. Despite that, even
Non-Negative MalConv has weaknesses that can be explored by attackers [
34], which makes this topic an open problem to be solved by future research. We advocate for more work and competitions, such as the
Machine Learning Security Evasion Competition (MLSEC) [
16], that encourage the implementation of new defense solutions that minimize the effects of adversarial attacks.
6.3 Class Imbalance
Class imbalance is a problem already mentioned in this work, but on the dataset side (Section
3.3). In this section, we are going to discuss the effects of class imbalance in the ML model and present some possible mitigation techniques that rely on improving the learning process (cost-sensitive learning), using ensemble learning (algorithms that combine the results of a set of classifiers to make a decision) or anomaly detection (or one-class) models [
64,
86]. This way, when using cost-sensitive learning approaches, the generalization made by most algorithms, which makes minority classes ignored, is adapted to give each class the same importance, reducing the negative impact caused by class imbalance. Usually, cost-sensitive learning approaches increase the cost of incorrect predictions of minority classes, biasing the model in their favor and resulting in better overall classification results [
86]. Such techniques are not easy to implement in comparison to sampling methods presented in Section
3.3 but tend to be much faster given that they just adapt the learning process without generating any artificial data [
64].
In addition, ensemble learning methods that rely on bagging [
29] or boosting techniques (such as AdaBoost [
55]) present good results with imbalanced data [
58], which is one of the reasons that random forest performs well in many cybersecurity tasks with class imbalance problems, such as malware detection [
37]. Bagging consists in training the classifiers from an ensemble with different subsets of the training dataset (with replacement), introducing diversity to the ensemble and improving overall classification performance [
29,
58]. The AdaBoost technique consists in training each classifier from the ensemble with the whole training dataset in iterations. After each iteration, the algorithm gives more importance to difficult samples, trying to correctly classify the samples that were incorrectly classified by giving them different weights, very similar to what cost-sensitive learning does, but without using a cost to update the weights [
55,
58].
Even though all the methods presented so far are valid strategies to handle imbalanced datasets, sometimes the distribution of classes is too different that it is not viable to use one of them, given that the majority of the data will be discarded (undersampling), poor data will be generated (oversampling), or the model will not be able to learn the concept of the minority class [
64]. In these cases, anomaly detection algorithms are strongly recommended, given that they are trained over the majority class only and the remaining ones (minority class) are considered anomalous instances [
64,
131]. Two great examples of anomaly detection models are isolation forest [
98] and one-class SVM [
135]. Both of them try to fit the regions where the training data is most concentrated, creating a decision boundary that defines what is normal and what is an anomaly.
Finally, when building an ML solution that has imbalanced data, as well as testing several classifiers and feature extractors, it is also important to consider the approaches presented here for both the dataset and model sides. Also, it is possible to combine more than one method, for instance, generating a set of artificial data and using cost-sensitive learning strategies, which could increase classification performance in some cases. We strongly recommend that cybersecurity researchers include some of these strategies in their work, given that it is difficult to find solutions that actually consider class imbalance.
6.4 Transfer Learning
Transfer learning is the process of learning a given task by transferring knowledge from a related task that has already been learned. It has shown to be very effective in many ML applications [
150], such as image classification [
79,
126] and natural language processing problems [
75,
124]. Recently, Microsoft and Intel researchers proposed the use of transfer learning from computer vision to static malware detection [
40], representing binaries as grayscale images and using inception-v1 [
147] as the base model to transfer knowledge [
41]. The results presented by the authors show a recall of 87.05%, with only a 0.1% of false positive rate, indicating that transfer learning may help to improve malware classification without the need of searching for optimal hyperparameters and architectures, reducing the training time and the use of resources.
In addition, if the network used as the base model is robust, then they probably contain robust feature extractors. Consequently, by using these feature extractors, the new model produced inherits their robustness, producing new solutions that are also robust to adversarial attacks, achieving high classification performances without much data and with no need to use a lot of resources as some adversarial training approaches [
138]. At the same time that transfer learning might be an advantage, it may also be a problem according to the base model used because, usually, these base models are publicly available, which means that any potential attackers might have access to them and produce an adversarial vector that might affect both models: the base and the new one [
127]. Thus, it is important to consider the robustness of the base model when using it to transfer learning to produce a solution without security weaknesses. Finally, despite presenting promising results, the model proposed to detect malware by using transfer learning cited at the beginning of this subsection [
41] may be affected by adversarial attacks, given that its base model is affected by them, as already shown in the literature [
32,
66].
6.5 Implementation
Building a good Machine Learning model is not the last challenge to deploying ML approaches in practice. The implementation of these approaches might also be challenging [
113]. The existing frameworks, such as scikit-learn [
120] and Weka [
72], usually rely on batch learning algorithms, which may not be useful in dynamic scenarios where new data are available all the time (as a stream), requiring the model to be updated frequently with them [
64]. In these cases, ML implementations for streaming data, such as Scikit-Multiflow [
108],
Massive Online Analysis (MOA) [
22], River [
107], and Spark [
103,
144], are highly recommended, once they provide ML algorithms that could be easily used in real cybersecurity applications. Also, adversarial machine learning frameworks, such as CleverHans [
116] and SecML [
104], are important to test and evaluate the security of ML solutions proposed. Thus, contributing to streaming data and adversarial machine learning projects is as important as contributing to well-known ML libraries, and we advocate for that to make all research closer to real-world applications. Note that we are not just talking specifically about contributing with new models, but also prepossessing and evaluating algorithms that may be designed only in batch learning and could also be a good contribution to streaming learning libraries. We believe that more contributions to these projects would benefit both industry and academia with higher-quality solutions and research, given the high number of research works using only batch learning algorithms nowadays, even in cybersecurity problems.
In addition, multi-language codebases may be a serious challenge when implementing a solution, once different components may be written in different languages, not being completely compatible, becoming incompatible with new releases, or being too slow according to the code implementation and language used [
134]. Thus, it is common to see ML implementations being optimized by C and C++ under the hood, given that they are much faster and more efficient than Python and Java, for instance. Despite such optimizations being needed to make many solutions feasible in real-world solutions, they are not always performed, given that (i) researchers create their solutions as prototypes that only simulate the real world, not requiring optimizations, and (ii) optimizations require knowledge about code optimization techniques that are very specific or may be limited to a given type of hardware, such as GPUs [
134]. Also, implementing data stream algorithms is a hard task, given that we need to execute the whole pipeline continuously: If any component of this pipeline fails, then the whole system may fail [
64].
Another challenge is to ensure a good performance for the proposed algorithms and models [
42]. A good performance is essential to deploy ML in the security context, because most of the detection solutions operate in runtime to detect attacks as early as possible and slow models will result in a significant slowdown to the whole system operation. To overcome the performance barriers of software implementations, many security solutions opt to outsource the processing of ML algorithms to third-party components. A frequent approach in the security context is to propose hardware devices to perform critical tasks, among which is the processing of ML algorithms [
28]. Alternatively, security solutions might also outsource scanning procedures to the cloud. Many research works proposed cloud-based AVs [
47,
80], which have the potential to include ML-based scans among their detection capabilities and streamline such checks to the market. We understand that these scenarios should be considered for the proposal of new ML-based detection solutions.