1. Introduction
Recently, the Industrial Internet of Things (IIoT) has witnessed on of the most devastating ransomware attacks that infected major companies such as the Colonial Pipeline, which disrupted the pipeline’s operations and causing a serious fuel crisis in the Southeastern United States on 7 May 2021. This incident was one of the fastest-spreading computer-world attacks carried out by ransomware that encrypts and/or hijacks the data, making them inaccessible to the users [
1,
2,
3]. After encrypting the victim’s assets, the ransomware author demands a ransom for the restoration of the assets (user’s data) into their original states [
4,
5,
6]. If the victim paid the ransom to the attacker through the anonymous currency mechanisms such as Bitcoin [
1,
3], the access to the encrypted assets is made available again [
7,
8]. Distribution of the crypto-ransomware can happen by a very large infection vector, including application and browser vulnerabilities, extraction of ZIP files, malicious payload, e.g., cryptoWall, JRE vulnerability, e.g., DMA locker, exploit kits, such as neutrino, eternal blue, eternal romance, etc.
The ransomware analysis is commonly divided by the static and dynamic approach. The most common detection method used by the antivirus is the signature-based detection; it is based on the characterization of the knowledge in its repository. When a new malicious piece is discovered, the anti-virus vendors need to catch its binary signature through analyzing the instructions of the executable code [
9,
10,
11]. This signature technique is becoming more difficult and detectable, since all recent malicious applications have intention of stealth techniques to evade the detection [
9,
11]. Ransomware writers employ obfuscation technique for automatically modifying themselves to the unknown version to evade antivirus software detection [
4]. This means that the signature-based detection becomes less efficient and reliable to the new unseen ransomware and easily escapes the detection by simple defense such as code obfuscation [
6,
11]. Many researchers employed the analysis of the ransomware executable binary files using a dynamic approach [
12]. To extract the real behavior of the malicious file, samples are executed in controlled an environment such as sandbox [
13,
14,
15]. In this approach, the changes of the file systems, such as registry modification, file deletion and encryptions, are monitored.
Many researchers are concerned about the detection of executable binary files using machine learning algorithms to test different methods [
14]. The authors in [
15] presented a method for detecting previously unseen polymorphic computer viruses. Their model was based on Support Vector Machine (SVM) algorithm using the system API to detect malicious codes. Although the sample dataset size was small, they showed the model’s detection accuracy and performance by comparing the result of SVM with other learning algorithms. Similarly, Kolter and Maloof [
16] introduced a text classification method that detected and identified malicious binary executable files. They compared machine learning classifiers, such as, naive Bayesian, decision tree, boosted decision tree and SVM, and the result showed that the boosted decision tree had a perfect performance. Another work of Singhal and Raul [
17] proposed an advanced machine learning-based method for malware detection. Their module extracted API calls made by various normal and harmful executables and implemented an enterprise gateway level to act as a supplement anti-virus present on end-user computers.
Due to resource limitations, most ransomware detection solutions proposed to protect the IIoT focus on reducing the data dimensionality and extract a compact representation of attack patterns [
18,
19,
20,
21]. The common characteristic of these solutions is the reliance on the autoencoder to derive a reduced set of features for model training. The autoencoder transforms the high-dimensional data into another set with a lower dimension. However, this approach focuses on capturing more information regardless of its relevancy. Although several solutions have used variances of the minimum Redundancy Maximum Relevancy (mRMR) for ransomware early detection, they are not suitable for the IIoT environment as they assume that the APIs are noise-free, which does not hold for the IIoT environment where heterogeneous components are co-located and many of those API are not compatible with each other. Therefore, there is a need for models that can be trained by low-dimensional yet informative features.
In order to enhance the early detection of ransomware in Industrial IoT, this paper examines several machine learning techniques for classifying benign and maliciousness executables programs using API call features, and then presents a comparison of their effectiveness with the existing methods in a controlled environment. In summary, this paper presents three main contributions:
A filtering method for API call noisy reduction was proposed that reduces the size of the API call logs based on the indication of a malicious file.
A Weighted Enhanced maximum-Relevance and minimum-Redundancy (WEmRmR) technique that selects the most informative API calls with small number of computations. Unlike the original mRmR, the weighted mRmR avoids the limitation for ranking the features based on their importance in the collection. Therefore, we applied the term frequency-inverse document frequency (TF-IDF) method to evaluate the weights of these features to produce the proposed WEmRmR technique.
The performance of the proposed technique was evaluated on different datasets with several feature-sets. We also compared our proposed WEmRmR method with the original mRmR to measure the computational complexity and number of evaluations.
The remaining of this paper is organized as follows; the second section discusses the related works.
Section 3 presents the proposed methodology.
Section 4 presents the experimental result of the research. Finally, in
Section 5, the conclusion of the research is presented.
2. Related Work
Recently, supervised and unsupervised machine learning methods were suggested to detect and classify the ransomware and benign application. Ahmed et al. [
13] proposed a highly survivable ransomware (HSR) early detection approach with supervised machine learning classifiers. They employed behavioral-based analysis of HSR to extract the integrated features through Term Frequency-Inverse document frequency (TF-IDF). Their experimental results achieved high accuracy and less false-positive rate for detecting HSR in the early phases of the attack. The authors employed seven features to distinguish the ransomware from the benign executable files. However, there are some common features that share the malicious and benign application, which can lead to poor characterization of the ransomware behavior and caused more false-positive alarms. Dynamic behavior-based crypto-ransomware was introduced by Sgandurra et al. [
8] to capture the characteristics of ransomware by proposing the EldeRan framework. The authors applied a machine learning algorithm such as the Regularized Logistic Regression classifier that achieved a 96.3% detection rate with an area under the ROC curve of 0.995%. However, identifying a fixed time for ransomware detection is not appropriate to all ransomware samples, since some variants display their malicious activities after human interaction [
22,
23]. Iglesias and Zseby [
24] proposed mRmR, WMR SAM, and LASSO for a multi-stage feature reduction with 41 traffic features. In their work, they reduced and selected the most important features into 16 features. For classification purpose, different algorithms, such as Decision Tree (DT), k-Nearest Neighbor (kNN), Naïve Bayes (NB), Least Absolute Shrinkage and Selection Operator and the Least Angle Regression (LASSO-LAR), Artificial Neural Network (ANN) and Support Vector Machine (SVM), are employed with five-fold cross-validation. However, the complexity and the number of evaluations of mRmR generated features are clearly experimented in this work.
Similar to traditional systems, the IIoT are targeted by ransomware attacks, whose effect mostly is catastrophic as it disrupts the critical infrastructures and industrial operations. In their study, [
18] investigated the likelihood that targeted ransomware attacks disturb the edge layer of IIoT. The attack vectors have been explored and both dynamic and static analysis were carried out. The study pinpointed the importance of the inclusion of kernel-related parameters in detecting the ransomware. However, no solution has been proposed for mitigating such attacks. In the study conducted by [
19], the authors proposed a deep learning-based technique to extract the latent representation of ransomware attack patterns in IIoT. The solution relies on a deep autoencoder to reduce data dimensionality. However, one of the major limitations of the autoencoder is that it learns to capture as much information as possible rather than as much relevant information as possible. Consequently, the ability of the model to perceive the attack patterns is negatively affected. The high data dimensionality was also investigated [
20], and a solution based on stacked variational autoencoder (VAE) was proposed. The data were augmented using the VAE and new artificial observations have been generated. However, the autoencoder sacrifices the relevancy of the extracted features. The same approach has also been used by [
21] as they built the detection model based on the Constructive Denoising Auto-Encoder (CDAE) coupled with the Convolutional Neural Network (DNN). The model, however, tries to preserve less information at the cost of losing the representativeness.
An enhanced mRMR feature selection technique was proposed by Al-rimy et al. [
23], which relies on the Redundancy Coefficient Gradual Upweighting (RCGU) technique to capture the discriminative features from ransomware pre-encryption data. By evaluating each feature individually, the RCGU overcomes data insufficiency and provides robust features. Regarding the feature selection approach, filter methods are proposed for detection of a ransomware sample by selecting the informative runtime characteristics of ransomware. Huda et al. [
1] employed a non-signature-based framework with API calls using hybrids of a support vector machine wrapper and a filter-based approach. Authors combined filters with the wrapper approach to identify the selective features. In the experimental results, the mRmR was used with API call’s scores to feed SVM-based wrapper heuristics algorithm that reaches an accuracy of 94.362% with 291 APIs.
The supervised learning was also used to build ransomware detection models. These models are trained using data collected during ransomware analysis [
25,
26,
27]. Such an analysis takes place either statically or dynamically [
28]. For ransomware analysis, the feature selection is a main step that many of existing research works employ to introspect the latent characteristics of the malicious program [
29,
30,
31,
32]. For static analysis, ransomware’s Portable Executable (PE) file is unpacked, and the source code is introspected to extract the malicious patterns. Several studies followed this approach [
33,
34,
35]. Although static analysis is fast, safe and accurate in identifying previously known ransomware samples, this approach suffers from several flaws [
36,
37,
38]. Particularly, static analysis is unable to deal with evasive strains that leverage obfuscation and packing techniques to change their structures and/or protect the code from being analyzed [
38,
39,
40,
41,
42,
43]. In addition, the static analysis is not suitable for crypto-ransomware early detection as the detection relies on runtime data which cannot be provided by the static analysis.
On the other hand, the dynamic analysis captures runtime data generated during the execution of ransomware samples [
28]. Such data represent the behavioral aspect of the malicious software, hence can be used to detect the attacks [
34,
44]. Like static data, dynamic data can be introspected, and malicious patterns can be extracted [
45]. Due to its efficacy for countering the sophisticated ransomware families that employ polymorphic techniques to deceive detection, the dynamic analysis gained popularity in the research community [
23,
41,
42,
46,
47,
48,
49,
50]. During the dynamic analysis, several types of data are collected, including API calls, PE contents, and file systems; memory; CPU and I/O statistics [
2]. The collected data are used to extract several attack patterns by which detection models are built. However, the main drawback of ransomware dynamic analysis is the inclusion of many irrelevant data and noise due to the overlapping between runtime data generated by ransomware process and other applications running simultaneously in the system. As such, it is important to distinguish between the patterns pertaining to ransomware behavior and other applications.
3. Methodology
In this section, we present the research methodology of the proposed technique that can detect the ransomware in the earlier phases using supervised machine learning. We discuss the general architecture that contains four main steps, including data acquisition, analysis of data, feature extraction, selecting the informative features using Enhanced maximum-Relevance and minimum-Redundancy (EmRmR) ranked with the TF-IDF algorithm. Finally, supervised machine learning algorithms with an integrated number of prominent features are utilized.
3.1. Dataset Collection, Pre-Processing and Analysis
In this study, the dataset that contains benign and ransomware samples is employed.
Figure 1 shows the workflow of the environmental dynamic sample execution. It consists of a dataset that combines benign and ransomware samples, ground truth represented by Virustotal, Cuckoo Sandbox, and API corpus. Similar to [
19,
21], the sandbox simulates the edge gateway for the IIoT system. Both dataset instances are in the format of Windows Portable Executable (PE) file binaries. We collected and acquired a total of 1500 unique samples both benign and malicious files. The benign samples of 450 executable files collected from the file systems in the “System32” directory of a fresh installation of Windows 64-bit environment. We also acquired a total of 1050 ransomware executable files from the publicly computer virus websites, such as VirusShare, Maltrieve and VirusTotal. These samples are representative of the real world crypto-ransomware that is gathered from 16 different families, such as Petya, Kovter, WannaCry, Cerber, Citroni, Reveton, Kollah, Torrent Locker, Dirty Decrypt, Crypt-Locker, Crypto-Wall, Trojan-Ransom, Tesla-Crypt, and Pgpcoder.
To determine the maliciousness of the files, we double-checked the MD5 hash values of the samples. To do this, we employed Virus Total service to compare the obtained hashes with 57 common different antivirus software. The collected samples contain redundant MD5 hash instances with other ransomware names that may lead to a poor detection of the model and increase the false-positive rate. To remove these redundancies, again, we rechecked the sample’s hash through online Virus Total service. For ransomware family categorization, we employed antivirus vendors’ labelling scheme to classify the sample’s names. Once the data acquisition and labelling process are finished, the pre-processing tasks begin to clean the dataset such as file type identification, and removal of duplicate files.
Since ransomware writers disguise themselves by employing evasion techniques, such as obfuscation, compression, and encryption to subvert the static analysis approach, we used a dynamic analysis approach to gain valuable information related to ransomware behavior. In this method, malware instances are executed inside a guest virtual machine with a host-sandbox to expose the dynamic malicious activities of the samples. To capture the runtime characteristics of samples, we employed Cuckoo Sandbox, a free open-source tool for automation of malware analysis [
25]. Cuckoo monitored and recorded information in terms of the API calls, network traffic, changes of files and folders, processes and memory dumps.
We installed Cuckoo sandbox in Ubuntu 16.04 LTS Desktop fully updated version as host operating system. The guest machine was WindowsXp_server_Pack3 32bit as it has less security protections that provides more ransomware activities. We executed every sample in the Sandbox up to a range of 4 until 9 min to exhibit its malicious behavior, but modern ransomware pauses its execution until human interaction such as a mouse event or a key is pressed. Therefore, we employed a python script that works under the Sandbox to do the normal user’s activities, such as clicking, creating and deleting documents and folders on the desktop. Once the sample execution is completed, the output of the sandbox is a human-readable file with extension of JavaScript Object Notation (JSON) format.
The framework for the ransomware detection-based supervised-approach with WEmRmR is shown in
Figure 2. The framework is composed of five components, namely data collection, sample extraction, API gathering, API refinement, and detection engine. During data collection, the ransomware and benign samples are collected to build the dataset and labels are added for each sample based on the decision of the Virustotal. During sample execution, both benign and ransomware samples are submitted to the sandbox for analysis. The APIs are captured and stored into trace files. Then, the refinement is conducted where the APIs are purified, and the failed ones are identified. During this phase, the features are extracted using the N-gram technique and then selected using the proposed EmRmR. The selected features are then used in the detection engine to train the model.
3.2. Run-Time Feature Extraction
The behavioral JSON log files generated by the sandbox are then moved into the extraction phase to gain the valuable API calls.
Table 1 shows samples of runtime API calls used by the ransomware during the execution. Although this output log file contains several groups of analyzed malware activities, we focused only the behavioral part that defines the runtime characteristics of the malicious samples. API Call features are generated from the malicious log files and fed into machine learning algorithms for detection purposes. These effective API Calls are considered to be the most promising approach for the detection and classification of ransomware by providing a valuable behavior of suspicious activities and the attack patterns [
16]. To expose the malicious activities, a ransomware file requires a service from the operating system through an application program interface (API) call that represents the essential behavior of the malware [
27].
The collected log files contain a huge number of runtime features that occupy hundreds of MBs of memory, so retrieving the most important elements from these JSON files manually is experimentally infeasible. Therefore, we employed a parsing algorithm to convert JSON-formatted string representations to the appropriate machine learning format objects as shown in Algorithm 1. The parsing algorithm reads the content of the output JSON file of every process and extracts the entire trace that consists of the API call parameters and its values. To reduce the size of the traces, the parameters and the return-values of the API calls were eliminated from the log files. The names of the related API calls were extracted to create a feature vector based on the process timestamp.
Algorithm 1: API Calls Feature Extraction
|
Input: Set path
that contains a static feature
|
Output: Extracted files |
1. | for process in json_data[‘behavior’] [‘processes’] do |
2. | if json_data_process is equal to states then |
3. | set FST = process_first_seen |
4. | if FS is greater than FST or equal to zero |
5. | set first_seen = first_seen_temp |
6. | for features in json_data_process[F] do |
7. | if features [F] not in _Dict_features then |
8. | set _Dict[F] and timestamps = f and _time |
9. | Our_Dict [F][count] = 1 |
10. | Else set _Dict [F] and timestamps = _F_timeappend_F |
11. | Dict [F][count]+1, return FS, Dict_ [F]} |
12. | Function ParsedJson_Files (JR, PR) |
13. | for (JR, PR) in G_file() do |
14. | for i, name in enumerate [all_files] do |
15. | If name ends with (‘json’) then |
16. | = initialize files that matches (name) |
17. | Open the json data
as
|
18. | Set
= load
|
19. | GetFeature , true) |
20. | Print(PR)} |
21. | In the main function { |
22. | Input ← parse_directory |
23. | Input PR ← F |
24. | Store the user’s input in the ) variables |
25. | Parsed Json_Files(JR, PR) |
26. | if name is equal to main then terminate |
3.3. Weighted Enhanced Maximum Relevance and Minimum Redundancy (WEmRmR)
The output of the JSON files includes a huge amount of unrelated runtime API calls. The ransomware authors inject many API calls into the flow execution of the program and this leads to a noisy characteristic of the feature vector. The detection results of machine learning becomes poor due to the hard monitoring and analysis of the real runtime ransomware activities. Therefore, many researchers employed the Maximum Relevance and Minimum Redundancy (mRmR) algorithm that was suggested by Peng [
31]. The mRmR is a popular filter method that is used to select the informative characteristics with high relevance of the class target, and reduces the redundancy features with a supplementary feature set [
29]. The implementation of mRmR has been effectively employed in many different fields, such as video processing, microarray gene expression and data analysis [
30]. In the mRmR approach, the maximum relevance (mR) selects the dynamic runtime features correlated to the characteristics of ransomware types without considering the relationships among features. If
is a set of ransomware features,
is the mutual information between the features
and the malicious target class, the mR is calculated as:
The maximum relevance (
mR) describes the runtime characteristics related to the malicious file, but mR causes redundancies [
18]. To address this replication issue in
mR, the minimum Redundancy (
mR) criterion is employed and illustrated as:
Therefore, the combination of
and
into one function is described by mRMR using the mutual information difference
explained as follows:
The disadvantage of mRmR is the redundant calculations of mutual information (MI) among pairs of features, so, we proposed a slight version of the Enhanced (EmRmR) method that employs with an empty set. This approach identifies the most informative features from the original feature set and eliminates the noise. This repletion does not finish until all subset features are equal to . Due to the repetition, the mRmR algorithm calculates the value on the feature at each iteration.
The generated featurWes by the EmRmR are not ranked, therefore, we need to weight the selected features based on their importance in the document collection. We employed the term frequency-inverse document frequency (TF-IDF) method to weigh the features as expressed, as follows:
where
is the scoring method of word
in the feature pooling d∈D, and TF (
, d) is the frequency of term of
in document d, and IDF (Inverse Document Frequency). Algorithm 2 presents the pseudocode of the proposed Weighted Enhanced maximum Relevance and minimum Redundancy (WEmRmR).
Algorithm 2: Weighted API Call Extracted using Enhanced maximum Relevance and minimum Redundancy (WEmRmR)
|
Input: Discretised data d, number of features in d is F, subset features , class , number of features to select selected features |
Output: selected output features F |
| Initialization |
27. | 0; |
28. | for do |
29. | Relevance Mutual_Info ; |
30. | Aggregate_ Redun = 0; |
31. | end for |
32. | = Max (Relevance (S)) |
33. | ]//To store the highest scorer |
34. | for t1: q − 1 do |
35. | size len ) |
36. | while ν ++ < size do |
37. | for do |
38. | Relevance (S) |
39. | Aggregate_ Redun
|
40. | end for |
41. | -Aggregate_Redun; |
42. | end for |
43. | return selected feature subset ; |
4. Experimental Results
In this section, we present the experimental results of the several machine learning algorithms, such as Logistic Regression (LR), k-Nearest Neighbor (kNN), Random Forest (RF), Decision Tree (DT), AdaBost (AD) and Support Vector Machine (SVM) on various feature dimensions generated by the Enhanced EmRmR and, then, we weighted the EmRmR. To evaluate the proposed detection scheme, the accuracy of the model, F-measure, precision and recalls, and the Area Under the Curve are depicted. The definitions of these performance metrics have been defined in
Table 2. To reduce the dimensionality of the features, we selected the Top-k feature set: 30, 60, 90, 120 and 150.
4.1. Accuracy
The accuracy of the classifiers describes the performance of the trained and the tested models. To evaluate the accuracy, we experimented the sub-selected
features from the EmRmR and present the results in
Table 3 and
Table 4. The accuracy variation among the classifiers ranging from 57.61% to 98.63 for the top-k feature sets: 30, 60, 90, 120, 150 and 180, with 10-fold cross-validation.
Table 3 illustrates the classifiers’ performance on the accuracy of the selected k features without ranking. The lowest accuracy (75.10–89.92%) among the selected k dimensions showed by the classifiers when k = 30 is employed to train and evaluate the detection model. This poor detection accuracy is probably due to the short ransomware behavior characteristic that is not important for describing the pattern of the malicious file from normal files. That is why some ransomware families tend to carry out the entire attack within a short time to inflict the maximum damage before the detection takes place. Such behavior is challenging as it does not give enough time to capture sufficient data needed for accurate detection.
The highest accuracy (76.93–98.64%) was achieved by the classifiers when the top 90 k of the feature set was employed. However, when we expanded the k feature dimensionality to k = 150 and k = 180, the classifiers’ detection has dramatically declined to (78.23–89.14%). This is because the effect of overfitting using high-dimensional data, which adversely affects the accuracy of the detection. Particularly, adding more features leads to feed the model with less relevant and noise input that confuses the model and makes it less sensitive to the actual attack patterns.
On the other hand, we also evaluated the accuracy of the classifiers based on the WEmRmR method on same
k feature dimensionality as shown in
Table 4. Overall, the results of this experiment are nearly similar to the previous one, but were quite satisfactory. Interestingly, the accuracy of the classifiers significantly outperformed compared the previous experiment when the
,
and
is used. This is attributed to the ability of WEmRmR to filter out the noise, irrelevant as well as the redundant features, which contribute to improve the discriminative capability of the detection model.
4.2. F-Measure
In this section, we evaluated the F-measure to convey the balance between the precision and the recall on the selected subset features. This overcomes the problem of biasness towards the major class in imbalanced datasets as it is the case for a ransomware dataset used in this study.
Figure 3 compares the F-measure of each classifier trained and evaluated with different
dimensions without ranked scheme. The highest F-measure values were achieved by the SVM (0.988), and DT (0.986) on a subset feature of
, followed by the LR (0.964), RF (0.917) on
. The classifier of AD and kNN showed a fairly good F-measure of 0.903 and 0.842 with
and
feature dimensions, respectively. We observed the lowest F-measure presented by the AD (0.554) when
of the feature set was employed to train and test the detection model. This is due to the lack of diverse features that AdaBoost needs to build the base estimators that introspect the data from different perspectives.
Figure 4 shows that the classifiers on differently weighted feature dimensions have good/better F-measure compared to the previous unranked feature dimension. From the result in
Figure 4, it is obvious that the classifiers on
performed the highest with an F-measure ranging from 0.727 to 0.989. More specifically, the LR Classifier showed the highest F-measure of 0.989% on
, but presented a poor F-measure on the weighted subset feature of
. This is again the result of overfitting that high-dimensional data normally cause.
Likewise, although the size of the weighted k features is enlarged to 150, the classifiers showed the lowest F-measure; this reveals that the small size of weighted features does not indicate the runtime characteristics of the malicious file.
4.3. Area under the Curve
To measure the performance of the classifiers, we evaluated the Area Under the Curve (AUC), precision and recall on the sub-selected
features.
Table 5 and
Table 6 demonstrate the AUC values, precision and recall of different classifiers on both weighted and unweighted
feature sets. From
Table 5 (A,B), the classifiers on k_90 and k_120 feature sets achieved the highest AUC values of all the k-feature dimensions, ranging from 0.996 to 0.724, with a lesser false-positive rate of 0.016. The lowest AUC values were obtained by the classifiers when k_30 feature sets were employed. This demonstrates that less k-selected attributes do not determine the dynamic characteristics of the malicious file. Moreover, the AUC values of the detection model increased, whereas the size of k was set to 120. This is attributed to the ability of the proposed feature selection technique to pick discriminative features and filter out the irrelevant redundant features.
On the other hand, we evaluated the AUC values of the classifiers on weighted k feature dimensions on their relevance and ranking by the TF-IDF-EmRmR. Overall, the results of AUC values, precision and recall improved significantly compared with the preceding experiment. Regarding to
Table 6 (A,B), the highest accuracy of 0.998 to 0.732 presented by the classifiers with a low false-positive rate of 0.017 when the k_120 of the TF-IDF-EmRmR-based features were employed to train the detection accuracy. When we increased the number of the selected k feature dimension to K_150, the classifiers presented the lowest AUC values. This is due the fact that the large size of the k feature dimension does not differentiate the characteristics of the malicious file from the benign. It is interesting to note that the k_30 feature set generated by the TF-IDF-EmRmR has shown fairly better AUC values compared to the EmRmR features.
5. Comparing with Previous Works
In this section, as shown in
Table 7, the proposed method is compared with earlier similar works against the detection performance using different kinds of ransomware. The aim is to present the effectiveness of the detection model of the supervised classifiers based on the performance criteria. Vinod et al. [
32] proposed a dynamic-based method for detection of the obfuscated malicious files. This approach applied a hybrid method, such as Principal Component Analysis (PCA) and Minimum Redundancy and Maximum Relevance (mRmR) filter with various mnemonic
n–grams, to produce informative dynamic behaviors. The generated features are then fed into several classification algorithms that showed a detection accuracy of 94.1% with a less false-positive rate. Similar work has been conducted by Iglesias and Zseby [
24] who used a combined feature reduction approach for the network traffic. They employed
SAM, LASSO, WMR, and mRmR selection techniques to reduce 41 traffic features into 16 predominant features. In their work, the classification algorithms presented an accuracy (0.27–95.48) with mRmR-generated features. This is attributed to the ability of the proposed feature selection technique to make the appropriate trade-off between relevancy and redundancy. Thus, the proposed WEmRmR can select the most distinctive features while maintaining the low redundancy between candidate and already-selected features. This diversifies the detection model’s input, which helps look into data from many perspectives and understand the data more precisely.