Ransomware Detection and Classification Using Ensemble Learning: A Random Forest Tree Approach
Ransomware Detection and Classification Using Ensemble Learning: A Random Forest Tree Approach
Ransomware Detection and Classification Using Ensemble Learning: A Random Forest Tree Approach
Abstract—Viruses significantly threaten computer systems, the victim will lose access to his files. This threat is also called
potentially causing extensive damage and data loss. All users crypto-locker because it locks your entire files on your system
must prioritize cyber security by installing effective antivirus and demands money to decrypt your files again [1]. Suppose
software, safeguarding their PCs against potential harm. Even
though there are many different kinds of malware, ransomware is you are a victim of this threat. In that case, you will only
particularly dangerous since it prevents victims from accessing have two choices: give demand money from the hacker or
their vital data or locks files permanently unless they pay a erase your entire data from the drive and install an operating
ransom to the attackers. Recent ransomware strains must be system, i.e., windows or any other, but is it simple to lose or
categorized promptly. Data for the present investigation was erase all your important data?
gathered from a variety of web resources, including Kaggle and
ransomware.re. Concerning using Kaggle to acquire harmless First, a ransomware .exe file gets downloaded from some
datasets, ransomware.re is retrieved for use in a study on ran- unknown resource, a phishing attack, or a downloadable file
somware. Many preprocessing methods, such as Normalisation through email. Once the user downloads it, it will appear as
and Imputation, are used to polish our datasets. The most recent a .exe file; when it installs it, it will take over to the system.
additions to the dataset were classified using the Random Forest It uses different parameters like the name of the computer,
tree classifier, with a final accuracy of 99.9%. Random Forest
Tree fared exceptionally well compared to the KNN and SVM information related to the processor, etc., to generate a key
algorithms. We also highlighted that additional preprocessing that is unique for various PCs; this hashed data will be used
methods can enhance outcomes for SVM and KNN. to uniquely identify victims.[2].2ndly, encrypt files and folders
Index Terms—ransomware, Viruses, Random Forest Tree, or lock the entire system or prevent normal usage. 3rd Step: It
Support Vector Machine, K nearest Neighbors will lock every important file and display a ransom message
with a deadline to pay and the amount the victim must pay to
I. I NTRODUCTION
unlock his files or computer. Targeted Files by ransomware
Antivirus software is the basic need of a computer system attack: Once this virus enters your computer, it will look
to protect it from computer attacks through viruses. With for the files with extensions .txt, .doc, .rft, .chm, .ppt, .cpp,
the increase in technology, cyber threats are also increasing, .db, .zip, .jpg, .mdb, .asm, .key, .pdf, .pgp, etc.[3] Encryption
requiring more computer system protection. Ransomware is method: when the target files are found, it encrypts files with
the latest and improved version of the threat. By this threat, the RAS + AES algorithm to prevent the owner from accessing
This work is funded by FCT/MEC through national funds and co-funded by them without paying for the attack. Ransomware limits a user
FEDER—PT2020 partnership agreement under the project UIDB/50008/2020. to use its computer, it encrypts all files and folders on a
laptop or locks the whole computer with a password, and the suggested using approaches such as permission-based feature
attacker demands money to give a key to decrypt all data of extraction and opcode n-gram to increase the detection rate.
the computer or give the password to provide full access to a According to the researchers, their system used various
computer. It is a widely spreading attack. Every 13 seconds, machine learning models to obtain high detection rates of
a computer is a victim of a ransomware attack, and there is a almost 97%. The framework also recorded minimal false pos-
million-dollar loss [4] to different companies due to this threat itive rates, showing it can successfully differentiate between
(virus). This research investigates the previously developed legitimate and malicious apps. The authors also discussed the
techniques to detect ransomware attacks and how people or framework’s shortcomings and upcoming work to increase
businesses can protect their data and privacy to avoid them. accuracy. In conclusion, the authors offer a viable method
Moreover, the proposed model would be able to detect and for identifying Android malware using machine learning tech-
classify ransomware families by using an ensemble learning niques. The quality of the data used to train the model and the
approach called Random Forest Tree. Results are compared reliability of the feature extraction technique will impact how
with the other three state-of-the-art classification models to well any machine learning-based malware detection system
validate the proposed model’s accuracy, precision, recall, and performs.
f1-score. H. Rathore et al. [7] concentrated on creating a malware de-
tection system that effectively utilizes Machine Learning and
A. Key Contributions of this work is summarized below Deep Learning techniques because the conventional signature-
1) Preparing the data for the impending ransomware attack. based approaches can no longer keep up with the quickly
2) Normalization of the dataset to prepare it for the model changing malware. The research’s technique involved training
to classify ransomware better. and evaluating multiple Machine Learning and Deep Learning
3) Implementation of proposed Model for the classification models on a dataset of good and bad Windows executable files.
of ransomware and compare results with other classifica- The assembly code of the executable file is used as input to
tion models to validate the performance of the proposed the models in the authors’ new feature extraction technique.
model. The study’s findings revealed that, with an average detection
rate of 97.32 percent, deep learning models outperformed
The rest of the paper will proceed as follows: Section II
machine learning models in terms of detection rate. However,
represents state-of-the-art work (related work). In section III,
this strategy’s drawback was the high computational cost
the proposed model is discussed. Section IV represents the
and memory requirement of deep learning models, which the
results obtained from this research, and finally, Section V will
authors suggested as a potential area of improvement in future
conclude this work and present future research directions.
work.
F. Khan et al. [8] revealed a novel method for identifying
II. R ELATED W ORK
ransomware utilizing digital DNA sequencing and machine
J. Hwang et al.[5] presented a two-stage mixed method learning methods. According to the authors, there is a need for
for detecting ransomware. The proposed method combined a more sophisticated strategy because conventional signature-
a Markov model and a Random Forest model to capture based ransomware detection methods are insufficient to iden-
the characteristics of ransomware. The first stage focused on tify new strains of ransomware. The authors put forth a brand-
Windows API call sequence patterns and used a Markov model new feature extraction technique that extracts features from
to extract the features of ransomware. In contrast, the second ransomware samples using digital DNA sequencing and ap-
stage used a Random Forest machine learning model on the plies a machine learning-based classifier to determine whether
remaining data to control false positive and negative error the samples are malicious or benign. They gathered a dataset
rates. The authors reported that the method achieved an overall of benign and ransomware files and utilized it to test and refine
accuracy of 97.3%, with a false positive rate of 4.8% and their suggested methodology. According to the authors, their
a false negative rate of 1.5%. This approach was presented approach had minimal false positive rates and high detection
as a promising solution that could improve current methods rates of around 98 percent. They admitted, nevertheless, that
of ransomware detection and could further be developed for the dataset they utilized was small and not a representative of
practical use. the entire ransomware ecosystem. And they also suggested that
Arvind Mahindru presented a framework named MLDroid future work should include a much more diverse and extensive
and A. L. Sangal [6] for identifying Android malware using dataset to increase the method’s robustness.
machine learning methods. The authors contend that due H. Zhang et al. [9] N-grams of opcodes have been proposed
to malware authors’ increased ability to elude detection by as a new machine-learning technique for categorizing various
changing their programs, existing signature-based malware ransomware families. The authors want to solve the issue
detection techniques are losing effectiveness. The authors of correctly recognizing and classifying different kinds of
suggested a model to detect malware using machine learning ransomware. The study takes ransomware samples and extracts
methods to solve this issue. They gathered a dataset of the opcodes, presenting them as N-grams. The samples are
good and bad Android applications to train and test multiple then classified using machine learning models that have been
machine-learning models for malware detection. They also trained using these N-grams. The study’s findings show that
the suggested method can accurately classify several ran- excluding specific scenario possibilities. The author advises
somware families, up to 97.44 percent. However, the authors testing the performance of the suggested evaluation metric
also acknowledge the modest size of the dataset employed using more challenging and realistic settings and a larger and
and the need for additional research to boost performance. more varied dataset.
The authors recommend evaluating the strategy and enlarging E.Berrueta et al. [13] offered a method based on sharing-file
the dataset size for future work. traffic analysis to find and stop crypto-ransomware activities.
Udayakumar N. et al. [10]presented a study on classifying The paper’s primary goal is to address the severe threat
malware samples using machine learning methods. The paper ransomware poses to individuals and businesses, particularly
investigated the application of several machine-learning tech- in corporate settings where one infected computer can lock
niques for precisely classifying and identifying malware sam- access to all shared files to which it has access. The suggested
ples. As part of the study’s approach, features from malware method keeps track of all data sent between clients and file
samples were extracted, and these features were used to train servers, and it uses machine learning to look for patterns in
a variety of machine learning algorithms, including Support the data that reveal ransomware operations while reading and
Vector Machines, Random Forests, and Neural Networks. overwriting files. It is the first proposal intended to function
The algorithms’ performance is then assessed using various with encrypted file-sharing protocols and clear text protocols.
criteria, including accuracy, precision, recall, and F1-score. This article aims to identify ransomware activity from a high
According to the study’s findings, when compared to the SVM activity from innocuous programs by extracting elements from
and neural network algorithms, Random Forest had the best network data that describe the activity of accessing, closing,
accuracy, at 98.5%. However, the study has significant draw- and changing files. More than 2,400 hours of ’not infected’
backs, including the short dataset size and lack of coverage of traffic from actual users and more than 70 ransomware files
all malware strains. from 33 different strains were used to train and test the
Sreelaja N.K. [11] presented a study using the Ant Colony proposed technique.
Optimisation (ACO) algorithm to boost signature matching’s The methodology utilized in the paper includes employing
effectiveness in filtering ransomware. The research suggested a network probe to acquire and analyze network traffic and
a novel strategy for leveraging the ACO algorithm to enhance machine learning methods to examine the captured data. The
the effectiveness of signature-based ransomware detection. number of TCP connections, bytes exchanged, the order of
The study’s methodology utilized the ACO algorithm to speed messages between the client and server, packet sizes, inter-
up the signature-matching step that identifies ransomware. The packet timings, inactivity times, connection durations, and
ACO algorithm determines which ransomware sample and combinations of any of these are among the features utilized
signature match best. The proposed strategy is then compared in the training and testing of the model. A neural network with
to the conventional binary search method and assessed using three hidden layers of neurons was discovered to be the most
a variety of metrics, including false positive and true positive effective model.
rates. The study’s findings show that the suggested method The validation findings demonstrate that the suggested tool
employing the ACO algorithm had a higher true positive rate can detect all ransomware binaries listed, even those not
of 99.5% and a lower false positive rate of 0.1%. The study utilized during the training phase. With more than 2400 hours
does, however, acknowledge significant limitations, including of real user traffic, the tool has a false positive rate of 0.004%.
that it was only tested on a small dataset of well-known ran- It can identify all ransomware binaries used during the training
somware strains and could not identify unidentified variants. phase in an average of 30.2 seconds. Only losing an average of
S.H. Kok et al. [12] presented a study on creating measures 99 MB of user data before discovery, it detects 100% of a batch
for evaluating machine learning-based methods for detecting of 10 crypto-ransomware binaries not utilized in the training
crypto-ransomware. The paper aims to suggest a brand-new phase. The study also identifies various tool limitations, such
evaluation metric for gauging the effectiveness of crypto- as the tool’s focus on just Microsoft Windows operating
ransomware detection systems based on machine learning. systems and its relevance to only cases where essential files are
Formulating a new evaluation metric based on the mean kept on a file server. The writers mention that the suggested
average precision (MAP) and the area under the receiver oper- approach is static and that future work will concentrate on
ating characteristic curve (AUC-ROC) of the study’s approach developing better adaptive training methodologies so that new
considers both the detection and false positive rates. After that, ransomware strains can be added to the model and assessed
the performance of several machine learning-based crypto- for improvement or decrement in results.
ransomware detection systems is assessed using the proposed
evaluation metric. III. M ETHODOLOGY
The study’s findings suggest that the suggested assess-
ment metric is superior to conventional metrics like accu- This research aims to describe the approach used to inves-
racy (86.5%), precision (85.2%), recall (90.7%), and F1-score tigate the problem of ransomware detection using Ensemble
(87.9%) for assessing the effectiveness of crypto-ransomware learning Random Forest Tree. We look into the possibility of
detection systems. The study does, however, admit significant using RFT techniques to improve the detection of ransomware
limitations, including using a small dataset for testing and attacks in a distributed network. The proposed method is
divided into three phases: Data Preprocessing phase, Ran-
somware detection using RFT, and Classification of Ran-
somware families and their variants.
A. Pre-Processing
In Preprocessing Phase, data is collected for benign and
ransomware files from different websites. After the collection
of the dataset, I converted all the datasets into an Excel
file and defined its features based on the requirements. After
finetuning, the dataset is converted into useful preprocessed
data. Columns included type of file, hash code, benign or
virus, and ransomware family. The dataset consists of 20,000
samples of ransomware and almost 30,000 benign files. We
also divided our dataset into train and test data with a ratio of
80: 20.
...,
Count(fk (X)))(2)
IV. R ESULTS AND D ISCUSSION
We have used the argMax() function to find the class label
index with the highest count, which will be our predicted class.