Malware Detection Using Machine Learning

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1102
Malware Detection using Machine Learning

Dilip Dalgade1 (Professor) Srushti Patyane2
Department of Computer Engineering Department of Computer Engineering
Rajiv Gandhi Institute of Technology, Rajiv Gandhi Institute of Technology,
Mumbai, India Mumbai, India
Anushka Matey3 Saloni Singh4

Department of Computer Engineering Department of Computer Engineering
Rajiv Gandhi Institute of Technology, Rajiv Gandhi Institute of Technology,
Mumbai, India Mumbai, India
Amey Godbole5
Department of Computer Engineering
Rajiv Gandhi Institute of Technology,
Mumbai, India
Abstract:- As the level of malware and viruses is on the instrumental and enables effective development of
rise, the prominence of effective detection systems is preventive defense mechanisms able to spot and kill all the
crucial. Malwares are the modern-day threats that have malware variants in real time. In this research, the paper
troubled major companies worldwide. This article allows to make a comparison between two machine learning
explores in depth two powerful machine learning tools, algorithms that are known to be the best at malware
Random Forest, Support Vector Machines in particular, detection, namely Random Forest and Support Vector
for the detection of malware. Our study revealed the Machines (SVM). The Random Forest approach is an
Random Forest's capacity to reach the upper detection advanced ensemble method that involves creation of many
accuracy limit of 98% by applying an analysis of a dataset decision trees in an attempt to boost precision of the
of various malware samples. The feature selection process classifier. One the other hand, SVM is efficient supervised
as well as the model improvement that we've adopted learning method which specializes in recognizing patterns
have substantially improved use of our approach for in complex datasets. This purpose will be achieved through
malware detection, and this is thereby highly crucial for the implementation of a dataset with a structured collection
organizations to fight against evolving cyber threats. The containing samples of malware.
results of the present research support the ongoing
actionsof strengthening cybersecurity security, therefore, Furthermore, the study delves into how the crucial
providing invaluable information for proactive defense factors like the feature selection and tuning of the models
approach mechanisms against malicious software and algorithms affect their performance. Through achieving
attacks. those targets, the research project at hand makes a valuable
support to the efforts in intensifying cybersecurity
Keywords:- Malware, Machine Learning, Random Forest, safeguards and minimizing the risks that come with
Support Vector Machines (SVM), Detection Accuracy, evolving cyber threats. The results of the investigation
Cybersecurity, Feature Selection, Model Optimization. which are of a practical nature are of a great help to
cybersecurity specialists toward the creation of strong
I. INTRODUCTION detection frameworks and network securing against malware
attacks.
In recent years, malware attacks have become a major
pain in the neck that have slowed down cybersecurity efforts II. LITERATURE SURVEY
and even disrupted the operations of organizations. The
crimeware species always remain in the mode of evolving The work of Tamás Csongor et al. [1] concentrated on
with sophistication of technologies and therefore the utilization of malwares hashes for effect detection. They
conventional signature-based detection methods are of little relied on both behavioral analysis and pattern recognition as
use against this. Therefore, novel techniques are the foundations of their technology generating automatic
increasingly being recognized as the most outstanding alerts in a manner that would allow for preventive threat
methodology that effectively recognized these threats and mitigation. While the results demonstrated their approach’s
quench them. Machine learning algorithms which have effectiveness, the research study was faced with challenges
proved to be greatly potential recently as a weapon fight pertaining to the integration of false positives/negatives
against malware. Through the use of computation that could potentially affect the accuracy of the detection
algorithms and big data analysis, the machine learning is outcome. Though their system was highly accurate,
IJISRT24APR1102 www.ijisrt.com 1949

SIMBIoTA had a true positive detection rate of 97-98% and In contrast, P Priyadarshan et al. [6] adopted a hybrid
many other capabilities, also the results proved its approach combining dynamic and static analysis techniques,
competence in discriminating malevolent software threats. utilizing k-NN, Logistic Regression, and Random Forest
algorithms for malware detection. Despite the absence of
To the work of S. Agarkar et al. [2], they showed that feature selection and reduction techniques, their approach
by the means of a combination of machine learning methods showcased a high accuracy rate of 99.1% with Random
namely Light GBM, Decision Tree, and Random Forest, the Forest, highlighting the synergistic benefits of combining
results were more robust and efficient. With a special different analysis methods in enhancing detection
emphasis on how to reduce false negatives and that of the capabilities.
security achievement better, their methodology was set. The
researchers also started experimenting with using larger Furthermore, M. Masum et al. [7] conducted a
datasets and adding more features to help improve the comprehensive assessment of machine learning classifiers,
discriminability of their classifiers afterward. To note, including Decision Tree, Random Forest, Bayesian, Logistic
however, the accurate rates they got were impressive, with Regression, and Neural Network, to identify the most
Tree Decision yielding 99.14%, Random Forest giving effective malware identifier. Their findings underscored the
99.47%, and Light GBM, with an impressive 99.50%, superiority of the Random Forest algorithm, which achieved
reasserting that their strategy worked perfectly in helping a classification rate of 99%, highlighting the efficacy of
the system recognize rogue software file types while ensemble learning techniques in improving detection
attaining high accuracy. efficiency.
The research led by S. A. Roseline [3] for assembly of Moreover, SH Kok et al. [8] conducted a comparative
hybrid multi-layered ransomware and the procedure of analysis of various ransomware detection techniques,
application of a random forest ensembling algorithm was encompassing state-of-the-art methods such as Bayesian
developed for the detection of malware. This involved Decision Tree, Dimension Reduction, Instance- Based,
their approach in finding a trade- off between computation Clustering, Deep Learning, Ensemble, Neural Network, and
and accuracy, highlighting efficiency as the basis of the Regression. While their study outlined the breadth of
whole process and at the same time, achieving a good techniques available for ransomware detection, detailed
detection performance. Yet, the limitation of the study on information regarding their hybrid algorithm configuration
using the testing set that was sized, might impact my was lacking. Nevertheless, their research contributed to
ability to generalize the result. Regardless of that is, their advancing the field of ransomware detection, paving the
proposed art technique has reached impressive success by way for more comprehensive and informative detection
bringing the accuracy rate of 98.91% while using up only a strategies.
few resources of thecomputation as they have demonstrated.
Lastly, Ham, Hyo-Sik et al. [9] focused on Android
The team of K. Sethi et al [4] applied a set of classical malware detection using a Linear Support Vector Machine
machine learning techniques including k-NN, SVM and (SVM) classifier, aiming to ensure reliable Internet of
Random Forest for the constituent family's malware class Things (IoT) services. Despite potential biases in the dataset,
distribution. They classified the malware being researched their findings demonstrated the superior performance of the
by determine their characteristics, functions, and exploits. Linear SVM classifier in accurately identifying malicious
On the other hand, one of therestrictions of their research software instances, highlighting its effectiveness in
was that they used their own hand-created data source, with enhancing malware detection efficiency, particularly in the
it being challenging to achieve real diversity and context of IoT security.
representativeness of modern malware samples like this.
The fact that there is a drawback to this finding makes it III. PROPOSED SYSTEM
more impressive that the Decision Tree algorithm, which is
able to be used, performed very well maintaining a precision In order to accomplish the task of discovering and
rate of 99.11%. This point shows the strength of their preventing malware, our structure combines different
solution which helps in precise classification of malware features selection methods and a number of various machine
family except constraints of dataset associated with it. learning methods, for example, Random Forest (RF),
Support Vector Machine (SVM), and cross validation.
Similarly, N.A Anuar et al. [5] explored the These techniques are used to pick up attribute set of
capabilities of machine learning classifiers, including Naïve population that is chosen by us intelligently. Lowered
Bayes, Support Vector Machine, Random Forest, and data quality and high correlation data features are detected
Decision Tree, in identifying malware threats. While their and eliminated by variance inflation factors which are a
focus was primarily on dynamic analysis, their findings popular feature selection technique; as a result, the
underscored the resilience of SVM in accurately predicting effectiveness of our system isguaranteed.
malware behavior, achieving an impressive accuracy score
of 95.4%. Despite the inherent limitations of dynamic
analysis, the study emphasized the pivotal role of SVM in
bolstering malware detection capabilities.

 Evaluation:
Further, an assessment is carried out to monitor the
accuracy of the model after training. The evaluation of this
quality may contain an estimation of the following metrics:
precision, recall, accuracy, etc. These indicators can
demonstrate how good the model is in classifying ransom
ware and that which of them are legitimate software.
 Implementation:
Subsequently, the model is evaluated for the purposes
of training, and then the instrumented model is deployed on
a real-time environment for malware detection. Monitoring
round the clock is necessary to detect any activities that are
suspect to aid ascertain malwares. When it is found out, the
conduct which is considered as IT security violations can be
reported to users or system administrators to rectify threats.
 Dataset Splitting:
The feature extraction is the first step that prior to
splitting the data set is set into the training and, the testing
data. This is accomplished being that the ratio of the split is
80-20; where 80% of the data is set aside to train the model
Fig 1 System Architecture and 20% is employed in testing its performance. This
ensures that the model learn on a large amount of data for
 For the Proposed System, we will Adapt the Steps training to avoid over training but also a subset of unseen
Outlined in the Recommended Approach for Malware data for evaluation.
Detection:
 Algorithm Selection:
 Data Collection: The random forest and SVM algorithms are selected
At the initial stage, we gather a dataset which is concerning the accuracy of the model. Based on this work,
composed not only of some ransomware samples but also these algorithms show great ability to cope with the high-
legitimate software files. The dataset shall have enough size dimensional data and, naturally, they are the best option for
as well as a variety in order to be able to draw accurate malware detection task that is the classification since they are
conclusions for the machine learning model. The system able to deal with the aforementioned problems.
relies on a dataset where each sample comprises 138,047
labeled instances of ransomware and the remaining 30% of  Random Forest uses Multiple Decision Tress. Here are
the dataset contain regular observations. Some Basic Formulas and Concepts Related to Decision
Trees:
 Feature Extraction:
With data being collected and the features to be  Entropy (H(S)):
extracted, we then develop statistical models that can Entropy is a measure of impurity or disorder in a
combine historical data with the data collected. These dataset. For a binary classification problem (two classes,
metadata features are something like file size, file type, typically 0 and 1), the entropy of a dataset S is calculated as:
entropy, and moving to the most significant one and common
for both economic and industrial espionage methods, H(S) = -p(0) * log2(p(0)) - p(1) * log2(p(1))
System API Calls. System API Calls are the main source of
information that expose the secret of software behavior and Where:
should be the main criterion the detection system considers
to differ between harmless andmalicious software.  p(0) is the proportion of class 0 instances in S.
 p(1) is the proportion of class 1 instances in S.
 Machine Learning Framework:
Earlier, we got the features by extraction that forms the The goal is to minimize entropy by splitting the dataset
basis of the machine learning platform. We can do this by into subsets that are as pure as possible.
using Random Forest, SVM and cross validation depending
on the machine learning technique that gives the best  Information Gain (IG):
performance. It is crucial to set up a balanced dataset that Information gain measures the reduction in entropy
yields both clean and bad patterns, without dataset bias achieved by splitting a dataset based on a particular feature.
tendency, to counter the model. For a feature F and a dataset S, the information gain is
calculated as:

IG(S, F) = H(S) - Σ((|S_v| / |S|) * H(S_v))
Where:
 H(S) is the entropy of the original dataset S.

 S_v represents the subset of S when the feature F has
value v.
 |S| is the size of the dataset S.
A higher information gain indicates a better feature for

splitting the dataset.
 Gini Impurity (Gini(S)):

Gini impurity is another measure of impurity used in
decision trees. For a dataset S with multiple classes, the Gini
impurity is calculated as:
Gini(S) = 1 - Σ(p_i^2)
Where: Fig 4 Confusion Matrix for Cross Validation
 p_i is the proportion of instances belonging to class i in

S.
IV. RESULTS
Fig 5 Cross Validation Scores Graph
V. SCOPE
Fig 2 Confusion Matrix for RF
The future prospects of implementing Support Vector
Machine (SVM), Random Forest, and cross-validation
methodologies in cybersecurity present a promising and
diverse landscape. These sophisticated algorithms offer
heightened capabilities in detecting intricate malware and
ransomware threats, thereby ensuring more resilient security
measures. Their integration into existing frameworks holds
the promise of fortifying defenses against evolving cyber
threats, while their scalability and adaptability ensure
efficient performance in managing extensive data processing
demands. Collaborative research endeavors and educational
initiatives are poised to benefit from these techniques,
fostering innovation and knowledge dissemination within the
cybersecurity domain. In summary, the utilization of SVM,
Random Forest, and cross-validation methodologies holds
significant potential for advancing cybersecurity resilience
and staying ahead of emerging threats.
Fig 3 Confusion Matrix for SVM

VI. CONCLUSION [6]. P. Priyadarshan, P. Sarangi, A. Rath and G. Panda,

"Machine Learning Based Improved Malware
Our findings reveal that the best set of parameters for Detection Schemes," 2021 11th International
our dataset consisted of 'max_depth' = 20, Conference on Cloud Computing, Data Science &
'min_samples_split' = 5, and 'n_estimators' = 50. Employing Engineering (Confluence), Noida, India, 2021, pp.
these optimized parameters resulted in a significantly 925-931, doi: 10.1109/Confluence51648.2021.9377
improved accuracy of 98.20% on our classification task. 123.
This highlights the importance of parameter tuning in [7]. M. Masum, M. J. Hossain Faruk, H. Shahriar, K.
maximizing the performance of Random Forest models. Qian, D. Lo and M. I. Adnan, "Ransomware
Furthermore, we discussed the implications of our findings Classification and Detection with Machine Learning
and emphasized the significance of hyperparameter Algorithms," 2022 IEEE 12th Annual Computing
optimization in machine learning model development. By and Communication Workshop and Conference
fine-tuning the parameters of Random Forest, practitioners (CCWC), Las Vegas, NV, USA, 2022, pp.
can achieve superior performance in classification tasks 0316-0322,doi:10.1109/CCWC54503.2022.9720869.
across various domains. [8]. SH Kok, Azween Abdullah, NZ Jhanjhi and
Mahadevan Supramaniam, "Ransomware, Threat and
In conclusion, this research underscores the Detection Techniques: A Review" 2019 International
effectiveness of optimizing Random Forest parameters for Journal of Computer Science and Network Security,
enhancing classification accuracy. Our results provide VOL.19 No.2, Februar 2019
valuable insights for researchers and practitioners seeking to [9]. Ham, Hyo-Sik & Kim, Hwan-Hee & Kim, Myung-
leverage Random Forest efficiently in their machine Sup & Choi, Mi-Jung. (2014). Linear SVM- Based
learning applications. Android Malware Detection for Reliable IoT
Services. Journal of Applied Mathematics. 2014. 1-
REFERENCES 10. 10.1155/2014/594501.
[1]. Tamás, Csongor, Dorottya Papp and Levente Buttyán.

“SIMBIoTA: Similarity-based Malware Detection on
IoT Devices.” International Conference on Internet of
Things, Big Data and Security,
doi:10.5220/0010441500580069
[2]. S. Agarkar and S. Ghosh, "Malware Detection &
Classification using Machine Learning," 2020 IEEE
International Symposium on Sustainable Energy,
Signal Processing and Cyber Security (iSSSC),
Gunupur Odisha,India, 2020, pp. 1-6, doi:
10.1109/iSSSC50941.2020.9358835.
[3]. S. A. Roseline, A. D. Sasisri, S. Geetha and C.
Balasubramanian, "Towards Efficient Malware
Detection and Classification using Multilayered
Random Forest Ensemble Technique," 2019
International Carnahan Conference on Security
Technology (ICCST), Chennai, India, 2019, pp. 1-6,
doi: 10.1109/CCST.2019.8888406.
[4]. K. Sethi, R. Kumar, L. Sethi, P. Bera and P. K. Patra,
"A Novel Machine Learning Based Malware
Detection and Classification Framework," 2019
International Conference on Cyber Security and
Protection of Digital Services (Cyber Security),
Oxford, UK, 2019, pp. 1-4, doi:
10.1109/CyberSecPODS.2019.8885196.
[5]. N. A. Anuar, M. Zaki Mas’ud, N. Bahaman and N.
A. Mat Ariff, "Analysis of Machine Learning
Classifier in Android Malware Detection Through
Opcode," 2020 IEEE Conference on Application,
Information and Network Security (AINS), Kota
Kinabalu, Malaysia, 2020, pp. 7-11, doi:
10.1109/AINS50155.2020.9315060.

Malware Detection Using Machine Learning

Uploaded by

Copyright:

Available Formats

Malware Detection Using Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malware Detection Using Machine Learning

Uploaded by

Copyright:

Available Formats

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1102

Malware Detection using Machine Learning

Anushka Matey3 Saloni Singh4

IJISRT24APR1102 www.ijisrt.com 1949

IJISRT24APR1102 www.ijisrt.com 1950

IJISRT24APR1102 www.ijisrt.com 1951

IG(S, F) = H(S) - Σ((|S_v| / |S|) * H(S_v))

 H(S) is the entropy of the original dataset S.

A higher information gain indicates a better feature for

 Gini Impurity (Gini(S)):

Where: Fig 4 Confusion Matrix for Cross Validation

 p_i is the proportion of instances belonging to class i in

Fig 5 Cross Validation Scores Graph

IJISRT24APR1102 www.ijisrt.com 1952

VI. CONCLUSION [6]. P. Priyadarshan, P. Sarangi, A. Rath and G. Panda,

[1]. Tamás, Csongor, Dorottya Papp and Levente Buttyán.

IJISRT24APR1102 www.ijisrt.com 1953

You might also like