Malware Detection Using ML
Malware Detection Using ML
ABSTRACT
Cyberattacks and the use of malware are more and more omnipresent nowadays. Targets are
as varied as states or publicly traded companies. Malware analysis has become a very
important activity in the management of computer security incidents. Organisations are often
faced with suspicious files captured through their antiviral and security monitoring systems,
or during forensics analysis. Most solutions funnel out suspicious files through multiple
tactics correlating static and dynamic techniques in order to detect malware. However, these
mechanisms have many practical limitations giving rise to a new research track. Our
demonstration results illustrate the possibility to analyze malware leveraging several machine
learning (ML) algorithms comparing them.
Chapter 1
INTRODUCTION
The aim of this project is to tackle the use of machine learning algorithms to analyze malware
and expose how data science is used to detect malware. Training systems to find attacks
campaigns. This study reveals that many models can be employed to evaluate their
detectability
Describe the state of the art in machine learning malware detection for Android in
unsupervised, and deep learning, that have been utilized to identify malware on
machine-learning techniques . Discuss the difficulties and limitations of the methods that are
currently in use as well as the opportunities for improvement . There are some insights into
how machine learn-ing can be used to improve Android malware detection and some
Malware analysis (Afianian et al., 2018) is the discipline of studying a malware (Virus,
Worm, Trojan Horse, Rootkit, Backdoor, APT ..), to determine the potential impact of an
infection (Filiol, 2006). Malware analysis is divided into two parts: static analysis and
dynamic analysis (Sikorski and Honig, 2012)(Ligh et al., 2010). In-depth analyzes are based
on a mix of both.
Malware have disrupted several industries and nations in recent years (Moubarak et al.,
2017). New techniques are leveraged to enable more sophisticated behaviors and furtiveness
(Saad et al., 2019)(Moubarak et al., 2018)(Moubarak et al., 2019). New breed of malicious
software can leverage artificial intelligence (AI) to conceal payload and unleash the action
when machine learning algorithms identify the target using patterns related to face and voice
recognition combined with the geolocation (Stoecklin, 2018).
The pledge of machine learning (ML) in detecting malware consists in apprehending the
features of these malicious software to be able to differentiate between good and bad binaries.
Different steps are needed for that purpose: malicious and benign binaries are collected and
malware specific features are extracted (Saxe and Sanders, 2018) in order to develop
appropriate inference.
Multiple studies related to this field have been undergone to analyze malware based on APIs
(Fan et al., 2015), system calls (Nikolopoulos and Polenakis, 2017), network inspections
(Boukhtouta et al., 2016) and to detect android malware (Wu et al., 2016). In this paper,
several ML algorithms are tested and utilized to analyze input PE (Portable Executable) files
to establish their malicious or harmless nature. The datasets were tested on several models
including Random Forest, Logistic Regression, Naive Bayes, Support Vector Machines,
K-nearest neighbors and Neural Networks. Finally, multiple tests are undergone on real data
to test the accuracy of the models.
Android malware detection is a critical aspect of mobile security. As the Android ecosystem
continues to grow, so does the number of malicious applications that can compromise user
data and device integrity. To address this, various machine learning algorithms have been
employed to detect and classify malware effectively. This study aims to compare the
performance of different machine learning algorithms in detecting Android malware. The
algorithms under consideration are Support Vector Machine (SVM) , Neural Networks (NN) ,
Support Vector Machine with Genetic Algorithm (SVM-GA) , Neural Networks with Genetic
Algorithm (NN-GA) .
Chapter 2
SYSTEM ANALYSIS
Proposed System
With technology increasing at a fast pace, the digital world is faced by alarming security
threats and challenges in the form of malware capable of bringing down organizations and
governments. The counter-attacking measures have gotten strong with antivirus companies
increasing the signature database which is regularly updated but they are not that efficient and
fail in case of polymorphic malware. In this project we present an alternative approach of
detecting malicious files by using machine learning algorithms like Support Vector Machine ,
Neural Networks , Support Vector Machine with Genetic algorithm , Neural Networks with
Genetic algorithm and compare their results to determine the best suitable algorithm for our
dataset .
Existing System
Several conventional and well-known systems are utilized for Android malware detection,
employing a range of techniques from signature-based detection to behavior analysis and
machine learning. Among these, VirusTotal stands out as a widely-used online service that
analyzes files and URLs for malware. It aggregates results from various antivirus engines and
provides a comprehensive report on potential threats. For Android malware detection,
VirusTotal scans APK files and cross-references them with multiple antivirus databases,
offering a swift method to identify known threats based on signatures and heuristic rules.
Another prominent tool is AppScan by IBM, which includes static and dynamic analysis for
mobile applications. AppScan helps identify vulnerabilities and potential threats in Android
applications through source code analysis, binary analysis, and behavior assessment. This
multifaceted approach aids in detecting security issues and malware in apps.
Malwarebytes is a popular antivirus and anti-malware tool designed for Android devices. It
employs both signature-based detection and heuristic analysis to identify and remove
malware. Malwarebytes scans installed apps, files, and URLs to protect against various
threats including trojans, adware, and spyware, ensuring robust security for Android users.
Dr.Web Anti-virus is another established tool that combines signature-based methods with
heuristic analysis to detect malware on Android devices. It features real-time scanning of
apps and files, and includes additional functions like anti-spam and anti-theft protection to
safeguard against a broad spectrum of threats.
Trend Micro Mobile Security provides protection through a mix of signature-based detection,
behavior-based analysis, and machine learning techniques. The tool includes features such as
app privacy scans, safe browsing, and ransomware protection, which contribute to a
comprehensive defense against malware and other security threats.
Overall, these conventional tools and systems for Android malware detection typically utilize
a blend of signature-based methods, heuristic and behavior analysis, and cloud-based
intelligence. They offer various features such as real-time scanning, vulnerability detection,
and app analysis to effectively safeguard Android devices against a wide range of malware
and security threats.
2.2 Functional Requirements
Historically, the beginnings of AI date back to Alan Turing in the 1950s (Moor, 2003). In the
common imaginary, artificial intelligence is a program that can perform human tasks,
learning by itself. However, AI as defined in the industry is rather more or less evolved
algorithms that imitate human actions. Subelements of AI include ML, NLP (Natural
Language Processing), Planning, Vison and Robotics. The ML is a sub-part of artificial
intelligence that focuses on creating machines that behave and operate intelligently or
simulate that intelligence. ML is very effective in situations where insights must be
discovered from large and diverse datasets. ML algorithms are grouped into five major
classes, which correspond to different types of learning (Russell and Norvig, 2016)
Supervised learning: the algorithm is given a certain number of examples (inputs) to learn
from, and these examples are labeled, that is, we associate them with a desired result
(outputs). The algorithm then has for task to find the law which makes it possible to find the
output according to the inputs. The aim is to estimate the best function f(x) able to connect
the input (x) to the output (y). Through supervised learning, two major types of problems can
be solved: classification problems and regression problems.
Unsupervised learning: no label is provided to the algorithm that discovers without human
assistance the characteristic structure of the input. The algorithm will build its own
representation and a human may have difficulty understanding it. Common patterns are
identified in order to form homogeneous groups from the observations. Unsupervised
learning also splits into two subcategories: clustering and associations. The idea behind
clustering is to find similarities within the data in order to form clusters. Slightly different
from clustering, association algorithms take care of finding rules in the data. These rules can
take the form of ”If conditions X and Y are met then event Z may occur”.
Transfer learning: it is a learning that can come to optimize and improve a learning model
already in place. The understanding is therefore quite conceptual. The idea is to be able to
apply a set that is acquired on a task to a second relative set. Several ML algorithms are
incorporated depending on their relevance. The focus in this study includes the
undermentioned algorithms:
Random Forest: This algorithm belongs to the family of model aggregations, it is actually a
special case of bagging (bootstrap aggregating). Moreover, random forests add randomness to
the variable level. For each tree, a bootstrap sample is selected and at each stage, the
construction of a node of the tree is done on a subset of variables randomly drawn.
Logistic Regression: It is a supervised classification algorithm where (Y) takes only two
possible values (negative or positive). This algorithm measures the association between the
occurrence of an event and the factors likely to influence it.
Naive Bayes: The Bayesian naive classification method is a supervised machine learning
algorithm that classifies a set of observations according to rules determined by the algorithm
itself. This classification tool must first be trained on a set of learning data that shows the
class expected according to the entries. This theorem is based on conditional probabilities.
The descriptors (Xi) are two to two independent, conditionally values of the variable to
predict (Y).
K-nearest neighbors (KNN): This algorithm is a supervised learning method that can be used
for both regression and classification. To make a prediction, the KNN algorithm will be based
on the entire dataset. For an observation, which is not part of the dataset, the algorithm will
look for the K instances of the dataset closest to the observation. Then, for these K neighbors,
the algorithm will be based on their output variables (y) to calculate the value of the variable
(Y) of the observation that needs to be predicted.
Neural Networks: A neural network is inspired by how the human brain works to learn. It is
based on a large number of processors operating in parallel and organized in thirds. The first
third receives raw information inputs, much like the human’s optic nerves when dealing with
visual cues. Subsequently, each third party receives the information output from the previous
third party. The same process is found in humans when neurons receive signals from neurons
close to the optic nerve. The last third, on the other hand, produces the results of the system.
In general, neural networks are categorized by the number of thicknesses that separate the
data input from the output of the result, based on the number of hidden nodes in the model, or
the number of inputs and outputs of each node.
4.1.2 Result Analysis
This section depicts how Machine Learning algorithms are evoked to detect malware. The
evaluation of the algorithms considered multiple malware features including PE headers,
instructions, calls, strings, compression and the Import Address Table. The implementation
was based on Python and sklearn .
Support Vector Machines Classifier The support vector machine classifier will also draw a
hyperplane that splits malware from benignware in the training dataset. The decision of being
clean or suspicious depends on its location compared to the hyperplane (Chebbi, 2018).
The neural network was divided into layers composed from an input layer, a middle layer and
the output layer that generates the final result. The middle layer is formed from 512 neurons
that uses ReLU (Rectified Linear Unit) (Saxe and Sanders, 2018) as activation function. The
last layer uses a sigmoid function (Gan et al., 2015) and comprise one neuron.
Each neuron in the middle layer (dense layer) has access to all input values. The ReLU
activation function, if positive, will output a positive value, else it is a zero. The extract
features function was applied to the input (that includes clean and malicious files) and then
the hash of each token is taken and distributed among this layer. Furthermore, the training
was done in batches using a feature generator each time.
model.fit generator ( generator=training generator, steps per epoch=num obs per epoch /
batch size, epochs=10, verbose=1)
OUTPUT SCREENS
In above screen click on ‘Upload Android Malware Dataset’ button and upload dataset.
In above screen I am uploading ‘AndroidDataset.csv’ file and after upload will get below
screen
Now click on ‘Generate Train & Test Model’ button to split dataset into train and test part.
All machine learning algorithms will take 80% dataset for training and 20% dataset to test
accuracy of trained model. After clicking that button will get train and test model .
In above screen we can see there are total 3799 android app records are there and application
using 3039 records for training and 760 records for testing. Now we have both train and test
model and now click on ‘Run SVM Algorithm’ button to generate SVM model on train and
test and get its accuracy .
In above screen we got 98% accuracy for SVM and now click on ‘Run SVM with Genetic
Algorithm’ button to choose optimize features and then run SVM on optimize features to get
accuracy .
In above screen SVM with Genetic algorithm got 93% accuracy. Genetic with SVM accuracy
is less but its execution time will be less which we can see at the time of comparison graph.
(Note: when u run genetic then 4 empty windows will open u just close all those 4 windows
and let main window to run)
In above console we can see genetic algorithm chooses 40 features from all dataset features.
Now click on ‘Run Neural Network Algorithm’ button to test neural network accuracy.
In above screen neural network also gave 98.64% accuracy. Now click on ‘Run Neural
Network with Genetic Algorithm’ button to get NN accuracy with genetic algorithm
In above screen NN with genetic got 98.02% accuracy. Now click on ‘Accuracy Graph’
button to see all algorithms accuracy in graph .
In above graph x-axis represents algorithm name and y-axis represents accuracy and in all
SVM got high accuracy. Now click on ‘Execution Time Graph’ button to get execution time
of all algorithm .
In above graph x-axis represents algorithm name and y-axis represents execution time. From
above graph we can conclude that with genetic algorithm machine learning algorithms taking
less time to build model.
CONCLUSION AND FUTURE ENHANCEMENT
The information age has recently discovered the value of big data and information that can
hide in disparate, large data sources. The current interest in data has also spread across
multiple applications to detect and prevent attacks. New technologies permit nowadays an
advanced analytics approach leveraging big data. In cybersecurity, machine learning
algorithms can be used to detect external intrusions, for example by identifying patterns in
the behavior of attackers performing reconnaissance, but also to detect internal risks. The
analysis simply aims to provide visualization so that human interaction can be applied to infer
ideas. By combining data from system log files, historical data on IP addresses, honeypots,
system and user behaviors, etc. a more comprehensive overview of a normal situation is
conceived. The wit is to analyze multiple sources and patterns to signal unwanted behavior.
Furthermore, machine learning is used for attack detection and attribution. Besides, several
use cases of machine learning are employed for penetration testing. The work done in this
paper proves that different approaches can be leveraged to detect malware using machine
learning. Several algorithms have been implemented, trained and tested. For each algorithm,
the methodology of detecting malware have been abridged in details.
Our future plans consist in studying and enhancing the detection of malware using hybrid
training model and ensemble learning. These algorithms can be built also leveraging other
parameters and training data. In addition, in a next step we envisage to associate multiple
analysis techniques to detect malware. For a complete detection mechanism, we plan to
combine static, dynamic and machine learning techniques to analyse malware.