Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
ON
the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted to:
Dr. Atul Kumar Srivastava
Assistant Professor
Submitted by:
Anagh Sharma, CSE A
Kartikey Sharma, CSE A
Vishal Bora, CSE ML
Harendra Singh Bisht, CSE A
SCHOOL OF COMPUTING
DIT UNIVERSITY, DEHRADUN
(State Private University through State Legislature Act No. 10 of 2013 of Uttarakhand and approved by UGC)
We here by certify that the work, which is being presented in the Phase 1 report, entitled
Malware Detection Using Machine Learning, in partial fulfilment of the requirement for the
award of the Degree of Bachelor of Technology and submitted to the DIT University is an
authentic record of our work carried out under the guidance of Dr. Atul Kumar Srivastava.
Date: 23-10-2021
Signature of the Candidates
1. Anagh Sharma 2. Kartikey Sharma 3. Vishal Bora 4. Harendra Singh Bisht
Signature of Guide
2
ACKNOWLEDGEMENT
This project is a culmination of invaluable guidance and encouragement from various people at
DIT University.
We would first like to thank our guide, Dr. Atul Kumar Srivastava for his encouragement and
guidance throughout the project. We are wholeheartedly thankful to him for giving us his valuable
time & attention and for providing us with regular feedback to help us in progressing our project
in time. Then we would also like to thank our friends and family for their support.
3
ABSTRACT
4
TABLE OF CONTENTS
Declaration 2
Acknowledgement 3
Abstract 4
Chapter 1 – Introduction 7
1.1. Malicious software 7
1.2. Machine learning 8
1.3. Malware detection using machine learning 9
Chapter 2 – Phase 1: Description 11
2.1. Study of datasets 11
2.2. Portable Executable file format 14
2.3. Feature engineering 16
Chapter 3 – Tools and Technologies 19
Chapter 4 – Summary: Phase 1 20
Chapter 5 – Project Pathway 20
Bibliography 21
Annexure –
Implementation and Code 22
5
List of Figures
6
Chapter 1 – Introduction
In recent years, the increased usage and widespread understanding of computer systems has led to
a growth in the attempts to exploit the systems for data and money, or simply with the intent of
vandalism. According to Cybersecurity Ventures, ransomware attacks alone are projected to cost
over twenty billion US dollar in 2021 [1]. This threat has led to a growth in the malware analysis
services sector that seeks to offer solutions which can help classify software as malicious or not.
A study by MarketsandMarkets Research Pvt. Ltd. concluded that between 2019 and 2024, market
for malware analysis shall grow from three to twelve billion US dollars [2].
Malicious software, or malware for short, is a program that is designed to compromise a digital
system. Malware specimens were initially made for experimentation purposes to study the
vulnerabilities in computer system architecture and programming. Malware has rapidly evolved
from targeting personal computers to almost every device used today, from mobile phones to
ATMs. The diversity of malware and the complexity with which malware is launched will only
increase in future, as shown by various studies.
7
4. Spyware: A malware that installs itself on a computer to monitor the behaviour and to steal
sensitive information from a user. Spyware can also be used to grant remote access of the
compromised system to predators.
5. Adware: This type of software is designed to help companies generate more revenue by
automatically displaying advertisement banners and pop-ups while another program is
running.
6. Ransomware: A program which threatens to perpetually block or publish personal user data
unless a ransom is paid. The attacker does so by using a disguised link to trick the user into
downloading the malware file which then encrypts the user data. The data can then only be
unlocked through a secret key which is usually promised to be revealed upon payment.
Machine learning is a part of the broader subject of artificial intelligence which is applied to enable
a machine to use real-world data in order to solve a problem, i.e., machine learning is a
combination of statistics, applied mathematics, and computer science. Machine learning allows
computers to automatically improve their performance from experiences and without any
additional programming to make those improvements. A model training algorithm is used to
generate a model which can generalise well on unseen data.
1. Supervised Learning: This methodology uses real-world datasets which consist of training
features and associated labels to generate an inferred function which can then be used to
label unseen and unlabelled data.
2. Unsupervised Learning: In unsupervised learning approach, an unlabelled and
uncategorized dataset is used to train a machine learning model which must discover
patterns to associate features with labels by itself.
3. Reinforcement Learning: This approach involves training through reward and punishment
of the behaviour of an intelligent agent, which takes actions with the goal of maximizing
cumulative reward.
8
Deep learning, is a subfield of machine learning wherein, features are extracted from raw data
progressively, from lower to higher degree of abstraction. Deep learning algorithms are composed
of multiple layers arranged in a hierarchical manner of increasing complexity.
Earlier, the methodology used for malware detection involved manual configuration of malware
fingerprints by analysts and security experts. This signature-based malware detection was used
extensively throughout the industries and involved detection through manually configured and
regularly updated pre-execution rules. These rules were based on the features of files such as
fragments of code and several other properties of a file.
But this methodology eventually became obsolete due to the substantial increase of the volume of
malware produced that made manual configuration of fingerprints unrealistic since even a small
change in a file’s properties could render these rules ineffective.
This has led to the exploration of machine learning and deep learning-based malware detection
approaches which are able to use intelligent solutions in order to keep up with the evolution of
malware. The following methods are used in malware detection:
1. Static analysis: Features of a program are extracted and used to predict its nature without
executing the code. Such features include but are not limited to file format, binary data of
the file, text strings, etc. This method is least computationally expensive but could fail to
detect malware if it uses effective code obfuscation.
2. Dynamic analysis: This method takes a behaviour-based approach in examining an
executable file. The executable file is run and its behaviour is studied on either an air-
gapped machine, a virtual machine or a sandbox. The behaviour includes API calls,
memory writes, registry changes, etc. This method is used only after static methods have
been exhausted.
3. Hybrid malware detection: This method combines both static and dynamic methodologies.
9
Deep learning-based approaches do not require expertly selected feature configurations based on
domain knowledge. Instead, deep learning involves approaches that include:
10
Chapter 2 – Phase 1: Description
Following the machine learning pathway, the first step in the first phase of this project is to explore
various datasets suitable for this study. The datasets explored in this project include executable
files, binary representations of those files, and metadata information. The executable files are
studied without their execution as this study focuses on static malware detection.
This is followed by an examination of features from a dataset to assess which features could be
most useful to train a generalized machine learning model. At this stage, this is done by learning
from the results of related work in this field. In the further stages, features will be selected based
upon usefulness measured by training models and observing their performance on unseen data.
This is a fundamental stage of the machine learning pathway and involves identifying appropriate
data sources and available datasets. The following are some of the attributes of a dataset which are
used to assess the usefulness of a dataset:
11
The following datasets were studied in this project:
1. Microsoft Malware Classification Challenge (BIG 2015) [3]: This dataset was published
by Microsoft as part a malware classification challenge hosted on Kaggle in 2015. When
uncompressed, this dataset contains over half a terabyte of data. The dataset consists of
bytecode and disassembly code from over twenty thousand malware files.
The malware samples which are represented in this dataset belong to over nine malware
families. These malware families, or classes, are shown in Fig 1.
Ramnit Lollipop
Kelihos_
Gatak
ver3
MMCC
DATASET Vundo
Obfusca
tor.ACY
Kelihos_ Simda
ver1
Tracur
12
Each malware sample has two files associated with it which are described in Fig 2.
Hexadecimal
Metadata
representation
information of
of a sample’s
the sample
binary content
PE header is Includes
removed to function calls,
ensure sterility strings, etc.
Fig 2: Malware sample files
The SoReL-20M dataset contains the following data for each malware sample:
a. Features of the samples that are derived in accordance with the format of the EMBER
2.0 dataset.
b. Labels for each sample which are obtained by using both external as well as internal
Sophos sources.
c. PE metadata of malware files obtained using pefile library.
d. Binary files of malware samples.
3. Malimg (Nataraj et al., 2011) dataset [5] [6]: This dataset contains PNG image
representation of nearly nine thousand malware files. Over 25 malware families are
represented in this dataset.
13
The malware families represented in this dataset are shown in Fig 3.
The Portable Executable, or PE format is employed for executable/dll files in the Windows
environments and was first used in Windows NT operating system. The contents of the file are
composed in a linear manner. Features are extracted from this file to train a machine learning
model which is used in static malware analysis.
14
Fig 4: PE file structure [7]
15
2.3. Feature engineering
At this stage, the raw data of the samples is transformed into features which can be used to
effectively train a machine learning model.
Feature
Fitting Inference
Engineering
Fig 5: Feature engineering in machine learning pathway
In this instance of feature engineering, the hexadecimal content of the malware file samples is
transformed to produce a PNG image representation of the malware file. These PNG files will be
used in the later phases of this project to train a deep learning model.
Malware Binary
Data PNG Image
Hexadecimal
Representation
Fig 6: Binary file to PNG representation
Feature engineering is a continuous process in the machine learning pathway and newer features
will be explored throughout the later stages as this project evolves. The image representation of
malware files is only an example of possible features.
16
The PNG images as such cannot be directly fed to the model since there is a great variance in sizes
of malware images. These images need to be processed first so that they are of a common scale.
This helps a deep learning model to converge faster.
A sample of normalized tensor image data created using the TensorFlow python library is shown
in Fig 8. This data is used to train a deep learning model.
Feature engineering is the most time-consuming stage of the machine learning pathway and
involves deep statistical analysis of the data. An exhaustive study of the features on the basis of
domain knowledge and with respect to specific machine learning models is required.
17
Finally, after feature engineering the resultant data is split into the following two randomly
generated subsets using the train_test_split() method of the Scikit-learn python library:
1. Training set: Used to train a machine learning model to help it understand the relation
between the features and labels.
2. Testing set: The performance of a model is assessed using this subset by the use of several
evaluation metrics.
18
Chapter 3 – Tools and Technologies
Hardware Requirement:
1. CPU : Intel core i5 8th generation or better
2. GPU : Preferred
3. RAM : 8 GB or better
Software Requirements:
1. Anaconda computer program including all essential machine learning tools.
Tools Used:
1. Anaconda: A python and R distribution which is primarily used for data science and
machine learning applications. The primary objective of this tool is to make package
management simple. It includes data-science packages adapted to Windows, Linux and
macOS.
2. Keras: An open-source python library works as interface for the TensorFlow python
library. Keras provides high level APIs for machine learning.
3. JupyterLab: An interactive environment which helps in integrating code, data and notebook
in a single interface.
4. Amazon S3: Amazon Simple Storage Service or Amazon S3 in cloud service that is used
for object storage and is provided by the Amazon Web Services. Practically infinite amount
of data can be stored and accessed from anywhere through Amazon S3.
19
Chapter 4 – Summary: Phase 1
The first phase of this project follows the machine learning procedure as described in Fig 20.
The project on malware detection using machine learning is divided into three phases as described
in Fig 21.
20
Bibliography
[1] D. Braue, "Global Ransomware Damage Costs Predicted To Exceed $265 Billion By 2031," 3 June
2021. [Online]. Available: https://cybersecurityventures.com/global-ransomware-damage-costs-
predicted-to-reach-250-billion-usd-by-2031/.
[2] businesswire, "Global Malware Analysis Market Expected to Grow with a CAGR of 31% During the
Forecast Period, 2019-2024 - ResearchAndMarkets.com," businesswire, 13 December 2019.
[Online]. Available: https://www.businesswire.com/news/home/20191213005123/en/Global-
Malware-Analysis-Market-Expected-to-Grow-with-a-CAGR-of-31-During-the-Forecast-Period-2019-
2024---ResearchAndMarkets.com.
[3] Microsoft, "Microsoft Malware Classification Challenge (BIG 2015)," Microsoft, 2015. [Online].
Available: http://arxiv.org/abs/1802.10135.
[5] L. &. K. S. &. J. G. &. M. B. Nataraj, Malware Images: Visualization and Automatic Classification,
2011.
[6] H. Mallet, "Malware Classification using Convolutional Neural Networks — Step by Step Tutorial,"
27 May 2020. [Online]. Available: https://towardsdatascience.com/malware-classification-using-
convolutional-neural-networks-step-by-step-tutorial-a3e8d97122f.
21
Annexure – Implementation and Code
22
[4]: for subdir, dirs, files in os.walk(root):
for d in dirs:
print(f'Iterating through folder of class: {d}')
def convert(array,name):
print('Converting: '+name)
b = int((array.shape[0]*16)**(0.5))
b = 2**(int(log(b)/log(2))+1)
a = int(array.shape[0]*16/b)
array = array[:a*b//16,:]
array = np.reshape(array,(a,b))
img = Image.fromarray(np.uint8(array))
img.save(os.path.join(subdir, d)+'\\'+name+'.png', "PNG")
return img
23
Converting: 05Kps4iFw8mOLJZQrb1H.bytes
Converting: 065EZhxgbLRSHsB87uIF.bytes
Converting: 06aLOj8EUXMByS423sum.bytes
Iterating through folder of class: 3_Kelihos_ver3
Converting: 04BfoQRA6XEshiNuI7pF.bytes
Converting: 04cvLCVPqBMs6yn5xGlE.bytes
Converting: 04QzZ3DVdPsEp9elLR65.bytes
Converting: 04sJnMaORYc1SV5pKjrP.bytes
Converting: 06arUi9q3wHS2C8RZxeB.bytes
Converting: 06KfrF7ltESna2ZHPVp5.bytes
Converting: 06osXqPUVM1HbvBGNncT.bytes
Converting: 07nrG1cLKUPxjOlWMFiV.bytes
Converting: 09bfacpUzuBN5W3S8KTo.bytes
Converting: 0aSTGBVRXeJhx5OcpsgC.bytes
Iterating through folder of class: 4_Vundo
Converting: 0qPGt4cRVk9NoiJgubf2.bytes
Converting: 1bL4yiwCUvSOg7tBJudf.bytes
Converting: 1eOaAY4fpV38LIdhxl95.bytes
Converting: 1FacC02JPfxSdXeD7MEw.bytes
Converting: 1gx83bLB4PSsYIKCTlZt.bytes
Converting: 1PQFYMSBLAO9TmKk2Zhj.bytes
Converting: 1S9ui2XqltCJAOGUPw7v.bytes
Converting: 1yC7BzWHgtI2FibhQ0km.bytes
Converting: 2CfJMa5HIn6D1d9EXbpe.bytes
Converting: 2g4C03AeqoZR6ctiF1Qr.bytes
Iterating through folder of class: 5_Simda
Converting: 0qjuDC7Rhx9rHkLlItAp.bytes
Converting: 1IpWLz6eyhVxDAfQMKEd.bytes
Converting: 1KB3Z7gd5aN4Xmx8W0sf.bytes
Converting: 2aHfrLhcPTj5GnFZXUCN.bytes
Converting: 2pwjzv6eGEb8QmHPfxSc.bytes
Converting: 2qAtoGOuMQZdmH3y7bEY.bytes
Converting: 3m8kb5ILPrHcMC1o9Nht.bytes
Converting: 3zZpqyclD9B2v5Qas18m.bytes
Converting: 40KRbGeQZ8PwcUgt5joa.bytes
Converting: 4UTMdcZkxzLvwygO8EuK.bytes
Iterating through folder of class: 6_Tracur
Converting: 02IOCvYEy8mjiuAQHax3.bytes
Converting: 02mlBLHZTDFXGa7Nt6cr.bytes
Converting: 03nJaQV6K2ObICUmyWoR.bytes
Converting: 05LHG8fR3iPn6agIo9z7.bytes
Converting: 08BX5Slp2I1FraZWbc6j.bytes
Converting: 09CPNMYyQjSguFrE8UOf.bytes
Converting: 09sXMJUHwQWVanrhzAoT.bytes
Converting: 0BZQIJak6Pu2tyAXfrzR.bytes
Converting: 0Cq4wfhLrKBJiut1lYAZ.bytes
Converting: 0df4cbsTBCn1VGW8lQRv.bytes
Iterating through folder of class: 7_Kelihos_ver1
24
Converting: 09LXtWxm1EbK5uVqcQS3.bytes
Converting: 0ACDbR5M3ZhBJajygTuf.bytes
Converting: 0b5LqcWix3J4fGIEhXQu.bytes
Converting: 0BIdbVDEgmPwjYF4xzir.bytes
Converting: 0eN9lyQfwmTVk7C2ZoYp.bytes
Converting: 0hZEqJ5eMVjU21HAG7Ii.bytes
Converting: 0KgE6ksUeytoHfl2cT4r.bytes
Converting: 0LAXajqhQy7po16dw8Tx.bytes
Converting: 0M7aSiE9csDzkmfKheVt.bytes
Converting: 0PlfqyKM1JtYZx2me5FU.bytes
Iterating through folder of class: 8_Obfuscator.ACY
Converting: 01SuzwMJEIXsK7A8dQbl.bytes
Converting: 04hSzLv5s2TDYPlcgpHB.bytes
Converting: 0aKlH1MRxLmv34QGhEJP.bytes
Converting: 0aVNj3qFgEZI6Akf4Kuv.bytes
Converting: 0aVxkvmflEizUBG2rMT4.bytes
Converting: 0BFIPv1rO83whtpMYyAs.bytes
Converting: 0BY2iPso3bEmudlUzpfq.bytes
Converting: 0C4aVbN58O1nAigFJt9z.bytes
Converting: 0cTu2bkefOAJqIhYUWFK.bytes
Converting: 0fhnXI9ESr4jgWmkiaTe.bytes
Iterating through folder of class: 9_Gatak
Converting: 01azqd4InC7m9JpocGv5.bytes
Converting: 01jsnpXSAlgw6aPeDxrU.bytes
Converting: 04mcPSei852tgIKUwTJr.bytes
Converting: 07ECKjDTyQLnabNoxrIH.bytes
Converting: 0AV6MPlrTWG4fYI7NBtQ.bytes
Converting: 0bjN3Kgw5OATSreRmEdi.bytes
Converting: 0co46B8IkPt2UN3HSaw7.bytes
Converting: 0CPaAXtyswrBq83D6VEg.bytes
Converting: 0dauMIK4ATfybzqUgNLc.bytes
Converting: 0dhL8Jvcswa7U1qHiDS5.bytes
#Generating DataSet
the_data = ImageDataGenerator().flow_from_directory(directory=root,␣
‹→target_size=(512,512), batch_size=100)
Train-Test Split
[6]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(imgs/255.,labels,␣
‹→test_size=0.3)
25
[7]: print(f'Shape of the tuple holding image data: {imgs.shape}')
print(f'Shape of the tuple holding labels of the image data: {labels.shape}')
26
27
MALWARE DETECTION USING MACHINE LEARNING
ORIGINALITY REPORT
9 %
SIMILARITY INDEX
6%
INTERNET SOURCES
2%
PUBLICATIONS
6%
STUDENT PAPERS
PRIMARY SOURCES
1
www.coursehero.com
Internet Source 3%
2
Submitted to DIT university
Student Paper 2%
3
Submitted to South Bank University
Student Paper 1%
4
Submitted to The Robert Gordon University
Student Paper 1%
5
scholarworks.sjsu.edu
Internet Source <1 %
6
link.springer.com
Internet Source <1 %
7
speakerdeck.com
Internet Source <1 %
8
Yixuan Ma, Shuang Liu, Jiajun Jiang, Guanhong
Chen, Keqiu Li. "A comprehensive study on
<1 %
learning-based PE malware family
classification methods", Proceedings of the
29th ACM Joint Meeting on European
Software Engineering Conference and
Symposium on the Foundations of Software
Engineering, 2021
Publication
9
Submitted to Metropolitan Community
College
<1 %
Student Paper
10
ai.sophos.com
Internet Source <1 %
11
arun-aiml.blogspot.com
Internet Source <1 %
12
repositori.udl.cat
Internet Source <1 %
13
"Big Data Analytics", Springer Science and
Business Media LLC, 2018
<1 %
Publication
14
Sumit S. Lad, Amol C. Adamuthe. "Malware
Classification with Improved Convolutional
<1 %
Neural Network Model", International Journal
of Computer Network and Information
Security, 2020
Publication
15
journals.riverpublishers.com
Internet Source <1 %
16
www.hindawi.com
Internet Source <1 %
Exclude quotes On Exclude matches Off
Exclude bibliography On