Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Report Format

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

MULUNGUSHI UNIVERSITY

Pursuing the frontiers of Knowledge

SCHOOL OF SCIENCE, ENGINEERING AND TECHNOLOGY

Project Title: HOST-BASED RANSOMWARE DETECTION MODEL WITH


MACHINE LEARNING

Submitted by: Yotam Mkandawire

Student Number: 201905063

REPORT SURRENDERED IN PARTIAL FULFILLMENT OF THE NECESSITIES


NEEDED FOR THE AWARD OF BACHELOR OF SCIENCE DEGREE IN COMPUTER
SCIENCE FOR THE 2021/2022 ACADEMIC YEAR.

SUPERVISOR: DR. A. ZIMBA


TABLE OF CONTENTS
ACKNOWLEDGEMENTS .................................................................................................... 5

ABSTRACT ........................................................................................................................... 6

LIST OF FIGURES ................................................................................................................ 7

LIST OF TABLES ................................................................................................................. 9

ACRONYMS AND ABBREVIATIONS .............................................................................. 10

CHAPTER 1-INTRODUCTION ............................................................................................ 1

1.1 INTRODUCTION ........................................................................................................ 1

1.2 PROBLEM STATEMENT ........................................................................................... 2

1.3 AIM .............................................................................................................................. 2

1.4 OBJECTIVES ............................................................................................................... 3

1.5 PROJECT SCOPE ........................................................................................................ 3

1.6 PROJECT JUSTIFICATION ........................................................................................ 3

1.7 SUMMARY ................................................................................................................. 4

CHAPTER 2 - LITERATURE REVIEW................................................................................ 5

2.1 INTRODUCTION ........................................................................................................ 5

2.2 RELATED LITERATURE ........................................................................................... 5

2.2.1 EVOLUTION OF ATTACK TECHNIQUES ............................................................ 5

2.3 REVIEW OF EXISTING MODELS ............................................................................. 7

2.3.1 BRIEF ON MALWARE ANALYSIS ......................................................................... 7

2.3.2 DETECTION TECHNIQUES ................................................................................. 7

2.4 COMPARISON OF MODELS ................................................................................... 10

2.5 PROPOSED MODEL ................................................................................................. 10

2.5.1 SYSTEM ARCHITECTURE .................................................................................. 10

2.6 SELECTED METHODOLOGY ................................................................................. 11

2.6.1 SELECTED METHODOLOGY (INCREMENTAL) ............................................... 11

2.6.2 WATERFALL MODEL FOR THE DEVELOPMENT OF THE SOFTWARE ......... 12

1
2.7 TECHNOLOGIES AND FRAMEWORK TO BE USED ............................................ 14

2.7.1 PYTHON PROGRAMMING LANGUAGE ............................................................ 14

2.7.2 PyCHARM IDE .................................................................................................... 14

2.7.3 STREAMLIT ......................................................................................................... 14

2.7.4 JUPYTER NOTEBOOK........................................................................................ 14

2.7.5 SCIKIT-LEARN .................................................................................................... 15

2.7.6 CLASSIFIERS ...................................................................................................... 15

2.7.7 CUCKOO SANDBOX........................................................................................... 16

2.7.8 DATASET............................................................................................................. 16

2.8 SUMMARY ............................................................................................................... 17

CHAPTER 3 – SYSTEM ANALYSIS AND DESIGN ......................................................... 18

3.1 INTRODUCTION ...................................................................................................... 18

3.2 SYSTEM ANALYSIS ................................................................................................ 18

3.2.1 REQUIREMENTS GATHERING AND ANALYSIS ................................................ 18

3.2.1.1 HOST BASED RANSOMWARE DETECTION WITH ML .................................. 18

3.3 SYSTEM DESIGN ..................................................................................................... 19

3.3.1 GENERAL OVERVIEW ........................................................................................ 19

3.3.2 MODEL TRAINING ............................................................................................. 20

3.3.2 WEB APPLICATION DESIGN ............................................................................. 22

3.4 CONCLUSION........................................................................................................... 23

Chapter 4 – RESULTS ANALYSIS ..................................................................................... 24

4.1 INTRODUCTION ...................................................................................................... 24

4.2 ENVIRONMENT DESCRIPTION ............................................................................. 24

4.3 UNIT TESTING ......................................................................................................... 24

Test case 1: Model Training module ............................................................................. 24

Test case 2: Report Processing module ......................................................................... 25

Test case 3: Detection module....................................................................................... 25

2
4.3.1 TEST PLANNING ................................................................................................ 26

4.3.2 RESULTS OF MODEL TRAINING ...................................................................... 26

4.4 SYSTEM TESTING ................................................................................................... 28

4.4.1 TEST PLANNING ................................................................................................ 28

4.4.2 USER INTERFACE TESTING .............................................................................. 28

4.5 SUMMARY.............................................................................................................. 30

CHAPTER 5 – PROJECT MANAGEMENT ........................................................................ 31

5.1 INTRODUCTION ...................................................................................................... 31

5.2 RISK AND QUALITY MANAGEMENT .................................................................. 31

5.3 EFFORT COSTING MODEL ..................................................................................... 32

5.4 EFFORT CALCULATIONS FOR PROJECT ............................................................. 33

5.7 SCHEDULING AND WORK PLAN .......................................................................... 36

5.8 SUMMARY ............................................................................................................... 37

CHAPTER 6 – CRITICAL EVALUATION ......................................................................... 38

6.1 INTRODUCTION ...................................................................................................... 38

6.2 REASON FOR UNDERTAKING THE PROJECT ..................................................... 38

6.3 MAIN LEARNING OUTCOME ................................................................................ 38

6.4 CHALLENGES ENCOUNTERED ............................................................................. 39

6.5 FUTURE WORK........................................................................................................ 39

6.6 CONCLUSION........................................................................................................... 39

CHAPTER 7 – CONCLUSION ............................................................................................ 40

7.1 INTRODUCTION ...................................................................................................... 40

7.2 RESEARCH CONTRIBUTIONS ............................................................................... 41

REFERENCES ..................................................................................................................... 42

APPENDICES ...................................................................................................................... 46

3
I Yotam Mkandawire student number 201905063, hereby declare that this is my
original work and it has never been submitted at any University for any award.

SIGN:

DATE: 5thJune, 2022.

4
ACKNOWLEDGEMENTS

First and foremost, I would like to acknowledge my God for his grace, in times when
faced with impossible deadlines and personal challenges God strengthened me. Secondly, I
extend high acknowledgement to my family and friends for the love and support rendered to
me throughout the life of this project, it really does take a village to raise a child. Thirdly, this
academic piece of information would not have come to fruition without the vigilant guidance
of Dr. Aaron Zimba, his faith in my capabilities and the patience to see this project come
together is inspiring. Last but not least, I would also like to acknowledge my lectures for always
being there to support and guide whenever they could.

5
ABSTRACT
The term ransomware has become a common headline, and the impacts of this sort of
software have been fast expanding, leaving a trail of terrible losses in its wake. Individuals and
businesses have both been victims of ransomware, with victims having to forego millions of
dollars in ransom money. Victims have also suffered data losses as a result of failing to pay
the ransom or failing to unlock the encrypted data. The chaos caused by ransomware has
inspired research in the field, mitigation measures are also on the rise but majority of them are
focused on network-level detection and prevention. This leaves a significant research void in
the field of host-based ransomware mitigation approaches. As a result, the goal of this research
is to create a host-based ransomware detection model/framework that is capable of detecting
more recent ransomware variants. Incremental and spiral software development methodologies
were used for the development of the two main modules of this research, much focus was
dedicated to appropriately labeling the dataset and extracting optimum features using
conventional machine learning classifiers. Multiple classifiers were tested and the best
classifier with regards to accuracy was selected. Feature extraction from the sandbox report
was executed and a perdition was made. The results of the experiments showed an impressive
success rate. Findings are useful as cardinal points to consider in the field of ransomware
detection and prevention.

Keywords: Ransomware, CryptoLocker, Crypto API, IDS and Machine Learning

6
LIST OF FIGURES
Figure 1: Evolution of ransomware attack techniques (Zimba and Chishimba, 2019)............. 5
Figure 2: Novel attack model (Zimba and Chishimba, 2019) ................................................. 6
Figure 3: Recovery-prevention techniques (Kharraz et al., 2015) ........................................... 6
Figure 4: I/O access monitor in UNVEIL (Kharraz et al., 2015) ............................................ 8
Figure 5: Shows the workflow of the API monitoring program (Honda et al., 2018) .............. 8
Figure 6: Flow chart of the proposed framework. ................................................................ 10
Figure 7: Incremental model (Sommerville, 2011). .............................................................. 11
Figure 8: Waterfall Model (Sommerville, 2011) .................................................................. 12
Figure 9: Overview of dataset.............................................................................................. 16
Figure 10: System overview. ............................................................................................... 19
Figure 11: Initial step in model training. .............................................................................. 20
Figure 12: Optimize input set with Extra Tree Classifier. .................................................... 20
Figure 13: Split dataset to train and test set. ......................................................................... 21
Figure 14: Model training. ................................................................................................... 21
Figure 15: Use case diagram................................................................................................ 22
Figure 16: Activity diagram. ............................................................................................... 23
Figure 17: Unit testing of Model training module. ............................................................... 24
Figure 18: Unit testing for Report Processing module.......................................................... 25
Figure 19: Unit testing of the Detection module. ................................................................. 25
Figure 20: Classification report of the trained and tested Logistic Regression model. .......... 26
Figure 21: Confusion matrix. ............................................................................................... 27
Figure 22: ROC curve and AUC. ......................................................................................... 27
Figure 23: Host Based Ransomware Detection Web Application. ........................................ 28
Figure 24: Upload directory of the web application. ............................................................ 29
Figure 25: Submitted sample being processed. .................................................................... 29
Figure 26: Results of submitted sample. .............................................................................. 30
Figure 27: Triple Constraint model (Van Wyngaard, Pretorius and Pretorius, 2012) ............ 31
Figure 28: Shows the count-total for the proposed model .................................................... 34
Figure 29: Shows the Complexity of Weighting Factors for the proposed model. ................ 35
Figure 30: Shows the LOC for the proposed model. ............................................................ 35
Figure 31: Shows the effort and duration for the proposed model. ....................................... 35
Figure 32: Gantt chart view of software development plan. ................................................. 36

7
Figure 33: Schedule for software development. ................................................................... 36
Figure 34 Model training code snippet. ............................................................................... 46
Figure 35: Repost processing module code snippet. ............................................................. 47

8
LIST OF TABLES
Table 1: Comparison of Systems ......................................................................................... 10
Table 2: Technologies Used. ............................................................................................... 16
Table 3: Test planning ......................................................................................................... 26
Table 4: COCOMO Constants ............................................................................................. 34

9
ACRONYMS AND ABBREVIATIONS

AES - Advanced Encryption Standard

API - Application Programming Interface

AUC - Area under the Curve

DLL - Dynamic-link library

FN - False Negative

FP - False Positive

IDS - Intrusion Detection System

I/O - Input/output

OS - Operating System

PE - Portable Executable

ROC - Receiver Operating Characteristics

RSA - Rivest–Shamir–Adleman

SRS - Software Requirements Specification

TF-IDF - Term Frequency - Inverse Document Frequency

TN - True Negative

TP - True Positive

10
CHAPTER 1-INTRODUCTION

1.1 INTRODUCTION
Devising defense mechanisms against ransomware is an impossible task without
having an insightful understanding of the paradigm, this chapter aims at giving a background
of ransomware, problem statement, project aim, project objectives, project scope, project
justification, and summary.

1.1.1 BACKGROUND OF THE STUDY

The AIDS Trojan Horse infamously known as PC-CYBORG, first made its appearance
in 1989, and it was the first known instance of ransomware. The victims were requested to pay
a $189 ransom by the malware (Hernandez-Castro, Cartwright and Cartwright, 2020). This
ransomware not only proved the concept, but it also coupled it with various current attack
strategies. "To fool the recipients, the Trojan was placed in a socially engineered package with
a floppy disk. The attacker mass-mailed the item by surface mail, addressing it to a mailing
list to which the attacker had subscribed. The creator of this spyware was arrested on blackmail
charges."(Geri, Jota and Avert, 2006).

Adam Young and Mote Yung introduced the notion of crypto-virology in the academic
literature for the first time in 1996 (Young and Yung, 1996). The practice of employing public-
key cryptography for extortion was a major feature in the Young and Yung technique, and the
cryptographic scheme utilized should not be open to compromise via key reverse-engineering.
To put it another way, once a victim has been infected, they have no choice but to communicate
with the attackers and possibly reimbursement a ransom in order to recover their files (Young
and Yung, 1996).

Ransomware has been divided into two types over the years: locker-ransomware and
crypto ransomware (O’Kane, Seer and Carlin, 2018). Locker ransomware essentially includes
corrupting or disrupting basic computer functionality while protecting the data integrity and
safety of the victim; it typically locks computing devices or user interfaces and requires a
ransom payment to unlock them. Crypto-ransomware on the other hand, encrypts the files of
victims on a computer or network and demands a ransom to decode them. It is worth noting
that crypto-ransomware assaults do not encode the entire hard disk, but rather look for
imperative file extensions that have the greatest impact on victims (Human et al., 2021).

1
At first, ransomware was mainly a problem for the Windows platform. However,
Linux, Mac and Android systems have all fallen prey to ransomware attacks. It has been
observed that technological advancement and ransomware evolution are seen to be directly
proportional. For example, an innovation such as a smart-watch already has ransomware
targeted at them like the ransomware written by researchers in 2016 that attacks the smart
thermostat (Casen, Li and Williams, 2021), “If researchers can do it, so can ransomware”
(Savage, Cogan and Lau, 2015).

In this research, interest is directed towards host-based detection techniques for the
most (with regards to the time of writing) recent ransomware attack techniques.

1.2 PROBLEM STATEMENT


Ransomware is a modern threat because of the ever-increasing technical improvements
in crypto-virology and crypto-currency. Here are some of the problems brought about by
ransomware;

 Data loss: some data can never be recovered once encrypted regardless of the
availability of a decryption key.
 Data insecurity: ransomware does not offer a guarantee that the encrypted data will be
restored nor does it guarantee data integrity and confidentiality once a ransom is paid.
 Data corruption: once encrypted, the integrity of some of the data is compromised.
 Denial of service: service in this regard is having access to data, computer or network.
 Extortion: ransomware is fundamentally a crime.
 Abuse of Crypto-currency: Though ransom payments methods are left to the discretion
of the attacker, most ransomware utilize crypto-currency as the mode of payment.
 Investigation Challenges: due to the incorporation of crypto-currencies in newer
ransomware attack techniques, tracking and investigations by the authorities has been
very limited and challenging.

1.3 AIM
The main aim of this project is to design a host-based ransomware detection framework
with the aid of machine learning.

2
1.4 OBJECTIVES
The following are the objectives of the project.

i. To formulate a ransomware detection framework based on the review of previous


attack techniques.
ii. To validate the proposed novel framework.
iii. To build and validate a system based on the proposed detection framework.
iv. To make recommendations that will be useful in addressing challenges faced by the
novel host-level detection frameworks.

1.5 PROJECT SCOPE


The project will only focus on ransomware detection and not prevention. The first and
equally vital stage in mitigating problems caused by ransomware is detection.

The project will be limited to host level detection only, as such the study will not be
looking at the network behavior or network-based attack and mitigation techniques.

The project will be focused on Windows operating system. This is due to the fact that
Windows as of the time of writing, remains the most used operating system globally compared
to other operating systems such Linux, Mac, FreeBSD etc. (• Computer operating systems
market share 2012-2021 | Statista, no date).

The project will not cover mobile device ransomware detection and mitigation
frameworks and as such will not cover mobile applications and operating systems such as
android, iOS, Solaris etc.

1.6 PROJECT JUSTIFICATION


From its first introduction, ransomware has grown to become one of the biggest threats
to both individuals and enterprises globally. Over the years, the size of the ransom demanded
by attackers has grown exponentially, a ransomware annual report by (Sophos, 2021) shows
that the typical cost to rectify the consequences of the most recent ransomware attacks was
US$1.85 million (factoring in people and down time, network and device cost, opportunity
cost and paid ransom etc.), this is more than double the US$761,106 cost described in the year
2020 report. This rapid growth in the cost of ransomware consequences warrants research in
ransomware detection and prevention techniques.

3
In the fight against ransomware, the growth of ransomware attack tactics is a major
source of concern. Modern ransomware attack techniques have proven to be resilient due to
their encryption and recovery-prevention techniques. The newer variants of ransomware use
hybrid cryptosystems in which the malware generates sub symmetric and symmetric keys of
the host using AES and RSA. The AES keys are used to encrypt the data, the entrenched key
is used to encrypt the sub-RSA key, which is used to encrypt the AES keys (Zimba and
Chishimba, 2019). Furthermore, newer ransomware strains have evolved to the extent of
including recovery-prevention tactics such as the erasing of volume shadow copies or
overwriting original target files after encryption (Zimba and Chishimba, 2019). Most
mitigation frameworks have become obsolete as a result of these novel strategies, which have
sparked research attention.

Ransomware is a branch of crypto-virology that is constantly evolving, the mitigation


techniques used last year may be rendered redundant when they come face-to-face with newer
ransomware variants which employ newer and more sophisticated attack techniques. As such
there is need to device a detection framework that addresses the attack techniques used in the
more recent ransomware variants.

The losses in data and finances have continued to be on the rise and the need to device
a detection framework that can keep up with the newer strains of ransomware cannot be
overemphasized.

Because of the stealth and evasion strategies utilized in the latest strains of ransomware,
this study focuses on host level detection. Network-level detection tools cannot directly
witness the action of a malicious software and must rely on traffic generated by the malicious
program. Host-based malware detection techniques have the advantage of being able to view
the entire set of actions that a malware program performs, allowing harmful code to be
identified before it is run at all (Kolbitsch et al., 2009).

1.7 SUMMARY
This chapter introduced the project title “Host-based ransomware detection model with
machine learning”, it further brought about the background of the study. Statement of the
identified problem, project aim and objectives were thoroughly explained as well as project
justification.

4
CHAPTER 2 - LITERATURE REVIEW

2.1 INTRODUCTION
This chapter reviews ransomware detection models proposed by past researchers, a
comparison was made and a proposed model outlined. The literature was reviewed with
accordance to the objectives of the project.

2.2 RELATED LITERATURE


2.2.1 EVOLUTION OF ATTACK TECHNIQUES
The initial spike in ransomware development occurred in 2006-07, owing to the arrival
of GPCode variants (Hampton and Baig, 2015). The GPCode.ak variant was known to write
the encrypted file contents to a new location on the user's drive, destroying the unencrypted
user files in the process. Partially recovering user data without paying the ransom to the
attacker was feasible thanks to the usage of the 'undeletion tool’. A detailed examination of the
evolution of multiple ransomware releases revealed that they were mainly copies of earlier
versions' code, therefore, the flaws in one version were carried over into the next (Hampton
and Baig, 2015). The Reveton ransomware (Luo and Liao, 2009) operated in a peculiar
manner, it was discovered to just freeze the operating system's boot process without encrypting
user data in 2015.

In 2019, (Zimba and Chishimba, 2019) proposed a novel categorization of ransomware


evolution based on the robust nature of the attack techniques. The authors also proposed
an attack model, which is portrayed in the image below. An attack model to demonstrate 3rd
generation ransomware assault technique is also illustrated.

Figure 1: Evolution of ransomware attack techniques (Zimba and Chishimba, 2019)

5
Figure 2: Novel attack model (Zimba and Chishimba, 2019)

Recovery prevention is a feature embraced by all ransomware families as they advance.


This is accomplished by either deleting important files, overwriting crucial files with random
data, or combining the two (Kharraz et al., 2015).

Figure 3: Recovery-prevention techniques (Kharraz et al., 2015)

6
2.3 REVIEW OF EXISTING MODELS
2.3.1 BRIEF ON MALWARE ANALYSIS
There are two types of malware analysis, namely static and dynamic analysis. Static
analysis examines a malware file without actually running the program while dynamic analysis
involves executing the malware and examining its behavior on a particular device. Utilizing
static analysis for detection on a windows operating system relies on extracting anomalies in
the code and resources embedded in a PE file structure (analogous to signature-based
detection). Regardless of being the safest way to analyze malware, this method is
disadvantaged due to obfuscation. Obfuscation is a technique that makes programs harder to
understand, it converts a program to a new different version while making them functionally
equal to each other. Originally, this technology aimed at protecting the intellectual property of
software developers, but it has been broadly used by malware authors to elude detection (You
and Yim, 2010).

Utilizing dynamic analysis for detection involves extracting artifacts from the behavior
of a malware sample as it is executed. Malware detection techniques build on the two types of
malware analysis to broadly categorize detection techniques in two: deception-based methods
and behavior-based methods. The deception-based methods use decoy files to detect
ransomware activities or malicious activities. The behavior-based methods monitor file-related
operations to find out whether there is an abnormal process or not (Canfora et al., 2014).

Our proposed work focuses on behavior-based detection techniques that utilizes


machine learning techniques operating on features derived from dynamic analysis. We also
consider approaches that machine learning-based detection systems have employed in recent
works.

2.3.2 DETECTION TECHNIQUES


A study by (Kharaz et al., 2016)) gave rise to the "UNVEIL" ransomware detection
framework. The report presented the findings of a long-term analysis of ransomware assaults
seen in the open between 2006 and 2014. The study examined 1,359 malware samples from 15
different ransomware families. First, it creates a sandbox-type analysis environment with bait
files for a ransomware to targets. API hooking is then used to observe system activity in this
context. The ransomware is deployed into the environment and scanned for three conditions:
several I/O requests linked to writing or deleting the bait files, a considerable increase in the

7
randomness (entropy) between read and write data buffers, or the generation of new files with
a high entropy signature.

Figure 4: I/O access monitor in UNVEIL (Kharraz et al., 2015)

A study by Takanari (Takanari Shigeta et al., 2016), found that Locky and CryptoWall
ransomwares use Microsoft CryptoAPI or OpenSSL. This finding propelled the authors to
monitor API calls that relate to encryption as a method of attack detection. In this technique,
Encryptor are detected when they have attempted to start file encryption, prevention is achieved
by halting the API execution by the operating system (OS) as soon as detection occurs. To
monitor API calls from the target software, the detection program injects a DLL (dynamic-link
library) file into each software process.

Figure 5: Shows the workflow of the API monitoring program (Honda et al., 2018)
8
In 2018, (Takeuchi, Sakai and Fukumoto, 2018) proposed a ransomware detection
scheme for Microsoft computers based on support vector machines (SVM). Using Cuckoo
Sandbox, they dynamically retrieved characteristics from ransomware API invocation
sequences. When tested, the framework generated 2-gram count vectors that had a detection
accuracy of 97%.

(Kim, 2018) used document classification techniques to assess the performance of a


machine learning strategy in malware detection. They used batch script files to extract
Windows system calls as a feature and used 8-grams, 9-grams, and 10-grams to extract feature
information. Weights were assigned using TF-IDF (Term Frequency - Inverse Document
Frequency), and vectors were created using Euclidean normalization. Finally, they ran tests
using SVM (Support Vector Machine) and SGD (Stochastic Gradient Descent) which came
out with a 96% accuracy.

(Sgandurra et al., 2016) created the EldeRan model: a novel real-time analysis and
detection system of ransomware. EldeRan runs ransomware in a sandbox environment while
also monitoring registry and system functions, it then extracts characteristics using a mutual
information approach and uses the Regularized Logistic Regression classifier to classify
ransomware and benign files.

(Alazab et al., 2010) proposed a malware detection method that extracts Window APIs
from the assembly code of executable files. By using Mutual Information (MI)–based
Maximum Relevance (MR) filter, their method selects important information from the
extracted Windows APIs. The selected information was applied to various machine learning
algorithms such as Naïve Bayes, Sequential Minimal Optimization (SMO), and K-Nearest
Neighbors (KNN), and their experimental results showed that the accuracy ranged from mid
ninety percent to high nineties.

In this paper, we made use of the dataset result of the EldeRan research paper, the
proposed model focused on API statistics, registry key delete, open, read, and write operations,
file delete, open, read and write operations, directory created and enumerated operations plus
strings. Features such as file extensions, dropped files, and dropped file extension were
excluded from our research scope due to the insufficient information on the same in the cuckoo
sandbox report. Our experimental samples are restricted to ransomware and benign samples.

9
2.4 COMPARISON OF MODELS
Table 1: Comparison of Systems

Framework API Registry Key Directory File rwx Web


Monitoring Monitoring Monitoring Monitoring Application
Unveil     
Takeuchi et al. , 2018     
Honda et al. , 2018     
Takanari et al. , 2016     
Kim, 2018     
Proposed model     

2.5 PROPOSED MODEL


The proposed model will capitalize on all deficiencies of the reviewed techniques
except for user document classification, input – output monitoring and portability since our
scope is the Microsoft Windows operating system. The system will capitalize on the benefits
of cuckoo sandbox which utilizes API monitoring, network dumping, registry key operations
monitoring, file and directory operation. These cuckoo report will be parsed for features and
thrown to the classifier for to make the detection.

2.5.1 SYSTEM ARCHITECTURE


The figure below shows a flow chart that graphically illustrates the overview of the
proposed system.

Figure 6: Flow chart of the proposed framework.

10
2.6 SELECTED METHODOLOGY
Considering that the project can be broadly segmented into two parts (the machine
learning framework and the web application), two software development methodologies were
applied appropriately.

2.6.1 SELECTED METHODOLOGY (INCREMENTAL)


The incremental software methodology was used for the development of the system.
Incremental development depends on building an initial implementation, presenting it to the
user’s comment, and developing it through a few forms or versions until a satisfactory system
is built (Sommerville, 2011). The figure below shows phases in incremental development,
requirements are divided into various modules and each module goes through every stage of
the software development life cycle i.e. Design and Development, Testing and implementation
(Alshamrani and Bahattab, 2015).

Figure 7: Incremental model (Sommerville, 2011).

2.6.1.1 PHASES OF INCREMENTAL MODEL


 Outline Description: This is where the main objectives of the system are outlined.
 Specification: In this phase, requirements such as appropriately labelled dataset, new
ransomware variants, newer benign samples and other specifications are identified.
 Development: At this point, the design and development of the system functionality
will be done. This phase also involves the actual programming of the components and
this will be done with the aid of the design principles set out in the previous phase.

11
 Validation: At this phase, the performance of the existing function and additional
functionalities will be checked to ensure that each functionality works as required.

2.6.1.2 JUSTIFICATION OF THE INCREMENTAL SOFTWARE DEVELOPMENT PROCESS

Because the analysis and documentation are far less in the incremental method than in
the waterfall model, the cost of adapting change in requirements is greatly reduced. Once
advanced work has been completed, the incremental approach allows software engineers to
gather input from key stakeholders, allowing users of the system to assess how much of the
requirements have been implemented. The model also allows for rapid delivery and
development to the user, allowing the user to benefit from the software sooner than if they
used any other methodology (Sommerville, 2011).

2.6.2 WATERFALL MODEL FOR THE DEVELOPMENT OF THE SOFTWARE


The waterfall model gained its name from its cascading aspect, which was derived from
a general system engineering procedure (Sommerville, 2011). It is an excellent example of a
plan-driven process in such that every procedure must be planned and scheduled before
embarking on a project (Sommerville, 2011). The stages of the waterfall model are represented
in the figure below.

Figure 8: Waterfall Model (Sommerville, 2011)

12
The waterfall approach should be utilized only when the requirements are thoroughly
comprehended and are unlikely to drastically change during the development process
(Sommerville, 2011). Below are some of the other conditions where the waterfall model is
recommended:

 Product definition is stable.


 Technology is understood.
 There are no ambiguous requirements.
 Sufficient resources with the required expertise are available.
 The project is short.

Waterfall Model Phases


 Requirement definition: The first phase is comprehending what you want to develop,
i.e., input specifications, limitations, and the final product are thoroughly specified
through interaction with end users. The software requirements specifications document
is the end result of this process.
 System and software design: The SRS document is used as input in this phase to
accurately specify hardware and system requirements. During this phase, a clear system
architecture is established. This phase will also include coding.
 Unit testing and implementation: Taking system designs as input a divide-and-conquer
approach is used to generate little programs known as units. Each unit is checked to
ensure that it’s functioning according to requirements.
 Integration and system testing: All the units developed in the implementation phase are
integrated and tested as a complete system to ensure that all requirements have been
met. After testing, the software system is delivered to the customer.
 Operation and maintenance: The system is installed and put into practical use,
correcting of errors which were not discovered in earlier stages of the life cycle is
referred to as maintenance.

The following are the advantages of the waterfall methodology (Kramer, 2018):

 The methodology is simple to understand and use.


 Works well for smaller projects where requirements are clearly defined and very well
understood.

13
 Works well for smaller projects where requirements are clearly defined and very well
understood.

The following are the disadvantages of the waterfall methodology (Kramer, 2018):

 Poor fault tolerance. The waterfall model does not facilitate backtracking, when an
error occurs in any stage the process has to start from requirement specification all over
again.

2.7 TECHNOLOGIES AND FRAMEWORK TO BE USED


2.7.1 PYTHON PROGRAMMING LANGUAGE
Python Programming Language is a high-level programming language that was
conceived in the late 1980s by Guido van in the Netherlands as a successor to the ABC
programming language, which was inspired by SETL capable of exception handling and
interfacing with the Amoeba operating system (Van Rossum and others, 2007). The fact that
Python is a dynamically typed and interpreted language, whereas Java is statically typed and
compiled makes Java faster at runtime and easier to debug than Python (Khoirom et al., 2020).
However, Pythons’ wealth of libraries has made it our choice of programming language in our
proposed framework.

2.7.2 PyCHARM IDE


PyCharm is an integrated development environment (IDE) for the Python programming
language created by JetBrains. The IDE provides a unit tester for code analysis, a graphical
debugger as well as support for web development with Django (PyCharm: the Python IDE for
Professional Developers by JetBrains, no date).

2.7.3 STREAMLIT
Streamlit is an open-source app framework for machine learning and data science that
was formed by three industry veterans: a Zoox VP of engineering and founder of Eterna and
FoldIt; a Google Hangouts web tech lead manager and a Google X AI project; and a Stanford
MBA who directed product and operations for numerous secretive Google X initiatives
(Streamlit).

2.7.4 JUPYTER NOTEBOOK


Jupyter notebook is a web-based interactive development environment for notebooks,
code, and data. It provides a flexible interface that allows users to configure and arrange

14
workflows in data science, scientific computing, computational journalism, and machine
learning (Horton, 2020).

2.7.5 SCIKIT-LEARN
Scikit-learn is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms such as
support-vector machines, random forests, gradient boosting etc. It is designed to work with the
Python numerical and scientific libraries such as NumPy and SciPy (scikit-learn: machine
learning in Python — scikit-learn 1.1.1 documentation, no date).

2.7.6 CLASSIFIERS
The proposed system will be developed based on six classification algorithms and the
best performing algorithm will be selected as our classifier model. The six classification
algorithms are Decision Tree Classifier, Random Forest Classifier, Gradient Boosting
Classifier, Ada Boost Classifier, Gaussian Naïve Bayes and Logistic Regression.

The selection of classifier will be based on how best they perform in the following
metrics:

 Accuracy: Accuracy is the proportion of true results among the total number of cases
examined. This metrics is suitable for binary as well as multi-class classification
problems. Formally, accuracy has the following formula:

𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1.0)
𝑇𝑜𝑡𝑎𝑙𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

 Precision: also called the positive predictive value, is the fraction of relevant instances
among the retrieved instances.
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (1.1)

 Recall: also known as sensitivity, is the fraction of relevant instances that have been
retrieved over the total amount of relevant instances.
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑃 (1.2)

15
2.7.7 CUCKOO SANDBOX
Cuckoo sandbox is an open-tool that is used to launch malware in a secure and
isolated environment. The idea is to fool the malware into thinking it has infected a genuine
host, then record the activity of the malware and then generate a report on what the malware
has attempted to do while in this secure environment (Jamalpur et al., 2018).
2.7.8 DATASET
The proposed model makes use of the EldeRan dataset that was proposed and made
accessible by (Sgandurra et al., 2016). The dataset consists of 582 working samples of
ransomware belonging to 11 different classes and 942 of good applications. The dataset
features include registry keys operations, API statistics, strings, file extensions, files
operations, directory operations and dropped files extensions.

Figure 9: Overview of dataset.

Table 2: Technologies Used.

SYSTEM COMPONENTS TECHNOLOGY USED


System Implementation Python
Machine Learning Environment Jupyter Notebook
Application Server Streamlit
Design and Modeling Visual Paradigm
IDE PyCharm
Machine Learning Library Scikit-Learn
Analysis Environment Cuckoo Sandbox

16
2.8 SUMMARY
This chapter reviewed related literature, analyzed and evaluated existing host-based
ransomware detection frameworks, made a comparison of existing frameworks as well as
outline in detail technologies to be used in the proposed novel framework. Detail of the selected
development methodology has been expressed plus a description of the dataset used.

17
CHAPTER 3 – SYSTEM ANALYSIS AND DESIGN

3.1 INTRODUCTION
This chapter will introduce the design and analysis considerations of the developed
framework. System design and analysis are important phases in development as they provide
an avenue for solutions in the system through the various tasks involved in doing the analysis
as well as the design (Phillips and Nagle, 1985). This chapter will additionally instantiate
graphical blueprints of the developed system that form the basis of the developed systems
structure.

3.2 SYSTEM ANALYSIS


System analysis is the process of gathering and evaluating information obtained from
the different tasks carried out by the system as well as the inside connection of the system
(Satzinger, Jackson and Burd, 2015). It is during this phase that research in a specific field or
area is done, systems that are existing in the same area are exploited, improvements that could
be made to existing systems are documented and form the basis for the development of new
concepts for new and improved systems.

3.2.1 REQUIREMENTS GATHERING AND ANALYSIS


Requirement gathering and analysis is a process that involves software engineers to
work hand in hand with the end users of the system in order to analyze, refine, as well as
scrutinize the gathered requirements so as to come up with consistent and unambiguous
requirements (Sommerville, 2011). This phase reviews all requirements and may provide a
graphical view of the entire system.

3.2.1.1 HOST BASED RANSOMWARE DETECTION WITH ML


FUNCTIONAL REQUIREMENTS
 The System takes a PE file as input.
 The system submits the file to cuckoo sandbox for artifact extraction.
 The system parses the JSON report from cuckoo sandbox for features.
 The system determines the correct features against the models’ features.
 The system performs detection of the preprocessed sample against the model.
 The system maps the detection result to the corresponding label.

18
NON-FUNCTIONAL REQUIREMENTS
 Performance: The system’s performance is efficient when all dependencies are met.
 Reliability: The system is dependable when provided with the right data.
 Usability: The system is easy to use and does not need any programming skills.

3.3 SYSTEM DESIGN


This phase involves describing the requirements defined in the system analysis phase
logically and graphically. We incorporates the use of Object-Oriented design methodologies
to develop system designs that depict the overall system architecture in this phase (Giger and
Gall, 1996).

3.3.1 GENERAL OVERVIEW


The proposed system will have a web based user interface with will be initialized by
the user. Upon initialization, the user will upload a file for detection, the file is immediately
submitted to cuckoo sandbox to run and extract artifacts from the processes. The cuckoo report
is then parsed for sample behavior results which are extracted as features. The parsed sample
is then passed to the classifier which makes a prediction as to whether the submitted file is
ransomware or benign. The flow chart below give a graphical overview of the system.

Figure 10: System overview.

19
3.3.2 MODEL TRAINING
Determine feature and label sets
The figure below shows the first step in the model training process which involves
loading the dataset into a pandas data frame and splitting the data frame into feature-set and
output / label set.

Figure 11: Initial step in model training.

Optimizing feature-set
The figure below shows the steps taken to rectify the problem of dimensionality. We
use the extra tree classifier as a dimensionality reduction technique (Cho and Kurup, 2011).
This reduces the number of feature from 30, 967 to x > 2, 000 features. The product of this
process is an optimized feature-set and a features list which is saved in a pickle file.

Figure 12: Optimize input set with Extra Tree Classifier.

20
Dataset Splitting
The figure below shows the dataset splitting phase in model training. The new feature
set is split in two: the train set and test set. 80% of the data was used for training while the
remaining 20% was dedicated to testing the model as shown in the figure below.

Figure 13: Split dataset to train and test set.

Train and save model


The figure below shows the final stage in the model training process. Each classifier is
trained using the training set and tested with the test set, the accuracy score is then saved in an
array. The process is repeated for all six classifiers and the classifier with the highest accuracy
score is selected as the classifier for our model.

Figure 14: Model training.

21
3.3.2 WEB APPLICATION DESIGN
USE CASE DIAGRAM
This style of diagram depicts use cases, actors, and the interactions between them in
the form of an action and reaction behavior of the system from the user's perspective (Kumar
and Gupta, 2011). The diagram below shows the use case diagram, with the user as the primary
actor. The user first submits or uploads a file, the extension of the file is verified before
submitting the file to the sandbox for dynamic analysis. The user views updates from the
background processes, then finally views the detection results.

Figure 15: Use case diagram.

22
ACTIVITY DIAGRAM
This diagram graphically depicts the sequential flow of activities in a business process
or a use case, and it can also be used to describe actions that will be taken once an operation is
completed, as well as the outcomes of those actions.

Figure 16: Activity diagram.

3.4 CONCLUSION
The chapter presented an overview of the design and analytical considerations of the
developed system. The chapter additionally instantiated graphical blueprints of the developed
system that form the basis of the developed systems’ structure.

23
Chapter 4 – RESULTS ANALYSIS

4.1 INTRODUCTION
In order to determine the success of any research project, a series of tests have to be
done to determine its viability based on some performance measures. This chapter examines
the performance of the Host Based Ransomware Detection Model with Machine Learning on
five ransomware samples collected from https://github.com/ytisf/theZoo/tree/master/malware/Source/Original

4.2 ENVIRONMENT DESCRIPTION


The model was implemented using python programming language. The machine
learning model was trained and tested in Jupyter notebook, PyCharm community 2021.3.3 was
used as the IDE where the Streamlit server was launched to host the web application on a
computer with an Intel core i5 CPU, 2.5 GHz with 12gb RAM and running windows 10 64bit.
A minimum of 3 runs were performed for each sample. After testing the model, the results
showed that the framework was 80% successful in detecting the ransomware and a 100%
success rate at detecting benign samples. The following screenshots reveal the details:

4.3 UNIT TESTING


Unit testing is a typical practice where software developers compose experiments along
with normal code (Daka and Fraser, 2014). Different programming languages use different
unit test systems for example Java uses Junit, C# uses NUnite. The Python programming
language has an in-built unit testing framework called ‘unittest’ (unittest — Unit testing
framework — Python 3.10.4 documentation, no date).

Test case 1: Model Training module


Expected output: Saved optimum features in pickle file, trained model

Figure 17: Unit testing of Model training module.

24
As seen in figure 17 above, testing the model training module took 146.75 seconds to
test and the results are successful. At the end of the test we successfully had a features pickle
file with the optimum features that were extracted from the dataset using the extra tree
classifier and a trained & tested model with a full classification report.

Test case 2: Report Processing module


Expected output: Prepped csv of sample.

Figure 18: Unit testing for Report Processing module.

As seen from the test results shown in the figure above, the time it takes for the report
to be processed and a fully prepped sample to be generated is 161.145 seconds (approximately
2.6 minutes). This complexity is quite high and makes the whole process disadvantaged with
regards to time.

Test case 3: Detection module


Expected output: Detection results.

Figure 19: Unit testing of the Detection module.

As seen from the test results shown in the figure above, the time it takes for the actual
detection is 0.195 seconds which is very fast and efficient.

A unit test for the web application module was not undertaken because the Streamlit
framework which is being used as our server doesn’t support any performance testing (at the
time of writing), load testing was not considered because the system is not designed to be

25
communicated with via the internet and the only traffic being received by the application is
coming from cuckoo sandbox.

4.3.1 TEST PLANNING


Table 3: Test planning

Test Scenario Steps Expected Actual Pass / Fail


Outcome Outcome
Train Model Process dataset Trained model Trained model Pass
and train model and features file and features file
Report Process report Prepped csv file Prepped csv file Pass
Processing from cuckoo as of sample of sample
per dataset
features
Perform Process sample Detection result Detection Pass
detection and make either result:
prediction with ransomware or ransomware or
model benign benign

4.3.2 RESULTS OF MODEL TRAINING


Classification Report
The figure below shows the classification report of the trained model. The report shows
the accuracy scores of all the tested classifiers, precision, f1-score, recall and support of the
classifier with the highest accuracy score.

Figure 20: Classification report of the trained and tested Logistic Regression model. 26
CONFUSION MATRIX
The figure below shows the confusion matrix of our selected classifier. A confusion
matrix, also known as an error matrix, is a special table structure that permits visualization of
the performance of an algorithm in the field of machine learning, specifically the problem of
statistical classification (Luque et al., 2019).

Figure 21: Confusion matrix.

ROC Curve and AUC


The figure below shows ROC (Receiver Operating Characteristics) and AUC (Area
under the Curve). ROC is a probability curve and AUC represents the degree of separateness.
It tells how much the model is capable of distinguishing between classes. The higher the AUC
the better the model is at predicting 0 classes as 0 and 1 classes as 1 (Narkhede, 2021).

Figure 22: ROC curve and AUC.


27
4.4 SYSTEM TESTING
System testing refers to the process of accessing the product with the goal of
discovering faults in it (Sawant, Bari and Chawan, 2012). It is a strategy pointed toward
assessing the characteristics of a system and confirming that it meets quality.

4.4.1 TEST PLANNING


 An .exe, .msi or .JSON file is uploaded onto the web application input.
 The uploaded file is submitted and analysis and detection commences.
 The detection results are viewed on the web interface.

4.4.2 USER INTERFACE TESTING


The figure below shows the home page for the host based ransomware detection
framework with machine learning system.

Figure 23: Host Based Ransomware Detection Web Application.

The figure 24 below shows the upload button working as an upload directory is opened
and a file is selected to be submitted. Figure 25 shows the size of the uploaded file and the
progress bar which helps the user visualize how the process is going and how long it is taking
to complete the background processes.

28
Figure 24: Upload directory of the web application.

Figure 25: Submitted sample being processed.

29
The figure 26 below shows the detection results on the web interface. Samples are
either benign or ransomware as such only one outcome is expected and as seen the submitted
sample in the figure below is ransomware.

Figure 26: Results of submitted sample.

4.5 SUMMARY
This chapter outlined the necessary steps which were explored in the development and
implementation of the host based ransomware detection framework. The system was designed,
developed and deployed successfully meeting the aim and objectives set beforehand.

30
CHAPTER 5 – PROJECT MANAGEMENT

5.1 INTRODUCTION
A risk is defined as exposure to specific elements that pose a danger to accomplishing
a project's desired outcomes (Schwalbe, 2015). On this premise, risk is typically described in
software projects as the probability-weighted impact of an incident on a project. The process
of identifying and analyzing potential issues that could have a negative impact on significant
business endeavors or crucial projects in order to assist businesses in avoiding or mitigating
those risks is known as risk analysis (Schwalbe, 2015). The technique of predicting the most
realistic amount of effort (expressed in terms of person-hours) required to develop or sustain
software based on incomplete, ambiguous, and noisy data is known as effort costing in
software development (Schwalbe, 2015).

This chapter will present concepts of risk analysis and project management concerning
the proposed approach. Later on, the risk register will be presented. Calculations involving
effort costing will also be outlined. A clear structure of the development work schedule for the
proposed model will also be presented.

5.2 RISK AND QUALITY MANAGEMENT


As software projects are high-risk activities with variable performance results
(Bannerman, 2008), the need of risk management cannot be overstated. The success of
software projects is measured using the triple constraint model of project management depicted
in Figure 4.1 below (Van Wyngaard, Pretorius and Pretorius, 2012), with all three constraints
subject to risk. Risks must first be identified in order to be handled (Refsdal, Solhaug and
Stølen, 2015).

Figure 27: Triple Constraint model (Van Wyngaard, Pretorius and Pretorius, 2012)

31
Below are six main processes involved in software risk management (Schwalbe, 2015)
namely:

 Risk Management Planning: determining how to carry out risk management operations
for a project.
 Identifying Hazards: identifying and recording the risks that may harm the project.
 Conducting Qualitative Risk Analysis: identifying and prioritizing hazards for further
investigation by analyzing and compounding their likelihood of occurrence and impact.
 Conduct Quantitative Risk Analysis: which entails examining the impact of identified
risks on overall project objectives.
 Develop Risk Responses: choices and activities to improve opportunities and mitigate
threats to project objectives.
 Risk Monitoring and Control: include establishing risk response plans, tracking
recognized risks, recognizing new risks, and assessing risk process efficacy throughout
the project.

5.3 EFFORT COSTING MODEL


The Constructive Cost Approach (COCOMO) was established by Barry W. Boehm in
1981 and is a well-documented and widely recognized algorithmic model for effort estimation
(Boehm et al., 1995). The constructive cost model comes in three modes: organic, semi-
detached, and embedded. The size of the project is the primary parameter considered in
calculating. The project's size is expressed in terms of lines of code (LOC) or a thousand lines
of code (KLOC).

𝐸 = 𝑎(𝐾𝐿𝑂𝐶)𝑏 (1)

Where ‘KLOC’ is the size of the code (Kilo-lines of code), ‘E’ is the software effort
computed in person-month and ‘a’, ‘b’ is the COCOMO model parameters. The value of ‘a’
and ‘b’ depend on the mode of the software project (Boehm et al., 1995). The three COCOMO
modes are described further below.

32
 Organic (2-50 KLOC): A project can be treated as an organic type if the project deals
with developing a well-understood program, the size of the development team is
reasonably small, and the team members are experienced in developing using
frameworks familiar to all team members (Boehm et al., 1995).

● Semi-detached (50-300 KLOC): A project can be considered as a semi-detached type


if the development has a mixture of experienced and unexperienced staff (Boehm et
al., 1995).

● Embedded (> 300 KLOC): A development project can be regarded as an embedded


type if the system being developed is complex or the regulations on the operational
method exist and are stringent (Boehm et al., 1995).

5.4 EFFORT CALCULATIONS FOR PROJECT


The effort is defined as the time spent by workers on activities that contribute to the
development of the software product (Trendowicz and Jeffery, 2014). To complete a project,
it is necessary to establish how much staff time is required to build software products and
deliverables.

The effort costing equation 3 and duration of the project is calculated using the equation
4 below.

𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑖𝑛𝑚𝑜𝑛𝑡ℎ𝑠 = 𝑐(𝑝𝑒𝑟𝑠𝑜𝑛 − 𝑚𝑜𝑛𝑡ℎ)𝑑 (2)

Where the ‘period’ of the project is the effort in ‘person-month’. The proposed model
falls on organic type of project, Justification of this statement is the expected ‘KLOC’ is less
than 50K. Table 4.2 shows the value of the constants a, b, c and d.

33
Table 4: COCOMO Constants

a b c d
Organic 2.4 1.05 2.5 0.38
Semi-detached 3.0 1.12 2.5 0.35

Embedded 3.6 1.20 2.5 0.32

An estimation of the project program size is 2000 lines of code based expert judgement.
To convert this into KLOC we use 𝐾𝐿𝑂𝐶=2000/1000. KLOC is then given as 2K, the effort is
then calculated using 𝐸 = 2.4 (2)1.05

𝐸 = 4.97 𝑝𝑒𝑟𝑠𝑜𝑛−𝑚𝑜𝑛𝑡ℎ.

Project duration: Using equation 1.2 we can find an estimation of project duration.

duration_in_month = 2.5(4.97) 0.38

= 4.6 𝑚𝑜𝑛𝑡ℎ𝑠

The Figures Below Show The Cost Of The Project Based On COCOMO.

Figure 28: Shows the count-total for the proposed model

34
Figure 29: Shows the Complexity of Weighting Factors for the proposed model.

Figure 30: Shows the LOC for the proposed model.

Figure 4.2 Information domain calculations for function points

Figure 31: Shows the effort and duration for the proposed model.

Figure
Figure4.4
4.3Effort
complexity
and Duration
weighting factor
35
5.7 SCHEDULING AND WORK PLAN
The figures below shows the software development plan Ghantt chart. It graphically
illustrates the schedule of the different components in the software development phase of our
proposed model.

Figure 32: Gantt chart view of software development plan.

Figure 33: Schedule for software development.


36
5.8 SUMMARY
The chapter presented the project risk and quality management, effort costing for the
proposed model were presented and calculated using basic COCOMO. Finally, the schedule
of works was presented using a Gantt chart.

37
CHAPTER 6 – CRITICAL EVALUATION

6.1 INTRODUCTION
This chapter outlines the reason for undertaking the project, lessons learnt throughout
the development of the software. The chapter goes on to stipulate the challenges encountered
during the development of the system, as well as the future works.

6.2 REASON FOR UNDERTAKING THE PROJECT


Ransomware was once an issue that was only seen in movies and fiction media but not
anymore. Not only have corporations and large enterprises fallen victim to ransomware attacks
but even regular individuals. Freely available protection tools have proved futile as the
frequency of ransomware attacks keep on rising with time. Ransomware groups have become
one of todays’ greatest security threat globally targeting institutions in different industries such
as health, finance, governments etc. This project was undertaken to contribute to the fight
against this global security threat. Just like the manufacturers of shields consistently contend
to provide protection against the strongest ammunition so does this project aim to provide
accurate and timely ransomware detection so as to help end users and the computer security
community efficiently guard against ransomware attacks.

6.3 MAIN LEARNING OUTCOME


The development of the system has taught me various lessons. Below are the learnt
lessons:

 I have gained significant knowledge in computer security concepts.


 I have gained significant knowledge in crypto-virology
 I have gained significant knowledge in conventional machine learning concepts and
techniques.
 I have increased my knowledge in Python programming language.
 I have learnt that machine learning can be used to improve malware detection
mechanisms.
 I have learnt that the most vital element of conventional supervised machine learning
is feature availability of dataset that captures the features of interest.
 I have learnt the relevance of software development concepts and practices.

38
 I have understood the relevance of a development community and supervisor guidance
with regards to projects.

6.4 CHALLENGES ENCOUNTERED


During the development of the system, several challenges were faced. Before the
commencement of this project, I did not know any form of machine learning at all, the
paradigm was new and daunting to me, this resulted in a snail-paced progress as I found the
learning curve to be quite steep. The greatest challenge encountered was finding an updated
open source dataset that captures ransomware behavior on a Microsoft Windows system. The
search for an updated open source dataset was futile as the only viable dataset found was the
product of the EldeRan research paper that covers ransomware up to 2016 dataset.
Understanding JSON file structure to perform feature extraction with python modules such as
Scikit-Learn, pandas, numpy etc. was very challenging, the complexity of the code was high
and optimizing it proved difficult.

6.5 FUTURE WORK


The future version of this desktop application would be able to accept more file
extensions such as zip, iso, bat etc. Additionally the future version should be able to analyze a
file faster as opposed to the 2 minutes complexity that this version has. Future works of the
framework should include portability therefore should include features across different
operating systems. Future versions should also explore features like user document
classification and I/O access monitoring. The current system can only detect ransomware and
benign software, future versions should implement multiclass classification to detect if a file
is ransomware, benign or virus. The future version of this framework should also display the
confidence score of the result i.e. how confident it is that a file is benign, ransomware or virus.

6.6 CONCLUSION
This chapter delivered the reasons with regards to why the project was undertaken and
the results from the developer’s perspective. It outline difficulties that were faced during the
development and future works of the system were proposed.

39
CHAPTER 7 – CONCLUSION

7.1 INTRODUCTION
Chapter one was a brief introduction to ransomware. The chapter elaborated on the
need for a more efficient ransomware detection framework for the Microsoft Windows
operating system. Additionally, the chapter covers the problem statement, aim, objectives,
scope, and justification of the proposed project.

Chapter two outlines a review of literature in the field of the proposed system, brief
descriptions of concepts as well as past related works were brought forward. The chapter
outlined in detail the selected development methodologies. The incremental development
methodology was selected because it can accommodate changes in the requirements when the
development of the system is in progress. Furthermore, incremental models provide the
capability to test and debug during model iterations. The waterfall model was chosen for the
development of the web application because the user requirements were very simple and
readily understood. The chapter went ahead to look at the technologies and frameworks to be
used for the development of the proposed system. PyCharm community version was chosen as
an ideal environment to develop the proposed system, Jupyter notebook was used for our
machine learning model training environment.

Chapter three gave an overview of system and design analysis, it stated the functional
and nonfunctional requirement. The chapter also includes the UML diagrams of the two
modules of this project.

Chapter four shows the results of the development process. Unit tests and system tests
were conducted and results shown. This chapter also outlines the results of our machine
learning model training.

Chapter five is initiated by defining risk management which involves identifying and
analyzing risk factors. The various risks that could possibly influence the proposed system
were likewise analyzed and these include: failure to complete the system as expected,

40
ambiguity in requirement, loss of information, inability to implement nonfunctional-
requirements, and unrealistic duration estimates. The section additionally shows effort costing
computations utilizing COCOMO online calculator and scheduling for the proposed system.

Chapter six delivered the reasons with regards to why the project was undertaken and
the results from the developer’s perspective. The chapter also brings to light the plethora of
challenges that were faced during the life of this project. Future works of the system were also
proposed.

7.2 RESEARCH CONTRIBUTIONS


This research brought to light the critical focus points with regards to malware detection
on the Microsoft Windows operating system which is an updated open source dataset that
captures behavior features. The findings from this research show that machine learning is
indeed a competent tool that can be used in ransomware detection and prevention. This
research also brought to light the challenge of high complexity when it comes to dynamic
features-based detection techniques.

41
REFERENCES

Alazab, Mamoun et al. (2010) ‘Zero-day malware detection based on supervised learning
algorithms of API call signatures’.

Alshamrani, A. and Bahattab, A. (2015) ‘A comparison between three SDLC models waterfall
model, spiral model, and Incremental/Iterative model’, International Journal of Computer
Science Issues (IJCSI), 12(1), p. 106.

Bannerman, P.L. (2008) ‘Risk and risk management in software projects: A reassessment’,
Journal of systems and software, 81(12), pp. 2118–2133.

Boehm, B. et al. (1995) ‘Cost models for future software life cycle processes: COCOMO 2.0’,
Annals of software engineering, 1(1), pp. 57–94.

Canfora, G. et al. (2014) ‘Metamorphic malware detection using code metrics’, Information
Security Journal: A Global Perspective, 23(3), pp. 57–67.

Cho, J.H. and Kurup, P.U. (2011) ‘Decision tree approach for classification and dimensionality
reduction of electronic nose data’, Sensors and Actuators B: Chemical, 160(1), pp. 542–548.
doi:10.1016/j.snb.2011.08.027.

Casen, M., Li, F. and Williams, D. (2021) ‘Friend or Foe: An Investigation into Recipient
Identification of SMS-Based Phishing’, in Furnell, S. and Clarke, N. (eds) Human Aspects of
Information Security and Assurance. Cham: Springer International Publishing (IFIP Advances
in Information and Communication Technology), pp. 148–163. doi:10.1007/978-3-030-
81111-2_13.

Computer operating systems market share 2012-2021 | Statista (no date). Available at:
https://www.statista.com/statistics/268237/global-market-share-held-by-operating-systems-
since-2009/ (Accessed: 25 November 2021).

Daka, E. and Fraser, G. (2014) ‘A survey on unit testing practices and problems’, in 2014 IEEE
25th International Symposium on Software Reliability Engineering. IEEE, pp. 201–211.

Giger, E. and Gall, H. (1996) ‘Object-oriented design heuristics’.

Geri, B.N., Jota, N. and Avert, M. (2006) ‘The emergence of ransomware’, AVAR, Auckland
[Preprint].

Hampton, N. and Baig, Z.A. (2015) ‘Ransomware: Emergence of the cyber-extortion menace’.

Hernandez-Castro, J., Cartwright, A. and Cartwright, E. (2020) ‘An economic analysis of


ransomware and its welfare consequences’, Royal Society open science, 7(3), p. 190023.

Horton, W. (2020) A Brief History of Jupyter Notebooks. EuroPython.

Human, M. et al. (2021) ‘Internet of things and ransomware: Evolution, mitigation and
prevention’, Egyptian Informatics Journal, 22(1), pp. 105–117. doi:10.1016/j.eij.2020.05.003.

42
Jamalpur, S. et al. (2018) ‘Dynamic malware analysis using cuckoo sandbox’, in 2018 Second
international conference on inventive communication and computational technologies
(ICICCT). IEEE, pp. 1056–1060.

Kharaz, A. et al. (2016) ‘${$UNVEIL$}$: A large-scale, automated approach to detecting


ransomware’, in 25th ${$USENIX$}$ Security Symposium (${$USENIX$}$ Security 16), pp.
757–772.

Kharraz, A. et al. (2015) ‘Cutting the gordian knot: A look under the hood of ransomware
attacks’, in International Conference on Detection of Intrusions and Malware, and
Vulnerability Assessment. Springer, pp. 3–24.

Khoirom, S. et al. (2020) ‘Comparative analysis of Python and Java for beginners’, Int. Res. J.
Eng. Technol, 7(8), pp. 4384–4407.

Kim, C.W. (2018) ‘Ntmaldetect: A machine learning approach to malware detection using
native api system calls’, arXiv preprint arXiv:1802.05412 [Preprint].

Kolbitsch, C. et al. (2009) ‘Effective and efficient malware detection at the end host.’, in
USENIX security symposium, pp. 351–366.

Kramer, M. (2018) ‘Best practices in systems development lifecycle: An analyses based on the
waterfall model’, Review of Business & Finance Studies, 9(1), pp. 77–84.

Kumar, R. and Gupta, D. (2011) ‘Object oriented design heuristics’, International Journal of
Engineering Science and Technology (IJEST), 3(1), pp. 459–463.

Luo, X. and Liao, Q. (2009) “Ransomware: A new cyber hijacking threat to enterprises,”.

Luque, A. et al. (2019) ‘The impact of class imbalance in classification performance metrics
based on the binary confusion matrix’, Pattern Recognition, 91, pp. 216–231.

Narkhede, S. (2021) Understanding AUC - ROC Curve, Medium. Available at:


https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 (Accessed: 4
June 2022).

O’Kane, P., Seer, S. and Carlin, D. (2018) ‘Evolution of ransomware’, IET Networks, 7(5), pp.
321–327.

Phillips, C.R. and Nagle, N.T. (1985) ‘Digital control system analysis and design’, IEEE
Transactions on Systems, Man, and Cybernetics, SMC-15(3), pp. 452–453.
doi:10.1109/TSMC.1985.6313385.

PyCharm: the Python IDE for Professional Developers by JetBrains (no date). Available at:
https://www.jetbrains.com/pycharm/ (Accessed: 2 June 2022).

Ransomware Attackers Get Short Shrift From Zambian Central Bank - Bloomberg (no date).
Available at: https://www.bloomberg.com/news/articles/2022-05-18/ransomware-attackers-
get-short-shrift-from-zambian-central-bank (Accessed: 4 June 2022).

Refsdal, A., Solhaug, B. and Stølen, K. (2015) ‘Cyber-risk management’, in Cyber-Risk


Management. Springer, pp. 33–47.

43
Satzinger, J.W., Jackson, R.B. and Burd, S.D. (2015) Systems analysis and design in a
changing world. Cengage learning.

Savage, K., Cogan, P. and Lau, H. (2015) ‘The evolution of ransomware’, Symantec, Mountain
View [Preprint].

Sawant, A.A., Bari, P.H. and Chawan, P. (2012) ‘Software testing techniques and strategies’,
International Journal of Engineering Research and Applications (IJERA), 2(3), pp. 980–986.

Schwalbe, K. (2015) Information technology project management. Cengage Learning.

scikit-learn: machine learning in Python — scikit-learn 1.1.1 documentation (no date).


Available at: https://scikit-learn.org/stable/ (Accessed: 2 June 2022).

Sgandurra, D. et al. (2016) ‘Automated Dynamic Analysis of Ransomware: Benefits,


Limitations and use for Detection’, arXiv preprint arXiv:1609.03020 [Preprint].

Sommerville, I. (2011) ‘Software engineering 9th Edition’, ISBN-10, 137035152, p. 18.

Sophos (2021) ‘State of ransomware’. Available at: https://secure2.sophos.com/en-


us/medialibrary/pdfs/whitepaper/sophos-state-of-ransomware-2021-wp.pdf.

Streamlit (no date) Triplebyte. Available at: https://triplebyte.com/company/public/streamlit


(Accessed: 2 June 2022).

Takanari Shigeta et al. (2016) ‘Encryption Processing of Ransomware’.

Takeuchi, Y., Sakai, K. and Fukumoto, S. (2018) ‘Detecting ransomware using support vector
machines’, in Proceedings of the 47th International Conference on Parallel Processing
Companion, pp. 1–6.

Trendowicz, A. and Jeffery, R. (2014) ‘Software project effort estimation’, Foundations and
Best Practice Guidelines for Success, Constructive Cost Model–COCOMO pags, 12, pp. 277–
293.

unittest — Unit testing framework — Python 3.10.4 documentation (no date). Available at:
https://docs.python.org/3/library/unittest.html (Accessed: 4 June 2022).

Van Rossum, G. and others (2007) ‘Python Programming language.’, in USENIX annual
technical conference, pp. 1–36.

Van Wyngaard, C.J., Pretorius, J.-H.C. and Pretorius, L. (2012) ‘Theory of the triple
constraint—A conceptual review’, in 2012 IEEE International Conference on Industrial
Engineering and Engineering Management. IEEE, pp. 1991–1997.

You, I. and Yim, K. (2010) ‘Malware obfuscation techniques: A brief survey’, in 2010
International conference on broadband, wireless computing, communication and applications.
IEEE, pp. 297–300.

Young, A. and Yung, M. (1996) ‘Cryptovirology: Extortion-based security threats and


countermeasures’, in Proceedings 1996 IEEE Symposium on Security and Privacy. IEEE, pp.
129–140.

44
Zimba, A. and Chishimba, M. (2019) ‘Understanding the evolution of ransomware: paradigm
shifts in attack structures’, International Journal of computer network and information
security, 11(1), p. 26.

45
APPENDICES
1. Model Training

Figure 34 Model training code snippet.

46
2. Report Processing

Figure 35: Repost processing module code snippet.

47

You might also like