Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
70 views

B20-ml Basedbotnet Attack in IoT Devices

This document discusses machine learning algorithms for predicting botnet attacks from IoT devices. It proposes several machine learning models including random forest, k-nearest neighbor, support vector machines, and logistic regression to classify data as benign or representing a TCP attack. The models are trained, tested and evaluated based on accuracy, F1-score, recall and precision. Previous studies applying machine learning techniques for botnet detection are also reviewed.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

B20-ml Basedbotnet Attack in IoT Devices

This document discusses machine learning algorithms for predicting botnet attacks from IoT devices. It proposes several machine learning models including random forest, k-nearest neighbor, support vector machines, and logistic regression to classify data as benign or representing a TCP attack. The models are trained, tested and evaluated based on accuracy, F1-score, recall and precision. Previous studies applying machine learning techniques for botnet detection are also reviewed.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 66

MACHINE LEARNING ALGORITHMS -BASED

PREDICTION OF BOTNET ATTACKS FROM IOT DEVICES

ABSTRACT

There are an increasing number of Internet of Things (IoT) devices connected to the network
these days, and due to the advancement in technology, the security threads and cyberattacks,
such as botnets, are emerging and evolving rapidly with high-risk attacks. IoT-based botnet
attack is one of the most popular, spreads faster and create more impact than other attacks. In
recent years, several works have been conducted to detect and avoid this kind of attacks by
using novel approaches. Hence, a plethora of relevant of relevant models, methods, and etc.
have been introduced over the past few years, with quite a reasonable number of studies
reported in the research domain. Many studies are trying to protect against these botnet
attacks on the IoT environment. However, there are many gaps still existing to develop an
effective detection mechanism. These attacks disrupt IoT transition by disrupting networks
and services for IoT devices. Many recent studies have proposed ML and DL techniques for
detecting and classifying botnet attacks in the IoT environment. This work proposes machine
learning methods for classifying binary classes i.e., Benign, or TCP attack. A complete
machine learning pipeline is proposed, including exploratory data analysis, which provides
detailed insights into the data, followed by preprocessing. During this process, the data passes
through several fundamental steps. A random forest, k-nearest neighbour, support vector
machines, and a logistic regression model are proposed, trained, tested, and evaluated on the
dataset. In addition to model accuracy, F1-score, recall, and precision are also considered.
CHAPTER 1

INTRODUCTION

1.1 Overview

The general idea of the Internet of Things (IoT) is to allow for communication between
human-to-thing or thing-to-thing(s). Things denote sensors or devices, whilst human or an
object is an entity that can request or deliver a service [1]. The interconnection amongst the
entities is always complex. IoT is broadly acceptable and implemented in various domains,
such as healthcare, smart home, and agriculture. However, IoT has a resource constraint and
heterogeneous environments, such as low computational power and memory. These
constraints create problems in providing and implementing a security solution in IoT devices.
These constraints further escalate the existing challenges for IoT environment. Therefore,
various kinds of attacks are possible due to the vulnerability of IoT devices.

IoT-based botnet attack is one of the most popular, spreads faster and create more impact
than other attacks. In recent years, several works have been conducted to detect and avoid
this kind of attacks [2]–[3] by using novel approaches. Hence, a plethora of relevant of
relevant models, methods, and etc. have been introduced over the past few years, with quite a
reasonable number of studies reported in the research domain.

Many studies are trying to protect against these botnet attacks on the IoT environment.
However, there are many gaps still existing to develop an effective detection mechanism. An
intrusion detection system (IDS) is one of the efficient ways to deal with attacks. However,
the traditional IDSs are often not able to be deployed for the IoT environments due to the
resource constraint problem of these devices. The complex cryptographic mechanisms cannot
be embedded in many IoT devices either for the same reason. There are mainly two kinds of
IDSs: the anomaly and misuse approaches. The misuse-based, also called the signature-based,
approach, is based on the attacks’ signatures, and they can also be found in most public IDSs,
specifically Suricata [4]. Formally, the attacker can easily circumvent the signature-based
approaches, and these mechanisms cannot guarantee to detect the unknown attacks and the
variances of known attacks. The anomaly-based systems are based on normal data and can
support to identify the unknown attacks. However, the different nature of IoT devices is being
faced with the difficulty of collecting common normal data. The machine learning-based
detection can guarantee detection of not only the known attacks and their variances.
Therefore, we proposed a machine learning-based botnet attack detection architecture. We
also adopted a feature selection method to reduce the demand for processing resources for
performing the detection system on resource constraint devices. The experiment results
indicate that the detection accuracy of our proposed system is high enough to detect the
botnet attacks. Moreover, it can support the extension for detecting the new distinct kinds of
attacks.

1.2. Challenging Issues

The traditional attack detection systems cannot be competently relocated in the IoT
environments because of the different nature of such devices, and the diverse architecture of
the underlying network methodologies with the conventional network. Additionally, the
possible attacks can be distinct from the attacks that are found on the traditional network
devices. The heavyweight encryption methods cannot be deployed on these resource
constraint devices. On the other side, the IoT devices become very cheap to set up for
personal usages, like in small business and smart home appliances. The attackers were
launching the attacks to the victim nodes after infecting the botnets on these devices. They
can also circumvent formal rule-based detection systems. Although the machine learning-
based system can detect the variances of the many kinds of attacks, the new distinct kinds of
attacks can be launched sometimes. Additionally, the complex processing of ML classifiers is
a challenge to implement the lightweight attack detection system on the resource constraint
devices.

1.3. Our Contributions

In this study, our main contributions are as follows.

(1) A botnet attacks detection framework with sequential architecture based on machine
learning (ML) algorithms is proposed for dealing with attacks in IoT environments.

(2) A correlated-feature selection approach is adopted for reducing the irrelevant features,
which makes the system lightweight.

(3) In our proposal, classifiers based on different ML algorithms may be applied in different
attack detection sub-engines, which leads to better detection performance and shorter
processing times and a lightweight implementation.
CHAPTER 2

LITERATURE SURVEY
Soe et al. [5] adopted a lightweight detection system with a high performance. The overall
detection performance achieves around 99% for the botnet attack detection using three
different ML algorithms, including artificial neural network (ANN), J48 decision tree, and
Naïve Bayes. The experiment result indicated that the proposed architecture can effectively
detect botnet-based attacks, and also can be extended with corresponding sub-engines for
new kinds of attacks.

Ali et al. [6] outlined the existing proposed contributions, datasets utilised, network forensic
methods utilised and research focus of the primary selected studies. The demographic
characteristics of primary studies were also outlined. The result of this review revealed that
research in this domain is gaining momentum, particularly in the last 3 years (2018-2020).
Nine key contributions were also identified, with Evaluation, System, and Model being the
most conducted.

Irfan et al. [7] classified the incoming data in the IoT, contain a malware or not. In this
research, this work under sample the dataset because the datasets contain imbalance class.
After that, this work classified the sample using Random Forest. This work used Naive
Bayes, K-Nearest Neighbor and Decision Tree too as a comparison. The dataset that has been
used in this research are from UCI Machine Learning Depository's Website. The dataset
showed the data traffic from the IoT Device in a normal condition and attacked by Mirai or
Bashlite.

Shah et al. [8] presented a concept called ‘login puzzle’ to prevent capture of IoT devices in a
large scale. Login puzzle is a variant of client puzzle, which presented a puzzle to the remote
device during the login process to prevent unrestricted log-in attempts. Login puzzle is a set
of multiple mini puzzles with a variable complexity, which the remote device is required to
solve before logging into any IoT device. Every unsuccessful log-in attempt increases the
complexity of solving the login puzzle for the next attempt. This paper introduced a novel
mechanism to change the complexity of puzzle after every unsuccessful login attempt. If each
IoT device had used login puzzle, Mirai attack would have required almost two months to
acquire devices, while it acquired them in 20 h.
Tzagkarakis et al. [9] presented an IoT botnet attack detection method based on a sparsity
representation framework using a reconstruction error thresholding rule for identifying
malicious network traffic at the IoT edge coming from compromised IoT devices. The botnet
attack detection is performed based on small-sized benign IoT network traffic data, and thus
we have no prior knowledge about malicious IoT traffic data. We present our results on a real
IoT-based network dataset and show the efficacy of proposed technique against a
reconstruction error-based autoencoder approach.

Meidan et al. [10] proposed a novel network-based anomaly detection method for the IoT
called N-BaIoT that extracts behavior snapshots of the network and uses deep autoencoders
to detect anomalous network traffic from compromised IoT devices. To evaluate the method,
this work infected nine commercial IoT devices in our lab with two widely known IoT-based
botnets, Mirai and BASHLITE. The evaluation results demonstrated the proposed methods
ability to detect the attacks accurately and instantly as they were being launched from the
compromised IoT devices that were part of a botnet.

Popoola et al. [11] proposed the federated DL (FDL) method for zero-day botnet attack
detection to avoid data privacy leakage in IoT-edge devices. In this method, an optimal deep
neural network (DNN) architecture is employed for network traffic classification. A model
parameter server remotely coordinates the independent training of the DNN models in
multiple IoT-edge devices, while the federated averaging (FedAvg) algorithm is used to
aggregate local model updates. A global DNN model is produced after several
communication rounds between the model parameter server and the IoT-edge devices. The
zero-day botnet attack scenarios in IoT-edge devices are simulated with the Bot-IoT and N-
BaIoT data sets.

Hussain et al. [12] produced a generic scanning and DDoS attack dataset by generating 33
types of scans and 60 types of DDoS attacks. In addition, this work partially integrated the
scan and DDoS attack samples from three publicly available datasets for maximum attack
coverage to better train the machine learning algorithms. Afterwards, this work proposed a
two-fold machine learning approach to prevent and detect IoT botnet attacks. In the first fold,
this work trained a state-of-the-art deep learning model, i.e., ResNet-18 to detect the scanning
activity in the premature attack stage to prevent IoT botnet attacks. While, in the second fold,
this work trained another ResNet-18 model for DDoS attack identification to detect IoT
botnet attacks.
Abu et al. [13] proposed an ensemble learning model for botnet attack detection in IoT
networks called ELBA-IoT that profiles behavior features of IoT networks and uses ensemble
learning to identify anomalous network traffic from compromised IoT devices. In addition,
this IoT-based botnet detection approach characterizes the evaluation of three different
machine learning techniques that belong to decision tree techniques (AdaBoosted,
RUSBoosted, and bagged). To evaluate ELBA-IoT, we used the N-BaIoT-2021 dataset, which
comprises records of both normal IoT network traffic and botnet attack traffic of infected IoT
devices.

Alharbi et al. [14] proposed Gaussian distribution used in the population initialization.
Furthermore, the local search mechanism was followed by the Gaussian density function and
local-global best function to achieve better exploration during each generation. Enhanced BA
was further employed for neural network hyperparameter tuning and weight optimization to
classify ten different botnet attacks with an additional one benign target class. The proposed
LGBA-NN algorithm was tested on an N-BaIoT data set with extensive real traffic data with
benign and malicious target classes. The performance of LGBA-NN was compared with
several recent advanced approaches such as weight optimization using Particle Swarm
Optimization (PSO-NN) and BA-NN.

Ahmed et al. [15] proposed a model for detecting botnets using deep learning to identify
zero-day botnet attacks in real time. The proposed model is trained and evaluated on a CTU-
13 dataset with multiple neural network designs and hidden layers. Results demonstrated that
the deep-learning artificial neural network model can accurately and efficiently identify
botnets.
CHAPTER 3

EXISTING SYSTEM

3.1 KNN Classification

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique. K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to the
available categories. It stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm. It can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. It is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called
a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset. KNN
algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much like the new data.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below diagram:
Fig. 3.1: KNN on dataset.

How does K-NN work?

The K-NN working can be explained based on the below algorithm:

Step-1: Select the number K of the neighbors.

Step-2: Calculate the Euclidean distance of K number of neighbors.

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.

Step-6: Model is ready.

Suppose we have a new data point, and we need to put it in the required category. Consider
the below image:
Fig. 3.2: Considering new data point.

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

Fig. 3.3: Measuring of Euclidean distance.

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
Fig. 3.4: Assigning data point to category A.

As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

 There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
 Large values for K are good, but it may find some difficulties.

Disadvantages of KNN Algorithm

 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data
points for all the training samples.

3.2 Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes
so that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Fig. 3.5: Analysis of SVM.

Types of SVM: SVM can be of two types

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

3.2.1 Working

Linear SVM: The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two features
x1 and x2. We want a classifier that can classify the pair (x1, x2) of coordinates in either
green or blue. Consider the below image:
Fig. 3.6: Linear SVM.

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Fig. 3.7: Test-Vector in SVM.

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between the
vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Fig. 3.7: Classification in SVM.

Non-Linear SVM: If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line. Consider the below image:

Fig. 3.8: Non-Linear SVM.

So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It
can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
Fig. 3.9: Non-Linear SVM data separation.

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Fig. 3.10: Non-Linear SVM best hyperplane.

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Fig. 3.11: Non-Linear SVM with ROC.

Hence, we get a circumference of radius 1 in case of non-linear data.

3.2.2 Advantages of SVM

 Support vector machine works comparably well when there is an understandable


margin of dissociation between classes.
 It is more productive in high-dimensional spaces.
 It is effective in instances where the number of dimensions is larger than the number
of specimens.
 Support vector machine is comparably memory systematic. Support Vector Machine
(SVM) is a powerful supervised machine learning algorithm with several advantages.
Some of the main advantages of SVM include:
 Handling high-dimensional data: SVMs are effective in handling high-dimensional
data, which is common in many applications such as image and text classification.
CHAPTER 4

PROPOSED SYSTEM

4.1 Overview

The main goal of this project is to develop a machine learning-based system capable of
identifying botnet attacks within IoT device data. Botnets are networks of compromised
devices controlled by malicious actors, and detecting their activities is crucial for network
security.
Fig. 4.1: Block diagram of proposed system.

Here's an overview of the project's objectives and components:

Data: The project relies on two datasets

 "benign_traffic.csv": Contains data representing normal, non-malicious IoT device


behavior.
 "junk.csv": Contains data representing potentially malicious IoT device behavior.

Data Preprocessing: The data is preprocessed by

 Adding labels to indicate whether each data point represents benign (0) or malicious
(1) activity.
 Normalizing the data using Z-score normalization, which standardizes the features to
have a mean of 0 and a standard deviation of 1.

Data Exploration and Description: Descriptive statistics are computed and displayed for both
datasets to provide insights into the data's characteristics.

Model Building: Four different machine learning models are defined and used for this project

 Support Vector Machine (SVM)


 Random Forest (RF)
 K-Nearest Neighbors (KNN)

Model Training and Evaluation: The project uses a standard approach to evaluate the
performance of these models

 The dataset is split into training and testing sets (80-20 split).
 Each model is trained on the training data.
 Predictions are made on the test data.
 Accuracy scores are calculated for each model.
 Classification reports are generated, providing additional evaluation metrics like
precision, recall, F1-score, etc.

Confusion Matrix: A confusion matrix is created for each model to visualize its performance.
Confusion matrices show true positives, true negatives, false positives, and false negatives,
helping to understand the model's ability to classify benign and malicious data points.

4.2 Pre-processing

Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model. When creating a project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in
a formatted way. So, for this, we use data pre-processing task.

Why do we need Data Pre-processing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data pre-processing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

 Getting the dataset


 Importing libraries
 Importing datasets
 Finding Missing Data
 Encoding Categorical Data
 Splitting dataset into training and test set
 Feature scaling

4.1.1 Splitting the Dataset into the Training set and Test set

In machine learning data pre-processing, we divide our dataset into a training set and test set.
This is one of the crucial steps of data pre-processing as by doing this, we can enhance the
performance of our machine learning model. Suppose if we have given training to our
machine learning model by a dataset and we test it by a completely different dataset. Then, it
will create difficulties for our model to understand the correlations between the models.

If we train our model very well and its training accuracy is also very high, but we provide a
new dataset to it, then it will decrease the performance. So we always try to make a machine
learning model which performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

Fig. 4.2: Dataset splitting.

Training Set: A subset of dataset to train the machine learning model, and we already know
the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set,
model predicts the output.

4.3 Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model. As the
name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Fig. 4.3: Random Forest algorithm.

Random Forest algorithm

Step 1: In Random Forest n number of random records are taken from the data set having k
number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification
and regression respectively.

Important Features of Random Forest

 Diversity- Not all attributes/variables/features are considered while making an


individual tree, each tree is different.
 Immune to the curse of dimensionality- Since each tree does not consider all the
features, the feature space is reduced.
 Parallelization-Each tree is created independently out of different data and attributes.
This means that we can make full use of the CPU to build random forests.
 Train-Test split- In a random forest we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.
 Stability- Stability arises because the result is based on majority voting/ averaging.
Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:

 There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
 The predictions from each tree must have very low correlations.

Below are some points that explain why we should use the Random Forest algorithm

 It takes less training time as compared to other algorithms.


 It predicts output with high accuracy, even for the large dataset it runs efficiently.
 It can also maintain accuracy when a large proportion of data is missing.

Types of Ensembles

Before understanding the working of the random forest, we must look into the ensemble
technique. Ensemble simply means combining multiple models. Thus, a collection of models
is used to make predictions rather than an individual model. Ensemble uses two types of
methods:

Bagging– It creates a different training subset from sample training data with replacement &
the final output is based on majority voting. For example, Random Forest. Bagging, also
known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging
chooses a random sample from the data set. Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with replacement known as row
sampling. This step of row sampling with replacement is called bootstrap. Now each model is
trained independently which generates results. The final output is based on majority voting
after combining the results of all models. This step which involves combining all the results
and generating output based on majority voting is known as aggregation.
Fig. 4.4: RF classifier analysis.
Boosting– It combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.

Fig. 4.5: Boosting RF classifier.


CHAPTER 5

UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization,


Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems. The UML represents a collection of best
engineering practices that have proven successful in the modeling of large and complex
systems. The UML is a very important part of developing objects-oriented software and the
software development process. The UML uses mostly graphical notations to express the
design of software projects.

GOALS: The Primary goals in the design of the UML are as follows:

 Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
 Provide extendibility and specialization mechanisms to extend the core concepts.
 Be independent of particular programming languages and development process.
 Provide a formal basis for understanding the modeling language.
 Encourage the growth of OO tools market.
 Support higher level development concepts such as collaborations, frameworks,
patterns, and components.
 Integrate best practices.

Class Diagram

The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a"
or "has-a" relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.

Sequence Diagram

Represent the objects participating in the interaction horizontally and time vertically. A Use
Case is a kind of behavioral classifier that represents a declaration of an offered behavior.
Each use case specifies some behavior, possibly including variants that the subject can
perform in collaboration with one or more actors. Use cases define the offered behavior of the
subject without reference to its internal structure. These behaviors, involving interactions
between the actor and the subject, may result in changes to the state of the subject and
communications with its environment. A use case can include possible variations of its basic
behavior, including exceptional behavior and error handling.
Activity diagram

Activity diagram is another important diagram in UML to describe the dynamic aspects of the
system.
Deployment diagram

The deployment diagram visualizes the physical hardware on which the software will be
deployed.

Use Case Diagram

A use case diagram in the Unified Modeling Language (UML) is a type of behavioural
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
Component diagram

Component diagram describes the organization and wiring of the physical components in a
system.
CHAPTER 6

SOFTWARE ENVIRONMENT

What is Python?

Below are some facts about Python.

 Python is currently the most widely used multi-purpose, high-level programming


language.
 Python allows programming in Object-Oriented and Procedural paradigms. Python
programs generally are smaller than other programming languages like Java.
 Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.
 Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be used for
the following –

 Machine Learning
 GUI Applications (like Kivy, Tkinter, PyQt etc. )
 Web frameworks like Django (used by YouTube, Instagram, Dropbox)
 Image processing (like Opencv, Pillow)
 Web scraping (like Scrapy, BeautifulSoup, Selenium)
 Test frameworks
 Multimedia

Advantages of Python

Let’s see how Python dominates over other languages.

1. Extensive Libraries

Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.
2. Extensible

As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.

3. Embeddable

Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.

4. Improved Productivity

The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.

5. IOT Opportunities

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.

6. Simple and Easy

When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.

7. Readable

Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.

8. Object-Oriented

This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.
9. Free and Open-Source

Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.

10. Portable

When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.

11. Interpreted

Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.

Any doubts till now in the advantages of Python? Mention in the comment section.

Advantages of Python Over Other Languages

1. Less Coding

Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.

2. Affordable

Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.

The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.
3. Python is for Everyone

Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.

Disadvantages of Python

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.

1. Speed Limitations

We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.

The reason it is not so famous despite the existence of Brython is that it isn’t that secure.

3. Design Restrictions

As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.

4. Underdeveloped Database Access Layers


Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and
ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming Language.

Modules Used in Project

NumPy

NumPy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and


analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze. Python
with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.
Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and Ipython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots and
thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
Python

Install Python Step-by-Step in Windows and Mac

Python a versatile programming language doesn’t come pre-installed on your computer


devices. Python was first released in the year 1991 and until today it is a very popular high-
level programming language. Its style philosophy emphasizes code readability with its
notable use of great whitespace.

The object-oriented approach and language construct provided by Python enables


programmers to write both clear and logical code for projects. This software does not come
pre-packaged with Windows.

How to Install Python on Windows and Mac

There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.

Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.

Download the Correct version into the system

Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org

Now, check for the latest and the correct version for your operating system.

Step 2: Click on the Download Tab.


Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color
or you can scroll further down and click on download with respective to their version. Here,
we are downloading the most recent python version for windows 3.7.4

Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.
 To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.
 To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.

Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation

Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.

Installation of Python

Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.

Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.

Note: The installation process might take a couple of minutes.

Verify the Python Installation

Step 1: Click on Start

Step 2: In the Windows Run Command, type “cmd”.

Step 3: Open the Command prompt option.

Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.

Step 5: You will get the answer as 3.7.4


Note: If you have any of the earlier versions of Python already installed. You must first
uninstall the earlier version and then install the new one.

Check how the Python IDLE works

Step 1: Click on Start

Step 2: In the Windows Run command, type “python idle”.

Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program

Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save

Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.

Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.

Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.

CHAPTER 7

SYSTEM REQUIREMENTS

Software Requirements

The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.

The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.

 Python IDLE 3.7 version (or)


 Anaconda 3.7 (or)
 Jupiter (or)
 Google colab
Hardware Requirements

Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.

 Operating system : Windows, Linux


 Processor : minimum intel i3
 Ram : minimum 4 GB
 Hard disk : minimum 250GB

CHAPTER 8

FUNCTIONAL REQUIREMENTS

Output Design

Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:

 External Outputs, whose destination is outside the organization


 Internal Outputs whose destination is within organization and they are the
 User’s main interface with the computer.
 Operational outputs whose use is purely within the computer department.
 Interface outputs, which involve the user in communicating directly.

Output Definition
The outputs should be defined in terms of the following points:

 Type of the output


 Content of the output
 Format of the output
 Location of the output
 Frequency of the output
 Volume of the output
 Sequence of the output

It is not always desirable to print or display data as it is held on a computer. It should be


decided as which form of the output is the most suitable.

Input Design

Input design is a part of overall system design. The main objective during the input design is
as given below:

 To produce a cost-effective method of input.


 To achieve the highest possible level of accuracy.
 To ensure that the input is acceptable and understood by the user.

Input Stages

The main input stages can be listed as below:

 Data recording
 Data transcription
 Data conversion
 Data verification
 Data control
 Data transmission
 Data validation
 Data correction

Input Types

It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
 External inputs, which are prime inputs for the system.
 Internal inputs, which are user communications with the system.
 Operational, which are computer department’s communications to the system?
 Interactive, which are inputs entered during a dialogue.

Input Media

At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;

 Type of input
 Flexibility of format
 Speed
 Accuracy
 Verification methods
 Rejection rates
 Ease of correction
 Storage and handling requirements
 Security
 Easy to use
 Portability

Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As

Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.

Error Avoidance

At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.

Error Detection
Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.

Data Validation

Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.

The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.

User Interface Design

It is essential to consult the system users and discuss their needs while designing the user
interface:

User Interface Systems Can Be Broadly Clasified As:

 User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.
 Computer initiated interfaces

In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.

User Initiated Interfaces

User initiated interfaces fall into two approximate classes:

 Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.
 Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.

Computer-Initiated Interfaces

The following computer – initiated interfaces were used:

 The menu system for the user is presented with a list of alternatives and the user
chooses one; of alternatives.
 Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.

Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.

Error Message Design

The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.

This application must be able to produce output at different modules for different inputs.

Performance Requirements

Performance is measured in terms of the output provided by the application. Requirement


specification plays an important part in the analysis of a system. Only when the requirement
specifications are properly given, it is possible to design a system, which will fit into required
environment. It rests largely in the part of the users of the existing system to give the
requirement specifications because they are the people who finally use the system. This is
because the requirements have to be known during the initial stages so that the system can be
designed according to those requirements. It is very difficult to change the system once it has
been designed and on the other hand designing a system, which does not cater to the
requirements of the user, is of no use.

The requirement specification for any system can be broadly stated as given below:

 The system should be able to interface with the existing system


 The system should be accurate
 The system should be better than the existing system
 The existing system is completely dependent on the user to perform all the duties.

CHAPTER 9

SOURCE CODE
# # Ensemble Model for the Detection of Botnet Attacks from IoT Devices

# ### Importing all the required libraries

import pandas as pd

import numpy as np

import itertools

import matplotlib.pyplot as plt


import vpython as vs

from sklearn.preprocessing import MinMaxScaler

from IPython.display import display

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import


accuracy_score,roc_curve,auc,classification_report,confusion_matrix

import warnings

warnings.filterwarnings("ignore",category=DeprecationWarning)

# ### Loading dataset

pureData = pd.read_csv("./benign_traffic.csv")

print("Non-malicious Dataset")

display(pureData.head())

maliciousDataset = pd.read_csv("./junk.csv")

print("Malicious Dataset")

display(maliciousDataset.head())
# ### Description of dataset

print("There are %d records with %d features in non-malicious dataset"%


(pureData.shape[0],pureData.shape[1]))

display(pureData.describe())

print("There are %d records with %d features in malicious dataset"%


(maliciousDataset.shape[0],maliciousDataset.shape[1]))

display(maliciousDataset.describe())

# ### Adding label to the dataset

print("Adding output column in the datasets with all 0 in pureData dataset and all 1 in
maliciousDataset dataset")

pureData["output"] = 0

maliciousDataset["output"] = 1

# ### Combining Non-malicious and Malicious datasets

dataset = pd.concat([pureData, maliciousDataset], axis=0)

print("There are %d records with %d features in combined dataset"%


(dataset.shape[0],dataset.shape[1]))

# ### Splitting input and output

Output = dataset.output

Input = dataset.loc[:,"MI_dir_L5_weight":"HpHp_L0.01_pcc"]

print("Output shape :-",Output.shape)

print("Intput shape :-",Input.shape)


# ### Preprocessing

Output=np.array(Output).flatten()

# ### Normalization

print("Calculation Z-score normalization which converts all indicators to a common scale


with an average of zero and standard deviation of one. The average of zero means that it
avoids introducing aggregation distortions stemming from differences in indicators' means.")

datasetNormalised=(dataset-dataset.mean())/(dataset.std())

datasetNormalised_array=np.array(datasetNormalised)

print("Data after normalisation")

display(datasetNormalised.head())

# ### Split dataset (80-20)

X_train, X_test, y_train, y_test = train_test_split(datasetNormalised_array, Output, test_size


= 0.2, random_state = 3)

print ("Training set has {} samples.".format(X_train.shape[0]))

print ("Testing set has {} samples.".format(X_test.shape[0]))

# ### Model definition

LR = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')

SVM = SVC(probability=True)

RF = RandomForestClassifier(n_estimators=1500)

KNN = KNeighborsClassifier()
# ### Model training and prediction

def train_predict(model, X_train, y_train, X_test, y_test):

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

acc_test = accuracy_score(y_test,y_pred)

report = classification_report(y_test,y_pred)

return acc_test,report

# ### Prediction and accuracy score

LR_acc,LR_report = train_predict(LR,X_train,y_train,X_test,y_test)

print("The accuracy score of Logistic Regression is %f"%LR_acc)

print("Classification report :-")

print(LR_report)

SVM_acc,SVM_report = train_predict(SVM,X_train,y_train,X_test,y_test)

print("The accuracy score of SVM is %f"%SVM_acc)

print("Classification report :-")

print(SVM_report)

RF_acc,RF_report = train_predict(RF,X_train,y_train,X_test,y_test)

print("The accuracy score of Random Forest Classifier is %f"%RF_acc)

print("Classification report :-")

print(RF_report)
KNN_acc, KNN_report =train_predict(KNN,X_train,y_train,X_test,y_test)

print("The accuracy score of K-nearest neighbour is %f"%KNN_acc)

print("Classification report :-")

print(KNN_report)

# ### Confusion matrix

def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion


matrix',cmap=plt.cm.Blues):

plt.imshow(cm, interpolation='nearest', cmap=cmap)

plt.title(title)

plt.colorbar()

tick_marks = np.arange(len([1,0]))

plt.xticks(tick_marks, classes)

plt.yticks(tick_marks, classes)

thresh = cm.max() / 2.

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else


"black")

plt.tight_layout()

plt.ylabel('True label')

plt.xlabel('Predicted label')

cnf_matrix = confusion_matrix(y_test, LR.predict(X_test))

np.set_printoptions(precision=2)

plt.figure()
plot_confusion_matrix(cnf_matrix, classes=[1,0],title='Confusion matrix')

plt.show()

cnf_matrix = confusion_matrix(y_test, SVM.predict(X_test))

np.set_printoptions(precision=2)

plt.figure()

plot_confusion_matrix(cnf_matrix, classes=[1,0],title='Confusion matrix')

plt.show()

cnf_matrix = confusion_matrix(y_test, RF.predict(X_test))

np.set_printoptions(precision=2)

plt.figure()

plot_confusion_matrix(cnf_matrix, classes=[1,0],title='Confusion matrix')

plt.show()

CHAPTER 10

RESULTS AND DISCUSSION


Implementation description

This project implements an ensemble model to detect botnet attacks from IoT device data. It
loads and combines benign and malicious IoT device data and prepares the data by adding
labels and normalizing it then splits the data into training and testing sets. Afterwards, it
defines, and trains four machine learning models and evaluates the models using accuracy
scores, classification reports, and confusion matrices. Here is the step-by-step explanation:
1. Importing Libraries: It starts by importing necessary Python libraries, including pandas,
numpy, matplotlib, vpython, and various machine learning libraries from scikit-learn (e.g.,
LogisticRegression, SVC, RandomForestClassifier, KNeighborsClassifier).

2. Loading Datasets

 Two datasets are loaded: "benign_traffic.csv" (non-malicious IoT device data) and
"junk.csv" (malicious IoT device data).
 The data from these CSV files is read into Pandas DataFrames.
 The first few rows of both datasets are displayed.

3. Dataset Description: Descriptive statistics (e.g., mean, standard deviation) are displayed for
both datasets to provide an overview of the data.

4. Adding Labels: A binary label is added to both datasets. Non-malicious data is labeled as 0,
and malicious data is labeled as 1.

5. Combining Datasets: The two datasets are concatenated into a single dataset called
"dataset" by stacking them vertically.

6. Splitting Input and Output

 The "output" column is extracted as the target variable (y or output), and the rest of
the columns are considered as input features (X or Input).
 The shapes of the output and input are displayed.

7. Preprocessing

 The output array is flattened using NumPy.


 Z-score normalization is applied to the entire dataset, standardizing the features to
have a mean of 0 and a standard deviation of 1.

8. Splitting the Dataset

 The dataset is split into training and testing sets in an 80-20 ratio using the
train_test_split function from scikit-learn.
 The number of samples in the training and testing sets is displayed.

9. Model Definition: Four machine learning models are defined

 Logistic Regression (LR)


 Support Vector Machine (SVM)
 Random Forest (RF)
 K-Nearest Neighbors (KNN)

10. Model Training and Prediction

 A function named train_predict is defined to train a given model and make predictions
on the test set.
 Each of the four models is trained on the training data, and predictions are made on
the test data.
 The accuracy score and classification report (including precision, recall, F1-score,
etc.) are printed for each model.

11. Confusion Matrix

 A function named plot_confusion_matrix is defined to plot a confusion matrix.


 Confusion matrices are computed for each model using the confusion_matrix function
from scikit-learn.
 The confusion matrices are then displayed using Matplotlib, showing true positives,
true negatives, false positives, and false negatives.

Dataset description

The dataset is a collection of features extracted from a network traffic stream, with various
statistics calculated over different time frames. Each feature is related to a specific type of
stream aggregation and time frame. Let's break down the attribute information and the
dataset:

Stream Aggregation:

 H: Stats summarizing the recent traffic from this packet's host (IP)
 MI: Stats summarizing the recent traffic from this packet's host (IP + MAC)
 HH: Stats summarizing the recent traffic going from this packet's host (IP) to the
packet's destination host.
 HH_jit: Stats summarizing the jitter of the traffic going from this packet's host (IP) to
the packet's destination host.
 HpHp: Stats summarizing the recent traffic going from this packet's host+port (IP) to
the packet's destination host+port.
Timeframe (Decay Factor):

 L5: Decay factor used in the damped window, capturing recent history up to 5 time
units.
 L3: Decay factor capturing recent history up to 3-time units.
 L1: Decay factor capturing recent history up to 1 time unit.
 L0.1: Decay factor capturing recent history up to 0.1-time units.
 L0.01: Decay factor capturing recent history up to 0.01-time units.

Statistics Extracted:

For each combination of stream aggregation, time frame, and statistic, the dataset contains the
following features:

 weight: The weight of the stream, which can be viewed as the number of items
observed in recent history.
 mean: The mean of the stream's values.
 std: The standard deviation of the stream's values.
 radius: The root squared sum of the two streams' variances.
 magnitude: The root squared sum of the two streams' means.
 cov: An approximated covariance between two streams.
 pcc: An approximated correlation coefficient between two streams.

The dataset itself consists of multiple columns, each corresponding to a specific combination
of stream aggregation, time frame, and statistic. The naming convention follows this pattern:
<Stream_Aggregation>_<Time_Frame>_<Statistic>

For example, one column is named MI_dir_L5_weight, which indicates the weight of the
stream aggregated by source MAC-IP with a decay factor capturing recent history up to 5-
time units.

In this dataset, each row likely represents a sample or instance of network traffic, and the
values in the columns represent the calculated statistics for various combinations of stream
aggregation, time frame, and statistic.

Results description
Figure 1 shows a visual representation of data collected from Internet of Things (IoT) devices
that are operating normally and not exhibiting any malicious behavior. It displays various
features or attributes of the data points collected from these devices, such as network traffic
patterns, communication protocols, and other relevant parameters. Each data point in this
figure represents a non-malicious activity. Figure 2 illustrates the data collected from IoT
devices that are exhibiting malicious or abnormal behavior. The visual representation
highlights the anomalies or suspicious patterns in the data that indicate potential attacks or
unauthorized activities. Each data point in this figure represents a malicious activity.

Figure 1: Illustration of sample non-malicious dataset obtained from IoT device.


Figure 2: Illustration of sample malicious dataset obtained from IoT device.
Figure 3: Illustration of dataset after combining both malicious and non-malicious data with
labelling as benign = 0, and attack = 1.

Figure 3 depicts the dataset created by combining the non-malicious data (from Figure 1) and
the malicious data (from Figure 2). This shows how the data points have been labeled, with
the "benign" data points labeled as 0 and the "attack" data points labeled as 1. This labeled
dataset is essential for training machine learning models to distinguish between normal and
malicious activities.

Figure 4: Classification report of LR model, and SVM classifier for detection of botnet attack
in IoT network traffic data.
Figure 5: Classification report of RF, and KNN classifiers for detection of botnet attack in
IoT network traffic data.

Figure 4 and Figure 5 presents the classification performance report of two machine learning
models, namely LR model, SVM classifier, RF model, and KNN classifier for the specific
task of detecting botnet attacks in IoT network traffic data. The report includes metrics such
as accuracy, precision, recall, and F1-score, showing how well each model performed in
identifying botnet attacks. Figure 6 demonstrate the obtained confusion matrices using
various ML models. Confusion matrices are used to assess the performance of classification
algorithms. Each matrix shows the number of true positives, true negatives, false positives,
and false negatives for a given model's predictions. By comparing these matrices, we can
evaluate which model is performing better in terms of correctly identifying normal and
malicious activities.
Figure 6: Confusion matrices obtained using various ML models.

Table 1. Performance comparison of various ML models for predicting the botnet attack in
IoT device data.

Model/Metric LR model SVM RF model KNN classifier


classifier
Accuracy (%) 100 99.98 100 99.99
Precision (%) 100 100 100 100
Recall (%) 100 100 100 100
F1-score (%) 100 100 100 100

Table 1 presents a comprehensive performance comparison of different machine learning


(ML) models employed for the prediction of botnet attacks in data obtained from IoT devices.
The table showcases four ML models: LR model, SVM classifier, RF model, and KNN
classifier. The metrics assessed for each model include Accuracy, Precision, Recall, and F1-
score.

The LR model achieved exceptional results across all metrics. It displayed a remarkable
accuracy rate of 100%, indicating that it accurately classified all instances, both botnet
attacks and benign activities. This suggests that the LR model has a comprehensive
understanding of the underlying patterns in the data, enabling it to make precise predictions.
Additionally, the LR model demonstrated 100% precision, recall, and F1-score. This implies
that the model not only made accurate positive predictions (precision) but also correctly
identified all true positive instances (recall), leading to a harmonic balance between precision
and recall, as represented by the F1-score. The SVM classifier performed impressively as
well, with an accuracy of 99.98%. This indicates that the model almost perfectly classified
instances into their respective categories. Similar to the LR model, the SVM classifier
attained 100% precision, recall, and F1-score, signifying its strong capability to make
accurate predictions and correctly identify botnet attacks. The RF model achieved a perfect
accuracy of 100%, matching the performance of the LR model. This suggests that the RF
model was able to effectively capture the complexities and variations in the data. The
precision, recall, and F1-score also reached 100%, indicating consistent and reliable
performance across these evaluation metrics. The KNN classifier, while slightly behind the
other models in terms of accuracy with 99.99%, still demonstrated outstanding predictive
capability. Like the other models, the KNN classifier achieved perfect precision, recall, and
F1-score. This showcases the model's ability to make accurate and consistent predictions.

Finally, Table 1 showcases the remarkable performance of all the evaluated ML models for
predicting botnet attacks in IoT device data. Each model exhibited near-perfect accuracy,
precision, recall, and F1-score, implying their proficiency in accurately identifying both
botnet attacks and benign activities. These findings suggest that the models are well-suited
for detecting and preventing malicious activities within IoT networks, enhancing the overall
security and reliability of IoT devices and systems.

CHAPTER 11

CONCLUSION AND FUTURE SCOPE

Cyber-attacks involving botnets are multi-stage attacks and primarily occur in IoT
environments; they begin with scanning activity and conclude with distributed denial of
service (DDoS). Most existing studies concern detecting botnet attacks after IoT devices
become compromised and start performing DDoS attacks. Furthermore, most machine
learning-based botnet detection models are limited to a specific dataset on which they are
trained. Consequently, these solutions do not perform well on other datasets due to the
diversity of attack patterns. In this work, real traffic data is used for experimentation. EDA
(Exploratory Data Analysis) is the statistical analysis phase through which the whole dataset
is analyzed. The model will be able to be trained on a large data set in the future. ResNet50
and LSTM models, deep learning models can also be used in run-time Botnet detection.
Besides being integrated with front-end web applications, the research' model can also be
used with back-end web applications.

REFERENCES
[1] S. Dange and M. Chatterjee, "IoT botnet: The largest threat to the iot network" in Data
Communication and Networks, Cham, Switzerland: Springer, pp. 137-157, 2020.
[2] J. Ceron, K. Steding-Jessen, C. Hoepers, L. Granville and C. Margi, "Improving IoT
botnet investigation using an adaptive network layer", Sensors, vol. 19, no. 3, pp. 727,
Feb. 2019.
[3] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, et al.,
"N-baiot-network-based detection of iot botnet attacks using deep autoencoders", IEEE
Pervas. Comput., vol. 17, no. 3, pp. 12-22, 2018.
[4] Shah, S.A.R.; Issac, B. Performance comparison of intrusion detection systems and
application of machine learning to Snort system. Futur. Gener. Comput. Syst. 2018, 80,
157–170.
[5] Soe YN, Feng Y, Santosa PI, Hartanto R, Sakurai K. Machine Learning-Based IoT-
Botnet Attack Detection with Sequential Architecture. Sensors. 2020; 20(16):4372.
https://doi.org/10.3390/s20164372
[6] I. Ali et al., "Systematic Literature Review on IoT-Based Botnet Attack," in IEEE
Access, vol. 8, pp. 212220-212232, 2020, doi: 10.1109/ACCESS.2020.3039985.
[7] Irfan, I. M. Wildani and I. N. Yulita, "Classifying botnet attack on Internet of Things
device using random forest", IOP Conf. Ser. Earth Environ. Sci., vol. 248, Apr. 2019.
[8] Shah, T., Venkatesan, S. (2019). A Method to Secure IoT Devices Against Botnet
Attacks. In: Issarny, V., Palanisamy, B., Zhang, LJ. (eds) Internet of Things – ICIOT
2019. ICIOT 2019. Lecture Notes in Computer Science(), vol 11519. Springer, Cham.
https://doi.org/10.1007/978-3-030-23357-0_3
[9] C. Tzagkarakis, N. Petroulakis and S. Ioannidis, "Botnet Attack Detection at the IoT
Edge Based on Sparse Representation," 2019 Global IoT Summit (GIoTS), Aarhus,
Denmark, 2019, pp. 1-6, doi: 10.1109/GIOTS.2019.8766388.
[10] Y. Meidan et al., "N-BaIoT—Network-Based Detection of IoT Botnet Attacks Using
Deep Autoencoders," in IEEE Pervasive Computing, vol. 17, no. 3, pp. 12-22, Jul.-Sep.
2018, doi: 10.1109/MPRV.2018.03367731.
[11] S. I. Popoola, R. Ande, B. Adebisi, G. Gui, M. Hammoudeh and O. Jogunola,
"Federated Deep Learning for Zero-Day Botnet Attack Detection in IoT-Edge Devices,"
in IEEE Internet of Things Journal, vol. 9, no. 5, pp. 3930-3944, 1 March1, 2022, doi:
10.1109/JIOT.2021.3100755.
[12] F. Hussain et al., "A Two-Fold Machine Learning Approach to Prevent and Detect IoT
Botnet Attacks," in IEEE Access, vol. 9, pp. 163412-163430, 2021, doi:
10.1109/ACCESS.2021.3131014.
[13] Abu Al-Haija Q, Al-Dala’ien M. ELBA-IoT: An Ensemble Learning Model for Botnet
Attack Detection in IoT Networks. Journal of Sensor and Actuator Networks. 2022;
11(1):18. https://doi.org/10.3390/jsan11010018.
[14] Alharbi A, Alosaimi W, Alyami H, Rauf HT, Damaševičius R. Botnet Attack
Detection Using Local Global Best Bat Algorithm for Industrial Internet of Things.
Electronics. 2021; 10(11):1341. https://doi.org/10.3390/electronics10111341
[15] Ahmed, A.A., Jabbar, W.A., Sadiq, A.S. et al. Deep learning-based classification
model for botnet attack detection. J Ambient Intell Human Comput 13, 3457–3466
(2022). https://doi.org/10.1007/s12652-020-01848-9.

You might also like