Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms on NBaIoT Dataset
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms on NBaIoT Dataset
Abstract—Through botnet assaults, Mirai and BASHLITE with the ultimate goal of developing a robust model for the
present serious risks in the context of the Internet of Things detection and classification of these attack. Unraveling their
(IoT). A reliable detection and classification model is the need methods and impact, BASHLITE, also known as Gafgyt, is
of the hour. This article presents a solution to the problem by
suggesting an ensemble-based algorithm to detect and classify malicious software that specifically targets Linux systems. Its
different categories of Mirai and BASHLITE attacks. We have primary aim is to infect these systems and launch DDoS
used the NBaIoT, a heterogeneous dataset, that purposefully com- attacks, which can be as formidable as 400 Gbps [5]. In 2014,
promised on many devices, and exposes subtle attack categories. BASHLITE made its debut by exploiting a vulnerability in
Interestingly, certain devices resist Mirai infections, improving the bash shell, known as the Shellshock software bug. Later,
our analysis. By utilizing important data types and various time-
based window sizes, our approach seeks to create an effective its source code leaked in 2015, leading to a proliferation of
model for identifying and classifying IoT botnet assaults to variants and infections reaching one million devices by 2016.
strengthen the IoT ecosystem against new attacks. The primary targets of BASHLITE are IoT devices, with a
Index Terms—Internet of Things, Intrusion Detection, Botnet, staggering 96 percent of identifiable devices in botnets being
Mirai, BASHLITE, IoT devices, including cameras and DVRs [6].
Mirai, another infamous botnet, has a distinct modus
I. I NTRODUCTION operandi. It targets smart devices running on ARC processors,
Understanding IoT, requires envisioning a vast intercon- transforming them into remotely controlled bots, forming what
nected web of devices, ranging from smart thermostats to is commonly known as a botnet [7]. This botnet, in turn, is
security cameras, collectively contributing to a digitally wo- employed to launch DDoS attacks, harnessing the combined
ven tapestry. However, this interconnectedness brings forth power of these infected devices. Mirai scans the vast expanse
vulnerabilities, as cyber threats adapt and evolve to exploit of the Internet for vulnerable IoT devices, exploiting default
potential weaknesses in this expansive ecosystem [1]. In the username and password combinations to gain unauthorized
vast landscape of IoT, a lurking threat has emerged - the IoT access. Once inside, it infects the device, adding it to its legion
botnet attack. These attacks involve the manipulation of IoT of controlled devices [8].
devices to form a network, commonly referred to as botnet [2]. The major contributions of the paper are:
This web of compromised devices becomes a powerful tool for • To thoroughly analyze the NBaIoT (Network Based
cyber attackers, particularly in executing distributed denial-of- Anamoly detection in IoT) dataset and compare the
service (DDoS) attacks [3], [4]. performance of various machine learning (ML) and Deep
Our research delves into the intricacies of these attacks, Learning (DL) techniques on the given dataset.
focusing on the notorious Mirai and BASHLITE botnets, • To use various ML and DL paradigms to contribute to-
wards forming a reliable detection model using a variation
This project work was Sponsored by the SERB SURE funding agency, of the Stacking Ensemble technique to create a resource-
India. Application Number: SUR/2022/001939.
efficient detection model to operate on IoT devices.
979-8-3503-6486-6/24/ $31.00 ©2024 IEEE • To improve upon potentially outdated models applied to
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.
In another work, several experiments were conducted us-
ing Deep Neural Network (DNN) for detecting IoT attacks.
Most widely used datasets like KDD-Cup’99, NSL-KDD, and
UNSW-NB15 were used to assess the performance of the given
model. It was observed that the model based on DNN achieved
more than 90% accuracy for all the datasets [13].
Different ML approaches including Multi-layer Perception
Artificial Neural Network (MLP ANN), K- Nearest Neigh-
bour (KNN), and Naive Bayes were used to detect DDoS
attack. The dataset used for the evaluation is the BoT-IoT.
The algorithms are implemented on the two different sets
of the dataset. The first set is the actual dataset while the
second one is a class-balanced dataset created by applying
the SMOTE technique to the original dataset. It is observed
that applying SMOTE improved accuracy, precision as well as
recall. Based on the complete experimentation it is seen that
KNN performed best [14].
The author in [15] has proposed to use both supervised
and unsupervised ML algorithm for botnet detection in IoT
devices.
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.
XGBoost RF ANN RNN Final Model
Parameter Value Parameter Value Features L1 L2 L3 L1 L2 L1 L2 L3
alpha 0.06 n estimators 100 Activation sig relu sig sig sig relu relu sig
gamma 0.3 max features 0.4 Function
max depth 4 min sample leaf 10 No. of 32 24 1 48 1 16 8 1
min child weight 5 max depth 9 Percep-
n estimators 80 trons
TABLE I: Values assigned to Different Parameters of TABLE II: Value of Different Parameters of ANN, RNN and
XGBoost and RF Model for Binary Classification Final model for Binary Classification
Note* L stands for Layer
the packet’s host + port to the packet’s destination host + port
(HpHp) [17]. These discerning data types collectively serve were finally used for ensemble learning to make a final
as the keystones for unraveling the complexities embedded prediction. We have used the following models.
within the network traffic, setting the stage for the subsequent 1 Random Forests: RF is a well-known ensemble algorithm
phases of our research endeavor. that can be leveraged for both classification and regres-
To capture the essence of network traffic statistics, we’ve sion problems [18]. Parameters are decided upon hyper-
considered different window sizes based on time, denoted by parameter training using Randomized Search CV. Table I
Lambda (L). Ranging from shorter time frames like L5 and L3 shows the final hyperparameters.
to longer time frames like L0.01, these windows provide a nu- 2 Artificial Neural Network (ANN): ANN is an intercon-
anced understanding, detecting both short-lived and prolonged nected network of neurons [19]. Three types of layers
attacks [17]. Noteworthy is the distinctive status of devices 3 are there, input layer, multiple hidden layers, and out-
and 7. In an intriguing turn of events, Mirai refrained from put layer for receiving input, performing computation
infecting these devices, resulting in the conspicuous absence and producing output respectively. Hyperparameters are
of Mirai-related data. This peculiarity enriches the dataset with devised using Keras-tuner. It’s a Feed Forward Neural
a nuanced diversity, fostering a conducive environment for an Network. All the layers are ’Dense’ and the model is
in-depth analysis of the malevolent intricacies pervasive within ’Sequential’. Total number of layers used in the model
the IoT ecosystem. are 3. The final hyperparameters are shown in Table II.
3 Recurrent Neural Network (RNN): RNN is a type of NN
B. Data Analysis and Pre-processing that is designed specially to handle sequential data like
The raw data set comprised of 7062606 instances, with 115 time series data [20]. It makes use of feedback loops to
features each i.e. 7072606 X 115. The input data is purely remember the previous input. Hyperparameters like the
numerical for every input column. There are no missing values number of hidden layers, type of activation function, size
in the raw data. Raw data contains 4579930 duplicates. The of each layer, etc are devised using Keras-tuner. Table II
raw data is highly imbalanced categorically. Also, outliers shows the final hyperparameters.
were present but we decided not to remove them as this is 4 XGBoost: eXtreme Gradient Boosting is a supervised ML
a model for anomaly detection and outliers may point towards algorithm that makes gradient boosting more efficient by
anomalies. Different steps taken to pre-process the raw data increasing its power‘ [21]. It is capable of dealing with
are shown in Algorithm 1 large datasets and complicated models due to its parallel
processing capabilities. Parameters for the model are
Algorithm 1 Algorithm to Pre-process Dataset decided upon hyper-parameter training using Randomized
Require: Raw Data Search cross Validation (CV). The final hyperparameters
Ensure: Separate .csv files for various data frames. are shown in Table I.
1: Accumulate data from all the different files into a com- But there are some flaws in the above models, the first
bined data frame flaw is not every model is equally good, we must asso-
2: Add the labels for each instance. ciate a weight parameter with every model’s outcomes to
3: Remove Duplicate data. predict the final outcome. Another flaw is the question
4: Balance the dataset by reducing the attack instances. of ”how to decide the values of these weight parameters
5: Scaling. for every model”. The solution proposed in this paper is
6: Reduce Dimensions using PCA. to predict the probability associated with the predictions
7: Divide the data into train, test, and valid sets. of 5 different models and then concatenate these output
8: Save each data frame into separate .csv files to avoid prediction probabilities into a new data frame, which is
mixing up. used to train a final neural network that can regulate
the weights associated with each model thus giving us a
neural network of models or some might call it complex
C. Binary Classification Model feature engineering as we are making new features to fit
The binary classification model is trained to classify be- in our final neural network.
tween the benign state and the botnet attack state. Different 5 Final model: In the proposed model we have produced
models were trained on the same data of 496614*116, which the confidence of prediction of each model for both
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.
Fig. 4: Confusion Matrix for XGBOOST
Fig. 3: Accuracies of Different Models for Binary TABLE IV: Values assigned to Different Parameters of
Classification XGBoost and RF Model for Multi-Class Classification
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.
2 ANN: Hyperparameters are devised using Keras-tuner It’s
a Feed Forward Neural Network. A total of five layers are
used. All the layers are ’Dense’ and model is ’Sequential’.
Table V shows the values activation function and the
number of perceptrons used in different layers.
3 RNN: Hyperparameters like the number of hidden layers,
type of activation function, size of each layer, etc are de-
vised using Keras-tuner. It is a recurrent neural network,
the first layer is of type LSTM. Different parameters used
are shown in Table V.
4 XGBoost: Parameters are decided upon hyper-parameter
training using Randomized Search CV. The best model
was used further from the Randomized search CV with Fig. 6: Accuracies of Different Models for Multi-class
the parameter grid as shown in Table IV. Classification
ANN RNN Final Model
Features L1 L2 L3 L4 L5 L1 L2 L3 L4 L5 L1 L2 L3 L4
Activation tanh relu relu relu soft tanh sig tanh relu soft sig sig sig soft
Function
No. of 72 56 56 72 11 48 16 8 8 11 40 40 40 11
Percep-
trons
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] G. Singal, V. Laxmi, M. S. Gaur, S. Todi, V. Rao, M. Tripathi, and
R. Kushwaha, “Multi-constraints link stable multicast routing protocol
in manets,” Ad Hoc Networks, vol. 63, pp. 115–128, 2017.
[2] M. Feily, A. Shahrestani, and S. Ramadass, “A survey of botnet and
botnet detection,” in 2009 Third International Conference on Emerging
Security Information, Systems and Technologies. IEEE, 2009, pp. 268–
273.
[3] N. Ahuja, G. Singal, and D. Mukhopadhyay, “Dlsdn: Deep learning
for ddos attack detection in software defined networking,” in 2021
11th International Conference on Cloud Computing, Data Science &
Engineering (Confluence). IEEE, 2021, pp. 683–688.
[4] N. Ahuja, G. Singal, D. Mukhopadhyay, and N. Kumar, “Automated
ddos attack detection in software defined networking,” Journal of
Network and Computer Applications, vol. 187, p. 103108, 2021.
Fig. 11: Recall and Precision of Different Models for [5] A. Marzano, D. Alexander, O. Fonseca, E. Fazzion, C. Hoepers,
Multi-class Classification K. Steding-Jessen, M. H. Chaves, Í. Cunha, D. Guedes, and W. Meira,
“The evolution of bashlite and mirai iot botnets,” in 2018 IEEE Sym-
Multi-class classification: The slight differences in accura- posium on Computers and Communications (ISCC). IEEE, 2018, pp.
00 813–00 818.
cies among the models for differentiating between 10 types of [6] G. Bastos, A. Marzano, O. Fonseca, E. Fazzion, C. Hoepers, K. Steding-
IoT botnet attacks highlight their varied strengths in handling Jessen, Í. Cunha, D. Guedes, and W. Meira, “Identifying and character-
complex classification tasks. RNNs and RF offer solid perfor- izing bashlite and mirai c&c servers,” in 2019 IEEE Symposium on
Computers and Communications (ISCC). IEEE, 2019, pp. 1–6.
mance (as shown by confusion matrix in figure 7,figure ??) [7] M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein,
but fall slightly behind due to their inherent limitations with J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis
static and high dimensional data, respectively. In contrast, et al., “Understanding the mirai botnet,” in 26th USENIX security
symposium (USENIX Security 17), 2017, pp. 1093–1110.
ANNs(figure 8) excel in capturing intricate patterns and re- [8] W. T. Strayer, D. Lapsely, R. Walsh, and C. Livadas, “Botnet detection
lationships within the data, making them particularly effective based on network behavior,” Botnet Detection: Countering the Largest
for classifying multiple botnet types compared to a simpler Security Threat, pp. 1–24, 2008.
[9] A. Husain, A. Salem, C. Jim, and G. Dimitoglou, “Development of
binary classification. XGBoost (figure 9) also stands out with an efficient network intrusion detection model using extreme gradient
its advanced boosting techniques, providing exceptional ac- boosting (xgboost) on the unsw-nb15 dataset,” in 2019 IEEE Interna-
curacy through iterative error correction and fine-tuning. The tional Symposium on Signal Processing and Information Technology
(ISSPIT). IEEE, 2019, pp. 1–7.
ensemble model (figure 10), combining the strengths of ANN [10] R. Kumar, M. Swarnkar, G. Singal, and N. Kumar, “Iot network
and XGBoost along with other models, harnesses these advan- traffic classification using machine learning algorithms: An experimental
tages to achieve the highest overall accuracy, demonstrating analysis,” IEEE Internet of Things Journal, vol. 9, no. 2, pp. 989–1008,
2021.
the effectiveness of ensemble methods in complex multi-class [11] A. Alharbi and K. Alsubhi, “Botnet detection approach using graph-
classification scenarios. based machine learning,” Ieee Access, vol. 9, pp. 99 166–99 180, 2021.
[12] D. Hemanth et al., “Intrusion detection system using convolutional
neural network on unsw nb15 data-set,” Advances in Parallel Computing
Technologies and Applications, vol. 40, pp. 1–8, 2021.
V. C ONCLUSION AND F UTURE W ORK [13] S. Choudhary and N. Kesswani, “Analysis of kdd-cup’99, nsl-kdd and
unsw-nb15 datasets using deep learning in iot,” Procedia Computer
Science, vol. 167, pp. 1561–1573, 2020.
The article presents a reliable detection and classification [14] S. Pokhrel, R. Abbas, and B. Aryal, “Iot security: botnet detection in
iot using machine learning,” arXiv preprint arXiv:2104.02231, 2021.
method for various types of Botnet attacks using an ensemble [15] M. G. Desai, Y. Shi, and K. Suo, “A hybrid approach for iot botnet
algorithm combining the probability of prediction of different attack detection,” in 2021 IEEE 12th Annual Information Technology,
models. The variation of using prediction confidence and its Electronics and Mobile Communication Conference (IEMCON). IEEE,
2021, pp. 0590–0592.
variation, rather than predictions themselves can be more [16] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breiten-
precise and accurate as the confidence of prediction can be bacher, and Y. Elovici, “N-baiot—network-based detection of iot botnet
a better parameter than predictions themselves for getting attacks using deep autoencoders,” IEEE Pervasive Computing, vol. 17,
no. 3, pp. 12–22, 2018.
the output. This was evident from this dataset where the [17] H. Hamid, R. M. Noor, S. N. Omar, I. Ahmedy, S. S. Anjum, S. A. A.
final model outshone the individual models even though the Shah, S. Kaur, F. Othman, and E. M. Tamil, “Iot-based botnet attacks
individual models had pretty good accuracies. Although, the systematic mapping study of literature,” Scientometrics, vol. 126, pp.
2759–2800, 2021.
proposed model gives quite decent results in terms of accuracy, [18] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32,
recall, precision and other parameters, the model does not 2001.
consider the privacy of the sensitive adat of the users and [19] B. Yegnanarayana, Artificial neural networks. PHI Learning Pvt. Ltd.,
2009.
also the model is computation intensive. [20] L. R. Medsker, L. Jain et al., “Recurrent neural networks,” Design and
Future work includes using this variation in ensemble stack- Applications, vol. 5, no. 64-67, p. 2, 2001.
[21] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
ing with other datasets, with different individual models and in Proceedings of the 22nd acm sigkdd international conference on
types of datasets to find conclusive evidence for this idea. Also, knowledge discovery and data mining, 2016, pp. 785–794.
the future work may include using techniques like Federated .
learning and Pruning to overcome the identified limitations.
Authorized licensed use limited to: Netaji Subhas University of Technology New Delhi. Downloaded on December 12,2024 at 07:10:36 UTC from IEEE Xplore. Restrictions apply.