Prevention and detection of DDOS attack in virtual cloud computing environment using Naïve Bayes algorithm of machine learning
Prevention and detection of DDOS attack in virtual cloud computing environment using Naïve Bayes algorithm of machine learning
Measurement: Sensors
Prevention and detection of DDOS attack in virtual cloud computing environment using Naïve
Bayes algorithm of machine learning
Yongqiang Shang
Xinyang Agriculture and Forestry University, Department of Information Engineering Department, Xinyang, Henan, 464000, China
ARTICLEINFO ABSTRACT
Keywords: The popularity of cloud computing, with its incredible scalability and accessibility, has already welcomed a new era of
Machine learning innovation. Consumers who subscribe to a cloud-based service and use the associated pay-as-you-go features have unlimited
Cyber attack access to the applications mentioned above and technologies. In addition to lowering prices, this notion also increased the
Virtual cloud computing environment reliability and accessibility of the offerings. One of the most crucial aspects of cloud technology is the on-demand viewing of
Cloud computing personal services, which is also one of its most significant advantages. Apps that are cloud-based are available on demand
Navie bayes from anywhere in the world at a reduced cost. Although it causes its users pain with safety concerns, cloud computing can
thrive because of its fantastic instantaneous services. There are various violations, but they all accomplish something similar,
taking the systems offline. Distributed denial of service attacks are among the most harmful forms of online assault. For fast
and accurate DDoS (Distributed Denial of Service, distributed denial of service) attack detection. This research introduced the
DDOS attack and a method to defend against it, making the system more resistant to such attacks. In this scenario, numerous
hosts are used to carrying out a distributed denial of service assault against cloud- based web pages, sending possibly
millions or even trillions of packets. It uses an OS like ParrotSec to pave the way for the attack and make it possible. In the last
phase, the most effective algorithms, such as Naive Bayes and Random Forest, are used for detection and mitigation. Another
major topic was studying the many cyber attacks that can be launched against cloud computing.
2
Y. Shang Measurement: Sensors 31 (2024) 100991
horizontal scaling, which refers to adding more data centres consequently, one’s return on investment (ROI) can be
or clouds, are two of the most crucial computing features for accomplished through multi-tenancy. On the same physical
server, a single user can want to run multiple instances of
the same program or entirely distinct ones using different
virtual machines.
5
Y. Shang Measurement: Sensors 31 (2024) 100991
3. Materials and methods 3.2. Understanding Naive Bayes and machine learning
3.1. Navie Bayes algorithm Machine learning has two main branches: supervised
learning and unsupervised learning. Classification and
The premise that the most straightforward answers often regression are two subsets of supervised learning that can
turn out to be the most enlightening is evident in Naive be distinguished here. Classification is where the Naive
Bayes and may be demonstrated in practice in daily Bayes method excels. The naive Bayes method was used for
situations. Machine learning has come a long way in recent face recognition. People’s faces and other features, like their
years, but its continued development shows that it can still noses,
be kept very straightforward without compromising
efficiency, accuracy, or dependability. It serves many
functions and has particular strength in resolving problems
associated with natural language processing (NLP). In
machine learning, the naive Bayes technique is a standard
statistical methodology used to solve classification problems
based on the Bayes Theorem. To clarify any lingering
questions, the following paragraphs will thoroughly explain
the Naive Bayes algorithm and its core concepts. The speed
Fig. 4. Procedure for navie bayes.
with which an NB model may be built makes it particularly
useful when dealing with vast amounts of data. The Naive
Bayes approach has been widely used because of its mouths, eyes, etc., can be recognized using this classification
simplicity and ability to outperform more complex method. In meteorology, it can be used to foretell whether
classification techniques. The foundation of a Bayesian the following weather will be pleasant or unpleasant.
classification is the assumption that indicators can be treated Doctors can make accurate diagnoses with the help of the
separately. A Naive Bayes classifier assumes that the classifier. Doctors can assess a patient’s likelihood of
presence of one feature in a class does not influence the developing cancer, cardiovascular disease, or other disorders
presence of any other feature, which simplifies things. using the Naive Bayes approach. Using a Naive Bayes
The Naive Bayes classifier is a popular guided machine classifier, Google News can decide whether a news piece is
learning approach in applications like text classification. about politics, the world, or any other topic. The Naive Bayes
Since it mimics the distribution of inputs for a given class or classifier has the advantages of being simple, easily
category, it belongs to the group of learning algorithms implemented, and requiring little training data. Both
known as generative learning approaches. To be successful, continuous and discrete data types are manageable using
this tactic relies on the assumption that the input data’s this method. It is stable even when exposed to many
attributes are conditionally independent given the class. This predictors and data points. It is fast, can be used to make
allows for fast and accurate recommendation generation by predictions in the here and now, and does not care about
the system. trivial details.
Naive Bayes classifiers, which implement Bayes’
statistical theorem, are often thought of as being used for 4. Proposed method
more fundamental probabilistic categorization tasks. This
theorem incorporates empirical evidence and Gathering relevant data should be the initial step. By
supplementary context when determining a hypothesis’s collecting relevant data, we can locate and exploit several
credibility. In order to function, the naive Bayes classifier security holes in the victim’s computers in our attack. All
relies on the assumption that the input data’s attributes are available information regarding running services, open and
unrelated to one another. Contrarily, real- world scenarios closed ports, and other security holes is compiled during the
usually play out differently. Although based on an unduly information-gathering phase. Here, the attacker has a better
naive premise, the Naive Bayes classifier sees widespread chance of learning the weak spots of the victim, making
application. This is because it serves its purpose well and has further attacks much simpler. The cloud service provider
proven highly efficient in several practical settings. assigns a different port number to each of its services, such
One of the simplest Bayesian network models, naive as: In most cases, FTP uses port 990, but it can use port 21 as
Bayes classifiers, can achieve high levels of reliability when well; HTTP uses port 80. TCP and UDP use ports 20 through
used in conjunction with kernel density estimation. Despite 23 for various purposes.
their simplicity, they are used less than other Bayesian In conclusion, gathering information is a procedure that
network models. When the distribution pattern of the input provides an attacker with all the necessary data to launch a
data is not given, using a kernel function to approximate the successful attack on any target system. In order to learn
probability density function of the input data can help the more about a network, we can employ the Nmap scanner. It
classifier operate better. The purpose of developing this simply needs the target machine’s IP address to launch an
strategy was to raise efficiency. This proves that the naive attack; at this point, it will perform a full system scan,
Bayes classifier is an effective machine learning technique revealing the targeted system’s activity, services, open ports,
for various purposes, including but not limited to text and so on. This implies that when the exposed connection is
categorization, spam filtering, and sentiment analysis. found, whatever occurring right now may be shown,
Thomas Bayes is credited with developing the method for regardless of what OS the other system is using. We would
predicting a probability given a set of known probabilities probably come up with an attack plan, and that plan would
currently known as Bayes’ Theorem. Fig. 4 is the layout of involve a Distributed denial of service attack, which would
Navie bayes. involve methods like the "ping of death." A distributed denial
of service (DDOS) assault is one of the most damaging types
of cyberattacks since it disrupts the entire system. Due to the
flood of packets caused by the DDoS assault, all services are
6
Y. Shang Measurement: Sensors 31 (2024) 100991
either momentarily or completely inaccessible. ParrotSec, preprocessing filters. A single filter, such as normalization, is
like Kali and Ubuntu, can be managed via command line chosen from the available options. Data standardization, or
interface, with the shell or terminal serving as the primary "making data un-redundantly," refers to removing
interface for entering these instructions. This feature is superfluous or identical information from a dataset.
shared with ParrotSec. Since ParrotSec handles everything,
you can type "PING IP" into the console, and it will be carried
5.2. Training data set
out. Since the victim site would receive over 65 thousand
packets, all services would be taken down. This is how an
The procedure for the collection of collecting training
assault could be generated. The subsequent stage is
information includes the construction of a machine-learning
detection. In this case, the target is a website hosted in the
model. Programming a computer algorithm typically requires
cloud, and Nmap is used to scan the entire site in order to
the use of data to train it. Said training information is a
locate any security flaws. This would lead to the exposure of
subset of a dataset used for instruction and evaluation
any underlying problems. After the exposed ports have been
alongside the entire dataset. Separating the datasets into
made public, a Python script comprising a distributed denial
training and testing sets is an essential first step when
of service attack will be created and run. This implies that
developing a machine learning-based model. However, a
when the exposed connection is found, whatever occurring
model driven by machine learning is necessary to generate
right now may be shown, regardless of what OS the other
further forecasts against the newly acquired dataset.
system is using. We would probably come up with an attack
plan, and that plan would involve a Distributed denial of
service attack, which would involve methods like the "ping of 5.3. Prediction algorithm
death." A distributed denial of service (DDOS) assault is one
of the most damaging types of cyber-attacks since it disrupts Following the development and validation of the
the entire system. Due to the flood of packets caused by the information set, various algorithms have been developed
DDoS assault, all services are either momentarily or through this process to anticipate several of the issues. In
completely inaccessible. ParrotSec, like Kali and Ubuntu, can this particular scenario, one must consider identifying
be managed via command line interface, with the shell or whether DDoS messages are harmful or not.
prompt serving as the main interface for entering these 5.4. Prediction of naive bayes
instructions.
Wireshark thoroughly analyzes each incoming packet. The percentages of real positives and fake positives are
After finishing the thorough packet analysis, a large data set displayed in this figure.
was produced, which may indicate the presence of a The percentage of fake positives is seen as an indicator of
classifier. The experimental setting demonstrates that both a distributed denial of service attack (DDoS) or of fake data
the random forest and the naive Bayes classifier, both of packets. In contrast, the proportion of actual positives is the
which are well-known, produce excellent results. While standard one. In this case, the average mean of actual
various other classifiers may be used for detection (support packets is 0.973, while the overall mean for fraudulent
vector machines, k- nearest neighbors, k-means, etc.), "Naive transmissions is approximately 0.05.
Bayes" is still the most effective.
In this work, naive Bayes is applied to the problem of 5.5. Proposed formula for naive bayes
predicting application-layer packets during distributed
denial-of-service attacks.
Notwithstanding the apparent simplicity, the Naive Bayes P(x|y)=P(y|x) P(x) / P(y)
algorithm may make precise forecasts using the current data.
The data set under consideration was trained with naive Where, The conditional probability of y given x is denoted by
Bayes, and then a fresh information set was built using the P (y|x), The likelihood of a class being P(x) and the
cross-validation technique with 65 folds. This was done so conditional likelihood of a predictor is P(y), Probability of
that we could figure out where the files were coming from occurrence is P (x|y).
and where they were going. The true affirmative level, false
alarm rate, fake negative level, and many more are just some 5.6. Basic theory
of the metrics that may be derived from this fresh
information set. Naive Bayes, a technique for making 5.6.1. Three-way handshake
predictions, produces a mix of correct and incorrect results. The between-machine communication paradigm is
A fake negative is considered an alarm for the benefit of depicted in Fig. 2, and it must be adhered to for the
internet consumers. Naive Bayes and random forest both communication to succeed. A three- way handshake is the
correctly identified the true positives as ordinary packets, name given to this particular protocol. Within the context of
whereas the false negatives were classified as DDoS attacks. this dialogue, a protocol exchange takes place between the
server and the hacker. When establishing a standard TCP
5. Experimentation & results relationship, the attacker contacts the client by sending an
SYN protocol. This is referred to as the "three-way
5.1. Data pre-processing handshake." A buffer will be allotted to the user by the
server as a reaction, and the server will also send back an
Regarding data mining, the most efficient method is ACK packet in addition to the SYN packet. At this stage, the
preliminary data processing. It streamlines complex connection is in a state that is referred to be "partially
information into something everyone can understand. Due accessible," and it is waiting for an ACK response from the
to its unreliability and lack of granularity, real-time data adversary in order to complete the link configuration. The
necessitates transforming pretreatment into valuable process that occurs once it has been determined that a
information. This is because information in real-time is often relationship has been successfully established is called the
unreliable and vague. Weka includes numerous options for three-way handshake.
7
Y. Shang Measurement: Sensors 31 (2024) 100991
On the other hand, instances known as TCP SYN Flood 6. Conclusion
are intended to exploit this three-way handshake by
saturating the server with an excessive number of SYN The key goals of this study are to learn how to recognize
queries. The denial of functionality attack, of which TCP SYN and prevent attacks involving distributed denial-of-service.
Flood is a prominent example, falls within the DoS category. The first and most crucial step is determining which ports
Employing a prolonged link and monitoring a duplicate of can be exploited. Nevertheless, this approach is not risk-free
the server’s activity is required for a packet capture program because susceptible ports are more likely to be exploited.
to identify a TCP SYN Flood as having occurred. One way to Given ParrotSec’s track record for stability and performance,
accomplish this is to keep an eye on a copy of the server’s we decided it would be the ideal choice for our company’s
traffic. Introducing an incoming IP Address to the server computer system. Since a DDoS attack involves sending one
typically corresponds with the manifestation of TCP SYN million separate packets toward the target, starting with an
Flood properties. After being submitted to calculation within on-the-internet website would be best. The targeted website
a predetermined period, IP Addresses that continually show was taken offline after it became clear that an assault had
on the server are utilized to get characteristics in a DDoS happened. Machine learning is constructive in this detecting
attack. process as well. Using this data, the most popular and
accessible tool, "weka," is being trained. Employing pre-
5.6.2. Naive Bayes algorithm processing techniques and the "discretize" filter to achieve
A simple computational approach that can be used to the desired effect. Therefore, the following phase is not only
calculate conditional likelihoods is the Naive Bayes Theorem. quite intriguing but also rather useful for both forecasting
A probabilistic condition quantifies the likelihood of one and detecting. We employed both methods and compared
event based on the presumption, premise, declaration, or the findings on the same platform, and we found that the
reality that a second event has already occurred. An analogy naive Bayes method provides the most trustworthy
would be the chance of something happening after conclusions. PCA selected 21 features from the possible 42
something else has happened. The posterior likelihood can features, while LVQ selected only 20 features. The results
be computed using a formula like the one below based on suggest that LVQ based feature selection in the DT model
the Naive Bayes theorem. may be more accurate than other methods in identifying
attacks. As mentioned earlier, the model also outperformed
P(B|A)P(A)
the previous models in terms of accuracy, recall, specificity,
P(A|B)=
and f-score. It was shown that the naive Bayes model had
P(B)
significantly better predictive power than the random forest
If A is more likely if B happens to be accurate, then P (A| model. There is a chance that a false positive rate warning
B) represents the conditional likelihood of B if A is true. In will be triggered for packet transmissions within a network.
probability theory, P(A) stands for the likelihood of Moreover, when compared to the random forest, naive
occurrence A, and P(B) stands for the likelihood of Bayes produces considerably more accurate forecasts. It was
occurrence B. We discussed using the packet-capturing demonstrated that the Naive Bayes algorithm outperformed
software as a computational input to estimate the IP address the random forest technique to identify the false and actual
and packet length obtained. We did the maths using the rate of transmissions. The result detection is not carried out
Naive Bayes method and the Gaussian distribution. After the in real time. Although attacks can be detected, real-time
computation, the outcomes are displayed on a two- alarm cannot be realized in the environment of high cluster
dimensional network. The Gaussian Naive Bayes approach, security, so the feasibility of real-time monitoring under
which requires the calculation of the mean and standard Hadoop platform should be studied continuously.
deviation for analysis, is applied once the quantitative input
has been gathered. Table 1 is about the dataset format Declaration of competing interest
sample.
The authors declare that they have no known competing
5.6.3. Matlab’s Current classification using the Naive Bayes financial interests or personal relationships that could have
algorithm appeared to influence the work reported in this paper.
Matlab is the application we employ for the method of
categorization because it is not only user-friendly but also
highly effective in producing aesthetically pleasing
outcomes. In the environment of analyzing information, a
tool built into Matlab allows users to do Naive Bayes
categorization. Using this method, we can also classify
network traffic as either K, L, or Q to gain further insight into
the type of data transmitted throughout an internet
connection. This concept will be challenging to grasp for a
significant number of individuals. The Matlab script for the
Naive Bayes classification and the parameters that go along
with it are displayed in the following figure. The results of
categorizing the information obtained from the system are
shown in the figure. The nonlinear shape the blue line
represents limits the standard class set, of which the green
circle is a component. The blue line shows these limitations.
The other variety is an array of red squares depicting some
threat. Fig. 5 defines the DDoS attack detection using
MATLAB.
8
Y. Shang Measurement: Sensors 31 (2024) 100991
9
Y. Shang Measurement: Sensors 31 (2024) 100991
[14] R.L. Neupane, T. Neely, P. Calyam, N. Chettri, M. Vassell, R.
Durairajan, Intelligent defense using pretense against targeted
attacks in cloud platforms, Future Generat.
Comput. Syst. 93 (2019) 609–626.
[15] T. Alyas, M.S. Khan, Intelligent reliability management in software
based cloud ecosystem using AGI 17 (12) (2017) 134–139.
[16] N.S. Naz, S. Abbas, M. Adnan, B. Abid, N. Tariq, M. Farrukh, Efficient
load balancing in cloud computing using multi-layered mamdani
fuzzy inference expert system, Int. J. Adv. Comput. Sci. Appl. 10 (3)
(2019) 569–577.
[17] Rudol, Implementasi keamanan jaringan komputer pada virtual
private network (vpn) menggungakan, Implementasi Keamanan Jar.
Komput. Pada Virtual Priv.
Netw. Menggungakan Ipsec 2 (1) (2017) 65–68.
[18] W. Alosaimi, M. Alshamrani, K. Al-Begain, Simulation-based study of
distributed denial of service attacks prevention in the cloud, Proc. -
NGMAST 2015 9th Int.
Conf. Next Gener. Mob. Appl. Serv. Technol. (2016) 60–65.
[19] N.C.S.N. Iyengar, G. Ganapathy, Chaotic theory based defensive
mechanism against distributed Denial of Service Attack in cloud
computing environment, Int. J. Secur. its Appl. 9 (9) (2015) 197–212.
[20] S.A. Miller, O. Behalf, C. America, CASE STUDY HYPERCONVERGENCE
VS CLOUD, 2017, pp. 134–139.
[21] T. Alyas, M.S. Khan, Intelligent reliability management in software
based cloud ecosystem using AGI 17 (12) (2017) 134–139.
[22] R.E. Spiridonov, V.D. Cvetkov, O.M. Yurchik, Data Mining for Social
Networks Open Data Analysis, 2017, pp. 395–396.
[23] L. Wang, Y. Ma, J. Yan, V. Chang, A.Y. Zomaya, pipsCloud: high
performance cloud computing for remote sensing big data
management and processing, Future Generat. Comput. Syst. 78
(2018) 353–368.
10