To construct a valid detection and defense methods meets the needs of the big data environment, we will face two main issues. The first issue is the availability. Since the fast growing devices are attached to the network, the distribution and variation of devices are no longer dominated with regular computers such as terminals, servers or routers; instead, the nodes in a network are generally being occupied by IoT devices, such as civilian camera, vending machines, even advertisement show boards. Nowadays, those devices plugged into the net are becoming the biggest part of the nodes in a network, and these devices can be used by attackers easily with rather low cost, compared with regular methodology using zombie machines. For the method can be fit into this situation, and can be deployed on every kind of node, we should pick the features as basic as possible. The second issue we are facing is time complexity. With the high velocity, volume of data flowing in the network waiting to be processed, the detection algorithm must be efficiently enough to handle the large amount of data.
4.1. Extract Features at Different Layers
Given a network flow
with
sample packets to be detected, we define each packet as the standard packet at the 3rd layer of the OSI reference model, i.e., they are IP packets. And normally, underneath the IP layer, the payload should be either TCP or UDP packet. And step further, here comes the application layer, for example, DNS, or NTP, as shown in
Figure 2.
We define each IP packet as , where we define as the source IP, as the destination IP, as the source port, as the destination port, and as the payload, i.e., application layer packet, of . We will take as the parameter for sampling time. And we define vulnerable service (for example, DNS, NTP, UPnP, BT-DHT et al.) for DRDoS as .
In each sample interval, we merge all the source IP and destination IP into one single set . And we suggested extracting features for the -th IP in the set by . represents the amount of request packets of s by the -th source IP through the node in a fixed period of time; is the volume per unit time of these request packets with the -th source IP, and represents the number of unique port numbers of these packets going through from -th source IP; represents the amount of response packets to the destination IP through current node in a fixed period of time; is the volume per unit time of these response packets with the -th destination IP; also, represents the number of unique port numbers of those packets designated to each destination IP.
These features are going to be calculated for each IP in each sample interval, as follows:
When attacker initiates the DRDoS attack, for some
, there would be a large number of request packets and response packets from the reflectors. Thus we count the number of request packets and response packets for each source IP and destination IP respectively. And for request packets to the
, we use a dictionary
with its key denotes the source IP, and
denote the corresponding number of request packets declared from the
. Whereas, for the response packets from the
, we use
with its key denotes the destination IP, and
denote the corresponding number of response packets that send to the
.
At the end of each sample interval, we calculate the amount of request and response packets respectively for each IP.
For request and response packets, we calculate the volume per unit time of these packets with the same source IP and destination IP separately. We define the length of each packet as
, and for request packets, we use a dictionary
with source IP as its key, the corresponding total length from that source IP as its value. Meanwhile, for the response packet, a dictionary
is defined with destination IP as its key.
Then we could calculate the volume per unit time for each IP in
M as Equations (5) and (6).
An abnormally gigantic value of shows that there’s possibly a DRDoS attack, because some requires a larger size of request packets to gain more amplification for response flow from reflectors, thus we extract this basic feature from the request packets to . And for , it is obviously that this is the key point of the DRDoS attack. If is an abnormally large value, it indicates that this is under DRDoS attack.
Because each IP packet occupies one source or one destination port of a machine at a time, we are also taking the amount of ports into consideration. Likely, we use a dictionary
with source IP as its key, the corresponding value
is a set which represents the unique source port from
. Meanwhile, for the response packet, a dictionary
is defined similarly.
Then we could calculate
and
as Equations (8) and (9).
We use and as another two basic features in the HDTI, because when attacker initiate the DRDoS attack, and to make the DRDoS attack effective and valid, the attacker would send request packets to as much as possible, which leads to that there are many request packets been sent the same time, and each packet requires a unique source port number, thus the would be an abnormally large number. And based on the principles of TCP/IP, a response packet’s destination port number is the same source port number of the corresponding request packet, which suggests that would be an abnormally large number as well if is under DRDoS attack.
4.2. Analysis of the Feature Value
We characterized each part of the proposed six-tuple feature value with real-world observation to explain why it is effective for both detection and defense.
As for a destructive DRDoS attack, when the attacker launched the initial packets to the reflectors, an obvious growth in can be witnessed; also, as for the request of the amount for reflectors, in a fixed sampling time, the amount of occupied ports on victim’s source IP can be expanded fast. As for the volume per unit time , the growth of the amount may not explode rapidly, but the anomaly can be still visible than normal dataflow.
For the path after reflectors to the victim, a blast of and can be seen. Because of the request packets, the response packets’ destination ports of the intended victim would show a uniformed distribution. Alongside with the attack path, the attack flow should be accumulated, which leads to a rapid growth in each component. Based on the phenomena during different stages of a DRDoS attack, the validity on the feature components can be discussed into three situations.
Attack Source. A relatively abnormal growth among , and can be observed. By applying the features to the deep forest’s classifier, we would be able to detect the upstream of the attack flow. With the result from the classifier, we can drop those upstream packets before they can reach to the reflectors using differentiated service, in case of reducing the number of the abnormal packets to reflectors.
Intended Victim. An abnormally enormous value among , and can be observed. Moreover, the closer to the intended victim, the larger these components extracted in the nodes are, as the attack flow clustering from reflectors to the intended victim. As an answer to this situation, the detection mechanism using random forest deployed on the intended victim’s side could alert and activate defense moves by eliminating the downstream of the attack flow towards the intended victim.
Internal Nodes in the Internet. The nodes in the internet can obtain both upstream from attack flow and send downstream attack flow, which means that both streams can be observed and extracted. We are calling the flow with these features mentioned as mixed upstream and downstream (MUD). When attack flow lies in the MUD, we can still recognize the threats by classifying this with normal flow with random forest, and initiate differentiated service to drop the attack packets, so that the attack flow could be reduced, and the network load could be relieved.
Given the consideration and assumptions above, we can classify the feature proposed into 4 classes illustrated in
Figure 3. We defined 0 as a relatively low value, and 1 stands for a relatively large value in the corresponding position in the 6-tuple feature
.
With the definitions above, the status for any node in a network where under potential threat of a DRDoS attack could be revealed by our proposed detection method, and an efficient defense method could be deployed upon any node in the Internet.
4.3. Deep Forest Based DRDoS Detection and Defense Method
With features gathered from the network flow based on our proposed method above, a valid deep forest model can be trained by HDTI in order to determine if a certain IP was under a DRDoS attack. If the model classified the IP is under threat, i.e., downstream, upstream or MUD type, the differentiated service procedure will be introduced and activated to achieve the elimination of DRDoS attack flow in early, middle and post stages.
Detection Model. To implement this, firstly, we gathered the 6-tuple feature, HDTI, by online sampling from normal network flow and DRDoS simulation. The normal network flow contains the packets of
. And because there is no public available dataset of DRDoS attack, we simulated DRDoS attack. Then we take 30 s of normal network and 30 s of DRDoS attack to form a 60-s training set for the deep forest modeling. The model of our deep forest contains 5 estimators, including an XGBoost classifier, 2 random forest classifiers and 2 completely-random tree forest classifiers, as shown in
Figure 4.
The XGBoost classifier could be described as follows. The basic component of a boosted tree is regression tree, or classification and regression trees (CART). A CART will assign attributes to each leaf, and there is a real value score associated with that leaf. However, we can’t make effective prediction only using CART, thus a stronger model named tree ensemble was proposed, and the tree ensemble model could be written as Equation (10).
where the
belongs to the function space
, and
is the set of all regression trees. And we can write the object function as Equation (11).
As for the additive training of the XGBoost tree, we will choose a function
to minimize the value of object function
.
And the
, where
is the number of leaves, and
is the weight of
-th leaf. Then we could regroup the objective by each leaf. The result is the sum of
T independent quadratic functions.
Assume the structure of the tree is fixed, then we could solve the best , and the corresponding maximum gain of the objective, .
And XGBoost tree defines the gain as Equation (14). Essentially, it’s the score of left child plus the score of right child then minus the score if we do not split, and finally, minus the complexity cost by introducing additional leaf. Now, we could do a left to right linear scan on the sorted instance, and we can obtain the best split along the feature.
As for the random forest, firstly, it also utilizes CART as the weak learner. Secondly, it optimized the basic decision tree. It randomly selects a sub part of the features on the node, the number of randomly selected feature obeys . Then it decides the best split given the features.
The input of a random forest is , and represents the iteration rounds of weak classifier. The output of a random forest is a strong classifier . For any , the algorithm of random forest will
Sample the train set m times, obtaining the sampling set consists of m sample.
Train the -th model of decision tree with randomly selecting features.
The most voted class of T weak learners will be selected as the final prediction.
As for the extra trees, it’s a variant of random forest, there’re only 2 minor differs between them. Firstly, the random forest utilizes bootstrap for sampling the train set, whereas the extra trees use the original input as the train set. Secondly, after randomly selected features, the random forest will decide the best split based on information gain, Gini, or mean square error. However, the extra trees is way more radical, it randomly select a value for splitting the features. Although the randomly selected value will cause the increment of the tree, the ability of generalization is enhanced in extra trees.
The input feature value HDTI is a -dimensional feature value. And when feed into the first layer, each estimator outputs the initial classification result, which is a -dimensional vector. The results in first layer would produce a -dimensional feature value, then we concatenate the -dimensional feature value and the -dimensional input feature value to form a -dimensional augmented feature value. And the -dimensional augmented feature value will be used as the input feature value for the second layer, similar training procedure will be processed until there is no significant performance gain. Therefore the number of the layer could be automatically chosen, which enhances the adaptivity of the model, making it applicable to different scales of data and deploy at any node in the network in big data environment.
Defense Model. Within the trained deep forest model, we could identify the type for each IP in the network flow which needs detection, and make corresponding process against different attack packets by differentiated service. The basic idea can be describe as this: If an IP address was identified as normal, then we let all corresponding packets go through. If the IP address was identified as upstream, then we will filter abnormal vulnerable service request packets with source IP declared from that one, which achieves early stage DRDoS attack elimination. When the IP address was classified as downstream, we could filter all related abnormal vulnerable service response packets sending to that IP for post stage DRDoS attack elimination. Whereas the IP address was identified as MUD, then we filter both abnormal request and response packets of the corresponding vulnerable service for that IP.
In the actual setting up of the defense method, to make the differentiated service wise enough, in other words, demolishing attacks where letting normal network flow, a set of applicable thresholds should be applied. We define both request and response packet that exceeded the thresholds as abnormal packets. And if an IP was classified as MUD, then we will tag it as both upstream and downstream. The differentiated service will drop an abnormal packet when the following condition meets.
If the source IP of an abnormal request packet was identified as upstream, the differentiated service drops it.
If an abnormal response packet with its destination IP identified as downstream, it would be also getting filtered.
To get a set of applicable thresholds , we could learn from the normal and legitimated corresponding request and response packets separately with the statistics method applied. And in real-world, experts could change their experience into empirically rules for identifying whether a packet is abnormal of not.
In this paper, the packet length is used as one of the rules in the threshold set . We learnt from the dataset and calculated the observed max and min length of legitimated request and response packets respectively. We use as the max and min length of a legitimated request packets, and correspondingly, denote the max and min length of a legitimated response packets. We calculate the upper bound of the request and response packet length as and . Then we could define the rules for request and response packets as below.
With the rule set
defined, we could transform the rules in
into the conjunctive normal form (CNF). A CNF is the conjecture of many disjunctive expressions. We define the following atomic propositions.
Thus the final CNF of the defined rule set
could be represented as Equation (18)
Therefore the formalization of the defense method can be described as Equation (19).
The procedure of the deep forest based detection and defense method is shown in pseudo code in Algorithm 1.
Algorithm 1. Deep Forest based DRDoS Detection and Defense |
Input: | Training network flow , network flow to be detected, rule set |
1: | Extract HDTi features from with Equations (2), (4) and (6) |
2: | Training deep DRDoS detection and defense forest model with extracted HDTI features |
3: | CNF of |
4: | |
5: | for each sampling do |
6: | for each VSD packet do |
7: | if is a request packet then |
8: | if Upstream IP Set then |
9: | if proposition is true for then |
10: | drop this packet |
11: | end if |
12: | end if |
13: | |
14: | end if |
15: | if is a response packet then |
16: | if Downstream IP Set then |
17: | if proposition is true for then |
18: | drop this packet |
19: | |
20: | end if |
21: | end if |
22: | |
23: | end if |
24: | end for |
25: | for each do |
26: | calculate HDTI feature () for |
27: | identify the type of using the deep DRDoS detection and defense forest model |
28: | if the type of is normal then |
29: | do nothing |
30: | else |
31: | if the type of is Upstream then |
32: | add to Upstream IP Set |
33: | else |
34: | if the type of is Downstream then |
35: | add to Downstream IP Set |
36: | else |
37: | add to both Upstream and Downstream IP Set |
38: | end if |
39: | end if |
40: | end if |
41: | end for |
42: | end for |
43: | return |