UNSW-NB15: A Comprehensive Data Set For Network Intrusion Detection Systems (UNSW-NB15 Network Data Set)
UNSW-NB15: A Comprehensive Data Set For Network Intrusion Detection Systems (UNSW-NB15 Network Data Set)
UNSW-NB15: A Comprehensive Data Set For Network Intrusion Detection Systems (UNSW-NB15 Network Data Set)
net/publication/287330529
CITATIONS READS
337 12,762
2 authors:
Some of the authors of this publication are also working on these related projects:
Detecing malicious activity of HTTP and DNS protocols using a proposed ensemble leaning framework and statistical features View project
Big Data Analytics for Intrusion Detection System: Statistical Decision-making using Finite Dirichlet Mixture Model View project
All content following this page was uploaded by Nour Moustafa on 24 December 2015.
Abstract— One of the major research challenges in this field is The effectiveness of NIDS is evaluated based on their
the unavailability of a comprehensive network based data set performance to identify attacks which requires a
which can reflect modern network traffic scenarios, vast varieties comprehensive data set that contains normal and abnormal
of low footprint intrusions and depth structured information behaviors [6]. Older benchmark data sets are KDDCUP 99 [7]
about the network traffic. Evaluating network intrusion detection
and NSLKDD [8] which have been widely adopted for
systems research efforts, KDD98, KDDCUP99 and NSLKDD
benchmark data sets were generated a decade ago. However, evaluating NIDS performance. It is perceived through several
numerous current studies showed that for the current network studies [6][9][10][11], evaluating a NIDS using these data sets
threat environment, these data sets do not inclusively reflect does not reflect realistic output performance due to several
network traffic and modern low footprint attacks. Countering reasons. First reason is the KDDCUP 99 data set contains a
the unavailability of network benchmark data set challenges, this tremendous number of redundant records in the training set.
paper examines a UNSW-NB15 data set creation. This data set The redundant records affect the results of detection biases
has a hybrid of the real modern normal and the contemporary toward the frequent records [10]. Second, there are also
synthesized attack activities of the network traffic. Existing and multiple missing records that are a factor in changing the
novel methods are utilised to generate the features of the UNSW-
nature of the data [9]. Third, The NSLKDD data set is the
NB15 data set. This data set is available for research purposes
and can be accessed from the link1. improved version of the KDDCUP 99, it tackles the several
issues such as data unbalancing among the normal/abnormal
Keywords- UNSW-NB15 data set; NIDS; low footprint attacks; records and the missing values [12]. However, this data set is
pcap files; testbed not a comprehensive representation of a modern low foot print
attack environment.
I. INTRODUCTION The above reasons have instigated a serious challenge for
Currently, due to the massive growth in computer networks the cyber security research group at the Australian Centre for
and applications, many challenges arise for cyber security Cyber Security (ACCS)2 and other researchers of this domain
research. Intrusions /attacks can be defined as a set of events around the globe. Countering this challenge, this paper
which are able to compromise the principles of computer provides an effort in creating a UNSW-NB15 data set to
systems, e.g. availability, authority, confidentiality and evaluate NIDSs. The IXIA PerfectStorm tool3 is utilised in the
integrity [1]. Firewall systems cannot detect modern attack Cyber Range Lab of the ACCS to create a hybrid of the
environments and are not able to analyse network packets in modern normal and abnormal network traffic. The abnormal
depth. Because of these reasons, IDSs are designed to achieve traffic through the IXIA tool simulates nine families of attacks
high protection for the cyber security infrastructure [2]. that are listed in Table VIII. The IXIA tool contains all
information about new attacks that are updated continuously
A Network Intrusion Detection System (NIDS) monitors from a CVE site4. This site is a dictionary of publicly known
network traffic flow to identify attacks. NIDSs are classified information security vulnerabilities and exposures. Capturing
into misuse/signature and anomaly based [4]. The signature network traffic in the form of packets, the tcpdump 5 tool is
based matches the existing of known attacks to detect used. The simulation period was 16 hours on Jan 22, 2015 and
intrusions. However, in the anomaly based, a normal profile is 15 hours on Feb 17, 2015 for capturing 100 GBs. Further,
created from the normal behavior of the network, and any each pcap file is divided into 1000 MB using the tcpdump
deviation from this is considered as attack [3] [4]. Further, the tool. Creating reliable features from the pcap files, Argus6 and
signature based NIDSs cannot detect unknown attacks, and for
these anomaly NIDS are recommended in many studies [4] 2
http://www.accs.unsw.adfa.edu.au/
[5]. 3
http://www.ixiacom.com/products/perfectstorm
4
https://cve.mitre.org/
5
1
http://www.tcpdump.org/
6
http://www.cybersecurity.unsw.adfa.edu.au/ADFA%20NB15%20Datasets/. http://qosient.com/argus/index.shtml
Bro-IDS 7 tools are utilised. Additionally, twelve algorithms divided into three groups of intrinsic features, content features
are developed using a C# language to analyse in-depth the and traffic features. Further, attack records in this data set are
flows of the connection packets. The data set is labelled from categorised into four vectors (e.g., DoS, Probe, U2R, and
a ground truth table that contains all simulated attack types. R2L). The training set of KDDCUP99 included 22 attack
This table is designed from an IXIA report that is generated types and test data contained 15 attack types [13] [7].
during the simulation period. The key characteristics of the
UNSW-NB15 data set are a hybrid of the real modern normal A number of IDS researchers as have utilised these
behaviors and the synthetical attack activities. datasets due to their public availability. However, many
researchers have reported majorly three important
The rest of the paper is organised as follows: section 2 disadvantages of these datasets [6] [9] [10] [11] [12] which
examines the general goal and orientation of any IDS data set. can affect the transparency of the IDS evaluation. First, every
Section 3 exposes in-detail the existing benchmark datasets attack data packets have a time to live value (TTL) of 126 or
shortcomings. The synthetic environment configuration and 253, whereas the packets of the traffic mostly have a TTL of
generation of UNSW-NB15 details are given in section 4. 127 or 254. However, TTL values 126 and 253 do not occur in
Section 5 is a comparative analysis between the KDDCUP99 the training records of the attack [9]. Second, the probability
and the UNSW-NB15 data set. Section 6 displays the final distribution of the testing set is different from the probability
shape about the files of the UNSW-NB15 data set. Finally, distribution of the training set, because of adding new attack
section 7 concludes the work and future intentions. records in the testing set [10][12]. This leads to skew or bias
classification methods to be toward some records rather than
II. THE GOAL AND ORIENTATION OF A NIDS DATA SET the balancing between the types of attack and normal
A NIDS data set can be conceptualized as relational data observations. Third, the data set is not a comprehensive
[6]. Input to a NIDS is a set of data records. Each record representation of recently reported low foot print attack
consists of attributes of different data types (e.g., binary, float, projections [11].
nominal and integer) [6]. The label assigns each record of the
B. NSLKDD Data Set
data, either normal is 0 or abnormal is 1. Labelling is done by
matching processed record, according to the particular NIDS According to [12] considering the three goals, an
scenario with the ground truth table of all transaction records. upgraded version of the KDD data set was created and it is
referred to as NSLKDD. The first goal was, removing the
III. CRITICISMS OF EXITING DATA SETS duplication of the record in the training and test sets of the
A quality of the NIDS data set reflects two important KDDCUP99 data set for the purpose of eliminating classifiers
characteristics are a comprehensive reflection of contemporary biased to more repeated records. Secondly, selecting a variety
threat and inclusive normal range of traffic. The quality of the of the records from different parts of the original KDD data set
data set ultimately affects the reliable outcome of any NIDS is to achieve reliable results from classifier systems. Third,
[6] [9]. In this section the disadvantages of existing data sets eliminating the unbalancing problem among the number of
for NIDS are explored in the perspective of data set quality. records in the training and testing phase is to decrease the
The most widely adopted data sets for NIDS are KDDCUP99, False Alarm Rates (FARs). The major disadvantage of
and its improved version NSL-KDD. NSLKDD is that, it does not represent the modern low foot
print attack scenarios [9] [12].
A. KDDCup99 Data Set
Generating DARPA98 [13], (IST) group of Lincoln
laboratories at MIT University performed a simulation with IV. UNSW-NB15 DATA SET
normal and abnormal traffic in a military network (U.S. Air In this section, the synthetic environment configuration
Force LAN) environment. The simulation ended with nine and generation of UNSW-NB15 details are presented. The
weeks of raw tcpdump files. The training data size was about section includes mainly the testbed configuration details and
four GBs and consisted of compressed binary tcpdump files the whole processes which involved in generating UNSW-
from seven weeks of network traffic. This was processed into NB15 from the configured testbed.
approximately five million connection records. The simulation
A. An IXIA tool Testbed Configuration
provided two weeks of test data which contained two million
connection records [7] [13]. According to Fig. 1, the IXIA traffic generator is
Upgrading DARAP98 network data features configured with the three virtual servers. The servers 1 and 3
comprehensiveness, utilising the same environment (U.S. Air are configured for normal spread of the traffic while server 2
formed the abnormal/malicious activities in the network traffic.
Force LAN), the simulation ended with 41 features for each
Establishing the intercommunication between the servers,
connection along with the class label using Bro-IDS tool. The
acquiring public and private network traffic, there are two
upgraded version of DARAP98 is referred to as KDDCUP99. virtual interfaces having IP addresses, 10.40.85.30 and
In the KDDCUP99 data set, the whole extracted features were 10.40.184.30. The servers are connected to hosts via two
7
routers. The router 1 has 10.40.85.1 and 10.40.182.1 IP
https://www.bro.org/index.html addresses, whereas router 2 is configured with 10.40.184.1 and
10.40.183.1 IP addresses. These routers are connected to the the number of Kbytes that is sniffed during each simulation
firewall device that is configured to pass all the traffic either period.
normal or abnormal. The tcpdump tool is installed on the router
1 to capture the Pcap files of the simulation uptime. Moreover,
the central intent of this whole testbed was to capture the
normal or abnormal traffic, which was originated from the
IXIA tool and dispersed among network nodes (e.g., servers
and clients). Importantly, the IXIA tool is utilised as an attack
traffic generator along with as normal traffic, the attack
behaviour is nourished from the CVE site for the purpose of a
real representation of a modern threat environment.
(A)
(B)
Figure 1. The Testbed Visualization for UNSW-NB15 Figure 2. The Concurrent Transactions of Flows during the Simulation
Periods.
Due to the speed of network traffic and the way of exploiting
by modern attacks, the IXIA tool is configured to generate one
C. Architectural Framework
attack per second during the first simulation to capture the first
50 GBs. On the other hand, the second simulation is The whole architecture which is involved in generating the
configured to make ten attacks per second to extract another final shape of the UNSW-NB15 from pcap files to CSV files
50 GBs. with 49 features (attributes in any CSV file) is presented in
Fig. 3. All the 49 features of the UNSW-NB15 data set are
B. Traffic Analysis elaborated from Tables II-VII along with the generation
The traffic analysis is described for the cumulative flows sequence explanation for understanding convenience.
during the period of the simulation while generating the
UNSW-NB15 data set. In Table I, the data set statistics are
provided which represents the simulation period, the flows
numbers, the total of source bytes, the destination bytes, the
number of source packets, the number of destination packets,
protocol types, the number of normal and abnormal records
and the number of unique source/destination IP addresses.
ACKNOWLEDGMENT
This work is supported by cyber range lab of the
Australian Centre for Cyber Security (ACCS) at UNSW in
Canberra. The authors are grateful for the manager of the
Cyber range lab.
REFERENCES
[1] R.Heady, G.Luger, A.Maccabe, M.Servilla. “The architecture of a
network level intrusion detection system”. Tech. rep., Computer Science
Department, University of New Mexico, New Mexico ,1990.
[2] M.Aydın, M. Ali, A. Halim Zaim, and K. Gökhan Ceylan. “A hybrid
intrusion detection system design for computer network
security”, Computers & Electrical Engineering, 2009, p 517-526.
[3] Axelsson, Stefan. “Intrusion detection systems: A survey and
taxonomy”, Technical report, 2000, Vol. 99.