Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Anomaly detection for data streams in large-scale distributed heterogeneous computing environments

This document presents a scalable distributed framework for anomaly detection in large-scale computing environments, focusing on cybersecurity. It emphasizes the need for big data-driven solutions to effectively analyze complex data streams in real-time, utilizing machine learning algorithms and human feedback for improved accuracy. The proposed framework, based on Apache Spark, aims to enhance existing anomaly detection techniques and reduce false alarm rates in various applications.

Uploaded by

soutien104
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Anomaly detection for data streams in large-scale distributed heterogeneous computing environments

This document presents a scalable distributed framework for anomaly detection in large-scale computing environments, focusing on cybersecurity. It emphasizes the need for big data-driven solutions to effectively analyze complex data streams in real-time, utilizing machine learning algorithms and human feedback for improved accuracy. The proposed framework, based on Apache Spark, aims to enhance existing anomaly detection techniques and reduce false alarm rates in various applications.

Uploaded by

soutien104
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Anomaly Detection for Data Streams in Large-Scale Distributed Heterogeneous Computing Environments

2 1 1 2 3 2
Yue Dang , Bin Wang , Ryan Brant , Zhiping Zhang , Maha Alqallaf , Zhiqiang Wu
1
Department of Computer Science and Engineering, Wright State University, Dayton Ohio 45435 USA
2
Department of Electrical Engineering, Wright State University, Dayton Ohio 45435 USA
3
Kuwait ministry of education

Abstract: Cyber security has become a chief area of concern for national security. Counteracting cyber threats
to ensure secure cyberspace faces great challenges as cyber-attacks are increasingly stealthy and
sophisticated, the protected cyber domains exhibit rapidly growing complexity and scale as a result of their
continuous development, and a large amount of system status and monitoring data generated. Increasingly, it
is important to design big data-driven cyber security solutions that effectively and efficiently derive actionable
intelligence (e.g., patterns and models) from available heterogeneous sources of information using principled
data analytic methods to defend against cyber threats and to ensure system integrity.
In this work, we present a scalable distributed framework to collect and process extreme-scale networking and
computing system traffic and status data from multiple sources that collectively represent the system under
study, and develop and apply real-time adaptive data analytics for anomaly detection to monitor, understand,
maintain, and improve cybersecurity. The data analytics will integrate multiple sophisticated machine learning
algorithms and human-in-the-loop for iterative ensemble learning. Given the volume, speed, and complex
nature of the system data gathered, plus the need of real-time data analytics (e.g., anomaly detection), a
scalable data processing framework needs to handle big data with low latency. Our proposed big-data
analytics will be implemented using an Apache Spark computing cluster.
Early anomaly detection can be extremely valuable in many domains, such as various government agencies, IT
security (e.g., identify changes in employee behavior that signal a security breach), finance, vehicle tracking,
healthcare, energy grid monitoring, e-commerce – essentially in any application where there are sensors that
produce important data that change over time. The analytics developed will offer significant improvements
over existing methods of anomaly detection in real time. Our preliminary evaluation studies have shown that
the developed techniques achieve better capabilities of defending against cyber threats.

Keywords: data analytics, distributed processing framework, anomaly detection, large-scale cyber system, big
data stream, Apache Spark

1. Introduction
Cyber security has become a chief area of concern for national security. In 2010, President Obama declared
st
that “America’s economic prosperity in the 21 century will depend on cyber security.” In this work, anomaly
detection refers to the problem of finding patterns in data that do not conform to expected behaviour in
cybersecurity settings, such as detection of attacks on computer networks, fraud detection in credit card
transactions, fault detection in safety critical systems, and detection of enemy activities in military
surveillance. These non-conforming patterns are often referred to as anomalies or outliers.

Anomalies in data can sometimes be translated into significant (and often critical) actionable information. For
example, an unusual traffic pattern in a computer network could mean that a hacked computer is sending out
sensitive data to an unauthorized destination, and an anomalous MRI image may indicate the presence of
malignant tumours. Today, anomaly detection finds extensive use in a wide variety of applications such as
insurance or health care, intrusion detection for cyber-security, financial transactions, etc. A variety of
anomaly detection techniques have been developed in several research communities, such as statistics,
network, data mining, medicine, and so on.

Anomaly detection attempts to determine if a data pattern does not conform to expected normal behaviour. A
straightforward approach, therefore, is to define a region representing normal behaviour and declare any
observation in the data which does not belong to this normal region as an anomaly (Bhuyan, Bhattacharyya,
and Kalita, 2014). Figure 1 shows a diagram that contains major components of anomaly detection regardless
of the areas of application (Chandola, Banerjee and Kumar, 2009). At an abstract level, a system of anomaly
detection takes in some input data set and applies a learning technique to create a predictive model. For a
given test data instance, the learned model then produces an output in the form of either a binary label that
characterizes the test data instance as normal or anomalous, or an anomaly score that indicates the degree or
likelihood of being abnormal. The input data set is a set of data instances with attributes or features and the
data instances may or may not be labelled. Depending on applications or circumstances, the input data may be
point data, contextual or collective data. Anomaly detection techniques have been extensively researched in
many different disciplines. The diagram (Figure 1) shows a few types of techniques including classification
based, nearest neighbour based, clustering based, statistical, informational-theoretical, and spectral/subspace
approaches. In each type, there are often multiple subtypes and many proposed algorithms. These different
types of technique can be applied in different learning settings.

Three broad forms of machine learning and predicative model training exist. Unsupervised anomaly
detection techniques assume that the majority of the instances in the data set are normal by looking for
instances that seem to fit the least to the remainder of the data set. Supervised anomaly detection techniques
require a data set with instances that have been labelled as “normal” or “abnormal”, and involve training and
using a classifier or other means. Semi-supervised anomaly detection techniques construct a model
representing the normal behaviour from a given normal training data set, and then test the likelihood of a test
instance to be anomalous using the learnt model. Ensemble learning is the process by which multiple
models/learning algorithms are strategically generated and integrated to obtain better predicative
performance than could be obtained by any of the constituent learning algorithm alone. Note that ensemble
learning has been quite successful in terms of boosting predicative performance in a supervised setting. The
development of ensemble techniques for anomaly detection has been a difficult problem and lags significantly
behind. Unlikely a classification problem, several factors make the anomaly detection problem very
challenging:
§ Defining a normal region which indicates normal behaviour is very difficult and the boundary
between normal and anomalous behaviours is often not precise. An anomalous observation which
lies close to the boundary can actually be normal, and vice-versa. When anomalies are the result of
malicious actions, the malicious adversaries often adapt themselves to make the anomalous
observations appear like normal (i.e., stealthy attacks), thereby making the task of defining normal
behaviour more difficult.
§ In many real systems (e.g., network), normal behaviour or normal profile keeps evolving and a current
notion of normal behaviour might not be sufficiently representative in the future.
§ The exact notion of an anomaly is different for different applications. For example, in the medical
field a small deviation from normal might be an anomaly, whereas a similar deviation in the stock
market domain might be normal. Thus applying a technique developed in one domain may not be
applicable to another.
§ The ability to label data for the purpose of training/validation of models used by anomaly detection
techniques is usually a major issue. Additionally, sometimes obtaining such data (in particular, rare
abnormal data) can be very expensive, especially when sampling abnormal behaviour.
§ Anomaly detection needs to deal with the inherent class imbalance problem where abnormal data
points in the data set are often rare and diverse compared with normal data points.
§ Often the data contains noise which tends to be similar to the anomalies and hence is difficult to
distinguish and remove.
§ The amount of data upon which anomaly detection will work is increasingly enormous in many
disciplines. Increasingly, real time analysis of big data is required to timely detect attacks in order to
prevent costly damage or mitigate the negative effect.
The contributions of this work are summarized as follows. We propose a high performance data processing
framework based on the Apache Spark and its streaming extension for dealing with the big data nature of
increasingly many anomaly detection problems in order to achieve timely detection and to provide reactive
countermeasures. As anomaly data instances are often rare and diverse, in our predicative model we apply a
selective set of unsupervised learning techniques to capture the inherent complexity of unknown anomaly
classes. Specific unsupervised learning algorithms include modified principal component analysis, replicator
neural networks, k-means clustering, and one-class support vector machine to name a few. These learning
algorithms transform the input data set to a new feature space in the form of anomaly scores of the data
instances. Our framework allows additional learning techniques to be added to increase its feature space
diversity profile. We then build an ensemble learning using both unsupervised and supervised approaches in
the space spanned by the features produced by various unsupervised learning approaches. These ensemble
learning approaches result in a more powerful predicative model. Last but not the least, we reduce the false
alarm rate of anomaly detection by iteratively incorporating the feedback on the performance of the
predicative model from human analysts into the ensemble learning model.

Figure 1: Components of anomaly detection

The remainder of the paper is organized as follows. Section 2 presents the details of the proposed big data
analytical data processing framework based on Apache Spark and Spark Streaming. The anomaly detection
problem under study is described along with our proposed technical approach. Performance evaluation and
comparison with existing anomaly detection techniques are reported in Section 3. Section 4 summarizes some
related work on network anomaly detection. The paper concludes in Section 5.

2. Big data analytics for ensemble anomaly detection

2.1 Apache Spark based distributed big data processing framework


We describe a generic distributed Apache Spark based real-time data analytics framework (Ryza et al, 2015,
Agneeswaran, 2014) for processing stream data from heterogeneous sources, for example system log data of
large computing environments such as a large enterprise network, a data centre, or an exascale high
performance computing system. Major components of the proposed framework are shown in Figure 2.
Management and monitoring of large computing environments generate a large amount of data. The collected
data may indicate whether a node is up or down, network connectivity and usage, storage availability, user
tracking & authentication, node resource use (e.g., processor usage per node and per core, memory usage),
software use, jobs/workflows execution information (including how long a job ran, who ran them, when the
job was submitted, when the job started and how long it sat in the queue, how many jobs are in the queue at
any one time, which queues have the most jobs waiting, and the most popular day of the week and time of day
jobs are submitted). The job scheduler also provides a great deal of information that can indicate how the
computing system is doing and how to improve its performance. This quickly leads to numerous big-data
analytics problems based on the data gathered. What will this information tell you? What kind of knowledge
can you gain? Can you detect any anomaly or malicious activities? Much of “big data” is received in real time,
and most valuable at its time of arrival. Arriving data need to be processed in real-time to ensure timely
detection of anomalies and business continuity.
Given the volume, speed, and complex nature of the data gathered, plus the need of real-time data analytics, a
scalable data processing framework is critical that is capable of handling big data with low latency. Popular big
data frameworks, e.g., Hadoop (Hadoop, 2016), Mahout (Mahout, 2016), MapReduce (Dean and Ghemawat,
2008), HBase (HBase, 2016), Google Bigtable (Chang et al, 2008), are highly scalable, but are designed towards
batch processing. Recent developments of frameworks with built-in stream processing capabilities include
Apache Storm (Storm, 2016), Apache S4 (S4, 2016) and Apache Spark (Spark, 2016, Streaming, 2016). Our
!
!

51=5=E$A! 1$&8M2,<$! A,E21,K#2$A! A&2&! &'&802,@E! :1&<$J=1N! Q[,(#1$! )R! 2&N$E! &AD&'2&($E! =:! 24$! @&5&K,8,2,$E! 24&2!
75&@4$!?5&1N!E21$&<,'(!=::$1E!A#$!2=!,2E!<&'0!(1$&2!:$&2#1$EL!
?5&1N!5$1:=1<E!,'M<$<=10!&'&802,@E!='!E21$&<!A&2&!K0!1#'','(!E21$&<,'(!@=<5#2&2,='E!&E!&!E$1,$E!=:!<,@1=M
K&2@4!]=KEL!P2!<&N$E!24$!K&2@4!E,\$!&E!E<&88!&E!5=EE,K8$!2=!&@4,$D$!8=J!8&2$'@0L!B&@4!K&2@4gE!51=@$EE$A!A&2&!@&'!
K$! E2=1$A! ,'! &'0! :,8$! E0E2$<! Q$L(L*! d%[?R*! A&2&K&E$! Q$L(L*! dK&E$R*! =1! 8,D$! A&E4K=&1AEL! ?5&1N! &8E=! E#55=12E!
E2&2$:#8!=5$1&2,='E!24&2!4&D$!&!A$5$'A$'@0!='!51$D,=#E!K&2@4$E!=:!A&2&!K0!N$$5,'(!24$!E2&2$E!Q,'2$1<$A,&2$!
K&2@4!1$E#82R!K$2J$$'!K&2@4$E!,'!<$<=10!E=!24&2!,2!@&'!1$@=D$1!24$<!9#,@N80!#E,'(!&!A&2&!&KE21&@2,='!@&88$A!
1$E,8,$'2!A,E21,K#2$A!A&2&E$2E!Q/%%R!Q3&4&1,&!$2!&8*!)W.)RL!?5&1N!#E$E!&'!&AD&'@$A!%7T!Q%,1$@2$A!7@0@8,@!T1&54R!
$O$@#2,='!$'(,'$!24&2!E#55=12E!@0@8,@!A&2&!:8=J!&'A!,'M<$<=10!@=<5#2,'(*!J4,@4!<&N$E!,2!:&E2$1!24&'!=24$1!
A,E21,K#2$A! :1&<$J=1NEL! 7AA,2,='&880*! 75&@4$! ?5&1N! ,E! =5$'ME=#1@$! &'A! #E$E! d&A==5! Qd&A==5*! )W.eR! :=1! 24$!
A,E21,K#2$A!:,8$!E0E2$<!&'A!,E!@&5&K8$!=:!J=1N,'(!='!2=5!=:!&!'$O2!($'$1&2,='!d&A==5!@8#E2$1!@&88$A!"7/`!:=1!
1$E=#1@$! E4&1,'(L! ?5&1N! &D=,AE! 24$! P_F! K=228$'$@N! =:! 24$! @='D$'2,='&8! 2J=ME2&($! 6&5/$A#@$! 51=(1&<E! K0!
51=D,A,'(!,'M<$<=10!@8#E2$1!@=<5#2,'(!24&2!&88=JE!&!#E$1!2=!8=&A!A&2&!,'2=!&!@8#E2$1gE!<$<=10!&'A!9#$10!,2!
$::,@,$'280L!!
F#1! 51=5=E$A! :1&<$J=1N! J,88! K$! K#,82! #5='! 75&@4$! ?5&1N! E21$&<,'(! ,<58$<$'2&2,='! N'=J'! &E! ?5&1N!
?21$&<,'(!Q?21$&<,'(*!)W.eR!J4,@4!$O2$'AE!24$!@&5&K,8,20! =:!?5&1N!&'A!@&'!K$!,'2$(1&2$A!J,24!?5&1NgE!K&2@4!
51=@$EE,'(L! P2! ,E! 4,(480! E@&8&K8$*! &'A! :&#82M2=8$1&'2L! P2! #E$E! &! <,@1=MK&2@4! 2$@4',9#$! J4,@4! E58,2E! 24$! ,'5#2!
E21$&<!,'2=!&!E$9#$'@$!=:!E<&88!@4#'NE!=:!A&2&!@&88$A!%M?21$&<E!Q3&4&1,&!$2!&8*!)W.)R!24&2!=::$1E!&!4,(4M8$D$8!
:#'@2,='&8! 51=(1&<<,'(! 7XP*! E21='(! @='E,E2$'@0*! &'A! $::,@,$'2! :&#82! 1$@=D$10L! P2! 24$'! 51=@$EE$E! 24$E$! E<&88!
@4#'NE!&E!&!K&2@4!&'A!E$'AE!,2E!1$E#82E!2=!24$!'$O2!K&2@4!51=@$EEL!S4$!N$0!,A$&!K$4,'A!%M?21$&<E!,E!2=!21$&2!&!
E21$&<,'(!@=<5#2&2,='!&E!&!E$1,$E!=:!A$2$1<,',E2,@!K&2@4!@=<5#2&2,='E!='!E<&88!2,<$!,'2$1D&8E!2=!&@4,$D$!8=J!
8&2$'@0L!!!!!!!

7558,@&2,='E! -=1N:8=JE!

a=(!?$&1@4!

C',:,$A!a=(!
7((1$(&2,='! 6=',2=1,'(
6=',2=1,'(!

?$@#1,20!b!
CE$1!S1&@N,'(! [1&#A!
6=A$8!S1&,','(!
>=<5#2,'(!!
F5$1&2,='&8!a=(! P'2$(1,20!
X1$A,@2,='!
/==2!@&#E$!
F5$1&2,='&8!6$21,@E! 7'&80E,E!

7@2,='&K8$!!
P'2$88,($'@$!
!
C-36/*(D=(X1=5=E$A!($'$1,@!A,E21,K#2$A!1$&8M2,<$!A&2&!&'&802,@E!:1&<$J=1N(
!
?5&1N!?21$&<,'(!=::$1E!2J=!205$E!=:!=5$1&2=1EV!Q.R!S1&'E:=1<&2,='!=5$1&2=1!M!<&5E!&!'$J!%?21$&<!:1=<!='$!
=1! <=1$! 5&1$'2! E21$&<E*! E#55=12,'(! K=24! E2&2$8$EE! Q,'A$5$'A$'2! ='! $&@4! ,'2$1D&8R! &'A! E2&2$:#8! QE4&1$! A&2&!
&@1=EE!,'2$1D&8ER!=5$1&2,='EL!Q)R!F#25#2!=5$1&2=1!M!&'!&@2,='!=5$1&2=1!24&2!&88=JE!24$!51=(1&<!2=!J1,2$!A&2&!2=!
$O2$1'&8!E0E2$<E!Q$L(L*!E&D$!=1!51,'2!&!%?21$&<RL!?5&1N!?21$&<,'(!E#55=12E!&88!24$!21&'E:=1<&2,='!&'A!&@2,='!
=5$1&2,='E!E#@4!&E!<&5*!1$A#@$*!(1=#5+0*!&'A!]=,'*!$2@L!S4$E$!=5$1&2=1E!J,88!K$!#E$A!2=!K#,8A!=#1!2$@4',9#$E!
:=1!<=A$8!21&,','(!&'A!51$A,@2,='!51=@$EE,'(L!-$!J,88!A$E,('!&'A!,<58$<$'2!@#E2=<,\$A!<&@4,'$!8$&1','(!&'A!
A&2&!&'&802,@E!&8(=1,24<E!:=1!&'=<&80!A$2$@2,='!&'A!=24$1!A&2&!&'&802,@E!2&ENEL!
To efficiently and reliably deal with and unify large volume of data from heterogeneous sources, our proposed
framework will use a cluster based message broker, Apache Kafka (Kafka, 2016), to channel the data to the
Spark streaming processing framework without any loss of data. Data from the message broker is periodically
captured and processed. Kafka serves as an intermediary that communicates data between sources and
destinations by routing messages to one or more destinations and similarly collecting messages from multiple
sources. It performs message aggregation, decomposes messages into multiple messages and sends them to
their destination, then recomposes the responses into one message to return to the user. Kafka provides
guaranteed message delivery with proper ordering. A consumer receives messages in the order they are
stored. Moreover, a Kafka cluster can be formed to lower processing latency and provide fault tolerance
without loss of data.

2.2 Problem statement and overview of solution


Given a data set that contains instances with attributes or features extracted from various data sources of an
enterprise network, a large scale high performance computing system, or a data centre, and the data instances
are not labelled, the objective is to design an effective learning technique to detect anomalous data instances
or outliers with a low false alarm rate, and do so within a time constraint while processing potentially a large
amount of data.
Our solution approach is outlined as follows. As data instances are not labelled and the data volume is huge,
plus anomalies are relatively rare and their class composition is unknown (in fact, anomalies may be quite
diverse), our predicative model and solution consist of a number of iterative steps and are fundamentally
rooted in unsupervised learning and ensemble building to achieve better anomaly detection performance.
First, a selected unsupervised anomaly detection algorithm is used to capture the inherent characteristics of
“normal profile” of the data set and in the process each original data instance is determined a measure in the
form of an anomaly score. By applying a set of diverse unsupervised learning algorithms, a different “anomaly
score” is generated for the same input data instance. In this way, the original data instances or features are
converted to a set of new features in the form of anomaly scores. Second, we build an ensemble anomaly
detection upon the unsupervised learning algorithms using both supervised and unsupervised approaches.
Third, the learned ensemble is an ever evolving predicative model which will be applied for detecting
anomalies given test data instances. Fourth, to evolve the predicative model and to further improve its
performance, our solution incorporates feedback on predicated anomalies from human analysts. Analysts are
presented with a set of “anomalies” detected by the current predicative model and then provide as feedback
the ground truth of these “anomalies”. Armed with labelled data instances, our solution learns an improved
ensemble anomaly predicative model using supervised learning techniques. This process repeats and over
time, our solution will be able to better capture and be adaptive to the evolving nature of the normal profile
and achieve better detection performance while maintaining a low false alarm rate.

2.3 Unsupervised anomaly detection

2.3.1 K-means clustering


K-means clustering assumes that normal data instances lie close to their closest cluster centroid while
anomalies are far away from their closest cluster centroid. It takes two simple steps (1) an iterative refinement
technique is used to assign a data instance to the cluster whose mean yields the least within-cluster sum of
squares. Then the new means are calculated to be the centroids of the data instances in the new clusters. (2)
For each data instance, its distance to its closest cluster centroid is calculated as its anomaly score.

2.3.2 Principal component analysis


Principal component analysis based anomaly detection assumes that data can be projected from a high
dimensional feature space to a lower dimensional subspace in which normal instances and anomalies appear
significantly different. We intend to capture the largest cumulative proportion of the total sample data
variance as well as the likelihood of a data instance’s departure from the data set’s correlation structure which
is indicated by the last few components of the PCA) (Shyu, et al, 2003; Jolliffe, 2010). Let 𝑋 be a 𝑝-dimensional
dataset. Its covariance matrix Σ can be decomposed as: Σ = 𝑃×𝐷×𝑃 ! , where 𝑃 is an orthonormal matrix
where the columns are the eigenvectors of Σ and 𝐷 is the diagonal matrix containing the eigenvalues 𝜆! , ⋯ , 𝜆! .
The projection of the dataset onto the principal component space is given by 𝑌 = 𝑋×𝑃. This projection can be
performed with a reduced number of principal components. Let 𝑌! be the projected dataset using the largest 𝑗
principal components, i.e., 𝑌! = 𝑋×𝑃 ! . The reconstructed dataset from the projection space is given by
𝑅 ! = 𝑃 ! × 𝑌! ! ! , where 𝑅 ! is the reconstructed dataset using the largest 𝑗principal components. We then
define the score of an outlier point 𝑋! = (𝑥!! , ⋯ , 𝑥!" ) as
! !
! !!! 𝜆!
𝑆(𝑋! ) = |𝑋! − 𝑅! | ⋅ !
!!! !!! 𝜆!
!
!!! !!
Note that ! is the percentage of variance explained by the largest 𝑗 principal components. With this
!
!!! !
metric, large deviations in the large principal components are not heavily weighted (i.e., less than 1). On the
other hand, deviations in the last principal components are heavily weighted (i.e., close to 1). This approach
does not involve any parameters. We can set the outlier threshold based on the empirical distribution of 𝑆(𝑋! )
in the training data. We use PCC to represent this approach in our evaluation (Section 3).

2.3.3 Replicator neural networks


The replicator neural networks (Hawkins et al, 2002) uses a feed-forward multilayer perceptron to learn an
implicit, compressed model to minimize the error between the input and the output of the neural network
where the input and the output are same. This model allows the reconstruction of a majority of the data
instances. Then the data instance that cannot be accurately reconstructed contains varying degrees of
outlierness. The anomaly score or outlier factor is calculated as the mean square error:
!
1
𝑂𝐹! = (𝑥!" − 𝑂!" )!
𝑛
!!!
where 𝑥!" represents input of the neural network, 𝑖th data and 𝑗th attribute, 𝑂!" is the output. The larger the
value of the outlier factor, the more likely that the data instance is an anomaly. We use NN to represent this
approach in our evaluation (Section 3).

2.3.4 One-class support vector machine


One-class support vector machines (OCSVMs) (Schölkopf et al, 1999; Tax and Duin, 2004) attempt to learn a
decision boundary that separates all the data points from the origin in feature space and achieve the
maximum separation between the points and the origin. This results in a binary function which captures
regions in the input space where the probability density of the data lives. A one-class SVM uses an implicit
transformation function 𝜙(∙) defined by the kernel to project the data into a higher dimensional space. The
algorithm then learns the decision boundary (a hyperplane) that separates the majority of the data from the
origin. Only a small fraction of data points (those data points are considered as outliers) are allowed to lie on
the other side of the decision boundary. A one-class SVM can also be used in an unsupervised setup when it is
difficult to ascertain complete normalcy. Then, training and testing is applied on the same data. Unfortunately,
the training on a dataset already containing anomalies may not result in a good model. Let the function g(∙) be
defined as follows:
𝑔 𝑥 = 𝑤!𝜙 𝑥 − 𝜌
where w is the vector perpendicular to the decision boundary and 𝜌 is the bias term. The decision function
that one-class SVMs use to identify normal points is given by𝑓 𝑥 = 𝑠𝑔𝑛(𝑔(𝑥)). The primary objective of one-
class SVMs is
!
‖𝑤‖! 1
𝑚𝑖𝑛!,!,! −𝜌+ 𝜉!
2 𝑣𝑛
!!!
subject to: 𝑤 ! 𝜙 𝑥! ≥ 𝜌 − 𝜉! , 𝜉! ≥ 0, where 𝜉! is the slack variable for point i that allows it to lie on the other
side of the decision boundary, n is the size of the training dataset and 𝜐 is the regularization parameter and it
represents an upper bound on the fraction of outliers and a lower bound on the number of support vectors.
Varying 𝜐 controls the trade-off between 𝜉and 𝜌. We can also compute an anomaly score such that a larger
score corresponds to more significant anomaly points. A possible way to compute such a score
𝑔!"# − 𝑔(𝑥)
𝑓 𝑥 =
𝑔!"#
where 𝑔!"# is the maximum directed distance between the dataset points and the decision boundary. The
score is scaled by that distance such that the points that are lying on the decision boundary would have an
outlier score of 1.0. A score larger than 1.0 indicates that the data instance is an anomaly. This allows ranking
the anomalies, which is often essential in an unsupervised anomaly detection setup. The score also has a
reference point which means that scores in the range of [0,1] can be considered to be normal.

2.4 Anomaly detection ensemble

2.4.1 Unsupervised ensemble


We adapt GMM (the Gaussian Mixture Model) to produce an unsupervised ensemble predicative model from
multiple unsupervised anomaly detection algorithms. From prior knowledge, the proportion of anomaly data is
known. Assuming that unsupervised anomaly detection algorithms work reasonably well, the majority of the
proportional data instances as ranked by the anomaly score as the most likely to be abnormal are anomalous.
GMM is used to model these data. A Gaussian mixture model is a weighted sum of 𝑀component Gaussian
densities as given by the equation
!

𝑝 𝑥𝜆 = 𝑤! 𝑔(𝑥|𝜇! , 𝛴! )
!!!
where 𝑥 is a 𝐷-dimensional continuous-valued data vector (i.e., features), 𝑤! , 𝑖 = 1, … , 𝑀, are the mixture
weights, and 𝑔 𝑥 𝜇! , ! , 𝑖 = 1, … , 𝑀, are the component Gaussian densities. Each component density is a 𝐷-
variate Gaussian function of the form
1 1
𝑔 𝑥 𝜇! , 𝛴! = exp {− (𝑥 − 𝜇! )′𝛴! !! (𝑥 − 𝜇! )}
(2𝜋)!/! |𝛴! |!/! 2
with mean 𝜇! and covariance matrix 𝛴! . The mixture weights satisfy the constraint that ! !!! 𝑤! = 1 . The
expectation maximization (EM) algorithm is employed to simultaneously estimate the missing labels and the
parameters 𝜆 = 𝑤! , 𝜇! , ! , , 𝑖 = 1, … , 𝑀. Due to space limit, the algorithm details are not presented.

2.4.2 Supervised ensemble


An alternative approach that we propose is to use a supervised learning algorithm for building an ensemble of
unsupervised anomaly detection algorithms given that a human analyst is consulted to provide ground truth
(labels) for at least some selective anomalies reported by the anomaly detection predicative model. Similar to
the case of unsupervised ensemble, we train GMM to model the anomalous data in a training set provided by
the analyst. Then we use this model to obtain the likelihood of the data instances being anomalous in a test set.
A threshold can be set to distinguish normal from anomalies.

3. Performance evaluation

3.1 Metrics
Receiver Operator Characteristic (ROC) and Precision Recall Curve (PRC) are used to measure the performance
in our experiments for evaluation. ROC shows the relationship between TPR (True Positive Rate) and FPR (False
Positive Rate). PRC depicts how precision varies as recall increases. For anomaly detection, since anomalous
classes are severely imbalanced compared with the class of normal data instances, it has been argued that PRC
is more significant and effective than ROC for performance comparison (Davis et al, 2006).

3.2 Datasets
The NSL-KDD dataset is an improved version of KDD99 dataset that has been widely used for the evaluation of
network intrusion detection due to the lack of publicly available datasets. There are three subsets in the NSL-
KDD dataset, KDDTrain, KDDTest, KDDTest-21. They are not of the same distribution. For our evaluation,
training data and test data are generated from the KDDTrain dataset. The total number of normal records is
67343 and the total number of attack records is 58630. Each record contains 41 features, including 34 numeric
and 7 symbolic features. The 34 numeric is used in our experiments. To simulate the class imbalance for
anomaly detection, the proportion of attack records is kept at less than 1% for one data set. To make full use
of all the data, normal records in KDD Train are split into 5 equal parts and attack records are split into 500
equal parts. 13468 normal records and 117 attack records are combined to form one subset of data for
training or testing (with a total of 500 subsets). As a result, the proportion of attack records is 0.86% in one
data set.
For the unsupervised ensemble method, each subset is used to run the algorithm independently. The
experiment is repeated 500 times and the average results are reported. For the supervised ensemble method,
!
!

24$!H!$9#&8!5&12E!=:!'=1<&8!1$@=1AE!&'A!='$!5&12!=:!&22&@N!1$@=1AE!:=1<!H!A&2&!E$2E!:=1!21&,','(!&'A!2$E2,'(L!
S4$!$O5$1,<$'2!,E!1$5$&2$A!.WW!2,<$E!&'A!24$!&D$1&($!1$E#82E!&1$!1$5=12$AL!!

Q&R C'E#5$1D,E$A!$'E$<K8$! QKR!?#5$1D,E$A!$'E$<K8$!

IAI M*16&+1(
C-36/*(I=!T66!A$'E,20!$E2,<&2,='!&'A!8,N$8,4==A!:=1!#'E#5$1D,E$A!$'E$<K8$!&'A!E#5$1D,E$A!$'E$<K8$!
<$24=AEV!1$E#82E!:1=<!='$!E#KE$2!=:!A&2&!
!
`=1<&8,\&2,='! ,E! 5$1:=1<$A! ='! &'=<&80! E@=1$E! 51=A#@$A! K0! #'E#5$1D,E$A! &'=<&80! A$2$@2,='! &8(=1,24<E!
K$:=1$!8$&1','(!&'!$'E$<K8$!51$A,@&2,D$!<=A$8L!722&@NE!:&88!,'2=!G!<&,'!@&2$(=1,$E!&'A!4&D$!)G!&22&@N!205$EL!
!
!

S4$!'#<K$1!=:!T66!@=<5='$'2E!,E!E$2!2=!K$!.WL!S=!:#880!E5$@,:0!T66*!24$!5&1&<$2$1E!! ! !! ! !! ! ! ! ! ! !
!! ! ! !!'$$A!2=!K$!$E2,<&2$AL!%#$!2=!8,<,2$A!A&2&*!5&1&<$2$1E!&1$!2,$A!&<='(!24$!T&#EE,&'!@=<5='$'2EL!S4$!
@=D&1,&'@$! <&21,O! ,E! '=2! A,&(='&8! &E! K$@&#E$! 24$! &'=<&80! E@=1$E! :1=<! #'E#5$1D,E$A! &'=<&80! A$2$@2,='!
&8(=1,24<E!&1$!'=2!,'A$5$'A$'2L!>='2=#1!8,'$E!&1$!58=22$A!2=!E4=J!T66!A$'E,20!$E2,<&2,='!&'A!8,N$8,4==A!=:!
K$,'(!&'=<&8=#E!,'!K=24!#'E#5$1D,E$A!&'A!E#5$1D,E$A!$'E$<K8$!<$24=AEL![=1!$&@4!<$24=A*!2=!,88#E21&2$!24$!
@=<5='$'2!$::$@2*!J$!E4=J!,'![,(#1$!;!24$!@1=EE!E$@2,='!D,$JE!J4,@4!&1$!E$2!2=!24$!<$&'!=:!T66!<$&'E!=:!
@=11$E5='A,'(!&O,EL!!
[1=<![,(#1$!;Q&R*!J$!@&'!E$$!24&2!:=1!24$!#'E#5$1D,E$A!$'E$<K8$!<$24=A*!T66!A$5,@2E!24$!&'=<&8=#E!A&2&!
Q,'! K8&@NR! A,E21,K#2,='! 1$&E='&K80! J$88L! d=J$D$1*! 24$! E#5$1D,E$A! $'E$<K8$! <$24=A! &55$&1E! 2=! A$5,@2! 24$!
&'=<&80!A,E21,K#2,='!<=1$!&@@#1&2$80!A#$!2=!24$!#E$!=:!8&K$88$A!21&,','(!A&2&!:=1!$'E$<K8$!8$&1','(L!!!

Q&R X/>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!QKR!!/F>!
C-36/*(K=!X/>!&'A!/F>!:=1!#'E#5$1D,E$A!$'E$<K8$!<$24=A!
!
[,(#1$!G!E4=JE!24$!@=<5&1,E='!&<='(!#'E#5$1D,E$A!$'E$<K8$!<$24=A*!X>>*!``*!&'A!F>?j6!,'!2$1<E!=:!X/>!
&'A!/F>L!7E!A,E@#EE$A!&K=D$*!X/>!,E!<=1$!E,(',:,@&'2!:=1!@8&EE!,<K&8&'@$!51=K8$<EL![=1!1$@&88!D&8#$E!K$2J$$'!
WLe!&'A!WLf*!24$!51$@,E,='!=:!24$!#'E#5$1D,E$A!$'E$<K8$!<$24=A!,E!@8=E$!2=!24&2!=:!X>>L!d=J$D$1*!J4$'!1$@&88!
,E! 4,(4$1! 24&'! WLf*! 24$! 51$@,E,='! =:! 24$! #'E#5$1D,E$A! $'E$<K8$! <$24=A! ,E! <#@4! 4,(4$1! 24&'! X>>*! ``*! =1!
F>?j6L!!

Q&R X/>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!QKR!!/F>!
C-36/*(N=!X/>!&'A!/F>!:=1!E#5$1D,E$A!$'E$<K8$!<$24=A!
!
[,(#1$! H! E4=JE! 24$! @=<5&1,E='! &<='(! E#5$1D,E$A! $'E$<K8$! <$24=A*! X>>*! ``*! &'A! F>?j6! ,'! 2$1<E! =:! X/>!
&'A!/F>L![,(#1$H!Q&R!E4=JE!24&2!24$!E#5$1D,E$A!$'E$<K8$!<$24=A!51$:=1<E!24$!K$E2!J4$'!1$@&88!,E!8$EE!24&'!
WL^L!P'![,(#1$H!QKR!J$!@&'!E$$!24&2!24$!21#$!5=E,2,D$!1&2$!QJ4,@4!,E!$9#&8!2=!1$@&88R!,E!K$8=J!WL^!J4$'!24$!:&8E$!
5=E,2,D$! 1&2$! ,E! 8$EE! 24&'! WLfL! [1=<! [,(#1$E! G! &'A! H*! J$! @&'! E$$! 24&2! 24$! E#5$1D,E$A! $'E$<K8$! <$24=A!
=#25$1:=1<E!24$!#'E#5$1D,E$A!$'E$<K8$!<$24=A!,'!($'$1&8L!
4. Related work
A few recent surveys of anomaly detection discussed the taxonomy, the pros and cons of various types of
anomaly detection approaches. For details, refer to (Bhuyan et al, 2014; Chandola, Banerjee and Kumar, 2009).
Ensemble learning for anomaly detection is a relatively unexplored area of research (Aggarwal, 2013) with the
potential of producing more effective methods than individual anomaly detection algorithms. However,
significant challenges exist in terms of how to combine incomparable anomaly scores from different models.

5. Conclusions
In this work, we have presented a scalable distributed framework to collect and process extreme-scale
networking and computing system traffic and status data from multiple and develop real-time adaptive data
analytics for anomaly detection. The data analytics integrates multiple sophisticated machine learning
algorithms and human-in-the-loop for iterative ensemble learning. Our performance studies show that
ensemble methods outperform individual anomaly detection algorithms in significant cases and supervised
ensemble learning results in better performance in general.

References
Aggarwal, C. (2013) “Outlier Ensembles: Position Paper”, SIGKDD Explor. Newsl., Vol. 14, No. 2, pp 49-58.
Agneeswaran, V.S. (2014) Big Data Analytics Beyond Hadoop, Pearson Education, Inc.
Apache Hadoop. Available: http://hadoop.apache.org/
Apache Hbase. Available: https://hbase.apache.org/
Apache Kafka. Available: http://kafka.apache.org/
Apache Mahout. Available: https://mahout.apache.org/
Apache Spark. Available: http://spark.apache.org/
Apache Spark streaming. Available: http://spark.apache.org/streaming/
Bhuyan, M.H., Bhattacharyya, D.K., and Kalita, J.K. (2014) “Network Anomaly Detection: Methods, Systems and
Tools”, IEEE Communications Surveys & Tutorials, Vol. 16, No. 1, pp 303-336.
Chandola, V., Banerjee, A., and Kumar, V. (2009) “Anomaly Detection: A Survey”, ACM Computing Surveys, Vol.
41, No. 3.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A. and Gruber,
R.E. (2008) “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer
Systems (TOCS), Vol. 26, No. 2, pp 1-26.
Davis, J., and Goadrich, M. (2000) The Relationship between Precision-Recall and ROC curves. In Proceedings of
the 23rd international conference on Machine learning, pp 233-240.
Dean, J. and Ghemawat, S. (2008) “Mapreduce: simplified data processing on large clusters,” Communications
of the ACM, Vol. 51, No. 1, pp 107–113.
Gao, J. and Tan, P.N. (2006) “Converting output scores from outlier detection algorithms into probability
estimates”, Proceedings of Sixth International Conference on Data Mining, pp 212-221.
Hawkins, S., He, H., Williams, G.J., and Baxter R.A. (2002) “Outlier Detection using Replicator Neural
th
Networks”, Proceedings of the 4 International Conference on Data Warehousing and Knowledge Discovery, pp
170-180.
Hawkins, S., He, H., Williams, G.J. and Baxter R.A. (2000) "Outlier detection using replicator neural networks",
International Conference on Data Warehousing and Knowledge Discovery. Springer Berlin Heidelberg, pp 170-
180.
Jolliffe, I.T. (2010) Principal Component Analysis, Second Edition, Springer.
Ryza, S., Laserson, U. Owen, S. and Wills, J. (2015) Advanced Analytics with Spark, O’Reilly.
Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., and Platt, J. (1999) "Support Vector Method for
Novelty Detection", NIPS. Vol. 12. pp 582-588.
Shyu, M., Chen, S.C., Sarinnapakorn, K, and Chang, L. (2003) “A Novel Anomaly Detection Scheme based on
Principal Component Classifier”, Proceedings of the IEEE Foundations and New Directions of Data Mining
Workshop, the Third International Conference on Data Mining, pp 172-179.
Solaimani, M., Iftekhar, M., Khan, L., Thraisingham, B. and Ingram, J.B. (2014) “Spark-based anomaly detection
over multi-source VMware performance data in real-time,” Proceedings of 2014 IEEE Symposium on
Computational Intelligence, Orlando, FL USA.
Solaimani, M., Khan, L. and Thuraisingham, B. (2014) “Real-time anomaly detection over VMware performance
data using storm,” IEEE International Conference on Information Reuse and Integration, San Francisco, USA.
Storm - distributed and fault-tolerant real-time computation. Available: http://storm.incubator.apache.org/
S4. Available: http://incubator.apache.org/
Tavalleaee, M., Bagheri, E., Lu, W. and Ghorbani, A.A. (2009) “A Detailed Analysis of the KDD CUP 99 Data Set”,
Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Security and Defense Applications,
pp 53-58.
Tax, D. and Duin, R. (2004) "Support vector data description." Machine learning, Vol. 54, No. 1, pp 45-66.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., and Stoica, I.
(2012) “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” 9th
USENIX conference on Networked Systems Design and Implementation, pp 2–2.
Zaharia, M., Das, T., Li, H., Shenker, S., and Stoica, I. (2012) “Discretized streams: an efficient and fault-tolerant
model for stream processing on large clusters,” 4th USENIX conference on Hot Topics in Cloud Computing, pp
10–10.

You might also like