IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 19,
NO. 5,
MAY 2007
631
Conditional Anomaly Detection
Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka, Fellow, IEEE
Abstract—When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most
critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If
too many unremarkable data points are returned to the user labeled as candidate anomalies, the software will soon fall into disuse.
One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in
question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly.
However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes
whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly
detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms
for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our
algorithms with several other more standard methods for outlier or anomaly detection.
Index Terms—Data mining, mining methods and algorithms.
Ç
1
INTRODUCTION
A
detection has been an active area of computer
science research for a very long time (see the recent
survey by Markou and Singh [1]). Applications include
medical informatics [2], computer vision [3], [4], computer
security [5], sensor networks [6], general-purpose data
analysis and mining [7], [8], [9], and many other areas.
However, in contrast to problems in supervised learning
where studies of classification accuracy are the norm, little
research has systematically addressed the issue of accuracy
in general-purpose unsupervised anomaly detection methods. Papers have suggested many alternate problem
definitions that are designed to boost the chances of finding
anomalies (again, see Markou and Singh’s survey [1]), but
there have been few systematic attempts to maintain high
coverage at the same time that false positives are kept to a
minimum.
Accuracy in unsupervised anomaly detection is important because, if used as a data mining or data analysis
tool, an unsupervised anomaly detection methodology will
be given a “budget” of a certain number of data points
that it may call anomalies. In most realistic scenarios, a
human being must investigate candidate anomalies reported by an automatic system, and usually has a limited
capacity to do so. This naturally limits the number of
candidate anomalies that a detection methodology may
usefully produce. Given that this number is likely to be
small, it is important that most of those candidate
anomalies are interesting to the end user.
NOMALY
. The authors are with the Computer and Information Sciences and
Engineering Department, University of Florida, CSE Building Room
E301, Gainesville, FL 32611.
E-mail: {xsong, mwu, cjermain, ranka}@cise.ufl.edu.
Manuscript received 25 Aug. 2005; revised 2 Apr. 2006; accepted 25 Oct.
2006; published online 29 Jan. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0334-0805.
Digital Object Identifier no. 10.1109/TKDE.2007.1009.
1041-4347/07/$25.00 ß 2007 IEEE
This is especially true in applications where anomaly
detection software is used to monitor incoming data in
order to report anomalies in real time. When such events
are detected, an alarm is sounded that requires immediate
human investigation and response. Unlike the offline case
where the cost and frustration associated with human
involvement can usually be amortized over multiple alarms
in each batch of data, each false alarm in the online case will
likely result in an additional notification of a human expert,
and the cost cannot be amortized.
1.1 Conditional Anomaly Detection
Taking into account such considerations is the goal of the
research described in this paper. Rather than trying to find
new and intriguing classes of anomalies, it is perhaps more
important to ensure that those data points that a method
does find are, in fact, surprising. To accomplish this, we ask
the questions: What is the biggest source of inaccuracy for
existing anomaly detection methods? Why might they
return a large number of points that are not anomalies? To
answer, we note that, by definition, “statistical” methods for
anomaly detection look for data points that refute a standard
null hypothesis asserting that all data were produced by the
same generative process. The null hypothesis is represented
either explicitly (as in parametric methods [10], [11], [12]) or
implicitly (as in various distance-based methods [9], [7], [8]).
However, the questionable assumption made by virtually all
existing methods is that there is no a priori knowledge
indicating that all data attributes should not be treated in the
same fashion.
This is an assumption that is likely to cause problems
with false positives in many problem domains. In almost
every application of anomaly detection, there are likely to
be several data attributes that a human being would never
consider to be directly indicative of an anomaly. By
allowing an anomaly detection methodology to consider
such attributes equally, accuracy may suffer.
For example, consider the application of online anomaly
detection to syndromic surveillance, where the goal is to
Published by the IEEE Computer Society
632
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 1. Syndromic surveillance application.
detect a disease outbreak at the earliest possible instant.
Imagine that we monitor two variables: max_daily_temp and
num_fever. max_daily_temp tells us the maximum outside
temperature on a given day, and num_fever tells us how
many people were admitted to a hospital emergency room
complaining of a high fever. Clearly, max_daily_temp should
never be taken as direct evidence of an anomaly. Whether it
was hot or cold on a given day should never directly
indicate whether or not we think we have seen the start of
an epidemic. For example, if the high in Gainesville,
Florida, on a given June day was only 70 degrees Fahrenheit
(when the average high temperature is closer to 90 degrees),
we simply have to accept that it was an abnormally cool
day, but this does not indicate in any way that an outbreak
has occurred.
While the temperature may not directly indicate an
anomaly, it is not acceptable to simply ignore max_daily_
temp, because num_fever (which is clearly of interest in
detecting an outbreak) may be directly affected by max_daily_temp, or by a hidden variable whose value can easily be
deduced by the value of the max_daily_temp attribute. In this
example, we know that people are generally more susceptible to illness in the winter, when the weather is cooler. We
call attributes such as max_daily _temp environmental attributes. The remainder of the date attributes (which the user
would consider to be directly indicative of anomalous data)
are called indicator attributes.
The anomaly detection methodology considered in this
paper, called conditional anomaly detection, or CAD, takes
into account the difference between the user-specified
environmental and indicator attributes during the anomaly
detection process and how this affects the idea of an
“anomaly.” For example, consider Fig. 1. In this figure,
Point A and Point B are both anomalies or outliers based on
most conventional definitions. However, if we make use of
the additional information that max_daily_temp is not
directly indicative of an anomaly, then it is likely safe for
an anomaly detection system to ignore Point A. Why? If we
accept that it is a cold day, then encountering a large
number of fever cases makes sense, reducing the interest of
this observation. For this reason, the CAD methodology will
only label Point B an anomaly.
The methodology we propose works as follows: We
assume that we are given a baseline data set and a userdefined partitioning of the data attributes into environmental
attributes and indicator attributes based upon which
attributes the user decides should be directly indicative of
VOL. 19, NO. 5,
MAY 2007
an anomaly. Our method then analyzes the baseline data and
learns which values are usual or typical for the indicator
attributes. When a subsequent data point is observed, it is
labeled anomalous or not depending on how much its
indicator attribute values differ from the usual indicator
attribute values. However, the environmental attributes are
not necessarily ignored because it may be that indicator
attribute values are conditioned on environmental attribute
values. If no such relationships are present in the baseline
data, then the CAD methodology recognizes this and will
effectively ignore the environmental attributes. In this case,
or in the case when the user is unable or unwilling to
partition the attributes (and, so, by default, all attributes are
indicator attributes), the CAD methodology is equivalent to
simply performing anomaly detection only on the indicator
attributes. If such relationships are present, then the CAD
methodology learns them and takes them into account.
1.2 Our Contributions
In addition to describing the parametric CAD methodology,
this paper describes three EM-based [13] learning algorithms for the parametric CAD model. This paper describes
a rigorous testing protocol and shows that, if the true
definition of an “anomaly” is a data point whose indicator
attributes are not in keeping with its environmental
attributes, then the CAD methodology does indeed increase
accuracy. By comparing the CAD methodology and its three
learning algorithms against several conventional anomaly
detection methods on 13 different data sets, this paper
shows that the CAD methodology is significantly better at
recognizing data points where there is a mismatch among
environmental and indicator attributes and at ignoring
unusual indicator attribute values that are still in keeping
with the observed environmental attributes. Furthermore,
this effect is shown to be statistically significant, and not just
an artifact of the number of data sets chosen for testing.
1.3 Paper Organization
The remainder of the paper is organized as follows: In
Section 2, we describe the statistical model that we use to
model environmental and indicator attributes and the
relationships between them. Section 2 also details how this
model can be used to detect anomalies. Section 3 describes
how to learn this model from an existing data set.
Experimental results are given in Section 4, related work
in Section 5, and the paper is concluded in Section 6. The
Appendix gives a mathematical derivation of our learning
algorithms.
2
STATISTICAL MODEL
This section describes the statistical model we make use of
for reducing the false positive rate during online anomaly
detection. Like other parametric methods (such as those
enumerated in Section 2.1 of Markou and Singh’s survey
[1]), the detection algorithms described in this paper make
use of a two-step process:
1.
First, existing data are used to build a statistical
model that captures the trends and relationships
present in the data.
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
Then, during the online anomaly detection process,
future data records are checked against the model to
see if they match with what is known about how
various data attributes behave individually and how
they interact with one another.
Like many other statistical methods, our algorithms rely
on the basic principle of maximum likelihood estimation (MLE)
[14]. In MLE, a (possibly complex) parametric distribution f
is first chosen to represent the data. f is then treated as a
generative model for the data set in question. That is, we
make the assumption that our data set of size n was, in fact,
generated by drawing n independent samples from f.
Given this assumption, we adjust the parameters governing
the behavior of f so as to maximize the probability that f
would have produced our data set.
Formally, to make use of MLE, we begin by specifying a
distribution fðxj1 ; 2 ; . . . ; k Þ, where the probability density
function f gives the probability that a single experiment or
sample from the distribution would produce the outcome x.
The parameter ¼ h1 ; 2 ; . . . ; k i governs the specific
characteristics of the distribution (such as the mean and
variance).
Since we are trying to model an entire data set, we do
not have only a single experiment x; rather, we view our
data set X ¼ hx1 ; x2 ; . . . ; xn i as the result of performing
n experiments over f. Assuming that these experiments
are independent, the likelihood that we would have
observed X ¼ hx1 ; x2 ; . . . ; xn i is given by:
2.
LðXjÞ ¼
n
Y
fðxk jÞ:
k¼1
The MLE problem can then be succinctly stated as: Given
f and X, can we solve for so as to maximize LðXjÞ over
all possible values of ? Once we have solved for , then
we have a complete model for our data set.
2.1 The Gaussian Mixture Model
One of the common applications of MLE (particularly in
machine learning) is to fit a multidimensional data set to a
model described using the PDF fGMM (fGMM refers to the
PDF for a Gaussian Mixture Model; a GMM is essentially a
weighted mixture of multidimensional normal variables
and is widely used due to its ability to closely model even
difficult, nonparametric distributions). A GMM can very
easily be used as the basis for anomaly detection and is the
most ubiquitous parametric model used for the purpose
(see Section 2.1 of Markou and Singh [1]). The overall
process begins by first performing the MLE required to fit
the GMM to the existing data, using one of several
applicable optimization techniques.
Of course, the problem described in the introduction of
this paper is that, if we allow all attributes to be treated in
the same fashion (as all existing parametric methods seem
to do), then false positives may become a serious problem.
Our solution to this problem is to define a generative
statistical model encapsulated by a PDF fCAD that does not
treat all attributes identically. This model is described in
detail in the next section.
633
2.2 Conditional Anomaly Detection
fCAD can be described intuitively as follows: We assume
that each data point is an ordered pair of two sets of
attribute values ðx; yÞ, where x is the set of environmental
attribute values and y is the set of indicator attribute values.
Our PDF will be of the form fCAD ðyj; xÞ. The PDF fCAD is
conditioned on x, which implies that our generative model
does not generate the environmental attributes associated with a
data point. The PDF gives the likelihood that a single
experiment with input x will give output y. Thus, a data
point’s environmental attributes x are taken as input and
used along with the model parameters to generate the set
of indicator attributes y. The net result is that, when we
perform an MLE to fit our model to the data, we will learn
how the environmental attributes map to the indicator
attributes. Formally, our model uses the following three sets
of parameters, which together make up the parameter set :
A GMM U consisting of nU Gaussians (or multidimensional normal “point clouds”), each of dimensionality dU . The purpose of U is to model the
distribution of the data set’s environmental attributes. The ith Gaussian in U is denoted by Ui and
the weight of the ith Gaussian is denoted by pðUi Þ.
The distribution parameters of the ith Gaussian in U
are given by Ui ¼ hUi ; Ui i.
. A set of nV additional Gaussians of dimensionality
dV which we refer to collectively as V . Each
Gaussian in V models a portion of the data space
for the indicator attributes. The distribution parameters of the jth Gaussian in V are given by
Vj ¼ hVj ; Vj i.
. A probabilistic mapping function pðVj jUi Þ. Given a
Gaussian Ui 2 U, this function gives the probability
that Ui maps to Vj . In other words, this function
gives the probability that a tuple’s indicator attributes are produced by the Gaussian Vj , if we know
that the tuple’s environmental attributes were
produced by the Gaussian Ui .
Given these parameters, we assume the following generative process for each data point in the historical data set:
.
The process begins with the assumption that the
values for a data point’s environmental attributes
were produced by a sample from U. Let Ui denote
the Gaussian in U which was sampled. Note that the
generative process does not actually produce x; it is
assumed that x was produced previously and the
Gaussian which produced x is given as input.
2. Next, we use Ui to toss a set of “dice” to determine
which Gaussian from V will be used to produce y.
The probability of choosing Vj is given directly by
the mapping function and is pðVj jUi Þ.
3. Finally, y is produced by taking a single sample from
the Gaussian that was randomly selected in Step 2.
The process of generating a value for y given a value
for x is in Fig. 2.
It is important to note that, while our generative model
assumes that we know which Gaussian was used to
generate x, in practice, this information is not included in
the data set. We can only infer this by computing a
1.
634
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Fig. 2. The generative model used in conditional anomaly detection.
Given x, a vector of environmental attribute values, we first determine
which Gaussian from U generated x (Step 1). Next, we perform a
random trial using the mapping function to see which Gaussian from V
we map to (Step 2). In our example, we happen to choose V3 , which has
a probability 0.6 of being selected given that x is from U2 . Finally, we
perform a random trial over the selected Gaussian to generate y
(Step 3).
probability that each Gaussian in U was the one that
produced x. As a result, fCAD ðyj; xÞ is defined as follows:
fCAD ðyj; xÞ ¼
nU
X
i¼1
pðx 2 Ui Þ
nV
X
j¼1
fG yjVj p Vj jUi ;
where
.
pðx 2 Ui Þ is the probability that x was produced by
the ith Gaussian in U and is computed using Bayes’
rule:
fG ðxjUi ÞpðUi Þ
:
pðx 2 Ui Þ ¼ PnU
k¼1 fG ðxjUk ÞpðUk Þ
fG ðyjVj Þ is the likelihood that the jth Gaussian in V
would produce y and is the standard Gaussian
distribution (see the Appendix).
. pðVj jUi Þ is the probability that the ith Gaussian from
U maps to the jth Gaussian from V . This is given
directly as a parameter in .
Given the function fCAD ðyj; xÞ, the problem of modeling a data set ðX; Y Þ can be succinctly stated as:
Choose so as to maximize
.
¼
n
X
log fCAD ðyk j; xk Þ over all possible values for :
k¼1
In this expression, is referred to as the log-likelihood of
the resulting model.
2.3 Detecting Conditional Anomalies
Like other parametric anomaly detection methods, the
methodology for detecting future anomalies using the
CAD model is based on the observation that, once an
optimal value for has been learned, the function fCAD can
be used to give a meaningful ordering over all data points
past and future, from the most astonishing to the most
typical. When a new point is observed that has a small
value for fCAD , this means that it occurs in a portion of the
data space with low density, and it should be flagged as an
anomaly.
In order to determine the cutoff value for fCAD , below
which a new point is determined to be anomalous, we allow
VOL. 19, NO. 5,
MAY 2007
the user to first pick a fraction from 0 to 1, and then we
choose the fCAD value such that exactly a fraction of the
data points in the training set would have been flagged as
anomalies. A high value of means that the method is less
selective and, hence, it increases the chance that an
anomalous event will be flagged for further investigation.
However, at the same time, this will tend to increase the
rate of false positives.
Given and the size of the training data set n, in order to
choose a cutoff such that exactly percent of the training
points would be considered anomalous, we simply sort the
training data based upon their fCAD values, from lowest to
highest. Next, let c ¼ n. This is the index of the cutoff point
in the existing data set. Any future observation which is
found to be more unusual than the cth existing point will be
termed an anomaly. To check if a point ðxnew ; ynew Þ is
anomalous, we simply check if
fCAD ðynew j; xnew Þ < fCAD ðyc j; xc Þ:
If yes, then the new observation is flagged as an anomaly.
3
LEARNING
THE
MODEL
Of course, defining a model like fCAD is not enough. It is
also necessary to define a process whereby it is possible to
adjust the model parameters so that the model can be fitted
to the existing data. In general, this type of maximization is
intractable. This section defines several learning algorithms
for computing the model. The section begins with a highlevel introduction to the Expectation Maximization [13]
framework that serves as a basis for all three of the
algorithms.
3.1 The Expectation Maximization Methodology
While EM is usually called an “algorithm,” in reality, it is a
generic technique for solving MLE problems. EM can either
be easy or nearly impossible to apply to a specific MLE
problem, but, in general, its application is nontrivial. We
now turn to a description of the EM methodology.
As mentioned previously, most interesting MLE problems are computationally intractable. In practice, this
intractability results from one or more “hidden” variables
that cannot be observed in the data and must be inferred
simultaneously with the MLE. In the conditional anomaly
detection problem, these hidden variables are the identities
of the Gaussians that were used to produce the environmental and indicator attributes of each data point in the
historical data set. The fact that these variables are not
observable makes the problem difficult; if these values were
included in the data, then solving for becomes a mostly
straightforward maximization problem relying on collegelevel calculus. The EM algorithm is useful in precisely such
cases. EM is an iterative method, whose basic outline is as
follows:
The Basic Expectation Maximization Algorithm
1) While the model continues to improve:
a) Let be the current “best guess” as to the optimal
configuration of the model.
be the next “best guess” as to the optimal
b) Let
configuration of the model.
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
c) E-Step: Compute Q, the expected value of with
over all possible values of the hidden
respect to
parameters. The probability of observing each
possible set of hidden parameter values (required to
compute the expectation of ) is computed using .
so as to maximize the value for Q.
d) M-Step: Choose
then becomes the new “best guess.”
EM sidesteps the problem of not knowing the values of
the hidden parameters by instead considering the expected
value of the PDF with respect to all possible values of the
hidden parameters. Since computing the expected value of
with respect to the hidden parameters requires that we be
able to compute the probability of every possible configuration of the hidden parameters (and to do this we need to
know the optimal configuration ), at each iteration, EM
computes this probability with respect to the best guess so
far for . Thus, EM does not compute a globally optimal
solution at each iteration; rather, it computes a best guess as
to the answer, which is guaranteed to improve at each
iteration. EM has the desirable property that, while it does
not always converge to the globally optimal solution, it does
always converge to a locally optimal solution, where no
combination of slight “tweaks” of the various values in
can improve the value of .
3.2 The Direct-CAD Algorithm
The remainder of this section outlines three different
EM algorithms for learning the CAD model over a data
set. The first of the three algorithms attempts to learn all of
the parameters simultaneously: The Gaussians that govern
the generation of the environmental attributes, the Gaussians that govern the generation of the indicator attributes,
and the mapping function between the two sets of
Gaussians. This section only outlines the approach and
gives the equations that are used to implement the
algorithm; the complete mathematical derivation of the
EM algorithm is quite involved and is left to Appendix B.
As in any EM algorithm, the Direct-CAD algorithm relies on
the two classic steps: the E-Step and the M-Step.
3.2.1 E-Step
To begin the process of deriving an appropriate EM
algorithm, we first need to define the hidden parameters
in the context of the conditional anomaly detection
problem. In conditional anomaly detection, the hidden
parameters are the identities of the clusters from U and V
that were used to produce each data point. To denote these
parameters, we define:
k tells which cluster in U produced the kth value
ðiÞ
in X. k denotes the assertion “the kth set of
environmental attributes was produced by the
ith cluster in U.”
.
k tells which cluster in V produced the kth value in
ðjÞ
Y . k denotes the assertion “the kth set of indicator
attributes was produced by the jth cluster in V .”
Given these variables, we can then derive the Q function
required by the E-Step of the EM algorithm. Let Qð; Þ
denote the expected value of over all possible ; given
.
635
the current parameter values . Then, as derived in
Appendix B,
Qð; Þ
"
n
X
¼E
log fCAD ðxk; yk ;
k ; k jxk ; ÞjX; Y ;
k¼1
#
9
82
3
>
nU X
nV >
n X
=
< log fG ðyk jVj ; Þ þ log pðVj jUi ; Þ
X
6
7
¼
4 þ log fG ðxk jUi ; Þ þ log pðUi Þ 5 bkij ;
>
>
;
k¼1 i¼1 j¼1:
log fðxk jÞ
where
fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ
:
bkij ¼ PnU PnV
t¼1
h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg
3.2.2 M-Step
In the M-Step, we now use standard calculus to maximize
This results in a series of equations
the value of Q over all .
that can be used to compute the new parameter values that
will be used during the next iteration. The details of the
derivation of each of the equations are reasonably involved
and are given in the Appendix. However, the use of the
equations is relatively straightforward and requires only
that we perform standard arithmetic and linear algebra
operations over the previous model parameters and the
are as
data set. The formulas for the components of
follows:
Pn PnV
Pn PnU PnV
. pðUi Þ ¼
j¼1 bkhj .
Pn k¼1
PnV j¼1 bkij = Pk¼1
Ph¼1
V
. Ui ¼ k¼1 j¼1 bkij xk = nk¼1 nj¼1
bkij .
.
U ¼
i
nV
n X
X
bkij ðxk Ui Þðxk Ui ÞT =
Vj ¼
Pn
Vj ¼
k¼1
PnU
i¼1 bkij yk =
nU
n X
X
k¼1 i¼1
bkij :
k¼1 j¼1
k¼1 j¼1
.
.
nV
n X
X
Pn
k¼1
PnU
i¼1 bkij .
bkij ðyk Vj Þðyk Vj ÞT =
nU
n X
X
bkij :
k¼1 i¼1
P
P P V
bkih .
. pðVj jUi Þ ¼ nk¼1 bkij = nk¼1 nh¼1
Given the update rules described above, our EM
algorithm for learning the conditional anomaly detection
model is then as follows:
The Direct-CAD Algorithm
1) Choose an initial set of values for hVj , Vj i,
hUi , Ui i, pðVj jUi Þ, pðUi Þ, for all i, j.
2) While the model continues to improve (measured by the
improvement of ):
a) Compute bkij for all k, i, and j as described under
E-Step above.
Vj i, hUi ;
Ui i, pðVj jUi Þ, and pðUi Þ for
b) Compute hVj ;
all i, j as described in Section 3.2.2 above.
V i, hU ; U i :¼ hU ;
U i,
c) Set hVj ; Vj i :¼ hVj ;
j
i
i
i
i
pðVj jUi Þ :¼ pðVj jUi Þ, pðUi Þ :¼ pðUi Þ for all i, j.
636
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
The time and space requirements of this algorithm can be
derived as follows: To compute bkij in the E-Step, we first
compute fG ðxk jUi Þ and fG ðyk jVj Þ, which takes OðnnU d2U Þ
time and OðnnV d2V Þ time, respectively. Then, it takes
OðnnU nV Þ time to compute all the bkij values. In the M-Step,
It takes
we update the value for each component of .
2
OðnnU nV dU Þ time to update Ui , and OðnnU nV d2V Þ time to
V . To update the rest of the components, the time
update
j
complexity is OðnnU nV ðdU þ dV þ nV ÞÞ. Adding them together, the time complexity for one iteration (E-Step and
M-Step) is OðnnU nV ðd2U þ d2V ÞÞ. The total time complexity of
the conditional anomaly detection algorithm depends on
the convergence speed, i.e., the number of iterations, which
depends on the data set itself.
Accordingly, in the E-Step, storing the results for
fG ðxk jUi Þ, fG ðyk jVj Þ, and bkij requires OðnnU nV Þ space.
requires OðnU d2 þ
Storing the updated components of
U
2
nV dV Þ space.
3.3 The GMM-CAD-Full Algorithm
Unfortunately, the algorithm described in the previous
subsection has a major potential drawback. The EM approach
is known to be somewhat sensitive to converging to locally
optimal solutions that are far from the globally optimal
solution, particularly when it is used in conjunction with very
complex optimization problems. While standard methods
such as multiple runs with randomized seeds may help [15],
the problem remains that the CAD problem is certainly
complex as MLE problems go since the task is to simultaneously learn two sets of Gaussians as well as a mapping
function, and the problem is complicated even more by the
fact that the function fCAD is only conditionally dependent
upon the set of Gaussians corresponding to the distribution
of the environmental attributes.
As a result, the second learning algorithm that we
consider in this paper makes use of a two-step process. We
begin by assuming that the number of Gaussians in U and V
is identical; that is, nU ¼ nV . Given this, a set of Gaussians is
learned in the combined environmental and indicator
attribute space. These Gaussians are then projected onto
the environmental attributes and indicator attributes to
obtain U and V . Then, only after U and V have been fixed, a
simplified version of the Direct-CAD algorithm is used to
compute only the values pðVj jUi Þ that make up the mapping
function. The benefit of this approach is that, by breaking
the problem into two, much easier optimization tasks, each
solved by much simpler EM algorithms, we may be less
vulnerable to the problem of local optima. The drawback of
this approach is that we are no longer trying to maximize
fCAD directly; we have made the simplifying assumption
that it suffices to first learn the Gaussians, and then learn the
mapping function.
The GMM-CAD-Full Algorithm
1) Learn a set of nU ðdU þ dV Þ-dimensional Gaussians over
the data set fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . . ; ðxn ; yn Þg. Call this set Z.
2) Let Zi refer to the centroid of the ith Gaussian in Z, and
let Zi refer to the covariance matrix of the ith Gaussian
in Z. We then determine U as follows:
For i ¼ 1 to nU do:
pðUi Þ :¼ pðZi Þ
VOL. 19, NO. 5,
MAY 2007
For j ¼ 1 to dU do:
Ui ½j :¼ Zi ½j
For k ¼ 1 to dU do:
Ui ½j½k :¼ Zi ½j½k
3) Next, determine V as follows:
For i ¼ 1 to nV do:
For j ¼ 1 to dV do:
Vi ½j :¼ Zi ½j þ dU
For k ¼ 1 to dV do:
Vi ½j½k :¼ Zi ½j þ dU ½k þ dU
4) Run a second EM algorithm to learn the mapping
function. While the model continues to improve
(measured by the improvement of ):
a) Compute bkij for all k, i, and j as in the previous
section.
b) Compute pðVj jUi Þ as described in the previous
section.
c) Set pðVj jUi Þ :¼ pðVj jUi Þ.
The time required for the GMM-CAD-Full algorithm
consists of two parts: the time to learn the Gaussians, which
is OðnnU ðdU þ dV Þ2 Þ, and the time to learn the mapping
function by the simplified Direct-CAD algorithm, which is
Oðnn3U Þ. The memory required for the GMM-CAD-Full
algorithm is OðnnU þ nU ðdU þ dV Þ2 Þ.
3.4 The GMM-CAD-Split Algorithm
There is a third obvious way to learn the required model
that is closely related to the previous one. Since the
previous algorithm breaks up the optimization problem
into two smaller problems, there is no longer any reason to
think that the GMMs for U and V should be learned
simultaneously. Learning them simultaneously might
actually be overly restrictive since learning them simultaneously will try to learn U and V so that every Gaussian in
U is “mated” to a Gaussian in V through a covariance
structure. The CAD model has no such requirement since
the Gaussians of U and V are related only through the
mapping function. In fact, the algorithm from the previous
subsection throws out all of this additional covariance
information since Zi ½j½k is never used for ðj dU ; k > dU Þ
or ðj > dU ; k dU Þ.
As a result, the third algorithm we consider learns U and
V directly as two separate GMMs, and then learns the
mapping function between them. The algorithm is outlined
below:
The GMM-CAD-Split Algorithm
1) Learn U and V by performing two separate EM
optimizations.
2) Run Step 4 from the GMM-CAD-Full Algorithm to learn
the mapping function.
In the GMM-CAD-Split algorithm, learning Gaussians in
U and V needs OðnnU ðd2U þ d2V ÞÞ time. Learning the
mapping function needs Oðnn3U Þ time. The memory
requirement for the algorithm is OðnnU þ nU ðd2U þ d2V ÞÞ.
3.5 Choosing the Complexity of the Model
Whatever learning algorithm is used, it is necessary to
choose as an input parameter the number of clusters or
Gaussians in the model. In general, it is difficult to choose
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
the optimal value of the number of Gaussians in a GMM, and
this is a research problem on its own [16]. However, our
problem is a bit easier because we are not actually trying to
cluster the data for the sake of producing a clustering; rather,
we are simply trying to build an accurate model for the data.
In our experiments, we generally found that a larger number
of clusters in the CAD model results in more accuracy. It is
true that a very large number of clusters could cause
overfitting, but in the CAD model, the computational cost
associated with any model large enough to cause overfitting
makes it virtually impossible to overfit a data set of
reasonable size. Thus, our recommendation is to choose
the largest model that can be comfortably computed. In our
implementation, we use 40 clusters for both U and V .
4
BENCHMARKING
In this section, we describe a set of experiments aimed at
testing the effectiveness of the CAD model. Our experiments
are aimed at answering the following three questions:
1.
2.
3.
Which of the three learning algorithms should be
used to fit the CAD model to a data set?
Can use of the CAD model reduce the incidence of
false positives due to unusual values for environmental attributes compared to obvious alternatives
for use in an anomaly detection system?
If so, does the reduced incidence of false positives
come at the expense of a lower detection level for
actual anomalies?
4.1 Experimental Setup
Unfortunately, as with any unsupervised learning task, it
can be very difficult to answer questions regarding the
quality of the resulting models. Though the quality of the
models is hard to measure, we assert that, with a careful
experimental design, it is possible to give a strong
indication of the answer to our questions. Our experiments
are based on the following key assumptions:
Most of the points in a given data set are not
anomalous and, so, a random sample of a data set’s
points should contain relatively few anomalies.
. If we take a random sample of a data set’s points and
perturb them by swapping various attribute values
among them, then the resulting set of data points
should contain a much higher fraction of anomalies
than a simple random sample from the data set.
Given these assumptions, our experimental design is as
follows: For each data set, the following protocol was
repeated 10 times:
.
1.
2.
For a given data set, we begin by randomly designating 80 percent of the data points as training data and
20 percent of the data points as test data (this latter set
will be referred to as testData).
Next, 20 percent of the anomalies or outliers in the
set testData are identified using a standard, parametric test based on Gaussian mixture modeling—the data are modeled using a GMM and new
data points are ranked as potential outliers based on
the probability density at the point in question. This
basic method is embodied in more than 12 papers
637
3.
4.
5.
cited in Markou and Singh’s survey [1]. However, a
key aspect of this protocol is that outliers are chosen
based only on the values of their environmental attributes
(indicator attributes are ignored). We call this set
outliers.
The set testData is then itself randomly partitioned
into two identically sized subsets: perturbed and
nonP erturbed, subject to the constraint that
outliers nonP erturbed.
Next, the members of the set perturbed are perturbed
by swapping indicator attribute values among them.
Finally, we use the anomaly detection software to
check for anomalies among
testData ¼ perturbed [ nonP erturbed:
Note that, after this protocol has been executed, members
of the set perturbed are no longer samples from the original
data distribution, and members of the set nonP erturbed are
still samples from the original data distribution. Thus, it
should be relatively straightforward to use the resulting data
sets to differentiate between a robust anomaly detection
framework and one that is susceptible to high rates of false
positives. Specifically, given this experimental setup, we
expect that a useful anomaly detection mechanism would:
Indicate that a relatively large fraction of the set
perturbed are anomalies.
. Indicate that a much smaller fraction of the set
nonP erturbed are anomalies.
. Indicate that a similarly small fraction of the set
outliers are anomalies.
This last point bears some further discussion. Since these
records are also samples from the original data distribution,
the only difference between outliers and ðnonP erturbed
outliersÞ is that the records in outliers have exceptional
values for their environmental attributes. Given the definition of an environmental attribute, these should not be
considered anomalous by a useful anomaly detection
methodology. In fact, the ideal test result would show that
the percentage of anomalies among outliers is identical to
the percentage of anomalies among nonP erturbed. A high
percentage of anomalies among the set outliers would be
indicative of a tendency toward false alarms in the case of
anomalous (but uninteresting) values for environmental
attributes.
.
4.2 Creating the Set P erturbed
The above testing protocol requires that we be able to create
a set of points perturbed with an expectedly high percentage
of anomalies. An easy option would have been to generate
perturbed data by adding noise. However, this can produce
meaningless (or less meaningful) values of attributes. We
wanted to ensure that the perturbed data is structurally
similar to the original distribution and domain values so
that it is not trivial to recognize members of this set because
they have similar characteristics compared to the training
data (that is, the environmental attributes are realistic and
the indicator attributes are realistic, but the relationship
between them may not be). We achieve this through a
natural perturbation scheme:
638
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Let D be the set of data records that has been chosen from
testData for perturbation. Consider a record z ¼ ðx; yÞ from
D that we wish to perturb. Recall from Section 2.2 that x is a
vector of environmental attribute values and y is a vector of
indicator attribute values. To perturb z, we randomly choose
k data points from D; in our tests, we use k ¼ minð50; jDj=4Þ.
Let z0 ¼ ðx0 ; y0 Þ be the sampled record such that the
euclidean distance between y and y0 is maximized over all
k of our sampled data points. To perturb z, we simply create
a new data point ðx; y0 Þ and add this point to perturbed.
This process is iteratively applied to all points all points
in the set D have been used to produce a point in perturbed.
Note that every set of environmental attribute values
present in D is also present in perturbed after perturbation.
Furthermore, every set of indicator attributes present in
perturbed after perturbation is also present in D. What has
been changed is only the relationship between the environmental and indicator attributes in the records of perturbed.
We note that swapping the indicator values may not
always produce the desired anomalies (it is possible that
some of the swaps may result in perfectly typical data).
However, swapping the values should result in a new data
set that has a larger fraction of anomalies as compared to
sampling the data set from its original distribution, simply
because the new data has not been sampled from the
original distribution. Since some of the perturbed points
may not be abnormal, even a perfect method should deem a
small fraction of the perturbed set to be normal. Effectively,
this will create a nonzero false negative rate, even for a
“perfect” classifier. Despite this, the experiment is still a
valid one because each of the methods that we test must
face this same handicap.
4.3 Data Sets Tested
We performed the tests outlined above over the following
13 data sets.1 For each data set, the annotation ði jÞ
indicates that the data set had i environmental and
j indicator attributes, which were chosen after a careful
evaluation of the data semantics. A brief description of the
environmental and indicator attributes for each data set is
also given. The 13 data sets used in testing are:
1.
2.
Synthetic data set ð50 50Þ. The synthetic data set is
created using the CAD model. Ten thousand data
points were sampled from the generative model f. f
has 10 Gaussians in U and V , respectively. The
centroids of the Gaussians are random vectors
whose entries are chosen on the interval ð0:0; 1:0Þ.
The diagonal of the covariance matrix is set to 1=4 of
the average distance between the centroids in each
dimension. The weight of the Gaussians is evenly
assigned and the mapping function P ðVj jUi Þ for a
certain value of i is a permutation of geometric
distribution.
Algae data set ð11 6Þ. The data is from the UCI KDD
Archive. We removed the data records with missing
values. The environmental attributes consist of the
season, the river size, the fluid velocity, and some
chemical concentrations. The indicator attributes are
1. We have conducted an experiment on a prelabeled data set as
suggested. Refer to Appendix A for the results.
VOL. 19, NO. 5,
MAY 2007
the distributions of different kinds of algae in surface
water.
3. Streamflow data set ð205 100Þ. The data set depicts
the river flow levels (indicator) and precipitation
and temperature (environmental) for California. We
create this data set by concatenating the environmental and indicator data sets. The environmental
data set is obtained from National Climate Data
Center (NCDC). The indicator data set is obtained
from US Geological Survey (USGS).
4. ElNino data set ð4 5Þ. The data is from the UCI
KDD Archive. It contains oceanographic and surface
meteorological readings taken from a series of buoys
positioned throughout the equatorial Pacific. The
environmental attributes are the spatio-temporal
information. The indicator attributes are the wind
direction, relative humidity, and temperature at
various locations in the Pacific Ocean.
5. Physics data set ð669 70Þ. The data is from KDD
Cup 2003. For a large set of physics papers, we pick
out frequency information for keywords (environmental) and a list of the most referred articles
(indicator). One data record, which represents one
physics paper, has 0/1-valued attributes, corresponding to the appearance/absence of the keyword
(or referred-to article), respectively.
6. Bodyfat data set ð13 2Þ. The data is from the CMU
statlib. It depicts body fat percentages (indicator)
and physical characteristics (environmental) for a
group of people.
7. Houses data set ð8 1Þ. The data is from the CMU
statlib. It depicts house prices (indicator) in California and related environmental characteristics (environmental), such as median income, housing
median age, total rooms, etc.
8. Boston data set ð15 1Þ. The data is from the CMU
statlib. We removed some attributes that are not
related to the environmental-indicator relationship.
The data depicts the house value of owner-occupied
homes (indicator) and economic data (environmental) for Boston.
9. FCAT-math data set ð14 12Þ. The data is from the
National Center for Education Statistics (NCES) and
the Florida Information Resource Network (FIRN).
We removed the data records with missing values. It
depicts mathematics achievement test results (indicator) of grade 3 of year 2002 and regular elementary
schools’ characteristics (environmental) for Florida.
10. FCAT-reading data set ð14 11Þ. The data is also
from NCES and FIRN. We processed the data
similarly as with the FCAT-math data set. It depicts
the reading achievement test scores (indicator) and
regular elementary schools’ characteristics (environmental) for Florida.
11. FLFarms data set ð114 52Þ. The data set depicts the
Florida state farms’ market value in various aspects
(indicator) and the farms’ operational and products
statistics (environmental).
12. CAFarms data set ð115 51Þ. The data set depicts the
California state farms’ market value in various
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
639
TABLE 1
Head-to-Head Winner When Comparing Recall/Precision for Identifying Perturbed Data as Anomalous
(Each Cell Has Winner, Wins Out of 13 Data Sets, Bootstrapped p-Value)
TABLE 2
Head-to-Head Winner When Comparing Recall/Precision for Identifying Nonperturbed Outlier Data as Nonanomalous
(Each Cell Has Winner, Wins Out of 13 Data Sets, Bootstrapped p-Value)
aspects (indicator) and the farms’ operational and
products statistics (environmental).
13. CAPeaks data set ð2 1Þ. The data set depicts the
California peaks’ height (indicator) and longitude
and altitude position (environmental).
4.4 Experimental Results
We performed the tests described above over six alternatives for anomaly detection:
1.
2.
3.
4.
5.
6.
Simple Gaussian mixture modeling (as described at
the beginning of this section). In this method, for
each data set, a GMM model is learned directly over
the training data. Next, the set testData is processed,
and anomalies are determined based on the value of
the function fGMM for each record in testData; those
with the smallest fGMM values are considered
anomalous.
kth-NN (kth nearest neighbor) outlier detection [9]
(with k ¼ 5). kth-NN outlier detection is one of the
most well-known methods for outlier detection. In
order to make it applicable to the training/testing
framework, we consider that the method must be
modified slightly since each test data point must be
scored in order to describe the extent to which it
deviates from the training set. Given a training data
set, in order to use the fifth-NN method in order to
determine the degree to which a test point is
anomalous, we simply use the distance from the
point to its fifth-NN in the training set. A larger
distance indicates a more anomalous point.
Local outlier factor (LOF) anomaly detection [8].
LOF must be modified in a matter similar to kth-NN
outlier detection: The LOF of each test point is
computed with respect to the training data set in
order to score the point.
Conditional anomaly detection, with the model
constructed using the Direct-CAD Algorithm.
Conditional anomaly detection, with the model
constructed using the GMM-CAD-Full Algorithm.
Conditional anomaly detection, with the model
constructed using the GMM-CAD-Split Algorithm.
For each experiment over each data set with the GMM/
CAD methods, a data record was considered an anomaly if
the probability density at the data point was less than the
median probability density at all of the test data points. For
5th-NN=LOF, the record was considered an anomaly if its
distance was greater than the median distance over all test
data points. The median was used because exactly one-half
of the points in the set testData were perturbed. Thus, if the
method was able to order the data points so that all
anomalies were before all nonanomalies, choosing the
median density should result in a recall2 rate of 100 percent
and a false positive rate of 0 percent (a false positive rate of
0 percent is equivalent to a precision3 of 100 percent). Also,
note that, since exactly one-half of the points are perturbed
and each method is asked to guess which half are
perturbed, it is always the case that the recall is exactly
equal to the precision when identifying either anomalous or
nonanomalous records in this experiment.
The results are summarized in Tables 1 and 2. Table 1
summarizes and compares the quality of the recall/
precision obtained when identifying perturbed points in
testData. Each cell indicates which of the two methods
compared in the cell was found to be the “winner” in terms
of having superior recall/precision over the 13 sets. To
compute the “winner,” we do the following: For each of the
13 sets and each pair of detection methods, the recall/
precision of the two methods is averaged over the
10 different training/testing partitionings of the data set.
Let method1 and method2 be the two anomaly detection
methods considered in a cell. method1 is said to “win” the
test if it has a higher average recall/precision in seven or
more of the 13 data sets. Likewise, method2 is declared the
winner if it has a higher recall/precision in seven or more of
the data sets. In addition, we also give the number of the
13 data sets for which the winning method performed
better, and we give a p-value that indicates the significance
of the results.
2. Recall in this context is the fraction of perturbed points that are flagged
as anomalies.
3. Precision in this context is the fraction of anomalies that are actually
perturbed points.
640
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Finally, the second column in Table 1 gives the average
recall/precision over all experiments for each method. This
is the average percentage of perturbed points that have
probability density less than the median (for GMM and
CAD) or have a score/distance greater than the median
(LOF and 5th-NN). Though these numbers are informative,
we caution that the number of head-to-head wins is
probably a better way to compare two methods. The reason
for this is that the recall/precision can vary significantly
from data set to data set and, thus, the average may tend to
overweight those data sets that are particularly difficult.
This p-value addresses the following question: Imagine
that the whole experiment was repeated by choosing a new,
arbitrary set of 13 data sets at random and performing
10 different training/testing partitionings of each data set.
The p-value gives the probability that we would obtain a
different winner than was observed in our experiments due
to choosing different data sets and partitionings. A low
p-value means that it is unlikely that the observed winner
was due mostly to chance and, so, the results are significant.
A high p-value means that, if the experiment was repeated
again with a new set of data sets, the result may be
different. The p-value was computed using a bootstrap resampling procedure [17].
Table 2 is similar to Table 1, but rather than comparing
the recall/precision with which the methods can identify
the perturbed points, it summarizes the recall/precision in
identifying as nonanomalous points in the set outliers.
Recall that the outliers are points with exceptional values
for their environmental attributes, but which are nonperturbed, and, so, they are actually obtained from the
training distribution. Thus, this table gives an indication of
how successful the various methods are in ignoring
anomalous environmental attributes when the points were,
in fact, sampled from the baseline (or training) distribution. Just as in Table 1, each cell gives the winner, the
number of trials won, and the associated p-value of the
experiment. The second column gives the average precision/recall. This is the percentage of nonperturbed points
with exceptional environmental attributes that were considered to be nonanomalous.
4.5 Discussion
Our experiments were designed to answer the following
question: If a human expert wishes to identify a set of
environmental attributes that should not be directly
indicative of an anomaly, can any of the methods successfully ignore exceptional values for those attributes while at
the same time taking them into account when performing
anomaly detection? The first part of the question is
addressed by the results given in Table 2. This table shows
that, if the goal is to ignore anomalous values among the
environmental attributes, then the CAD-GMM-Full algorithm is likely the best choice. This method declared the
lowest percentage of data points with exceptional environmental attribute values to be anomalous. In every head-tohead comparison, it did a better job for at least 77 percent of
the data sets (or 10 out of 13).
The second part of the question is addressed by the
results given in Table 1. This table shows that, if the goal is
to not only ignore exceptional environmental attributes, but
also to take the normal correlation between environmental
attributes and the indicator attributes into account, then the
VOL. 19, NO. 5,
MAY 2007
CAD-GMM-Full algorithm is again the best choice. Table 1
shows that, compared to all of the other options, this
method was best able to spot cases where the indicator
attributes were not in keeping with the environmental
attributes because they had been swapped.
Finally, it is also important to point out that no method
was uniformly preferable to any other method for each and
every data set tested. It is probably unrealistic to expect that
any method would be uniformly preferable. For example, we
found that CAD-based methods generally did better
compared to the two distance-based methods (and the
GMM method) on higher-dimensional data. The results
were most striking for the thousand-dimensional physics
paper data set, where the precision/recall for LOF and
5th-NN when identifying perturbed data was worse than a
randomized labeling would have done (less than 50 percent). In this case, each of the CAD methods had greater
than 90 percent precision/recall. For lower-dimensional
data, the results were more mixed. Out of all of the head-tohead comparisons reported in Table 1 and Table 2, there is
only one case where any method won 13 out of 13 times
(CAD-GMM-Full versus LOF in terms of the ability to
ignore anomalous environmental attributes). Thus, our
results do not imply that any method will do better than
any other method on an arbitrary data set. What the results
do imply is that, if the goal is to choose the method that is
most likely to do better on an arbitrary data set, then the
CAD-GMM-Full algorithm is the best choice.
5
RELATED WORK
Anomaly and outlier detection have been widely studied,
and the breadth and depth of the existing research in the
area precludes a detailed summary here. An excellent
survey of so-called “statistical” approaches to anomaly
detection can be found in Markou and Singh’s 2003 survey
[1]. In this context, “statistical” refers to methods which try
to identify the underlying generative distribution for the
data, and then identify points that do not seem to belong to
that distribution. Markou and Singh give references to
around a dozen papers that, like the CAD methodology,
rely in some capacity on Gaussian Mixture Modeling to
represent the underlying data distribution. However, unlike
the CAD methodology, nearly all of those papers implicitly
assume that all attributes should be treated equally during
the detection process.
The specific application area of intrusion detection is the
topic of literally hundreds of papers that are related to
anomaly detection. Unfortunately, space precludes a detailed survey of methods for intrusion detection here. One
key difference between much of this work and our own
proposal is the specificity: Most methods for intrusion
detection are specifically tailored to that problem and are
not targeted toward the generic problem of anomaly
detection, as the CAD methodology is. An interesting
2003 study by Lazarevic et al. [18] considers the application
of generic anomaly detection methods to the specific
problem of intrusion detection. The authors test several
methods and conclude that LOF [8] (tested in this paper)
does a particularly good job in that domain.
While most general-purpose methods assume that all
attributes are equally important, there are some exceptions
to this. Aggarwal and Yu [19] describe a method for
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
identifying outliers in subspaces of the entire data space.
Their method identifies combinations of attributes where
outliers are particularly obvious, the idea being that a large
number of dimensions can obscure outliers by introducing
noise (the so-called “curse of dimensionality” [20]). By
considering subspaces, this effect can be mitigated. However, Aggarwal and Yu make no a priori distinctions among
the types of attribute values using domain knowledge, as is
done in the CAD methodology.
One body of work on anomaly detection originating in
statistics that does make an a priori distinction among
different attribute types in much the same way as the
CAD methodology is so-called spatial scan statistics [21],
[22], [23]. In this work, data are aggregated based upon
their spatial location, and then the spatial distribution is
scanned for sparse or dense areas. Spatial attributes are
treated in a fashion analogous to environmental attributes
in the CAD methodology, in the sense that they are not
taken as direct evidence of an anomaly; rather, the
expected sparsity or density is conditioned on the spatial
attributes. However, such statistics are limited in that
they cannot easily be extended to multiple indicator
attributes (the indicator attribute is assumed to be a
single count), nor are they meant to handle the sparsity
and computational complexity that accompany large
numbers of environmental attributes.
The existing work that is probably closest in spirit to our
own is likely the paper on anomaly detection for finding
disease outbreaks by Wong et al. [24]. Their work makes use
of a Bayes Net as a baseline to describe possible environmental structures for a large number of input variables.
However, because their method relies on a Bayes Net, there
are certain limitations inherent to this approach. The most
significant limitation of relying on a Bayes Net is that
continuous phenomena are hard to describe in this framework. Among other issues, this makes it very difficult to
describe the temporal patterns such as time lags and
periodicity. Furthermore, a Bayes Net implicitly assumes
that even complex phenomena can be described in terms of
interaction between a few environmental attributes, and
that there are no hidden variables unseen in the data, which
is not true in real situations where many hidden attributes
and missing values are commonplace. A Bayes Net cannot
learn two different explanations for observed phenomena
and learn to recognize both, for it cannot recognize
unobserved variables. Such hidden variables are precisely
the type of information captured by the use of multiple
Gaussian clusters in our model; each Gaussian corresponds
to a different hidden state.
Finally, we mention that traditional prediction models,
such as linear or nonlinear models of regression [25], [26],
can also be potentially used for deriving the cause-effect
interrelationships that the CAD method uses. For example,
for a given set of environmental attribute values, a trained
regression model will predict the indicator attribute value
and, by comparing it with the actual indicator attribute
value, we can determine if the test point is anomalous or
not. However, there are important limitations on such
regression methods as compared to the CAD model. First,
in regression, there is typically one response variable
(indicator variable) and multiple predictor variables
(environmental variables), while the CAD method allows
an arbitrary number of indicator attributes. Second, it is
unclear how a regression-based methodology could be
641
Fig. 3. Precision rate on the KDD Cup 1999 data set by 10-fold crossvalidation. The minimum and maximum precision rates over 10 runs are
depicted as circles and the overall precision is depicted as a star sign.
used to rank the interest of data points based on their
indicator attributes. Distance from the predicted value
could be used, but this seems somewhat arbitrary, since
the natural variation may change depending on indicator
attribute values.
6
CONCLUSIONS
AND
FUTURE WORK
In this paper, we have described a new approach to
anomaly detection. We take a rigorous, statistical approach
where the various data attributes are partitioned into
environmental and indicator attributes, and a new observation is considered anomalous if its indicator attributes
cannot be explained in the context of the environmental
attributes. There are many avenues for future work. For
example, we have considered only one way to take into
account a very simple type of domain knowledge to boost
the accuracy of anomaly detection. How might more
complex domain knowledge boost accuracy even more?
For another example, an important issue that we have not
addressed is scalability during the construction of the
model. While the model can always quickly and efficiently
be used to check for new anomalies, training the model on
a multigigabyte database may be slow without a modification of our algorithms. One way to address this may be to
use a variation of the techniques of Bradley et al. [27] for
scaling standard, Gaussian EM to clustering of very large
databases.
APPENDIX A
EXPERIMENT
ON THE
KDD CUP 1999 DATA SET
One of the anonymous referees for this paper suggested
that we apply our algorithms to the KDD Cup 1999 data set.
Unlike the 13 data sets described in the paper, this data set
is labeled and, so, we know if a particular test record is
actually anomalous. After partitioning into causal and
result attributes, we used a 10-fold cross-validation test to
compare the precision of the different detection methods on
this data set; error bars showing the low and high accuracy
of each method as well as the mean are provided in Fig. 3.
The results show that this particular task is rather easy and,
so, it is difficult to draw any additional conclusions from
the results. All of the methods except for LOF perform very
well, with almost perfect overlap among the error bars.
642
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 19, NO. 5,
MAY 2007
APPENDIX B
EM DERIVATIONS
Inserting (5), (6), and (7) into (4), we get:
B.1 MLE Goal
The goal is to maximize the objective function:
fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ
:
¼ PnU PnV
t¼1
h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg
¼ log
n
Y
fð
¼
(
log
k¼1
nU
X
pðxk 2 Ui Þ
nV
X
)
ð1Þ
fG ðyk jVj ÞpðVj jUi Þ ;
j¼1
i¼1
fCAD ðyk jxk Þ ¼
pðxk 2 Ui Þ
nV
X
fG ðyk jVj ÞpðVj jUi Þ:
ð2Þ
j¼1
i¼1
bkij ¼ fð
Qð; Þ
nU X
nV n
n X
X
¼
log fCAD ðyk ;
k¼1
¼
k
ðiÞ
k ;
fCAD ðyk ;
fð
ðjÞ
k jxk ; Þ
¼
¼
ðjÞ
k jxk ; yk ; Þ
ðiÞ
ðjÞ
k ; yk ; k jÞ
fðxk ;
ðjÞ
k jxk ;
9
82
3
>
nU X
nV >
=
< log fG ðyk jVj Þ þ log pðVj jUi ; Þ
n X
X
6
7
¼
b
4 þ log fG ðxk jUi Þ þ log pðUi Þ 5 kij :
>
>
;
k¼1 i¼1 j¼1:
log fðxk jÞ
¼ log
log fðxk jÞ
ðiÞ
k ; Þ
ðjÞ
ðiÞ
ð4Þ
Since
ðiÞ
k jÞ
¼ fðxk ; Ui jÞ ¼ fG ðxk jUi ÞpðUi Þ:
ð5Þ
Through (2), we can get
fðyk ;
PnU
i¼1
ðjÞ
k jxk ;
ð12Þ
ð13Þ
ðiÞ
k ; Þ
nU
X
fG ðxk jUt ÞpðUt Þ;
t¼1
fðxk ; k jÞfðyk ; k jxk ; k ; Þ
P U
P V
:
fðxk jÞ nt¼1
pðxk 2 Ut ; Þ nh¼1
½fG ðyk jVh ÞpðVh jUt ; Þ
fðxk ;
ðjÞ
k jxk ;
ðjÞ
k jxk ; Þ
ðiÞ
ðiÞ
k ; Þ fð k jxk ; Þ
Qð; Þ
fðxk jÞfðyk jxk ; Þ
ðiÞ
¼
ðiÞ
k ;
In the Q function from (13), we have
fðxk ;
fðxk ; yk jÞ
ðiÞ
k jÞfðyk ;
ð10Þ
Inserting (12) into (10), the Q function becomes
ðjÞ
k jxk ; yk ; Þg:
Now, we derive the two subexpressions in (3):
ðiÞ
k ;
o
bkij ;
fG ðxk jUi ÞpðUi Þ :
¼fG ðyk jVj Þ pðVj jUi ; Þ
fðxk jÞ
ð3Þ
fð
¼fðyk ;
k jX; Y ; Þg
k¼1 i¼1 j¼1
ðiÞ
k ;
ðjÞ
k jxk ; Þ
The remaining subexpression in Q function is
k
nU X
nV
n X
X
flog fCAD ðyk ;
ðiÞ
k ;
ð11Þ
k¼1
k jxk ; Þfð k ;
ð9Þ
fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ
:
bkij ¼ PnU PnV
t¼1
h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg
¼ E½ðX; Y ; ; jX; ÞjX;
Y ;
"
#
n
X
log fCAD ðxk; yk ; k ; k jxk ; ÞjX; Y ;
¼E
k;
ðjÞ
k jxk ; yk ; Þ:
where
Qð; Þ
n XX
X
flog fCAD ðyk ;
ðiÞ
k ;
k¼1 i¼1 j¼1
B.2 Expectation Step
The expectation of the objective function is
¼
ð8Þ
With the new denotation bkij , the Q function in (3)
becomes
where
nU
X
ðjÞ
k jxk ; yk ; Þ
For simplicity, we introduce bkij :
fCAD ðyk jxk Þ
k¼1
n
X
ðiÞ
k ;
which is hard to differentiate. We can play an approximation
To make
“trick” by using fðxk jÞ to approximate fðxk jÞ.
the maximization target clear, we can rewrite the (13) as
Q0 ð; Þ
nV
nU X
nV
nU X
n X
n X
X
X
bkij log pðVj jUi Þ
bkij log fG ðyk jVj Þ þ
¼
þ
¼ fG ðyk jVj ÞpðVj jUi ; Þ:
ð6Þ
pðxk 2 Ui Þ ¼ 1; pðxk 2 Ut Þ is the membership of xk
in cluster Ut , so
fG ðxk jUt ÞpðUt Þ
fG ðxk jUt ÞpðUt Þ
¼
: ð7Þ
pðxk 2 Ut ; Þ ¼ PnU
fðxk jÞ
f
ðx
jU
ÞpðU
Þ
i
i¼1 G k i
k¼1 i¼1 j¼1
k¼1 i¼1 j¼1
nU X
nV
n X
X
nU X
nV
n X
X
k¼1 i¼1 j¼1
nU X
nV
n X
X
bkij log fG ðxk jUi Þ þ
bkij log pðUi Þ
k¼1 i¼1 j¼1
bkij log fðxk jÞ:
k¼1 i¼1 j¼1
ð14Þ
We will perform the maximization of Q0 by maximizing
each of the four terms in (14).
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
643
5b ðbT AbÞ ¼ 2Ab;
B.3 Maximization Step
1.
5A ðbT AbÞ ¼ bbT ;
First, we want to maximize:
nU X
nV
n X
X
5A jAj ¼ A1 jAj:
bkij log pðUi Þ
ð15Þ
nV
nU X
n X
X
k¼1 i¼1 j¼1
k¼1 i¼1 j¼1
nU X
nV
n X
X
with respect to pðUi Þ. Note that the pðUi Þ is
constrained by
¼
nU
X
pðUi Þ ¼ 1:
ð16Þ
k¼1 i¼1 j¼1
nU X
nV
n X
X
¼
i¼1
d
Vj
ð2Þ2
bkij log pðVj jUi Þ
when maximizing (15). Through the method of
nU
n X
X
Lagrange multipliers, maximizing (15) is the same
as maximizing
k¼1 i¼1
"
bkij log pðUi Þ þ
k¼1 i¼1 j¼1
nU
X
@pðUi Þ
j¼1 pðUi Þ
k¼1
nV
n X
X
ð17Þ
nU
n X
X
ð18Þ
V ¼
j
#
bkij :
ð23Þ
k¼1 i¼1
V ðyk V Þðyk V ÞT ¼ 0:
bkij ½
j
j
j
nU
n X
X
bkij ðyk Vj Þðyk Vj ÞT =
ð24Þ
bkij log fG ðyk jVj Þ
k¼1 i¼1 j¼1
V . Recall that, by linear
and
j
bkij :
ð25Þ
PnU PnV
We can maximize
k¼1
i¼1
j¼1 bkij log fG ðxk jUi Þ
in a similar way and get the update rule for
Ui Þ; i 2 f1; 2; . . . ; nU g,
Ui ¼ ðUi ;
Ui ¼
nV
n X
X
bkij xk =
k¼1 j¼1
Ui ¼
nV
n X
X
nV
n X
X
bkij ;
3.
ð26Þ
k¼1 j¼1
bkij ðxk Ui Þðxk Ui ÞT =
nV
n X
X
bkij :
k¼1 j¼1
k¼1 j¼1
ð20Þ
Now, we want to maximize:
nU
n X
X
k¼1 i¼1
Pn
Combining (18) and (19) and solving for pðUi Þ,
Pn PnV
j¼1 bkij
k¼1
pðUi Þ ¼ Pn PnU PnV
i 2 f1; 2; . . . ; nU g:
k¼1
h¼1
j¼1 bkhj
algebra, if A is symmetric matrix:
nU
n X
X
k¼1 i¼1
ð19Þ
with respect to Vj
bkij yk =
V ; j 2 f1; 2; . . . ; nV g,
Solving (24) for
j
bkij þ pðUi Þ ¼ 0:
bkhj þ pðUh Þ ¼ 0;
h¼1 k¼1 j¼1
Pn PnU PnV
nU X
nV
n X
X
k¼1
h¼1
j¼1 bkhj
¼
bkhj :
PnU
k¼1 h¼1 j¼1
h¼1 pðUh Þ
nU X
nV
n X
X
nU
n X
X
k¼1 i¼1
We now sum (18) over all i and solve for :
2.
ð23Þ
Now, taking the derivative of (21) with regard to
1 and setting it to zero
Vj
k¼1 j¼1
¼
1
1 ðyk Vj Þ ¼ 0:
2
Vj
2
k¼1 i¼1
þ ¼ 0;
"
nV
nU X
n X
X
Vj ¼
and setting it to zero,
nV
n X
X
bkij
:
Solving (22) for Vj ; j 2 f1; 2; . . . ; nV g,
#
Taking the derivative of (17) with respect to pðUi Þ
¼
bkij
pðUi Þ 1 :
i¼1
1
2
Taking the derivative of (21) with respect to Vj and
setting it to zero,
k¼1 i¼1 j¼1
@h
ð22Þ
bkij log
1 ðyk V Þ
exp½ 12 ðyk Vj ÞT
j
Vj
nU X
nV
n X
X
h¼
Vj Þ
bkij log fG ðyk jVj ;
k¼1 i¼1 j¼1
Here, we do not count
nU X
nV
n X
X
bkij log fG ðyk jVj Þ
ð27Þ
Now, we want to maximize
nU X
nV
n X
X
k¼1 i¼1 j¼1
with regard to pðVj jUi Þ.
bkij log pðVj jUi Þ
ð28Þ
644
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
[6]
We can set a constraint on the pðVj jUi Þ:
nV
X
[7]
pðVj jUi Þ ¼ 1 i 2 f1; 2; . . . ; nU g:
ð29Þ
j¼1
[8]
Again, by the method of Lagrange multipliers,
maximizing (28) is the same as maximizing
h¼
nV
nU X
n X
X
bkij log pðVj jUi Þ
[10]
k¼1 i¼1 j¼1
þ i
nV
X
ð30Þ
!
[11]
pðVj jUi Þ 1 ; i 2 f1; 2; . . . ; nU g:
j¼1
Taking the derivative of (30) with regard to pðVj jUi Þ
and setting it to zero,
@h
@pðVj jUi Þ
n
X
bkij
þ i ¼ 0;
pðV
j jUi Þ
k¼1
n
X
bkij þ i pðVj jUi Þ ¼ 0:
¼
[9]
[12]
[13]
ð31Þ
[14]
k¼1
We now sum (31) over all j and solve for i ,
"
#
nV X
n
X
bkih þ i pðVh jUi Þ ¼ 0;
h¼1
[15]
ð32Þ
[16]
k¼1
[17]
Pn
PnV
Pk¼1
nV
h¼1
h¼1 bkih
nV
n X
X
ð33Þ
[18]
Finally, combining (31) and (33) and solving for
pðVj jUi Þ; i 2 f1; 2; . . . ; nU g; j 2 f1; 2; . . . ; nV g, we have
[19]
i ¼
pðVh jUi Þ
pðVj jUi Þ ¼
n
X
k¼1
¼
bkij =
bkih :
k¼1 h¼1
nV
n X
X
[20]
bkih :
ð34Þ
k¼1 h¼1
[21]
[22]
ACKNOWLEDGMENTS
This material is based upon work supported by the US
National Science Foundation under Grant No. ACI-0325459,
IIS-0347408, and IIS-0612170. Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the
views of the US National Science Foundation.
REFERENCES
[1]
[2]
[3]
[4]
[5]
M. Markou and S. Singh, “Novelty Detection: A Review Part 1:
Statistical Approaches,” Signal Processing, vol. 83, pp. 2481-2497,
Dec. 2003.
W.-K. Wong, A. Moore, G. Cooper, and M. Wagner, “Rule-Based
Anomaly Pattern Detection for Detecting Disease Outbreaks,”
Proc. Nat’l Conf. Artificial Intelligence (AAAI ’02), pp. 217-223, Aug.
2002.
A. Adam, E. Rivlin, and I. Shimshoni, “Ror: Rejection of Outliers
by Rotations in Stereo Matching,” Proc. Conf. Computer Vision and
Pattern Recognition (CVPR ’00), pp. 1002-1009, June 2000.
F. de la Torre and M.J. Black, “Robust Principal Component
Analysis for Computer Vision,” Proc. Eighth Int’l Conf. Computer
Vision (ICCV ’01), pp. 362-369, 2001.
A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava, “A
Comparative Study of Anomaly Detection Schemes in Network
Intrusion Detection,” Proc. Third SIAM Int’l Conf. Data Mining,
2003.
[23]
[24]
[25]
[26]
[27]
VOL. 19, NO. 5,
MAY 2007
Y. Zhang and W. Lee, “Intrusion Detection in Wireless Ad-Hoc
Networks,” Proc. MobiCom Conf., pp. 275-283, 2000.
E.M. Knorr, R.T. Ng, and V. Tucakov, “Distance-Based Outliers:
Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237253, Feb. 2000.
M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “LOF:
Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD
Conf., pp. 93-104, May 2000.
S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for
Mining Outliers from Large Data Sets,” Proc. ACM SIGMOD Conf.,
pp. 427-438, May 2000.
E. Eskin, “Anomaly Detection over Noisy Data Using Learned
Probability Distributions,” Proc. Int’l Conf. Machine Learning,
pp. 255-262, June 2000.
K. Yamanishi, J. ichi Takeuchi, G.J. Williams, and P. Milne, “OnLine Unsupervised Outlier Detection Using Finite Mixtures with
Discounting Learning Algorithms,” Data Mining and Knowledge
Discovery, vol. 8, no. 3, pp. 275-300, May 2004.
S. Roberts and L. Tarassenko, “A Probabilistic Resource Allocating
Network for Novelty Detection,” Neural Computation, vol. 6,
pp. 270-284, Mar. 1994.
A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from
Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc.,
Series B, vol. 39, no. 1, pp. 1-38, 1977.
J. Bilmes, “A Gentle Tutorial on the EM Algorithm and Its
Application to Parameter Estimation for Gaussian Mixture and
Hidden Markov Models,” Technical Report ICSI-TR-97-021, Univ.
of California at Berkeley, 1997.
D.M. Chickering and D. Heckerman, “Efficient Approximations
for the Marginal Likelihood of Bayesian Networks with Hidden
Variables,” Machine Learning, vol. 29, nos. 2-3, pp. 181-212, 1997.
M.A.T. Figueiredo and A.K. Jain, “Unsupervised Learning of
Finite Mixture Models,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 24, no. 3, pp. 381-396, Mar. 2002.
B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap.
Chapman and Hall, 1993.
A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava, “A
Comparative Study of Anomaly Detection Schemes in Network
Intrusion Detection,” Proc. SIAM Conf. Data Mining, 2003.
C.C. Aggarwal and P.S. Yu, “Outlier Detection for High
Dimensional Data,” Proc. ACM SIGMOD Conf., pp. 37-46, May
2001.
K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is
‘Nearest Neighbor’ Meaningful?” Proc. Int’l Conf. Database Theory,
pp. 217-235, Jan. 1999.
M. Kulldorff, “A Spatial Scan Statistic,” Comm. in Statistics: Theory
and Methods, vol. 26, no. 6, pp. 1481-1496, 1997.
D.B. Neill and A.W. Moore, “Rapid Detection of Significant
Spatial Clusters,” Proc. 2004 ACM SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining, pp. 256-265, 2004.
M. Kulldorff, “Spatial Scan Statistics: Models, Calculations, and
Applications,” Scan Statistics and Applications, pp. 303-322, 1999.
W.-K. Wong, A. Moore, G. Cooper, and M. Wagner, “Bayesian
Network Anomaly Pattern Detection for Disease Outbreaks,” Proc.
Int’l Conf. Machine Learning, pp. 808-815, Aug. 2003.
L. Breiman, R.A. Olshen, J.H. Friedman, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1984.
J.R. Quinlan, “Learning with Continuous Classes,” Proc. Fifth
Australian Joint Conf. Artificial Intelligence, pp. 343-348, 1992.
P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering
Algorithms to Large Databases,” Knowledge Discovery and Data
Mining, pp. 9-15, 1998.
Xiuyao Song received the BS and MS degrees
from the Computer Science Department at the
Huazhong University of Science and Technology, China, in 1998 and 2001, respectively. She
is working toward the PhD degree in the
Computer and Information Sciences and Engineering Department at the University of Florida. She spent a year as a project management
specialist at Lucent Technologies (China) Co.,
Ltd. She began her PhD program in September
2002 and her research interest is data mining, more specifically,
anomaly detection in large databases.
SONG ET AL.: CONDITIONAL ANOMALY DETECTION
Mingxi Wu received the BS degree in computer
science from Fudan University in 2000. After
that, he worked at Microsoft for two years as a
global technical support engineer. He is a PhD
student in the Computer and Information
Science and Engineering Department at the
University of Florida (UF). His current research
includes outlier/anomaly detection in high-dimensional data sets and top-k related database
query processing.
Christopher Jermaine received the PhD degree from the Georgia Institute of Technology.
He is an assistant professor in the Computer and
Information Science and Engineering Department at the University of Florida, Gainesville. His
research interests are in the areas of databases,
data mining, and applied statistics, with an
emphasis on data analysis and randomized
algorithms.
645
Sanjay Ranka is a professor of computer
information science and engineering at the
University of Florida, Gainesville. His research
interests are the areas of large-scale software
systems, grid computing, high-performance
computing, data mining, biomedical computing,
and optimization. Most recently, he was the chief
technology officer at Paramark, where he
developed real-time optimization software for
optimizing marketing campaigns. He has also
held positions as a tenured faculty member at Syracuse University and
as a researcher/visitor at the IBM T.J. Watson Research Labs and
Hitachi America Limited. He has coauthored two books: Elements of
Artificial Neural Networks (MIT Press) and Hypercube Algorithms
(Springer Verlag). He has also coauthored six book chapters, more
than 55 journal papers, and 100 conference and workshop papers. He
was one of the main architects of the Syracuse Fortran 90D/HPF
compiler and served on the MPI Standards committee. He is a fellow of
the IEEE and a member of IFIP Committee on System Modeling and
Optimization.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.