Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 5, MAY 2007 631 Conditional Anomaly Detection Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka, Fellow, IEEE Abstract—When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software will soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection. Index Terms—Data mining, mining methods and algorithms. Ç 1 INTRODUCTION A detection has been an active area of computer science research for a very long time (see the recent survey by Markou and Singh [1]). Applications include medical informatics [2], computer vision [3], [4], computer security [5], sensor networks [6], general-purpose data analysis and mining [7], [8], [9], and many other areas. However, in contrast to problems in supervised learning where studies of classification accuracy are the norm, little research has systematically addressed the issue of accuracy in general-purpose unsupervised anomaly detection methods. Papers have suggested many alternate problem definitions that are designed to boost the chances of finding anomalies (again, see Markou and Singh’s survey [1]), but there have been few systematic attempts to maintain high coverage at the same time that false positives are kept to a minimum. Accuracy in unsupervised anomaly detection is important because, if used as a data mining or data analysis tool, an unsupervised anomaly detection methodology will be given a “budget” of a certain number of data points that it may call anomalies. In most realistic scenarios, a human being must investigate candidate anomalies reported by an automatic system, and usually has a limited capacity to do so. This naturally limits the number of candidate anomalies that a detection methodology may usefully produce. Given that this number is likely to be small, it is important that most of those candidate anomalies are interesting to the end user. NOMALY . The authors are with the Computer and Information Sciences and Engineering Department, University of Florida, CSE Building Room E301, Gainesville, FL 32611. E-mail: {xsong, mwu, cjermain, ranka}@cise.ufl.edu. Manuscript received 25 Aug. 2005; revised 2 Apr. 2006; accepted 25 Oct. 2006; published online 29 Jan. 2007. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-0334-0805. Digital Object Identifier no. 10.1109/TKDE.2007.1009. 1041-4347/07/$25.00 ß 2007 IEEE This is especially true in applications where anomaly detection software is used to monitor incoming data in order to report anomalies in real time. When such events are detected, an alarm is sounded that requires immediate human investigation and response. Unlike the offline case where the cost and frustration associated with human involvement can usually be amortized over multiple alarms in each batch of data, each false alarm in the online case will likely result in an additional notification of a human expert, and the cost cannot be amortized. 1.1 Conditional Anomaly Detection Taking into account such considerations is the goal of the research described in this paper. Rather than trying to find new and intriguing classes of anomalies, it is perhaps more important to ensure that those data points that a method does find are, in fact, surprising. To accomplish this, we ask the questions: What is the biggest source of inaccuracy for existing anomaly detection methods? Why might they return a large number of points that are not anomalies? To answer, we note that, by definition, “statistical” methods for anomaly detection look for data points that refute a standard null hypothesis asserting that all data were produced by the same generative process. The null hypothesis is represented either explicitly (as in parametric methods [10], [11], [12]) or implicitly (as in various distance-based methods [9], [7], [8]). However, the questionable assumption made by virtually all existing methods is that there is no a priori knowledge indicating that all data attributes should not be treated in the same fashion. This is an assumption that is likely to cause problems with false positives in many problem domains. In almost every application of anomaly detection, there are likely to be several data attributes that a human being would never consider to be directly indicative of an anomaly. By allowing an anomaly detection methodology to consider such attributes equally, accuracy may suffer. For example, consider the application of online anomaly detection to syndromic surveillance, where the goal is to Published by the IEEE Computer Society 632 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Fig. 1. Syndromic surveillance application. detect a disease outbreak at the earliest possible instant. Imagine that we monitor two variables: max_daily_temp and num_fever. max_daily_temp tells us the maximum outside temperature on a given day, and num_fever tells us how many people were admitted to a hospital emergency room complaining of a high fever. Clearly, max_daily_temp should never be taken as direct evidence of an anomaly. Whether it was hot or cold on a given day should never directly indicate whether or not we think we have seen the start of an epidemic. For example, if the high in Gainesville, Florida, on a given June day was only 70 degrees Fahrenheit (when the average high temperature is closer to 90 degrees), we simply have to accept that it was an abnormally cool day, but this does not indicate in any way that an outbreak has occurred. While the temperature may not directly indicate an anomaly, it is not acceptable to simply ignore max_daily_ temp, because num_fever (which is clearly of interest in detecting an outbreak) may be directly affected by max_daily_temp, or by a hidden variable whose value can easily be deduced by the value of the max_daily_temp attribute. In this example, we know that people are generally more susceptible to illness in the winter, when the weather is cooler. We call attributes such as max_daily _temp environmental attributes. The remainder of the date attributes (which the user would consider to be directly indicative of anomalous data) are called indicator attributes. The anomaly detection methodology considered in this paper, called conditional anomaly detection, or CAD, takes into account the difference between the user-specified environmental and indicator attributes during the anomaly detection process and how this affects the idea of an “anomaly.” For example, consider Fig. 1. In this figure, Point A and Point B are both anomalies or outliers based on most conventional definitions. However, if we make use of the additional information that max_daily_temp is not directly indicative of an anomaly, then it is likely safe for an anomaly detection system to ignore Point A. Why? If we accept that it is a cold day, then encountering a large number of fever cases makes sense, reducing the interest of this observation. For this reason, the CAD methodology will only label Point B an anomaly. The methodology we propose works as follows: We assume that we are given a baseline data set and a userdefined partitioning of the data attributes into environmental attributes and indicator attributes based upon which attributes the user decides should be directly indicative of VOL. 19, NO. 5, MAY 2007 an anomaly. Our method then analyzes the baseline data and learns which values are usual or typical for the indicator attributes. When a subsequent data point is observed, it is labeled anomalous or not depending on how much its indicator attribute values differ from the usual indicator attribute values. However, the environmental attributes are not necessarily ignored because it may be that indicator attribute values are conditioned on environmental attribute values. If no such relationships are present in the baseline data, then the CAD methodology recognizes this and will effectively ignore the environmental attributes. In this case, or in the case when the user is unable or unwilling to partition the attributes (and, so, by default, all attributes are indicator attributes), the CAD methodology is equivalent to simply performing anomaly detection only on the indicator attributes. If such relationships are present, then the CAD methodology learns them and takes them into account. 1.2 Our Contributions In addition to describing the parametric CAD methodology, this paper describes three EM-based [13] learning algorithms for the parametric CAD model. This paper describes a rigorous testing protocol and shows that, if the true definition of an “anomaly” is a data point whose indicator attributes are not in keeping with its environmental attributes, then the CAD methodology does indeed increase accuracy. By comparing the CAD methodology and its three learning algorithms against several conventional anomaly detection methods on 13 different data sets, this paper shows that the CAD methodology is significantly better at recognizing data points where there is a mismatch among environmental and indicator attributes and at ignoring unusual indicator attribute values that are still in keeping with the observed environmental attributes. Furthermore, this effect is shown to be statistically significant, and not just an artifact of the number of data sets chosen for testing. 1.3 Paper Organization The remainder of the paper is organized as follows: In Section 2, we describe the statistical model that we use to model environmental and indicator attributes and the relationships between them. Section 2 also details how this model can be used to detect anomalies. Section 3 describes how to learn this model from an existing data set. Experimental results are given in Section 4, related work in Section 5, and the paper is concluded in Section 6. The Appendix gives a mathematical derivation of our learning algorithms. 2 STATISTICAL MODEL This section describes the statistical model we make use of for reducing the false positive rate during online anomaly detection. Like other parametric methods (such as those enumerated in Section 2.1 of Markou and Singh’s survey [1]), the detection algorithms described in this paper make use of a two-step process: 1. First, existing data are used to build a statistical model that captures the trends and relationships present in the data. SONG ET AL.: CONDITIONAL ANOMALY DETECTION Then, during the online anomaly detection process, future data records are checked against the model to see if they match with what is known about how various data attributes behave individually and how they interact with one another. Like many other statistical methods, our algorithms rely on the basic principle of maximum likelihood estimation (MLE) [14]. In MLE, a (possibly complex) parametric distribution f is first chosen to represent the data. f is then treated as a generative model for the data set in question. That is, we make the assumption that our data set of size n was, in fact, generated by drawing n independent samples from f. Given this assumption, we adjust the parameters governing the behavior of f so as to maximize the probability that f would have produced our data set. Formally, to make use of MLE, we begin by specifying a distribution fðxj1 ; 2 ; . . . ; k Þ, where the probability density function f gives the probability that a single experiment or sample from the distribution would produce the outcome x. The parameter  ¼ h1 ; 2 ; . . . ; k i governs the specific characteristics of the distribution (such as the mean and variance). Since we are trying to model an entire data set, we do not have only a single experiment x; rather, we view our data set X ¼ hx1 ; x2 ; . . . ; xn i as the result of performing n experiments over f. Assuming that these experiments are independent, the likelihood that we would have observed X ¼ hx1 ; x2 ; . . . ; xn i is given by: 2. LðXjÞ ¼ n Y fðxk jÞ: k¼1 The MLE problem can then be succinctly stated as: Given f and X, can we solve for  so as to maximize LðXjÞ over all possible values of ? Once we have solved for , then we have a complete model for our data set. 2.1 The Gaussian Mixture Model One of the common applications of MLE (particularly in machine learning) is to fit a multidimensional data set to a model described using the PDF fGMM (fGMM refers to the PDF for a Gaussian Mixture Model; a GMM is essentially a weighted mixture of multidimensional normal variables and is widely used due to its ability to closely model even difficult, nonparametric distributions). A GMM can very easily be used as the basis for anomaly detection and is the most ubiquitous parametric model used for the purpose (see Section 2.1 of Markou and Singh [1]). The overall process begins by first performing the MLE required to fit the GMM to the existing data, using one of several applicable optimization techniques. Of course, the problem described in the introduction of this paper is that, if we allow all attributes to be treated in the same fashion (as all existing parametric methods seem to do), then false positives may become a serious problem. Our solution to this problem is to define a generative statistical model encapsulated by a PDF fCAD that does not treat all attributes identically. This model is described in detail in the next section. 633 2.2 Conditional Anomaly Detection fCAD can be described intuitively as follows: We assume that each data point is an ordered pair of two sets of attribute values ðx; yÞ, where x is the set of environmental attribute values and y is the set of indicator attribute values. Our PDF will be of the form fCAD ðyj; xÞ. The PDF fCAD is conditioned on x, which implies that our generative model does not generate the environmental attributes associated with a data point. The PDF gives the likelihood that a single experiment with input x will give output y. Thus, a data point’s environmental attributes x are taken as input and used along with the model parameters  to generate the set of indicator attributes y. The net result is that, when we perform an MLE to fit our model to the data, we will learn how the environmental attributes map to the indicator attributes. Formally, our model uses the following three sets of parameters, which together make up the parameter set : A GMM U consisting of nU Gaussians (or multidimensional normal “point clouds”), each of dimensionality dU . The purpose of U is to model the distribution of the data set’s environmental attributes. The ith Gaussian in U is denoted by Ui and the weight of the ith Gaussian is denoted by pðUi Þ. The distribution parameters of the ith Gaussian in U are given by Ui ¼ hUi ; Ui i. . A set of nV additional Gaussians of dimensionality dV which we refer to collectively as V . Each Gaussian in V models a portion of the data space for the indicator attributes. The distribution parameters of the jth Gaussian in V are given by Vj ¼ hVj ; Vj i. . A probabilistic mapping function pðVj jUi Þ. Given a Gaussian Ui 2 U, this function gives the probability that Ui maps to Vj . In other words, this function gives the probability that a tuple’s indicator attributes are produced by the Gaussian Vj , if we know that the tuple’s environmental attributes were produced by the Gaussian Ui . Given these parameters, we assume the following generative process for each data point in the historical data set: . The process begins with the assumption that the values for a data point’s environmental attributes were produced by a sample from U. Let Ui denote the Gaussian in U which was sampled. Note that the generative process does not actually produce x; it is assumed that x was produced previously and the Gaussian which produced x is given as input. 2. Next, we use Ui to toss a set of “dice” to determine which Gaussian from V will be used to produce y. The probability of choosing Vj is given directly by the mapping function and is pðVj jUi Þ. 3. Finally, y is produced by taking a single sample from the Gaussian that was randomly selected in Step 2. The process of generating a value for y given a value for x is in Fig. 2. It is important to note that, while our generative model assumes that we know which Gaussian was used to generate x, in practice, this information is not included in the data set. We can only infer this by computing a 1. 634 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Fig. 2. The generative model used in conditional anomaly detection. Given x, a vector of environmental attribute values, we first determine which Gaussian from U generated x (Step 1). Next, we perform a random trial using the mapping function to see which Gaussian from V we map to (Step 2). In our example, we happen to choose V3 , which has a probability 0.6 of being selected given that x is from U2 . Finally, we perform a random trial over the selected Gaussian to generate y (Step 3). probability that each Gaussian in U was the one that produced x. As a result, fCAD ðyj; xÞ is defined as follows: fCAD ðyj; xÞ ¼ nU X i¼1 pðx 2 Ui Þ nV X j¼1   fG yjVj p Vj jUi ; where . pðx 2 Ui Þ is the probability that x was produced by the ith Gaussian in U and is computed using Bayes’ rule: fG ðxjUi ÞpðUi Þ : pðx 2 Ui Þ ¼ PnU k¼1 fG ðxjUk ÞpðUk Þ fG ðyjVj Þ is the likelihood that the jth Gaussian in V would produce y and is the standard Gaussian distribution (see the Appendix). . pðVj jUi Þ is the probability that the ith Gaussian from U maps to the jth Gaussian from V . This is given directly as a parameter in . Given the function fCAD ðyj; xÞ, the problem of modeling a data set ðX; Y Þ can be succinctly stated as: Choose  so as to maximize . ¼ n X log fCAD ðyk j; xk Þ over all possible values for : k¼1 In this expression,  is referred to as the log-likelihood of the resulting model. 2.3 Detecting Conditional Anomalies Like other parametric anomaly detection methods, the methodology for detecting future anomalies using the CAD model is based on the observation that, once an optimal value for  has been learned, the function fCAD can be used to give a meaningful ordering over all data points past and future, from the most astonishing to the most typical. When a new point is observed that has a small value for fCAD , this means that it occurs in a portion of the data space with low density, and it should be flagged as an anomaly. In order to determine the cutoff value for fCAD , below which a new point is determined to be anomalous, we allow VOL. 19, NO. 5, MAY 2007 the user to first pick a fraction  from 0 to 1, and then we choose the fCAD value such that exactly a fraction  of the data points in the training set would have been flagged as anomalies. A high value of  means that the method is less selective and, hence, it increases the chance that an anomalous event will be flagged for further investigation. However, at the same time, this will tend to increase the rate of false positives. Given  and the size of the training data set n, in order to choose a cutoff such that exactly  percent of the training points would be considered anomalous, we simply sort the training data based upon their fCAD values, from lowest to highest. Next, let c ¼ n. This is the index of the cutoff point in the existing data set. Any future observation which is found to be more unusual than the cth existing point will be termed an anomaly. To check if a point ðxnew ; ynew Þ is anomalous, we simply check if fCAD ðynew j; xnew Þ < fCAD ðyc j; xc Þ: If yes, then the new observation is flagged as an anomaly. 3 LEARNING THE MODEL Of course, defining a model like fCAD is not enough. It is also necessary to define a process whereby it is possible to adjust the model parameters so that the model can be fitted to the existing data. In general, this type of maximization is intractable. This section defines several learning algorithms for computing the model. The section begins with a highlevel introduction to the Expectation Maximization [13] framework that serves as a basis for all three of the algorithms. 3.1 The Expectation Maximization Methodology While EM is usually called an “algorithm,” in reality, it is a generic technique for solving MLE problems. EM can either be easy or nearly impossible to apply to a specific MLE problem, but, in general, its application is nontrivial. We now turn to a description of the EM methodology. As mentioned previously, most interesting MLE problems are computationally intractable. In practice, this intractability results from one or more “hidden” variables that cannot be observed in the data and must be inferred simultaneously with the MLE. In the conditional anomaly detection problem, these hidden variables are the identities of the Gaussians that were used to produce the environmental and indicator attributes of each data point in the historical data set. The fact that these variables are not observable makes the problem difficult; if these values were included in the data, then solving for  becomes a mostly straightforward maximization problem relying on collegelevel calculus. The EM algorithm is useful in precisely such cases. EM is an iterative method, whose basic outline is as follows: The Basic Expectation Maximization Algorithm 1) While the model continues to improve: a) Let  be the current “best guess” as to the optimal configuration of the model.  be the next “best guess” as to the optimal b) Let  configuration of the model. SONG ET AL.: CONDITIONAL ANOMALY DETECTION c) E-Step: Compute Q, the expected value of  with  over all possible values of the hidden respect to  parameters. The probability of observing each possible set of hidden parameter values (required to compute the expectation of ) is computed using .  so as to maximize the value for Q. d) M-Step: Choose   then becomes the new “best guess.”  EM sidesteps the problem of not knowing the values of the hidden parameters by instead considering the expected value of the PDF with respect to all possible values of the hidden parameters. Since computing the expected value of  with respect to the hidden parameters requires that we be able to compute the probability of every possible configuration of the hidden parameters (and to do this we need to know the optimal configuration ), at each iteration, EM computes this probability with respect to the best guess so far for . Thus, EM does not compute a globally optimal solution at each iteration; rather, it computes a best guess as to the answer, which is guaranteed to improve at each iteration. EM has the desirable property that, while it does not always converge to the globally optimal solution, it does always converge to a locally optimal solution, where no combination of slight “tweaks” of the various values in  can improve the value of . 3.2 The Direct-CAD Algorithm The remainder of this section outlines three different EM algorithms for learning the CAD model over a data set. The first of the three algorithms attempts to learn all of the parameters simultaneously: The Gaussians that govern the generation of the environmental attributes, the Gaussians that govern the generation of the indicator attributes, and the mapping function between the two sets of Gaussians. This section only outlines the approach and gives the equations that are used to implement the algorithm; the complete mathematical derivation of the EM algorithm is quite involved and is left to Appendix B. As in any EM algorithm, the Direct-CAD algorithm relies on the two classic steps: the E-Step and the M-Step. 3.2.1 E-Step To begin the process of deriving an appropriate EM algorithm, we first need to define the hidden parameters in the context of the conditional anomaly detection problem. In conditional anomaly detection, the hidden parameters are the identities of the clusters from U and V that were used to produce each data point. To denote these parameters, we define: k tells which cluster in U produced the kth value ðiÞ in X. k denotes the assertion “the kth set of environmental attributes was produced by the ith cluster in U.” . k tells which cluster in V produced the kth value in ðjÞ Y . k denotes the assertion “the kth set of indicator attributes was produced by the jth cluster in V .” Given these variables, we can then derive the Q function  required by the E-Step of the EM algorithm. Let Qð; Þ denote the expected value of  over all possible ; given . 635 the current parameter values . Then, as derived in Appendix B,  Qð; Þ " n X ¼E log fCAD ðxk; yk ;  k ; k jxk ; ÞjX; Y ;  k¼1 # 9 82   3 > nU X nV > n X = < log fG ðyk jVj ; Þ þ log pðVj jUi ; Þ X 6 7  ¼ 4 þ log fG ðxk jUi ; Þ þ log pðUi Þ 5 bkij ; > > ; k¼1 i¼1 j¼1:   log fðxk jÞ where fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ : bkij ¼ PnU PnV t¼1 h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg 3.2.2 M-Step In the M-Step, we now use standard calculus to maximize  This results in a series of equations the value of Q over all . that can be used to compute the new parameter values that will be used during the next iteration. The details of the derivation of each of the equations are reasonably involved and are given in the Appendix. However, the use of the equations is relatively straightforward and requires only that we perform standard arithmetic and linear algebra operations over the previous model parameters and the  are as data set. The formulas for the components of  follows: Pn PnV Pn PnU PnV . pðUi Þ ¼ j¼1 bkhj . Pn k¼1 PnV j¼1 bkij = Pk¼1 Ph¼1 V . Ui ¼ k¼1 j¼1 bkij xk = nk¼1 nj¼1 bkij . . U ¼  i nV n X X bkij ðxk  Ui Þðxk  Ui ÞT = Vj ¼ Pn  Vj ¼  k¼1 PnU i¼1 bkij yk = nU n X X k¼1 i¼1 bkij : k¼1 j¼1 k¼1 j¼1 . . nV n X X Pn k¼1 PnU i¼1 bkij . bkij ðyk  Vj Þðyk  Vj ÞT = nU n X X bkij : k¼1 i¼1 P P P V bkih . . pðVj jUi Þ ¼ nk¼1 bkij = nk¼1 nh¼1 Given the update rules described above, our EM algorithm for learning the conditional anomaly detection model is then as follows: The Direct-CAD Algorithm 1) Choose an initial set of values for hVj , Vj i, hUi , Ui i, pðVj jUi Þ, pðUi Þ, for all i, j. 2) While the model continues to improve (measured by the improvement of ): a) Compute bkij for all k, i, and j as described under E-Step above.  Vj i, hUi ;   Ui i, pðVj jUi Þ, and pðUi Þ for b) Compute hVj ;  all i, j as described in Section 3.2.2 above.  V i, hU ; U i :¼ hU ;   U i, c) Set hVj ; Vj i :¼ hVj ;  j i i i i pðVj jUi Þ :¼ pðVj jUi Þ, pðUi Þ :¼ pðUi Þ for all i, j. 636 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, The time and space requirements of this algorithm can be derived as follows: To compute bkij in the E-Step, we first compute fG ðxk jUi Þ and fG ðyk jVj Þ, which takes OðnnU d2U Þ time and OðnnV d2V Þ time, respectively. Then, it takes OðnnU nV Þ time to compute all the bkij values. In the M-Step,  It takes we update the value for each component of . 2  OðnnU nV dU Þ time to update Ui , and OðnnU nV d2V Þ time to  V . To update the rest of the components, the time update  j complexity is OðnnU nV ðdU þ dV þ nV ÞÞ. Adding them together, the time complexity for one iteration (E-Step and M-Step) is OðnnU nV ðd2U þ d2V ÞÞ. The total time complexity of the conditional anomaly detection algorithm depends on the convergence speed, i.e., the number of iterations, which depends on the data set itself. Accordingly, in the E-Step, storing the results for fG ðxk jUi Þ, fG ðyk jVj Þ, and bkij requires OðnnU nV Þ space.  requires OðnU d2 þ Storing the updated components of  U 2 nV dV Þ space. 3.3 The GMM-CAD-Full Algorithm Unfortunately, the algorithm described in the previous subsection has a major potential drawback. The EM approach is known to be somewhat sensitive to converging to locally optimal solutions that are far from the globally optimal solution, particularly when it is used in conjunction with very complex optimization problems. While standard methods such as multiple runs with randomized seeds may help [15], the problem remains that the CAD problem is certainly complex as MLE problems go since the task is to simultaneously learn two sets of Gaussians as well as a mapping function, and the problem is complicated even more by the fact that the function fCAD is only conditionally dependent upon the set of Gaussians corresponding to the distribution of the environmental attributes. As a result, the second learning algorithm that we consider in this paper makes use of a two-step process. We begin by assuming that the number of Gaussians in U and V is identical; that is, nU ¼ nV . Given this, a set of Gaussians is learned in the combined environmental and indicator attribute space. These Gaussians are then projected onto the environmental attributes and indicator attributes to obtain U and V . Then, only after U and V have been fixed, a simplified version of the Direct-CAD algorithm is used to compute only the values pðVj jUi Þ that make up the mapping function. The benefit of this approach is that, by breaking the problem into two, much easier optimization tasks, each solved by much simpler EM algorithms, we may be less vulnerable to the problem of local optima. The drawback of this approach is that we are no longer trying to maximize fCAD directly; we have made the simplifying assumption that it suffices to first learn the Gaussians, and then learn the mapping function. The GMM-CAD-Full Algorithm 1) Learn a set of nU ðdU þ dV Þ-dimensional Gaussians over the data set fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . . ; ðxn ; yn Þg. Call this set Z. 2) Let Zi refer to the centroid of the ith Gaussian in Z, and let Zi refer to the covariance matrix of the ith Gaussian in Z. We then determine U as follows: For i ¼ 1 to nU do: pðUi Þ :¼ pðZi Þ VOL. 19, NO. 5, MAY 2007 For j ¼ 1 to dU do: Ui ½j :¼ Zi ½j For k ¼ 1 to dU do: Ui ½j½k :¼ Zi ½j½k 3) Next, determine V as follows: For i ¼ 1 to nV do: For j ¼ 1 to dV do: Vi ½j :¼ Zi ½j þ dU  For k ¼ 1 to dV do: Vi ½j½k :¼ Zi ½j þ dU ½k þ dU  4) Run a second EM algorithm to learn the mapping function. While the model continues to improve (measured by the improvement of ): a) Compute bkij for all k, i, and j as in the previous section. b) Compute pðVj jUi Þ as described in the previous section. c) Set pðVj jUi Þ :¼ pðVj jUi Þ. The time required for the GMM-CAD-Full algorithm consists of two parts: the time to learn the Gaussians, which is OðnnU ðdU þ dV Þ2 Þ, and the time to learn the mapping function by the simplified Direct-CAD algorithm, which is Oðnn3U Þ. The memory required for the GMM-CAD-Full algorithm is OðnnU þ nU ðdU þ dV Þ2 Þ. 3.4 The GMM-CAD-Split Algorithm There is a third obvious way to learn the required model that is closely related to the previous one. Since the previous algorithm breaks up the optimization problem into two smaller problems, there is no longer any reason to think that the GMMs for U and V should be learned simultaneously. Learning them simultaneously might actually be overly restrictive since learning them simultaneously will try to learn U and V so that every Gaussian in U is “mated” to a Gaussian in V through a covariance structure. The CAD model has no such requirement since the Gaussians of U and V are related only through the mapping function. In fact, the algorithm from the previous subsection throws out all of this additional covariance information since Zi ½j½k is never used for ðj  dU ; k > dU Þ or ðj > dU ; k  dU Þ. As a result, the third algorithm we consider learns U and V directly as two separate GMMs, and then learns the mapping function between them. The algorithm is outlined below: The GMM-CAD-Split Algorithm 1) Learn U and V by performing two separate EM optimizations. 2) Run Step 4 from the GMM-CAD-Full Algorithm to learn the mapping function. In the GMM-CAD-Split algorithm, learning Gaussians in U and V needs OðnnU ðd2U þ d2V ÞÞ time. Learning the mapping function needs Oðnn3U Þ time. The memory requirement for the algorithm is OðnnU þ nU ðd2U þ d2V ÞÞ. 3.5 Choosing the Complexity of the Model Whatever learning algorithm is used, it is necessary to choose as an input parameter the number of clusters or Gaussians in the model. In general, it is difficult to choose SONG ET AL.: CONDITIONAL ANOMALY DETECTION the optimal value of the number of Gaussians in a GMM, and this is a research problem on its own [16]. However, our problem is a bit easier because we are not actually trying to cluster the data for the sake of producing a clustering; rather, we are simply trying to build an accurate model for the data. In our experiments, we generally found that a larger number of clusters in the CAD model results in more accuracy. It is true that a very large number of clusters could cause overfitting, but in the CAD model, the computational cost associated with any model large enough to cause overfitting makes it virtually impossible to overfit a data set of reasonable size. Thus, our recommendation is to choose the largest model that can be comfortably computed. In our implementation, we use 40 clusters for both U and V . 4 BENCHMARKING In this section, we describe a set of experiments aimed at testing the effectiveness of the CAD model. Our experiments are aimed at answering the following three questions: 1. 2. 3. Which of the three learning algorithms should be used to fit the CAD model to a data set? Can use of the CAD model reduce the incidence of false positives due to unusual values for environmental attributes compared to obvious alternatives for use in an anomaly detection system? If so, does the reduced incidence of false positives come at the expense of a lower detection level for actual anomalies? 4.1 Experimental Setup Unfortunately, as with any unsupervised learning task, it can be very difficult to answer questions regarding the quality of the resulting models. Though the quality of the models is hard to measure, we assert that, with a careful experimental design, it is possible to give a strong indication of the answer to our questions. Our experiments are based on the following key assumptions: Most of the points in a given data set are not anomalous and, so, a random sample of a data set’s points should contain relatively few anomalies. . If we take a random sample of a data set’s points and perturb them by swapping various attribute values among them, then the resulting set of data points should contain a much higher fraction of anomalies than a simple random sample from the data set. Given these assumptions, our experimental design is as follows: For each data set, the following protocol was repeated 10 times: . 1. 2. For a given data set, we begin by randomly designating 80 percent of the data points as training data and 20 percent of the data points as test data (this latter set will be referred to as testData). Next, 20 percent of the anomalies or outliers in the set testData are identified using a standard, parametric test based on Gaussian mixture modeling—the data are modeled using a GMM and new data points are ranked as potential outliers based on the probability density at the point in question. This basic method is embodied in more than 12 papers 637 3. 4. 5. cited in Markou and Singh’s survey [1]. However, a key aspect of this protocol is that outliers are chosen based only on the values of their environmental attributes (indicator attributes are ignored). We call this set outliers. The set testData is then itself randomly partitioned into two identically sized subsets: perturbed and nonP erturbed, subject to the constraint that outliers  nonP erturbed. Next, the members of the set perturbed are perturbed by swapping indicator attribute values among them. Finally, we use the anomaly detection software to check for anomalies among testData ¼ perturbed [ nonP erturbed: Note that, after this protocol has been executed, members of the set perturbed are no longer samples from the original data distribution, and members of the set nonP erturbed are still samples from the original data distribution. Thus, it should be relatively straightforward to use the resulting data sets to differentiate between a robust anomaly detection framework and one that is susceptible to high rates of false positives. Specifically, given this experimental setup, we expect that a useful anomaly detection mechanism would: Indicate that a relatively large fraction of the set perturbed are anomalies. . Indicate that a much smaller fraction of the set nonP erturbed are anomalies. . Indicate that a similarly small fraction of the set outliers are anomalies. This last point bears some further discussion. Since these records are also samples from the original data distribution, the only difference between outliers and ðnonP erturbed  outliersÞ is that the records in outliers have exceptional values for their environmental attributes. Given the definition of an environmental attribute, these should not be considered anomalous by a useful anomaly detection methodology. In fact, the ideal test result would show that the percentage of anomalies among outliers is identical to the percentage of anomalies among nonP erturbed. A high percentage of anomalies among the set outliers would be indicative of a tendency toward false alarms in the case of anomalous (but uninteresting) values for environmental attributes. . 4.2 Creating the Set P erturbed The above testing protocol requires that we be able to create a set of points perturbed with an expectedly high percentage of anomalies. An easy option would have been to generate perturbed data by adding noise. However, this can produce meaningless (or less meaningful) values of attributes. We wanted to ensure that the perturbed data is structurally similar to the original distribution and domain values so that it is not trivial to recognize members of this set because they have similar characteristics compared to the training data (that is, the environmental attributes are realistic and the indicator attributes are realistic, but the relationship between them may not be). We achieve this through a natural perturbation scheme: 638 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Let D be the set of data records that has been chosen from testData for perturbation. Consider a record z ¼ ðx; yÞ from D that we wish to perturb. Recall from Section 2.2 that x is a vector of environmental attribute values and y is a vector of indicator attribute values. To perturb z, we randomly choose k data points from D; in our tests, we use k ¼ minð50; jDj=4Þ. Let z0 ¼ ðx0 ; y0 Þ be the sampled record such that the euclidean distance between y and y0 is maximized over all k of our sampled data points. To perturb z, we simply create a new data point ðx; y0 Þ and add this point to perturbed. This process is iteratively applied to all points all points in the set D have been used to produce a point in perturbed. Note that every set of environmental attribute values present in D is also present in perturbed after perturbation. Furthermore, every set of indicator attributes present in perturbed after perturbation is also present in D. What has been changed is only the relationship between the environmental and indicator attributes in the records of perturbed. We note that swapping the indicator values may not always produce the desired anomalies (it is possible that some of the swaps may result in perfectly typical data). However, swapping the values should result in a new data set that has a larger fraction of anomalies as compared to sampling the data set from its original distribution, simply because the new data has not been sampled from the original distribution. Since some of the perturbed points may not be abnormal, even a perfect method should deem a small fraction of the perturbed set to be normal. Effectively, this will create a nonzero false negative rate, even for a “perfect” classifier. Despite this, the experiment is still a valid one because each of the methods that we test must face this same handicap. 4.3 Data Sets Tested We performed the tests outlined above over the following 13 data sets.1 For each data set, the annotation ði  jÞ indicates that the data set had i environmental and j indicator attributes, which were chosen after a careful evaluation of the data semantics. A brief description of the environmental and indicator attributes for each data set is also given. The 13 data sets used in testing are: 1. 2. Synthetic data set ð50  50Þ. The synthetic data set is created using the CAD model. Ten thousand data points were sampled from the generative model f. f has 10 Gaussians in U and V , respectively. The centroids of the Gaussians are random vectors whose entries are chosen on the interval ð0:0; 1:0Þ. The diagonal of the covariance matrix is set to 1=4 of the average distance between the centroids in each dimension. The weight of the Gaussians is evenly assigned and the mapping function P ðVj jUi Þ for a certain value of i is a permutation of geometric distribution. Algae data set ð11  6Þ. The data is from the UCI KDD Archive. We removed the data records with missing values. The environmental attributes consist of the season, the river size, the fluid velocity, and some chemical concentrations. The indicator attributes are 1. We have conducted an experiment on a prelabeled data set as suggested. Refer to Appendix A for the results. VOL. 19, NO. 5, MAY 2007 the distributions of different kinds of algae in surface water. 3. Streamflow data set ð205  100Þ. The data set depicts the river flow levels (indicator) and precipitation and temperature (environmental) for California. We create this data set by concatenating the environmental and indicator data sets. The environmental data set is obtained from National Climate Data Center (NCDC). The indicator data set is obtained from US Geological Survey (USGS). 4. ElNino data set ð4  5Þ. The data is from the UCI KDD Archive. It contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The environmental attributes are the spatio-temporal information. The indicator attributes are the wind direction, relative humidity, and temperature at various locations in the Pacific Ocean. 5. Physics data set ð669  70Þ. The data is from KDD Cup 2003. For a large set of physics papers, we pick out frequency information for keywords (environmental) and a list of the most referred articles (indicator). One data record, which represents one physics paper, has 0/1-valued attributes, corresponding to the appearance/absence of the keyword (or referred-to article), respectively. 6. Bodyfat data set ð13  2Þ. The data is from the CMU statlib. It depicts body fat percentages (indicator) and physical characteristics (environmental) for a group of people. 7. Houses data set ð8  1Þ. The data is from the CMU statlib. It depicts house prices (indicator) in California and related environmental characteristics (environmental), such as median income, housing median age, total rooms, etc. 8. Boston data set ð15  1Þ. The data is from the CMU statlib. We removed some attributes that are not related to the environmental-indicator relationship. The data depicts the house value of owner-occupied homes (indicator) and economic data (environmental) for Boston. 9. FCAT-math data set ð14  12Þ. The data is from the National Center for Education Statistics (NCES) and the Florida Information Resource Network (FIRN). We removed the data records with missing values. It depicts mathematics achievement test results (indicator) of grade 3 of year 2002 and regular elementary schools’ characteristics (environmental) for Florida. 10. FCAT-reading data set ð14  11Þ. The data is also from NCES and FIRN. We processed the data similarly as with the FCAT-math data set. It depicts the reading achievement test scores (indicator) and regular elementary schools’ characteristics (environmental) for Florida. 11. FLFarms data set ð114  52Þ. The data set depicts the Florida state farms’ market value in various aspects (indicator) and the farms’ operational and products statistics (environmental). 12. CAFarms data set ð115  51Þ. The data set depicts the California state farms’ market value in various SONG ET AL.: CONDITIONAL ANOMALY DETECTION 639 TABLE 1 Head-to-Head Winner When Comparing Recall/Precision for Identifying Perturbed Data as Anomalous (Each Cell Has Winner, Wins Out of 13 Data Sets, Bootstrapped p-Value) TABLE 2 Head-to-Head Winner When Comparing Recall/Precision for Identifying Nonperturbed Outlier Data as Nonanomalous (Each Cell Has Winner, Wins Out of 13 Data Sets, Bootstrapped p-Value) aspects (indicator) and the farms’ operational and products statistics (environmental). 13. CAPeaks data set ð2  1Þ. The data set depicts the California peaks’ height (indicator) and longitude and altitude position (environmental). 4.4 Experimental Results We performed the tests described above over six alternatives for anomaly detection: 1. 2. 3. 4. 5. 6. Simple Gaussian mixture modeling (as described at the beginning of this section). In this method, for each data set, a GMM model is learned directly over the training data. Next, the set testData is processed, and anomalies are determined based on the value of the function fGMM for each record in testData; those with the smallest fGMM values are considered anomalous. kth-NN (kth nearest neighbor) outlier detection [9] (with k ¼ 5). kth-NN outlier detection is one of the most well-known methods for outlier detection. In order to make it applicable to the training/testing framework, we consider that the method must be modified slightly since each test data point must be scored in order to describe the extent to which it deviates from the training set. Given a training data set, in order to use the fifth-NN method in order to determine the degree to which a test point is anomalous, we simply use the distance from the point to its fifth-NN in the training set. A larger distance indicates a more anomalous point. Local outlier factor (LOF) anomaly detection [8]. LOF must be modified in a matter similar to kth-NN outlier detection: The LOF of each test point is computed with respect to the training data set in order to score the point. Conditional anomaly detection, with the model constructed using the Direct-CAD Algorithm. Conditional anomaly detection, with the model constructed using the GMM-CAD-Full Algorithm. Conditional anomaly detection, with the model constructed using the GMM-CAD-Split Algorithm. For each experiment over each data set with the GMM/ CAD methods, a data record was considered an anomaly if the probability density at the data point was less than the median probability density at all of the test data points. For 5th-NN=LOF, the record was considered an anomaly if its distance was greater than the median distance over all test data points. The median was used because exactly one-half of the points in the set testData were perturbed. Thus, if the method was able to order the data points so that all anomalies were before all nonanomalies, choosing the median density should result in a recall2 rate of 100 percent and a false positive rate of 0 percent (a false positive rate of 0 percent is equivalent to a precision3 of 100 percent). Also, note that, since exactly one-half of the points are perturbed and each method is asked to guess which half are perturbed, it is always the case that the recall is exactly equal to the precision when identifying either anomalous or nonanomalous records in this experiment. The results are summarized in Tables 1 and 2. Table 1 summarizes and compares the quality of the recall/ precision obtained when identifying perturbed points in testData. Each cell indicates which of the two methods compared in the cell was found to be the “winner” in terms of having superior recall/precision over the 13 sets. To compute the “winner,” we do the following: For each of the 13 sets and each pair of detection methods, the recall/ precision of the two methods is averaged over the 10 different training/testing partitionings of the data set. Let method1 and method2 be the two anomaly detection methods considered in a cell. method1 is said to “win” the test if it has a higher average recall/precision in seven or more of the 13 data sets. Likewise, method2 is declared the winner if it has a higher recall/precision in seven or more of the data sets. In addition, we also give the number of the 13 data sets for which the winning method performed better, and we give a p-value that indicates the significance of the results. 2. Recall in this context is the fraction of perturbed points that are flagged as anomalies. 3. Precision in this context is the fraction of anomalies that are actually perturbed points. 640 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Finally, the second column in Table 1 gives the average recall/precision over all experiments for each method. This is the average percentage of perturbed points that have probability density less than the median (for GMM and CAD) or have a score/distance greater than the median (LOF and 5th-NN). Though these numbers are informative, we caution that the number of head-to-head wins is probably a better way to compare two methods. The reason for this is that the recall/precision can vary significantly from data set to data set and, thus, the average may tend to overweight those data sets that are particularly difficult. This p-value addresses the following question: Imagine that the whole experiment was repeated by choosing a new, arbitrary set of 13 data sets at random and performing 10 different training/testing partitionings of each data set. The p-value gives the probability that we would obtain a different winner than was observed in our experiments due to choosing different data sets and partitionings. A low p-value means that it is unlikely that the observed winner was due mostly to chance and, so, the results are significant. A high p-value means that, if the experiment was repeated again with a new set of data sets, the result may be different. The p-value was computed using a bootstrap resampling procedure [17]. Table 2 is similar to Table 1, but rather than comparing the recall/precision with which the methods can identify the perturbed points, it summarizes the recall/precision in identifying as nonanomalous points in the set outliers. Recall that the outliers are points with exceptional values for their environmental attributes, but which are nonperturbed, and, so, they are actually obtained from the training distribution. Thus, this table gives an indication of how successful the various methods are in ignoring anomalous environmental attributes when the points were, in fact, sampled from the baseline (or training) distribution. Just as in Table 1, each cell gives the winner, the number of trials won, and the associated p-value of the experiment. The second column gives the average precision/recall. This is the percentage of nonperturbed points with exceptional environmental attributes that were considered to be nonanomalous. 4.5 Discussion Our experiments were designed to answer the following question: If a human expert wishes to identify a set of environmental attributes that should not be directly indicative of an anomaly, can any of the methods successfully ignore exceptional values for those attributes while at the same time taking them into account when performing anomaly detection? The first part of the question is addressed by the results given in Table 2. This table shows that, if the goal is to ignore anomalous values among the environmental attributes, then the CAD-GMM-Full algorithm is likely the best choice. This method declared the lowest percentage of data points with exceptional environmental attribute values to be anomalous. In every head-tohead comparison, it did a better job for at least 77 percent of the data sets (or 10 out of 13). The second part of the question is addressed by the results given in Table 1. This table shows that, if the goal is to not only ignore exceptional environmental attributes, but also to take the normal correlation between environmental attributes and the indicator attributes into account, then the VOL. 19, NO. 5, MAY 2007 CAD-GMM-Full algorithm is again the best choice. Table 1 shows that, compared to all of the other options, this method was best able to spot cases where the indicator attributes were not in keeping with the environmental attributes because they had been swapped. Finally, it is also important to point out that no method was uniformly preferable to any other method for each and every data set tested. It is probably unrealistic to expect that any method would be uniformly preferable. For example, we found that CAD-based methods generally did better compared to the two distance-based methods (and the GMM method) on higher-dimensional data. The results were most striking for the thousand-dimensional physics paper data set, where the precision/recall for LOF and 5th-NN when identifying perturbed data was worse than a randomized labeling would have done (less than 50 percent). In this case, each of the CAD methods had greater than 90 percent precision/recall. For lower-dimensional data, the results were more mixed. Out of all of the head-tohead comparisons reported in Table 1 and Table 2, there is only one case where any method won 13 out of 13 times (CAD-GMM-Full versus LOF in terms of the ability to ignore anomalous environmental attributes). Thus, our results do not imply that any method will do better than any other method on an arbitrary data set. What the results do imply is that, if the goal is to choose the method that is most likely to do better on an arbitrary data set, then the CAD-GMM-Full algorithm is the best choice. 5 RELATED WORK Anomaly and outlier detection have been widely studied, and the breadth and depth of the existing research in the area precludes a detailed summary here. An excellent survey of so-called “statistical” approaches to anomaly detection can be found in Markou and Singh’s 2003 survey [1]. In this context, “statistical” refers to methods which try to identify the underlying generative distribution for the data, and then identify points that do not seem to belong to that distribution. Markou and Singh give references to around a dozen papers that, like the CAD methodology, rely in some capacity on Gaussian Mixture Modeling to represent the underlying data distribution. However, unlike the CAD methodology, nearly all of those papers implicitly assume that all attributes should be treated equally during the detection process. The specific application area of intrusion detection is the topic of literally hundreds of papers that are related to anomaly detection. Unfortunately, space precludes a detailed survey of methods for intrusion detection here. One key difference between much of this work and our own proposal is the specificity: Most methods for intrusion detection are specifically tailored to that problem and are not targeted toward the generic problem of anomaly detection, as the CAD methodology is. An interesting 2003 study by Lazarevic et al. [18] considers the application of generic anomaly detection methods to the specific problem of intrusion detection. The authors test several methods and conclude that LOF [8] (tested in this paper) does a particularly good job in that domain. While most general-purpose methods assume that all attributes are equally important, there are some exceptions to this. Aggarwal and Yu [19] describe a method for SONG ET AL.: CONDITIONAL ANOMALY DETECTION identifying outliers in subspaces of the entire data space. Their method identifies combinations of attributes where outliers are particularly obvious, the idea being that a large number of dimensions can obscure outliers by introducing noise (the so-called “curse of dimensionality” [20]). By considering subspaces, this effect can be mitigated. However, Aggarwal and Yu make no a priori distinctions among the types of attribute values using domain knowledge, as is done in the CAD methodology. One body of work on anomaly detection originating in statistics that does make an a priori distinction among different attribute types in much the same way as the CAD methodology is so-called spatial scan statistics [21], [22], [23]. In this work, data are aggregated based upon their spatial location, and then the spatial distribution is scanned for sparse or dense areas. Spatial attributes are treated in a fashion analogous to environmental attributes in the CAD methodology, in the sense that they are not taken as direct evidence of an anomaly; rather, the expected sparsity or density is conditioned on the spatial attributes. However, such statistics are limited in that they cannot easily be extended to multiple indicator attributes (the indicator attribute is assumed to be a single count), nor are they meant to handle the sparsity and computational complexity that accompany large numbers of environmental attributes. The existing work that is probably closest in spirit to our own is likely the paper on anomaly detection for finding disease outbreaks by Wong et al. [24]. Their work makes use of a Bayes Net as a baseline to describe possible environmental structures for a large number of input variables. However, because their method relies on a Bayes Net, there are certain limitations inherent to this approach. The most significant limitation of relying on a Bayes Net is that continuous phenomena are hard to describe in this framework. Among other issues, this makes it very difficult to describe the temporal patterns such as time lags and periodicity. Furthermore, a Bayes Net implicitly assumes that even complex phenomena can be described in terms of interaction between a few environmental attributes, and that there are no hidden variables unseen in the data, which is not true in real situations where many hidden attributes and missing values are commonplace. A Bayes Net cannot learn two different explanations for observed phenomena and learn to recognize both, for it cannot recognize unobserved variables. Such hidden variables are precisely the type of information captured by the use of multiple Gaussian clusters in our model; each Gaussian corresponds to a different hidden state. Finally, we mention that traditional prediction models, such as linear or nonlinear models of regression [25], [26], can also be potentially used for deriving the cause-effect interrelationships that the CAD method uses. For example, for a given set of environmental attribute values, a trained regression model will predict the indicator attribute value and, by comparing it with the actual indicator attribute value, we can determine if the test point is anomalous or not. However, there are important limitations on such regression methods as compared to the CAD model. First, in regression, there is typically one response variable (indicator variable) and multiple predictor variables (environmental variables), while the CAD method allows an arbitrary number of indicator attributes. Second, it is unclear how a regression-based methodology could be 641 Fig. 3. Precision rate on the KDD Cup 1999 data set by 10-fold crossvalidation. The minimum and maximum precision rates over 10 runs are depicted as circles and the overall precision is depicted as a star sign. used to rank the interest of data points based on their indicator attributes. Distance from the predicted value could be used, but this seems somewhat arbitrary, since the natural variation may change depending on indicator attribute values. 6 CONCLUSIONS AND FUTURE WORK In this paper, we have described a new approach to anomaly detection. We take a rigorous, statistical approach where the various data attributes are partitioned into environmental and indicator attributes, and a new observation is considered anomalous if its indicator attributes cannot be explained in the context of the environmental attributes. There are many avenues for future work. For example, we have considered only one way to take into account a very simple type of domain knowledge to boost the accuracy of anomaly detection. How might more complex domain knowledge boost accuracy even more? For another example, an important issue that we have not addressed is scalability during the construction of the model. While the model can always quickly and efficiently be used to check for new anomalies, training the model on a multigigabyte database may be slow without a modification of our algorithms. One way to address this may be to use a variation of the techniques of Bradley et al. [27] for scaling standard, Gaussian EM to clustering of very large databases. APPENDIX A EXPERIMENT ON THE KDD CUP 1999 DATA SET One of the anonymous referees for this paper suggested that we apply our algorithms to the KDD Cup 1999 data set. Unlike the 13 data sets described in the paper, this data set is labeled and, so, we know if a particular test record is actually anomalous. After partitioning into causal and result attributes, we used a 10-fold cross-validation test to compare the precision of the different detection methods on this data set; error bars showing the low and high accuracy of each method as well as the mean are provided in Fig. 3. The results show that this particular task is rather easy and, so, it is difficult to draw any additional conclusions from the results. All of the methods except for LOF perform very well, with almost perfect overlap among the error bars. 642 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 5, MAY 2007 APPENDIX B EM DERIVATIONS Inserting (5), (6), and (7) into (4), we get: B.1 MLE Goal The goal is to maximize the objective function: fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ : ¼ PnU PnV t¼1 h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg  ¼ log n Y fð ¼ ( log k¼1 nU X pðxk 2 Ui Þ nV X ) ð1Þ fG ðyk jVj ÞpðVj jUi Þ ; j¼1 i¼1 fCAD ðyk jxk Þ ¼ pðxk 2 Ui Þ nV X fG ðyk jVj ÞpðVj jUi Þ: ð2Þ j¼1 i¼1 bkij ¼ fð  Qð; Þ nU X nV n n X X ¼ log fCAD ðyk ; k¼1 ¼ k ðiÞ k ; fCAD ðyk ;   fð ðjÞ  k jxk ; Þ ¼ ¼ ðjÞ k jxk ; yk ; Þ ðiÞ ðjÞ k ; yk ; k jÞ fðxk ; ðjÞ k jxk ; 9 82  3  > nU X nV > = < log fG ðyk jVj Þ þ log pðVj jUi ; Þ n X X 6 7  ¼  b 4 þ log fG ðxk jUi Þ þ log pðUi Þ 5 kij : > > ; k¼1 i¼1 j¼1:   log fðxk jÞ  ¼ log log fðxk jÞ ðiÞ k ; Þ ðjÞ ðiÞ ð4Þ Since ðiÞ k jÞ ¼ fðxk ; Ui jÞ ¼ fG ðxk jUi ÞpðUi Þ: ð5Þ Through (2), we can get fðyk ; PnU i¼1 ðjÞ k jxk ; ð12Þ ð13Þ ðiÞ k ; Þ nU X fG ðxk jUt ÞpðUt Þ; t¼1 fðxk ; k jÞfðyk ; k jxk ; k ; Þ P U  P V : fðxk jÞ nt¼1 pðxk 2 Ut ; Þ nh¼1 ½fG ðyk jVh ÞpðVh jUt ; Þ fðxk ; ðjÞ k jxk ; ðjÞ  k jxk ; Þ ðiÞ  ðiÞ  k ; Þ  fð k jxk ; Þ  Qð; Þ fðxk jÞfðyk jxk ; Þ ðiÞ ¼ ðiÞ k ; In the Q function from (13), we have fðxk ; fðxk ; yk jÞ ðiÞ k jÞfðyk ; ð10Þ Inserting (12) into (10), the Q function becomes ðjÞ k jxk ; yk ; Þg: Now, we derive the two subexpressions in (3): ðiÞ k ; o  bkij ;    fG ðxk jUi ÞpðUi Þ : ¼fG ðyk jVj Þ  pðVj jUi ; Þ  fðxk jÞ ð3Þ fð ¼fðyk ; k jX; Y ; Þg k¼1 i¼1 j¼1 ðiÞ k ; ðjÞ  k jxk ; Þ The remaining subexpression in Q function is k nU X nV n X X flog fCAD ðyk ; ðiÞ k ; ð11Þ k¼1 k jxk ; Þfð k ; ð9Þ fG ðxk jUi ÞpðUi ÞfG ðyk jVj ÞpðVj jUi ; Þ : bkij ¼ PnU PnV t¼1 h¼1 ffG ðxk jUt ÞpðUt ÞfG ðyk jVh ÞpðVh jUt ; Þg  ¼ E½ðX; Y ; ; jX; ÞjX; Y ;  " # n X  log fCAD ðxk; yk ; k ; k jxk ; ÞjX; Y ;  ¼E k; ðjÞ k jxk ; yk ; Þ: where  Qð; Þ n XX X flog fCAD ðyk ; ðiÞ k ; k¼1 i¼1 j¼1 B.2 Expectation Step The expectation of the objective function is ¼ ð8Þ With the new denotation bkij , the Q function in (3) becomes where nU X ðjÞ k jxk ; yk ; Þ For simplicity, we introduce bkij : fCAD ðyk jxk Þ k¼1 n X ðiÞ k ; which is hard to differentiate. We can play an approximation  To make “trick” by using fðxk jÞ to approximate fðxk jÞ. the maximization target clear, we can rewrite the (13) as  Q0 ð; Þ nV nU X nV nU X n X n X X X bkij log pðVj jUi Þ bkij log fG ðyk jVj Þ þ ¼ þ ¼ fG ðyk jVj ÞpðVj jUi ; Þ: ð6Þ pðxk 2 Ui Þ ¼ 1; pðxk 2 Ut Þ is the membership of xk in cluster Ut , so fG ðxk jUt ÞpðUt Þ fG ðxk jUt ÞpðUt Þ ¼ : ð7Þ pðxk 2 Ut ; Þ ¼ PnU fðxk jÞ f ðx jU ÞpðU Þ i i¼1 G k i  k¼1 i¼1 j¼1 k¼1 i¼1 j¼1 nU X nV n X X nU X nV n X X k¼1 i¼1 j¼1 nU X nV n X X bkij log fG ðxk jUi Þ þ bkij log pðUi Þ k¼1 i¼1 j¼1 bkij log fðxk jÞ: k¼1 i¼1 j¼1 ð14Þ We will perform the maximization of Q0 by maximizing each of the four terms in (14). SONG ET AL.: CONDITIONAL ANOMALY DETECTION 643 5b ðbT AbÞ ¼ 2Ab; B.3 Maximization Step 1. 5A ðbT AbÞ ¼ bbT ; First, we want to maximize: nU X nV n X X 5A jAj ¼ A1 jAj: bkij log pðUi Þ ð15Þ nV nU X n X X k¼1 i¼1 j¼1 k¼1 i¼1 j¼1 nU X nV n X X with respect to pðUi Þ. Note that the pðUi Þ is constrained by ¼ nU X pðUi Þ ¼ 1: ð16Þ k¼1 i¼1 j¼1 nU X nV n X X ¼ i¼1 d  Vj ð2Þ2  bkij log pðVj jUi Þ when maximizing (15). Through the method of nU n X X Lagrange multipliers, maximizing (15) is the same as maximizing k¼1 i¼1 " bkij log pðUi Þ þ  k¼1 i¼1 j¼1 nU X @pðUi Þ j¼1 pðUi Þ k¼1 nV n X X ð17Þ nU n X X ð18Þ V ¼  j # bkij : ð23Þ k¼1 i¼1  V  ðyk  V Þðyk  V ÞT  ¼ 0: bkij ½ j j j nU n X X bkij ðyk  Vj Þðyk  Vj ÞT = ð24Þ bkij log fG ðyk jVj Þ k¼1 i¼1 j¼1  V . Recall that, by linear and  j bkij : ð25Þ PnU PnV  We can maximize k¼1 i¼1 j¼1 bkij log fG ðxk jUi Þ in a similar way and get the update rule for  Ui Þ; i 2 f1; 2; . . . ; nU g, Ui ¼ ðUi ;  Ui ¼ nV n X X bkij xk = k¼1 j¼1  Ui ¼  nV n X X nV n X X bkij ; 3. ð26Þ k¼1 j¼1 bkij ðxk  Ui Þðxk  Ui ÞT = nV n X X bkij : k¼1 j¼1 k¼1 j¼1 ð20Þ Now, we want to maximize: nU n X X k¼1 i¼1 Pn Combining (18) and (19) and solving for pðUi Þ, Pn PnV j¼1 bkij k¼1 pðUi Þ ¼ Pn PnU PnV i 2 f1; 2; . . . ; nU g: k¼1 h¼1 j¼1 bkhj algebra, if A is symmetric matrix: nU n X X k¼1 i¼1 ð19Þ with respect to Vj bkij yk =  V ; j 2 f1; 2; . . . ; nV g, Solving (24) for  j bkij þ pðUi Þ ¼ 0: bkhj þ pðUh Þ ¼ 0; h¼1 k¼1 j¼1 Pn PnU PnV nU X nV n X X k¼1 h¼1 j¼1 bkhj ¼ bkhj :  PnU k¼1 h¼1 j¼1 h¼1 pðUh Þ nU X nV n X X nU n X X k¼1 i¼1 We now sum (18) over all i and solve for : 2. ð23Þ Now, taking the derivative of (21) with regard to  1 and setting it to zero  Vj k¼1 j¼1 ¼  1  1  ðyk  Vj Þ ¼ 0: 2 Vj 2 k¼1 i¼1 þ  ¼ 0; " nV nU X n X X  Vj ¼ and setting it to zero, nV n X X bkij : Solving (22) for Vj ; j 2 f1; 2; . . . ; nV g, # Taking the derivative of (17) with respect to pðUi Þ ¼ bkij pðUi Þ  1 : i¼1 1 2 Taking the derivative of (21) with respect to Vj and setting it to zero, k¼1 i¼1 j¼1 @h ð22Þ bkij log  1 ðyk  V Þ exp½ 12 ðyk  Vj ÞT  j Vj nU X nV n X X h¼  Vj Þ bkij log fG ðyk jVj ;  k¼1 i¼1 j¼1 Here, we do not count nU X nV n X X bkij log fG ðyk jVj Þ ð27Þ Now, we want to maximize nU X nV n X X k¼1 i¼1 j¼1 with regard to pðVj jUi Þ. bkij log pðVj jUi Þ ð28Þ 644 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, [6] We can set a constraint on the pðVj jUi Þ: nV X [7] pðVj jUi Þ ¼ 1 i 2 f1; 2; . . . ; nU g: ð29Þ j¼1 [8] Again, by the method of Lagrange multipliers, maximizing (28) is the same as maximizing h¼ nV nU X n X X bkij log pðVj jUi Þ [10] k¼1 i¼1 j¼1 þ i nV X ð30Þ ! [11] pðVj jUi Þ  1 ; i 2 f1; 2; . . . ; nU g: j¼1 Taking the derivative of (30) with regard to pðVj jUi Þ and setting it to zero, @h @pðVj jUi Þ n X bkij þ i ¼ 0; pðV j jUi Þ k¼1 n X bkij þ i pðVj jUi Þ ¼ 0: ¼ [9] [12] [13] ð31Þ [14] k¼1 We now sum (31) over all j and solve for i , " # nV X n X bkih þ i pðVh jUi Þ ¼ 0; h¼1 [15] ð32Þ [16] k¼1 [17] Pn PnV Pk¼1 nV h¼1 h¼1 bkih nV n X X ð33Þ [18] Finally, combining (31) and (33) and solving for pðVj jUi Þ; i 2 f1; 2; . . . ; nU g; j 2 f1; 2; . . . ; nV g, we have [19] i ¼  pðVh jUi Þ pðVj jUi Þ ¼ n X k¼1 ¼ bkij = bkih : k¼1 h¼1 nV n X X [20] bkih : ð34Þ k¼1 h¼1 [21] [22] ACKNOWLEDGMENTS This material is based upon work supported by the US National Science Foundation under Grant No. ACI-0325459, IIS-0347408, and IIS-0612170. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the US National Science Foundation. REFERENCES [1] [2] [3] [4] [5] M. Markou and S. Singh, “Novelty Detection: A Review Part 1: Statistical Approaches,” Signal Processing, vol. 83, pp. 2481-2497, Dec. 2003. W.-K. Wong, A. Moore, G. Cooper, and M. Wagner, “Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks,” Proc. Nat’l Conf. Artificial Intelligence (AAAI ’02), pp. 217-223, Aug. 2002. A. Adam, E. Rivlin, and I. Shimshoni, “Ror: Rejection of Outliers by Rotations in Stereo Matching,” Proc. Conf. Computer Vision and Pattern Recognition (CVPR ’00), pp. 1002-1009, June 2000. F. de la Torre and M.J. Black, “Robust Principal Component Analysis for Computer Vision,” Proc. Eighth Int’l Conf. Computer Vision (ICCV ’01), pp. 362-369, 2001. A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. Third SIAM Int’l Conf. Data Mining, 2003. [23] [24] [25] [26] [27] VOL. 19, NO. 5, MAY 2007 Y. Zhang and W. Lee, “Intrusion Detection in Wireless Ad-Hoc Networks,” Proc. MobiCom Conf., pp. 275-283, 2000. E.M. Knorr, R.T. Ng, and V. Tucakov, “Distance-Based Outliers: Algorithms and Applications,” VLDB J., vol. 8, nos. 3-4, pp. 237253, Feb. 2000. M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” Proc. ACM SIGMOD Conf., pp. 93-104, May 2000. S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM SIGMOD Conf., pp. 427-438, May 2000. E. Eskin, “Anomaly Detection over Noisy Data Using Learned Probability Distributions,” Proc. Int’l Conf. Machine Learning, pp. 255-262, June 2000. K. Yamanishi, J. ichi Takeuchi, G.J. Williams, and P. Milne, “OnLine Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms,” Data Mining and Knowledge Discovery, vol. 8, no. 3, pp. 275-300, May 2004. S. Roberts and L. Tarassenko, “A Probabilistic Resource Allocating Network for Novelty Detection,” Neural Computation, vol. 6, pp. 270-284, Mar. 1994. A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977. J. Bilmes, “A Gentle Tutorial on the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” Technical Report ICSI-TR-97-021, Univ. of California at Berkeley, 1997. D.M. Chickering and D. Heckerman, “Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables,” Machine Learning, vol. 29, nos. 2-3, pp. 181-212, 1997. M.A.T. Figueiredo and A.K. Jain, “Unsupervised Learning of Finite Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 381-396, Mar. 2002. B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, 1993. A. Lazarevic, L. Ertöz, V. Kumar, A. Ozgur, and J. Srivastava, “A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection,” Proc. SIAM Conf. Data Mining, 2003. C.C. Aggarwal and P.S. Yu, “Outlier Detection for High Dimensional Data,” Proc. ACM SIGMOD Conf., pp. 37-46, May 2001. K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is ‘Nearest Neighbor’ Meaningful?” Proc. Int’l Conf. Database Theory, pp. 217-235, Jan. 1999. M. Kulldorff, “A Spatial Scan Statistic,” Comm. in Statistics: Theory and Methods, vol. 26, no. 6, pp. 1481-1496, 1997. D.B. Neill and A.W. Moore, “Rapid Detection of Significant Spatial Clusters,” Proc. 2004 ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 256-265, 2004. M. Kulldorff, “Spatial Scan Statistics: Models, Calculations, and Applications,” Scan Statistics and Applications, pp. 303-322, 1999. W.-K. Wong, A. Moore, G. Cooper, and M. Wagner, “Bayesian Network Anomaly Pattern Detection for Disease Outbreaks,” Proc. Int’l Conf. Machine Learning, pp. 808-815, Aug. 2003. L. Breiman, R.A. Olshen, J.H. Friedman, and C.J. Stone, Classification and Regression Trees. Chapman and Hall, 1984. J.R. Quinlan, “Learning with Continuous Classes,” Proc. Fifth Australian Joint Conf. Artificial Intelligence, pp. 343-348, 1992. P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Knowledge Discovery and Data Mining, pp. 9-15, 1998. Xiuyao Song received the BS and MS degrees from the Computer Science Department at the Huazhong University of Science and Technology, China, in 1998 and 2001, respectively. She is working toward the PhD degree in the Computer and Information Sciences and Engineering Department at the University of Florida. She spent a year as a project management specialist at Lucent Technologies (China) Co., Ltd. She began her PhD program in September 2002 and her research interest is data mining, more specifically, anomaly detection in large databases. SONG ET AL.: CONDITIONAL ANOMALY DETECTION Mingxi Wu received the BS degree in computer science from Fudan University in 2000. After that, he worked at Microsoft for two years as a global technical support engineer. He is a PhD student in the Computer and Information Science and Engineering Department at the University of Florida (UF). His current research includes outlier/anomaly detection in high-dimensional data sets and top-k related database query processing. Christopher Jermaine received the PhD degree from the Georgia Institute of Technology. He is an assistant professor in the Computer and Information Science and Engineering Department at the University of Florida, Gainesville. His research interests are in the areas of databases, data mining, and applied statistics, with an emphasis on data analysis and randomized algorithms. 645 Sanjay Ranka is a professor of computer information science and engineering at the University of Florida, Gainesville. His research interests are the areas of large-scale software systems, grid computing, high-performance computing, data mining, biomedical computing, and optimization. Most recently, he was the chief technology officer at Paramark, where he developed real-time optimization software for optimizing marketing campaigns. He has also held positions as a tenured faculty member at Syracuse University and as a researcher/visitor at the IBM T.J. Watson Research Labs and Hitachi America Limited. He has coauthored two books: Elements of Artificial Neural Networks (MIT Press) and Hypercube Algorithms (Springer Verlag). He has also coauthored six book chapters, more than 55 journal papers, and 100 conference and workshop papers. He was one of the main architects of the Syracuse Fortran 90D/HPF compiler and served on the MPI Standards committee. He is a fellow of the IEEE and a member of IFIP Committee on System Modeling and Optimization. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.