A Survey On Outlier Detection Techniques
A Survey On Outlier Detection Techniques
A Survey On Outlier Detection Techniques
ABSTRACT Detecting outliers is a significant problem that has been studied in various research and
application areas. Researchers continue to design robust schemes to provide solutions to detect outliers
efficiently. In this survey, we present a comprehensive and organized review of the progress of outlier
detection methods from 2000 to 2019. First, we offer the fundamental concepts of outlier detection and
then categorize them into different techniques from diverse outlier detection techniques, such as distance-,
clustering-, density-, ensemble-, and learning-based methods. In each category, we introduce some state-of-
the-art outlier detection methods and further discuss them in detail in terms of their performance. Second,
we delineate their pros, cons, and challenges to provide researchers with a concise overview of each technique
and recommend solutions and possible research directions. This paper gives current progress of outlier
detection techniques and provides a better understanding of the different outlier detection methods. The
open research issues and challenges at the end will provide researchers with a clear path for the future of
outlier detection methods.
107964 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
H. Wang et al.: Progress in OD Techniques: A Survey
TABLE 1. The different categories covered by our survey and other related survey.
networks [14], efficient trajectories [15], and in data quality best of our knowledge, most surveys conducted so far only
and cleaning [16]. address specific areas rather than providing in-depth coverage
For example, consider the challenges present in large mul- and insights of up-to-date research studies, as can be seen
tidimensional data, in which, whether the data is relatively in Table 1. For example, the review in [25] only focuses
large or extremely large, it always contains some outliers. on data streams, [27] focuses on high dimensional numeric
In most cases, as the data increase in size, the number of data, [23], [33] on dynamic networks and the most recent
outliers also increases [17]. Therefore, with a large volume of on deep learning [32]. The most comprehensive ones [28],
data, it is essential to design scalable outlier detection tech- [33], [41], despite containing a lot of insights, they do not
niques to handle large datasets (Volume). As data increase review most of the primary state-of-the-art methods, with
in size, this proportionally influences the computational cost, most published at least five years ago.
rendering the process slow and expensive. It is of great impor- In recent years, more contemporary studies have been
tance that these outliers are detected in a timely manner, conducted, especially in the area of deep learning [35], [36]
to minimize dirty data, prevent data infection, and for the and ensemble techniques [37], [38]. Therefore, more of these
data to provide a well-timed value (Velocity and Value). recent studies and discoveries needs a review. Our survey
In another case, when varieties of data are present and some presents a comprehensive review of the most prominent state-
of which are structured, mixed-valued, semi-structured and of-the-art outlier detection methods, including both conven-
unstructured data (Variety); computing outliers of this nature tional and emerging challenges. This survey is different from
can be daunting and complicated. Other areas that are con- others because it captures and presents a more comprehensive
fronted with challenges include in application areas such as review of state-of-the-art literature, as well as consolidating
mobile social networks, security surveillance [18], [239], and complementing existing studies in the outlier detection
trajectory streams [19], and traffic management [20], [21]. domain. In addition, we did extensive research to bring forth
These areas demand constant discovery of abnormal objects significant categories of outlier detection approaches and
to deliver crucial information promptly. Many other out- critically discuss and evaluate them. We further discussed
lier detection areas share similar, and new challenges, and commonly adopted evaluation criteria, as well as the tools and
they will be referred to in subsequent sections of this available public databases for outlier detection techniques.
paper. We believe, this survey will significantly benefit researchers
As a result of the inherent importance of outlier detection and practitioners as it will give a thorough understanding of
in various areas, considerable research efforts in the survey of various advantages, disadvantages, open challenges, and gaps
outlier detection (OD) methods have been made [22]–[34]. associated with state-of-the-art outlier detection methods.
Despite the increasing number of reviews in outlier detec- This will provide them with a better insight into what needs
tion that are in existence, it remains to be an all-embracing to be focused on in the future. In summary, the novel and
topic in the research domain. There are still newly proposed significant contributions of the paper are:
methods and essential issues to be addressed. Therefore, this • We present the different up-to-date outlier definitions,
article serves a vital role in keeping researchers abreast with the different kinds, causes, contemporary detection
the latest progress in outlier detection techniques. To the and handling process, and the latest challenges and
detecting outliers to know the number of features that need to deliberate on some crucial questions to determine how to
be considered - a univariate or multivariate case. Also, for handle outliers. For example, whether it is prudent to remove
a statistical-based approach scenario, whether the selected outliers or acknowledge them as part of the data. Outliers
features can make assumptions of the distribution of values in data can sometimes have a negative impact. In machine
for parametric or non-parametric cases. learning and deep learning outlier detection processes, this
Many techniques have been designed to identify outliers, will consequently result in longer training process of the data,
and in Section 3, we will present and discuss further the dif- less accurate models, and eventually degrading results.
ferent recent proposed methods for outlier detection. In this With the recent development of new techniques to detect
paper, we categorize these outlier identification methods into outliers, new approaches have been proposed to deal with
the following: outliers. In some cases [42], [43], visual examination of the
• Statistical-based methods data is more preferred to get a clear picture of the degree of
The fundamental idea of statistical-based techniques in label- outlierness of a data point. In another case [44], an approach
ing or identifying outliers depends on the relationship with such as the univariate technique is used to search for data
the distribution model. These methods are usually classified points that contain extreme values on a single variable. While
into two main groups - the parametric and non-parametric other strategies such as the multivariate technique search
methods. for the combinations on the entire variables and then the
• Distance-based methods Minkowski error minimizes prospective outliers during the
The underlying principle of the distance-based detection training phase. There is another great deal of controversy as
algorithms focuses on the distance computation between to what to do when outliers are identified. In many situations,
observations. A point is viewed as an outlier it is far away just answering why there are outliers in the data can boost
from its nearby neighbors. the decision of what can be done with these outliers. In some
• Density-based methods scenarios, outliers are illegally included [45], while in some
The core principle of these methods is that an outlier can be other case, they might be part of the data [14]. In cases of
found in the low-density region, whereas inliers are in a dense high dimensional numeric data computation [27], [46], [47],
neighborhood. there are some critical factors like the curse of dimensionality
• Clustering-based methods that needs to be considered. Researchers recently tried to use
The key idea for clustering-based techniques is the applica- more accurate data that is uncontaminated and ones that are
tion of standard clustering techniques to detect outliers from suitable for an outlier detection process [48]–[52], before
given data. Outliers are considered as the observations that they start the outlier detection procedure.
are not within or nearby any large or dense clusters. Generally, dealing with outliers is dependent on the appli-
• Graph-based methods cation domain. For example, in cases where the influ-
Graph-based methods are based on the use of graph tech- ence of outliers might cause serious issues such as errors
niques to efficiently capture the interdependencies of inter- from instrument readings, critical environment safety sce-
connected entities to identify the outliers. narios, or in real-time situations (fraud detection/intrusion
• Ensemble-based methods detection). These outliers can be purged, or an alarm is set
Ensemble methods focus on the idea of combining the results up. While, in a no cause for alarm scenario, in a case like in a
from dissimilar models to produce more robust models to population census survey where few people stand out in some
detect outliers efficiently. They help to answer the question features like height, these outliers can be noted and verified
of whether an outlier should be linear-model based, distance- since they are just naturally occurring outliers. There is no
based, or another kind of model-based. need to delete them as in the former case.
• Learning-based methods In most cases, to answer the question about how to handle
Learning-based methods such as active learning and deep outliers, one has to use their intuition, analytic argument
learning, the underlying idea is to learn different models through some experiments and also thoughtful deliberation
through the application of these learning methods to detect before making decisions. Other noteworthy questions in the
outliers. outlier detection process, include the significance of consid-
ering the context and scenario, and in deliberating the purpose
2) HOW TO HANDLE OUTLIERS of detecting the outliers. It is essential to know the reason
There is still considerable discussion on what are consid- why the outliers are to be identified and what they signify at
ered outliers. The most applicable rule of thumb used by the end. In the subsequent sections, we will see that different
many researchers is to flag a data point as an outlier when methods or application areas call for various measures on how
the data point is three or more standard deviations from to deal with outliers.
the mean [39]. This, however, is a weak supporting idea
to discuss such argument further, since it cannot hold for C. APPLICATION AREAS OF OUTLIER DETECTION
all other scenarios. This is especially true in recent times, Outlier detection, with its ever-growing interest, has several
when we are faced with large dynamic and unstructured applications areas in wide-ranging areas. The applications
data. Therefore, in modern times, it is imperative to further areas where outlier detection is applied are so diverse, it
is impossible to cover thoroughly in just a single survey, 5) HEALTH CARE ANALYSIS AND MEDICAL DIAGNOSIS
because of space limitation. Therefore, in this paper, we list In the health care system and medical applications, we usu-
and introduce existing and recent application areas. We will ally get unusual patterns or readings from these devices,
refer our readers to some previous surveys that exhaus- which generally show a disease condition is diagnosed. The
tively cover many application domains that OD methods are detection and understanding of the abnormal patterns help
applied in. in the proper diagnosis of the disease and its underlining
Chandola et al. [20] provided a broad outline and an consequences. It allows doctors to take adequate measures.
in-depth knowledge of outlier detection application domain.
Also, the survey [5] also presented an exhaustive list and 6) DATA SOURCES OF TRANSACTIONS
discussions of applications that adopt outlier detection. Audit logs for financial transactions contain information
Some existing application areas include credit card fraud about the database operations. The audit logs help in verifying
detection [53], [54], intrusion detection [55], defect detec- the accuracy, legality, and in reporting the risks. It is essential
tion from behavioral patterns of industrial machines [56], to monitor the audit logs constantly to identify and report
sensor networks [14], finding unusual patterns in time-series unusual behaviors [67].
data [57], [58], trajectories [19], [59], e-commerce [60],
energy consumption [62], data quality and cleaning [16], 7) SENSOR NETWORKS AND DATABASES
[45], textual outlier [61], in big data analysis [12], [63], Detecting outliers in sensor environments such as in a
in social media [64], [65] and so on. wireless sensor environment [68], [69], target tracking envi-
Recently, detecting outliers has become essential in these ronment [70], and body sensor networks [71] has helped
application domains. We consider only a few new application in ensuring quality network routing and in giving accurate
areas of interest for just a short introduction. results from sensors. It helps in monitoring the computer
network performance, for example, to detect network bottle-
1) DATA LOGS AND PROCESS LOGS necks.
Providing outlier detection solutions give companies the edge
in gaining concealed insights on their websites, which, if not 8) DATA QUALITY AND DATA CLEANING
carried out, will necessitate more effort and additional cost. Data from different application areas may contain and gen-
In processing logs, some automated data mining techniques erate measurement errors and dirty data. Thus, the process
are needed to search for unusual patterns in the large volume of outlier detection [16], [45], can enhance data quality and
of logs [66]. These logs provide a good source of information cleaning. The method of cleaning and correcting data is
for outlier detection monitoring. essential for training high-quality model and the fast com-
putation and prediction of accurate results.
2) FRAUD DETECTION AND INTRUSION DETECTION
In fraud detection, if a card is stolen, the purchasing behavior 9) TIME-SERIES MONITORING AND DATA STREAMS
of the card user usually changes; we will notice an abnormal Detecting outliers in time series data [31], [57] and in detect-
buying pattern. The same is valid for unauthorized access in ing abnormal patterns in data streaming [10], [25], [72]–[74]
computer networks, which results in an unusual pattern [55]. is essential. This is because the abnormal pattern will influ-
Detecting these abnormal (outlier) patterns is essential for ence the fast computation and estimation of correct results.
security.
10) INTERNET OF THINGS (IOT)
3) SECURITY AND SURVEILLANCE IoT devices are made of a lot of sensors that unceasingly
Consider safety and surveillance in the field of cyberse- sense environmental parameters. These sensors are success-
curity. When we take into consideration computer net- fully fused to obtain information on a specific area or region,
works, the processes of ensuring safe logging and log depending on the desired task. Before carrying out this task,
administration are very significant as they improve authen- it is essential to check the quality of data, since the data might
ticity and security intelligence. Detecting outliers in be polluted with outliers. It is important to identify or detect
surveillance videos is a practical and exciting research these outliers in order not to limit the overall efficiency.
area [239].
III. OUTLIER DETECTION METHODS
4) FAKE NEWS AND INFORMATION, SOCIAL NETWORKS Outlier detection methods have been classified into differ-
In recent times, social media has given a platform for people ent techniques such as statistical-based methods, distance-
to spread fake news continually. Sometimes, it is difficult to based methods, graphical-based methods, geometric-based
differentiate between real and fake news. However, from a methods, depth-based methods, profiling methods, model-
reliable source, false news reports can be seen as outliers, based and density-based methods in a wide range of
since they stand out [237]. The spread of fake news has surveys [23], [24]. In this paper, we categorize our out-
negative influence on people and society at large, so it is also lier detection techniques into six key groups - statistical-
crucial to be identified. based, distance-based, density-based, clustering-based,
ensemble-based, and learning-based techniques. We give a set of size n, it will incur (n2 ) time when a sequential search is
short overview of the different methods and the research applied. Because of these shortcomings, Schubert et al. [79]
progress that has been made in the following categories. found that the LOF density estimate can be simplified, and
Also, we present the advantages, disadvantages, challenges, they proposed a simplifiedLOF to replace the LOF’s reacha-
and some possible future research directions for the different bility distance with the KNN distance.
methods. In some approaches, we offer a concise summary 1
in a table format (Tables 2-5) of the various method perfor- dens(p) = (4)
k − dist(p)
mances, and issues addressed.
where k-dist(p) replaces k-dist(o) in (3). Even though the
A. DENSITY-BASED APPROACHES SimplifiedLOF showed improved performance, it has a com-
Applying density-based methods to outlier detection is one putational complexity similar to LOF.
of the earliest known approaches to outlier detection prob- In a later study, an improvement to LOF [8] and sim-
lems. The core principle of the density-based outlier detection plifiedLOF [79], is introduced by Tang et al. [80], which
methods is that an outlier can be found in a low-density they called the Connective-based Outlier Factor (COF). The
region, whereas non-outliers (inliers) are assumed to appear method is closely similar to the LOF with the only differ-
in dense neighborhoods. The objects that differ considerably ence being the way the density estimation of the records
from their nearest neighbors, i.e., those that occur far from is computed. COF uses a chaining distance as the shortest-
their closest neighbors, are flagged and always treated as path to estimate the local densities of the neighbors while
outliers. They compare the local point’s densities with their LOF uses the Euclidean distance in selecting the K-nearest
local neighbor, densities. In density-based outlier detection neighbors. The drawback to this approach is the indirect
methods, more complex mechanisms are applied to model assumption made towards the data distribution, which results
the outliers, when compared to distance-based methods. in incorrect density estimation. The key idea proposed by
Notwithstanding this, the simplicity and effectiveness of the authors is based on differentiating ‘‘low density’’ from
density-based methods have made them widely adopted to ‘‘isolativity’’. The isolativity is defined as the degree of an
detect outliers. Some algorithms designed using this approach object’s connectivity to other objects. The COF value at p
have served to be the baseline algorithms [8], [75] for many with respect to its k-neighborhood is expressed as
new algorithms [76]–[78]. |Nk(p) |ac − distNk(p) (p)
Breunig et al. [8] proposed the Local Outlier Factor (LOF) COFk (p) = P (5)
o∈Nk(p) ac − distNk(o) (o)
method, which is one of the first fundamental loosely related
density-based clustering outlier detection methods. The tech- where ac-distNk(p) is the average chaining distance from p
nique makes use of the k-nearest neighbor. In the KNN set of to Nk (p). COF adjusts the SimplifiedLOF’s density estimate
each point, LOF makes use of the local reachability density to justify the ‘connectedness’ via a minimum spanning tree
(lrd) and compares it with those of the neighbors of each (MST) of the neighborhood. The cost of O(k 2 ) is incurred
participant of that KNN set. The local reachability density when computing the MST of the KNNs. Their method still
(a density estimate that reduces the variability) of an object p maintains a similar time complexity as the LOF except in
is defined as: cases where connective data patterns characterize the data
P sets.
o∈kNN (p) reach − distk (p ← o)
lrd(p) = 1/ (1) After a couple of techniques, it is still confusing which
|kNN (p)| threshold score can be considered as an outlier in LOF.
where Kriegel et al. [81], then formulated a more robust local den-
sity estimate for an outlier detection method called the Local
reach − distk (p ← o) = max{k − dist(o), d(p, o)} (2)
Outlier Probabilities (LoOP) which combines the idea of pro-
The final local outlier factor score is given as: viding an outlier ‘score’ with a probabilistic and statistical-
oriented approach. It makes use of a density estimation that
1 X lrdk (o)
LOFk (p) = (3) is based on the distance distribution, and the local outlier
|kNN (p)| o∈kNN (p) lrdk (p)
score is defined as a probability. LoOP tries to address the
where lrdk (p) and lrdk (o) are the local reachability density issue of LOF outputting an outlier score instead of an outlier
of p and o, respectively. The main focus of the approach is probability. The advantage of using the LoOP’s probability
that the outlier degree of the observation is defined by its score is that it may give a better comparison of the outlier
clustering structure in an adjacent neighborhood. The LOF records for different datasets. The LoOP showing that a point
score is at its peak if the lrd of the test points is smaller is an outlier is given as:
when compared to the nearest neighbors’ estimates. Storing
PLOFλ,S (O)
the KNN and lrd values simultaneously when computing the LoOPS (O) = max 0, erf √ (6)
LOF scores of all data points will incur O(k) additional oper- nPLOF. 2
ation in each point. Therefore, it is prudent to apply a valid where PLOFλ,S (O) is the probabilistic local outlier factor of
index, because in the absence of the useful index, for a data an object with respect to the significance of λr a context
set S(o) ⊆ D and nPLOF is the aggregated value. Points p divided by its density factor.
within the dense region will have a LoOP value close to 0 DFnbr (P, r)
while those that are closer to 1 will be for density-based RDF(p, r) = (8)
DF(P, r)
outliers. Similar to simplifiedLOF [79], the LoOP normalizes
its outlier detection score, which gives it the same com- where DF(P,r) is the density factor that is defined as the ratio
plexity of O(k) per point as in [79]. The LoOP, like other of the number of neighbors of P and the radius r, while DFnbr
previous local outlier algorithms, computes the local density (P,r) is the neighborhood density factor of the point p.
estimation using the neighborhood set. However, comput- Jin et al. [75] proposed INFLuenced Outlierness (INFLO),
ing the density is different. It follows the assumption of a which is another technique for local outlier detection similar
‘‘half-Gaussian’’ distribution and applies the probabilistic set to that of LOF and uses the symmetric neighborhood rela-
distance (standard deviation). tionship to mine outliers. In LOF, for a dataset with closely
In LOF [8] and COF [80], these methods fall short related clusters of different densities, correctly computing
of handling the issue of multi-granularity correctly. the score of the instances at the cluster borders is not given.
Papadimitriou et al. [82] proposed a technique with the LOcal INFLO addresses this shortcoming. It solves the problem
Correlation Integral called LOCI and its outlier metric - of inaccurate space representation in the LOF. INFLO uses
the multi-granularity deviation factor (MDEF), to handle different descriptions of the neighborhood for the reference
this drawback. Points that deviate at least three standard set and context set. The INFLO score is computed using
deviations away from MDEF’s neighbor are marked as an both the k-nearest neighbors and the reverse nearest neighbor.
outlier. It deals well with the local density variations in To achieve an enhanced estimation of the neighborhood’s
the feature space and also detects both distant clusters and density distribution, both the nearest neighbors (NN) and
secluded outliers. The MDEF of a point pi at a radius r is reverse nearest neighbors (RNNs) of data points are consid-
mathematically defined as: ered. INFLO is defined as the ‘‘ratio of the average density
of objects in ISk (p) to p0 s local density’’:
n(pi , αr)
MDEF(pi , r, α) = 1 − (7) P
n̂(pi , r, α) o∈ISk (p) den(o)
INFLOk (p) = (9)
where n(pi , αr) and n̂ (pi , r,α) are the number of αr neigh- |ISk (p)|den(p)
borhood objects and the average of all the objects p in the where den(o) and den(p) are the densities of o and p respec-
r-neighborhood of pi . For the faster computation of the tively, and ISk (p) is the average density of objects to p0 s
MDEF, if we estimate the value of the numerator and denom- local density. The higher the INFLO value, the higher the
inator of the fraction on the right-hand side, this gives a probability that the object is an outlier. In 2014, still using
better result. All the previous algorithms have shown that it the density-based approach to tackle local outlier detection
is crucial to choose an appropriate k for excellent detection problems, Cao et al. [84] proposed a novel density-based
performance. In the LOCI, a maximization approach is used local outlier detection (UDLO) notion on uncertain data that
to address this issue. The method adopts the half Gaussian are characterized by some discrete instances. Here, an exact
distribution to estimate the local density; similar to LoOP. algorithm is recommended to compute the density of an
However, instead of using the distance, the aggregate of the instance rather than using the naive method of finding all
records in the neighborhood are used. Another point worth k-neighbors to calculate the outliers, as in the LOF. However,
noting is that the LoOP has a different way to estimate in their approach, they only applied the Euclidean distance
the local density. It differs from that of LOCI. Instead of metrics. Using other distance computation methods to inves-
comparing the local density’s ratio, it examines two differ- tigate the possibility of improving the performance of the
ent sized neighborhoods. Even though the LOCI showed algorithm can be a future study.
good performance, however, it has a longer runtime and After the introduction of LOF [8], several variations of
Papadimitriou et al. [82], proposed another method, an LOF have been established, such as COF [80], INFLO [75],
approximate version of LOCI called aLOCI. To increase the and LOCI [82]. However, these algorithms are challenged
counting speed of the two neighborhoods, the quad-trees are with the distance computations for high dimensional datasets.
applied with some constraints. Keller et al. [85] proposed a high contrast subspace method
Another technique when compared with existing methods, (HiCS) to improve on evaluating and ranking of outliers
LOF [8] and LOCI [82], that performs more efficiently as a where outlier scores are closely related. Extending the
result of its pruning ability for data points that are deep in focus beyond only local outliers to include global outliers,
a cluster was proposed by Ren et al. [83]. It shows better Campello et al. [86] proposed a new effective outlier detec-
scalability with an increase in data size. They proposed a tion measure algorithm called Global-Local Outlier Score
method called the Relative Density Factor (RDF) method, from Hierarchies (GLOSH). It is capable of simultaneously
and which uses a vertical data model (P-trees) to detect detecting both global and local outlier types based on a
outliers. The RDF is the degree of the measure of outlierness complete statistical interpretation. Generally, even though
and outliers are points with high RDF values. The RDF of the GLOSH result can’t perform better in all cases than
point p is the ratio of the neighborhood density factor of point other techniques, it still has the strength of scaling well for
different tasks. Since the study is based on a specific k-nearest the neighbors. They then proposed a safe non-outlier object
neighbor density estimate, it has some limitations. A future removal method to preprocess the datasets to remove all non-
study could be to investigate how other density estimates outlier objects. This process is named as rough clustering
would improve this work. based on multi-level queries (RCMLQ). This helps in reduc-
Momtaz et al. [87], deviate a little from the central focus ing the amount of data that is required to be computed for
of most previous algorithms in computing the local outliers. the local outlier factor. The proposed method is based on
They introduced a novel density-based outlier detection tech- LDC and RCMLQ, and from the experiment, it improves
nique that detects the top-n outliers by providing for every on existing local outlier detection methods in both detection
object a score called the Dynamic-Window Outlier Factor accuracy and time efficiency.
(DWOF). This algorithm is a modified and improved version We present a summary in Table 2, showing the progress
of Fan et al. [88] - Resolution-based Outlier Factor (ROF) using this technique for some key algorithms mentioned
algorithm. ROF overcomes some setbacks, such as low accu- above. In our overview, it is essential to note that, when we
racy and its high sensitivity to parameters in data sets. say this method outperforms the other, it does not neces-
With the massive flow of high-dimensional data, new sarily mean that it is superior to the other in all scenarios
research motivations are linked with improving the effective- and datasets. The analysis and summary presented here are
ness and efficiency of algorithms in detecting outliers in big based on the experiment done in these papers, as reported
data. Wu et al. [89] proposed an algorithm for the detection by the authors. While a method might outperform another
of outliers in big data streams. They use a fast and accurate method, this might be for a set of parameters, the scenario
density estimator called RS-Forest and a semi-supervised one or assumptions used in the experiment. We cannot claim that
class machine-learning algorithm. Bai et al. [77], considered a method is superior to another in all cases since we did
a density-based outlier detection in big data and proposed a not perform experiments under the same parameter settings
Distributed LOF Computing (DLC) method, which detects and environment. This is true for all the following tables
outliers in parallel. The main idea here is twofold. Initially, (Table 2-5) in this paper.
the preprocessing stage uses the Grid-Based Partitioning
(GBP) algorithm and the DLC for the outlier detection stage. 1) DENSITY-BASED APPROACHES-ADVANTAGES,
However, despite the improved performance, it still does not DISADVANTAGES, CHALLENGES, AND GAPS
scale well when compared to Lozano et al. [90] - Parallel
a: ADVANTAGES
LOF Algorithm (PLOFA). Improving the scalability of the
In density-based methods, the density estimates used are non-
algorithm can be an interesting research problem for future
parametric; they do not rely on assumed distributions to fit the
direction.
data.
Tang and He [78] proposed an outlier detection method
Some of the density-based techniques [8], [75], [81], [82]
using the local KDE. To measure the local outlierness, a Rel-
have served as a fundamental baseline for many subse-
ative Density-Based Outlier Score (RDOS) is used. Here,
quent algorithms. They have experimentally been shown to
the local KDE method with an extension of the object’s near-
work well for modern methods, often outperforming their
est neighbor is used to estimate the density distribution at the
competitors like some existing statistical and distance-based
object location. They pay more emphasis on the reverse and
approaches [39], [94], [95]. Since outliers in these meth-
shared nearest neighbors rather than the k-nearest neighbor
ods are often analyzed through the object’s neighborhood
of an object for the density distribution estimation. In their
density [8], [82], this, in turn, gives it more advantage in
method, only the Euclidean distance metric is applied, similar
identifying crucial outliers missed by most other outlier
to UDLO in [84]. With a related extension for a future study,
detection-based methods. These methods facilitate the pro-
there is a need to involve other distance methods to investigate
cess of efficiently ruling out outliers nearby some dense
its effect, and to extend their work in real-life applications.
neighbors. They require only minimum prior knowledge such
Vázquez et al. [91] proposed a novel algorithm to detect
as the probability distribution and only a single parameter
outliers based on low-density models of the data called Sparse
tuning. They are also known for their ability to compute local
Data Observers (SDO). SDO reduces the quadratic complex-
outliers efficiently.
ity experienced by most lazy learner OD algorithms. It is an
eager learner and severely reduced the computational cost,
which in turn performs well when compared to other best- b: DISADVANTAGES, CHALLENGES, AND GAPS
ranked outlier detection algorithms. Ning et al. [92] pro- Even though some density-based methods are shown to
posed a relative density-based OD method that uses a novel have improved performance, they are more complicated and
technique to measure the object’s neighborhood density. computationally expensive when compared especially to sta-
Su et al. [93] proposed an efficient density-based scheme tistical methods in most cases [96]. They are sensitive to
based on local OD approach for scattered Data called parameter settings such as in determining the size of the
E2DLOS. They rename the local outlier factor and called neighbors. They need to cautiously take into consideration
theirs Local Deviation Coefficient (LDC) by utilizing the several factors, which consequently results in expensive com-
full benefit of the object distribution and the distribution of putations. For varying density regions, it becomes more
complicating and results in poor performance. Density-based In this paper, we classify some of the current research that has
methods, due to their inherent complexity and the lack of been done using a statistical approach to detect outliers into
update of their outlierness measures, some of these algo- three categories - parametric methods, non-parametric
rithms, such as INFLO and MDEF, cannot resourcefully methods, and other kinds of statistical techniques.
handle data streams. In addition, they can be a poor choice
for outlier detection in data stream scenarios. It is also chal- 1) PARAMETRIC METHODS
lenging for high dimensional data when the outlier scores are For this type of method that has an assumption of the under-
closely related to each other. lying distribution model, two well-known methods adopted
To discuss further, in Table 2, we present a summary for outlier detection are the Gaussian Mixture model and
of randomly handpicked (because of the space limita- Regression model.
tion) well-known density-based outlier detection algorithms.
We included the performance, issues addressed, and draw- a: GAUSSIAN MIXTURE MODEL METHODS
backs of standard algorithms and show the progress of how The Gaussian Model is one of the most prevalent statistical
these algorithms have evolved. In one of the most popu- approaches used to detect outliers. In this model, the train-
larly known density-based methods, LOF [8], it is crucial ing phase uses the Maximum Likelihood Estimates (MLE)
to note that in an outlier detection process where the local method [100] to perform the mean and variance estimates of
outliers are not significant, the algorithm can create a lot of the Gaussian distribution. In the test stage, some statistical
false alarms. Generally, since density-based methods are non- discordancy tests (box-plot, mean-variance test) are applied.
parametric, for high-dimensional data spaces, the sample size Yang et al. [101], introduced an unsupervised outlier detec-
is considered too small [27]. Additional re-sampling, to draw tion method with the globally optimal Exemplar-Based GMM
new samples, can be adapted to enhance the performance. (Gaussian Mixture Model). In their technique, they first
We also note that, since most density-based methods rely on realized the global optimal expectation maximization (EM)
nearest neighbor computations, this makes the choice of k algorithm to fit the GMM to a given data set. The outlier
very significant for the evaluation of these algorithms. Usu- factor for every data point is considered as the sum of the
ally, finding the nearest neighbor in nearest-neighbor outlier weighted mixture proportions with the weight signifying the
detection algorithms, the computational cost is about O(n2 ). relationship with other data points. The outlier factor Fk at
A rare case is that in LOCI, where the radius r is extended xk is mathematically defined as:
and thus summing its complexity to O(n3 ). This makes it Xn
very slow for more massive datasets. An improved ver- Fk = zk (th ) = skj πj (th ) (10)
j=1
sion is the aLOCI, which shows a faster runtime depend-
ing on the amount of the quad-trees that are utilized. where skj πj (t) shows the depth of the point xk ‘s influence on
Goldstein et al. [97], compared COF and LOF, and it was another point xj . skj , referring to the connection strength, th
found that the LOF spherical density estimation was a poor the final iteration and πj the measure of the significance of
choice for efficiently detecting outliers. COF estimated its point j. The data point xk , is more likely to be flagged as an
local density by connecting the regular records with each outlier if Fk , is smaller. This technique is in contrast to other
other to solve the above drawback. INFLO shows improved existing methods [8], [80], [82], which focus solely on local
outlier scores when clusters with different densities are not properties rather than global properties. Yang et al. [101]
far from each other. Table 2 gives the remaining summary of technique can be applied to solve the problem of the
critical points for the different algorithms. clustering-based technique’s inability to detect outliers in
the presence of noisy data, by fitting the GMM at every
data point in a given dataset. We note that, notwithstand-
B. STATISTICAL-BASED APPROACHES ing the method’s capacity to identify unusual shapes faster,
Detecting outliers using statistical techniques can be done it still has a high complexity (with O(n3 ) for single iteration
using supervised, semi-supervised, and unsupervised styles. and O(Nn3 ) for N iterations). An algorithm that can reduce
In statistical-based OD methods, the data points are some- such a computational complexity can serve to be more scal-
times modeled using a stochastic distribution, and some data able for future study.
points can be labeled as outliers depending on the relationship In 2015, for a more robust approach to outlier detection,
with the distribution model. Outliers and inliers are declared the use of GMM with locality preserving projections was
depending on the data distribution model. Statistical-based proposed by Tang et al. [102]. They combined the use of
methods are usually classified into two main groups - the GMM and subspace learning for robust outlier detection in
parametric and non-parametric methods. The major differ- energy disaggregation. In their approach, the locality pre-
ence between the two methods is that the former has an serving projection (LPP) of subspace learning is used to pre-
assumption of the underlying distribution model in given serve the neighborhood structure efficiently and then reveal
data, and from the known data, it estimates the parameters of the intrinsic manifold structure of the data. The outliers are
the distribution model. The latter method does not have any kept far away from the normal sample, which happens to
assumption of prior knowledge of the distribution model [98]. be reversed when compared to Saha et al. [103] principal
the shortcoming of existing works that are not well furnished it demonstrates that the approach is more efficient in a broader
to tackle current high-velocity data streams owing to high perspective. Improving the accuracy of the density ratio esti-
intricacy and their unpredictability to data updates. They mation can be an important future work to consider this
designed KELOS, an approach to unceasingly identify the approach.
top-N KDE-based local outliers over streaming data. Du et al. [118], proposed another robust technique with
To conclude, one big setback of most KDE methods is statistical parameters to solve the problem of local outlier
that they usually suffer from a high computational cost and detection called the Robust Local Outlier Detection (RLOD).
curse of dimensionality, which makes them very unreliable in This study was motivated by the fact that most OD methods
practice. Despite KDE’s better performance when compared focus on identifying global outliers, and most of these meth-
to other non-parametric OD approaches, there is a relatively ods [119], [120] are very sensitive to parameter changes. The
low number of reports that adopt KDE based phenomenon to whole idea of the framework is in three stages. In the first
approach this problem. stage, the authors propose a method to initially find density
peaks in the dataset using the 3σ standard. In the second
3) OTHER STATISTICAL METHODS stage, in the dataset, each remaining data object is then allo-
Many statistical approaches have been proposed, but among cated to its identical cluster to be labeled as to its nearest
the more straightforward statistical methods to identify out- neighbor with a higher density. In the final stage, they use
liers are the histogram [116] and other statistical tests [40] Chebyshev’s inequality and then the density peak reachability
such as the Boxplot, Trimmed mean, Extreme Studentized to recognize the local outliers in each group. The method
Deviate and the Dixon-type test [40]. The Trimmed mean supports both the detection of local and global outliers as
among the others is more comparatively resistance to outliers, in Campello et al. [86] technique, and they experimentally
while to identify single outliers, the Extreme Studentized proved that the method outperforms other methods [8], [26] in
Deviate test is the right choice. The Dixon-type test has terms of the running time and detection rate. The authors rec-
the advantage of performing well with a small sample size ommend further experiments on how to improve efficiency
because there is no need to assume the normalcy of the data. through the use of a robust method for distributed and parallel
Barnett et al. [39] discuss several tests for the optimization computing.
of different distributions model to effectively detect outliers. Other studies have been done using statistical methods for
Optimization could depend on the actual parameters of con- computing outliers. In Table 3, we present a summary show-
forming distributions, that is, the expected space for outliers ing the progress using this technique for some key algorithms
and the number of outliers. Rousseeuw and Hubert [117] mentioned above.
also gave a broader discussion of statistical techniques for
outlier detection. Using a histogram-based approach, Gold- 4) STATISTICAL-BASED APPROACHES-ADVANTAGES,
stein and Dengel [116] proposed a Histogram-Based Outlier DISADVANTAGES, CHALLENGES, AND GAPS
(HBOS) detection algorithm that uses static and dynamic Advantages
bin width histograms to model univariate feature densities.
i. They are mathematically acceptable and have a fast
These histograms are then used to calculate the outlier score
evaluation process once the models are built. This is
for each of the data instances. Though the algorithm showed
because most models are made in a compacted form,
improved performance in some performance metrics like the
and they showed improved performance given the prob-
computational speed when compared to some other popular
abilistic model.
OD methods such as LOF [8], COF [80], and INFLO [75].
ii. The models generally fit quantitative real-valued data
However, it falls short in local outlier detection problems
sets or some quantitative ordinal data distributions. The
because the algorithm cannot model local outliers using the
ordinal data can be changed to an appropriate value for
proposed density estimation.
processing, which results in improved processing time
Hido et al. [95], proposed a new statistical approach for
for complex data.
inlier-based outlier detection problems by using the directed
iii. They are easier to implement even though limited to
density ratio estimation. The main idea is to utilize the
specific problems.
ratio of the training and test data densities as outlier scores.
The method of unconstrained least-squares importance fitting Disadvantages, Challenges, And Gaps:
(uLSIF) was applied because it is more suitable with natural i. Because of their dependency and the assumptions of a
cross-validation measures that allow it to accurately opti- distribution model in parametric models, the quality of
mize the tuning parameter’s value; such as the kernel width the results produced is mostly unreliable for practical
and regularization parameter. The proposed technique, when situations and applications due to the lack of preceding
compared to the non-parametric KDE, is more advantageous knowledge regarding the underlying distribution.
because it has provision to escape the hard density estimation ii. Since most models apply to univariate feature space,
computation. The method also showed an improved perfor- they are characteristically not applicable in a multi-
mance in accuracy, even though not in all cases, they showed dimensional scenario. They incur high computational
a better performance than the other methods. Nevertheless, costs when dealing with multivariate data and this,
in turn, makes most of the statistical non-parametric a good estimate of the nominal data density in polluted data
models a poor choice for real-time applications. sets. In multivariate data, they scale well and are computa-
iii. In the histogram technique, one fundamental shortcom- tionally inexpensive. The histogram models work well for
ing for multivariate data is the ineptitude of capturing univariate data but not suitable for multivariate data. This is
the interactions between different attributes. This is because it cannot capture the relations between the different
because they cannot simultaneously analyze multiple attributes.
features. In general, some prevailing statistical methods Some statistical techniques are not well adapted in recent
are not applicable to handle very high dimensional times because of the kinds of data and application areas.
data. There is a need to design statistical techniques However, they are considered to be great practical approaches
to support high dimensional data that are capable of to outlier detection problems. Tang et al. method [102],
simultaneously analyzing multiple features. when compared to the PCA method [103], gives a robust
iv. When faced with problems of increased dimensional- improvement for outlier and noise detection problems.
ity, statistical techniques adopt different methods. This In HBOS [116], their approach shows a good computational
results in increase in the processing time and misrepre- speed even more than some clustering-based algorithms and
sentation of the distribution of the data. other types of algorithms (LOCI, LOF, INFLO) and thus
Discussing further on a more global view, statistical makes it a suitable for large scale near real-time applications.
methods comes with lots of advantages and drawbacks. While Hido et al. [95] method is more scalable for massive
In outlier detection problems, the importance of outlier- data sets and Du et al. [118] method has a more robust
free data is significant for building reliable systems. This analysis.
is because outliers can have a drastic effect on the system
efficiency, so it’s prudent to identify and remove those that C. DISTANCE-BASED APPROACHES
affect the system’s accuracy. Most of the drawbacks from Distance-based methods detect outliers by computing the
statistical methods are centered around the outlier detec- distances between points. A data point that is at a far dis-
tion accuracy, lack of efficient techniques for very high tance from its nearest neighbor is regarded as an outlier. The
data sets, the curse of dimensionality, and computational most commonly used distance-based outlier detection defi-
cost. nition is centered on the concept of the local neighborhood,
Statistical-based methods can be effective in the outlier k-nearest neighbor (KNN) [121], and the traditional distance
detection process when the correct distribution model is threshold. One of the earliest studies of computing distance-
captured. In some real-life situations, for instance, in sen- based outliers by Knorr and Ng [122] defined distance-based
sor stream distribution, there is no prior knowledge avail- outliers as:
able to be learned. In such a scenario, when the data does Definition: In a dataset T, an object O is a DB (p, D)-outlier
not follow the predetermined distribution, it may become if minute fraction p of the objects in the dataset lies beyond
impractical. Therefore, non-parametric methods are mostly the distance D from O.
appealing since they do not depend on the assumption of Other well-known definitions of distance-based outliers
the distribution characteristics. This is also true for big data given the distance measure of feature space, define outliers
streams, where the data distribution cannot be assumed. as:
For evenly dispersed outliers in a dataset, using statisti-
cal techniques becomes complicating. Therefore, paramet- i. Points with less than p different samples within the
ric methods are not applicable for big data streams, but distance d [99].
for non-parametric methods, they are. In addition, defin- ii. The top n examples whose distance to thekth nearest
ing the threshold of a standard distribution to differen- neighbor are the greatest [123].
tiate the outliers has a higher probability of inaccurate iii. The top n examples whose average distance to the k
labeling. nearest neighbors are the greatest [7].
For parametric cases, using the Gaussian mixture models, The abbreviation DB(p,D) is the Distance-Based outlier
a worth noting point is the daunting task in adopting Gaussian detected using the parameters p and D.
techniques for computing outliers in both a high dimensional DB outlier detection methods are moderate non-parametric
data subspace and data streams. Yang et al. [101] method, for approaches that scale well for large data sets with a medium
instance, has high complexity. An algorithm that can reduce to high dimensionality. In comparison with statistical tech-
such a computational complexity can serve to be more scal- niques, they tend to have a more robust foundation and are
able. Regression techniques also are not suitable to support more flexible and computationally efficient. In our subse-
high dimensional subspace data. For a more efficient and quent section, we classify the distance-based methods into
robust solution in finding and discovering outliers, it’s more the following groups - distance-based computation method
appropriate to apply robust regressions rather than ordinary using k-nearest neighbor computation, pruning techniques,
regressions because outliers can impact the latter. For the non- and data stream related works. Some of the most commonly
parametric case, KDE performs better in most cases, despite used distance-based approaches to detect outliers are as
its sensitivity to outliers and the complexity in determining follows:
1) K-NEAREST NEIGHBOR METHODS The key difference from the earlier algorithms is that it sup-
Using these methods for computing outliers have been one ports the fast merging of a point’s approximate nearest neigh-
of the most popular ways adopted by many researchers to bors. In terms of its efficiency, only the points’ approximate
detect outliers. It is not the same as the k-nearest neighbor nearest neighbors are of value. It scales linearly as a function
classification. These methods are mostly used for detecting of the number of dimensions and log-linear for the number of
global outliers. Initially, a search for the k-nearest neighbor data points. One key difference from other methods is, instead
of every record, and then these neighbors are used to compute of using the nearest neighbors, the approximate nearest neigh-
the outlier score. They mainly examine the nature of a given bor is used, which makes the computation faster.
object neighborhood information to determine whether they In 2009, instead of following the trend in outlier detection
are close to their neighbors or have a low density or not. The for global outliers using distance-based computation tech-
key concept is to exploit the neighborhood information to niques, the authors decided to divert to local outlier detection.
detect the outliers. Zhang et al. [76] proposed a local distance-based outlier
Knorr and Ng [122] and Ramaswamy et al. [123] were detection method called the Local Distance-based Outlier
among the first to propose techniques for detecting outliers Factor (LDOF). Their study shows improved performance
in large data sets that shows significant progress in the over the range of neighbor size when compared to LOF [8].
already state-of-the-art existing studies. Knorr and Ng [122], The demand for a pairwise distance computation is (O(k 2 )),
proposed a non-parametric approach, which is in contrast similar to COF [80]. It is comparable to that of the k-nearest
to some previous statistical techniques [101], [104]. The neighbor outlier detection techniques in performance; how-
users lack knowledge about the underlying distribution. ever, it is less sensitive to parameter values. Liu et al. [125],
The indexed-based and nested-loop based algorithms were in a later study, extended the traditional LOF to uncertain
the two algorithms proposed with the computational com- data.
plexity of O(kN2 ); with k as the dimensionality and N as Huang et al. [126] proposed a method called Rank-Based
the number of datasets. Later Ramaswamy et al. [123], pro- Detection Algorithm (RBDA) to rank the neighbors. It pro-
posed a cell-based algorithm that is linear with respect to N vides a feasible solution and ensures that the nature of
and exponential with respect to K to optimize the previous high-dimensional data becomes meaningful. For illustration,
algorithm [122]. It has a computational complexity lower in [17], the fundamental assumption will be that objects will
than the two previous methods. Ramaswamy et al. [123] become close to each other or share similar neighbors when
tried to improve on several of the shortcomings of [122] such they are produced from the same mechanism. The RBDA uses
as specifying the distance, the ranking method adopted and in the ranks of individual objects that are close as the degree of
minimizing the computational cost. To address the problem of proximity of the object. It does not take into consideration the
determining the distance, they defined their approach as one objects distance information with respect to their neighbors.
that does not require the users to specify the distance param- Bhattacharya et al. [127] propose a method that further uses
eter but adopts the kth nearest neighbor. In the expanded ver- both the ranks of the nearest neighbors and the reverse nearest
sion of [122], to find the nearest neighbor of each candidate neighbors. This ensures each candidate object outlier score is
spatial indexing structures, the KD-tree, X-tree, and R-tree effectively measured.
are used [99]. This is done by querying the index structure In another study, Dang et al. [121] applied k-nearest neigh-
for the closest k points in each example and finally, in line bor to detect outliers in daily collected large-scale traffic data
with the outlier definition, the top n candidate is selected. in some advanced cities. Outliers are detected in data points
One main concern of this method is that the index structures by exploiting the relationship among neighborhoods. An out-
breakdown with an increase in the dimensionality. lier here is a data point that is farther from its neighbors.
Angiulli et al. [7] differ a bit from the traditional approach Notwithstanding the good results shown concerning the suc-
of targeting the development of techniques to detect out- cess detection rate of 95.5% and 96.2% respectively, which
liers in an input dataset, to that which can learn a model outperforms those of statistical approaches such as KDE
and predict outliers in an incoming dataset. They designed (95%) and GMM (80.9%). However, a shortcoming of their
a distance-based algorithm that detects top outliers from a work is they only considered a single distance-based metric,
given unlabeled dataset and predicts if an undetected data so a future study with more complicated variations like that
point is an outlier. The outlier detection process involves of different weights on multiple distances, can improve the
detecting the top n outliers in a given dataset, which means outlier detection rate. In another study, to improve on the
the n objects of the dataset with the highest weight. This is search effectiveness of the KNN neighbors, Wang et al. [128]
done by determining whether an incoming object’s weight in applied a minimum spanning tree.
the dataset is greater than or equal to the nth highest weight. Radovanovi’c et al. [129] presented a reverse nearest
This process results in a O(n2 ) complexity. neighbor approach to tackle one of the biggest challenges
Ghoting et al. [124] proposed an algorithm called the in computing outliers in high-dimensional data sets, that
Recursive Binning and Re-Projection (RBRP) to enhance is, ‘‘the curse of dimensionality.’’ They showed that their
the computational speed for high-dimensional datasets and approach could be effectively applied in both low and
improve on the drawbacks of previous methods [122], [123]. high-dimensional settings. When compared with the original
KNN method [123], it showed an improved performance challenges such as the notion of time, multi-dimensionality,
in the detection rate. Their primary focus is centered on concept drift, and uncertainty problems [72]. Researchers
the influence of high dimensionality and the hubness phe- have seen these as interesting challenges, and they have
nomenon. An antihub technique was then proposed, which focused on designing algorithms to detect outliers in the data
optimized the perception between scores. Huang et al. [130] stream environment. The data stream is considered to be a
implemented the concept of natural neighbor to acquire the large volume of unlimited incoming sequence data. Since
information of the neighborhood. Ha et al. [131] proposed the mining of these kinds of data is highly dependent on
a heuristic approach to determine a suitable value for k time intervals, usually the computation is done in windows.
by employing iterative random sampling. To this end, most The two well-known data stream window models are the
recently, Tang et al. [78] proposed a method to determine the landmark and sliding window [136]. In the former, a time
outlier scores in local KDE. They examine different types point in the data stream is identified, and the points within
of neighborhoods, including the reverse nearest neighbor, both the last time point and the current time are then analyzed.
shared nearest neighbors and the k nearest neighbor. The While in the latter, the window is marked by the two sliding
neighbor-based detection methods are independent of the endpoints.
data distribution model and can be easily understood and Angiulli et al. [136] propose a novel idea for the one-time
interpreted. However, they are sensitive to parameter settings query of outliers in data streams that is different from the
and sometimes deficient in performance. continuous queries approach presented by authors in [137],
[138]. They proposed three kinds of Stream Outlier Miner
2) PRUNING METHODS (STORM) algorithms to detect outliers in data streams using
Bay et al. [132], presented an algorithm based on a nested the distance-based method. The first one is based on com-
loop that uses the randomization and pruning rule. By mod- puting the exact outlier query and the other two focus on
ifying the nested loop algorithm, which is recognized for its retrieving the approximate results of the queries. The exact
quadratic performance O(N 2 ), they were able to obtain a near algorithm (Exact-Storm) makes use of the stream manager
linear time on most of the data sets that previously showed a (which collects the incoming streams) and a suitable data
quadratic performance in the previous method [122]. How- structure (that is used by the query manager to answer outlier
ever, the algorithm makes a lot of assumptions which will queries). One shortcoming of this algorithm is the cost of
consequently lead to poor performance. Angiulli et al. [133], storing all the window objects. It is also not suitable in cases
since most previous research [99], [122], [123] were unable of colossal memory since it cannot fit into the memory. To
to simultaneously meet the demand of both the CPU cost tackle this issue, the approximate algorithm (Approx-Storm)
and in minimizing the I/O cost, the authors presented a is applied to improve the Exact-Storm. This is done by adjust-
novel algorithm called Detecting OutLiers PusHing data into ing two approximations, that is, by reducing the number of
an Index (DOLPHIN) to address these challenges. In the data points stored in each window and by decreasing the space
proposed algorithm, only two sequential scans of the data for every data point neighbor’s storage. The final algorithm
set are performed, while that of [132] implements a block- (Approx-fixed-memory) aims to minimize memory usage by
nested loop analysis of the disk pages, which results in a keeping only a controlled fraction of the safe inliers.
quadratic input and output cost. Ren et al. [134], presented Yang et al. [139], proposed some methods (Abstract-C,
an improved version of Ramaswamy et al. technique [123], Abstract-M, Exact-N, and Extra-N) to deal with the incre-
a vertical distance-based outlier detection method to detect mental detection of neighbor-based patterns in the sliding
outliers in large data sets by also applying the pruning method window scenarios over data streams. The old static approach
and a ‘‘by-neighbor’’ labeling technique. In their study, as an of pattern detection is costly and results in high complexity.
alternative to the outdated horizontal structures, the vertical Therefore, in this technique, the authors address the issue
structure is adopted to facilitate the efficient detection of of handling sliding windows, which was not supported in
outliers. The technique is implemented in two phases (with earlier incremental neighbor-based pattern detection algo-
and without pruning) with P-Trees for the outlier detection. rithms such as the incremental DBSCAN [26]. From their
According to the authors, a future study can be to discover the experimental studies, it shows less CPU usage, and it main-
use of P-Trees in other OD methods, such as the density-based tained a linear memory usage for the number of objects in
approach. In another work, Vu et al. [135] introduced the the window. Among these algorithms, Abstract-C is the only
MultI-Rule Outlier (MIRO) that adopts a similar technique related algorithm using distance-based while the other two
as in [134] by using the pruning technique to speed up the are more linked with density-based cluster methods. Table 3
process of detecting outliers. gives further details and summaries of these methods.
Kontaki et al. [140], proposed algorithms that tackle
3) IN DATA STREAMS some issues in event detection in data streams [141] and
Lately, most incoming data are in the form of continuous in sliding window scenarios over data stream [139], which
flow, and storing these data can be impractical because they are both characterized by continuous outlier detection.
need to be computed fast. In data streams, for distance- In Angiulli et al. technique [141], two of the algorithms
based approaches, researchers continue to face significant use the sliding window in parallel with the step function
in the process of detecting the outliers. The main objective the number of neighbors, and R is the distance parameter for
in [140] is to minimize the storage consumption, improve the outlier detection. The key dissimilarity between STORM
the algorithm efficiency, and to make it more flexible. The and COD is the decline in the number of examined objects
authors designed three algorithms to support their aim, and in each slide, while for Abstract-C and COD, they are the
they include Continuous Outlier Detection (COD), Advance speed and memory consumption. That is, it is much faster
Continuous Outlier Detection (ACOD), and Micro-Cluster- and requires less space. Another algorithm that is designed
Based Continuous Outlier Detection (MCOD). The first algo- specifically for a high-volume of data streams was proposed
rithm, COD, has double versions that support a fixed radius by Cao et al. [142] called ThreshLEAP. It is a technique
and multiple k values, while ACOD supports multiple k that tries to mitigate the expensive range queries. This is
and R values. The final algorithm, MCOD, minimizes the achieved by not storing data points in the same window
range queries and thereby reduces the amount of distance like those in the same index structure. Leveraging modern
computation that needs to be done. K is the parameter for distributed multi-core clusters to improve the scalability of
detecting the outliers can be an exciting direction for future can be computationally demanding and consequently
studies. resulting in a lack of scalability. Even though researchers
In Table 4, added to the survey done by Tamboli et al. [25] have focused on solving these problems, we still believe
in comparing some distance-based outlier detection algo- better algorithms can be designed, which can simultaneously
rithms using the Massive Online Analysis tool [143], address the problem of both a low memory cost and compu-
we added other methods that were not included in their work. tational time. To address the issue of quadratic complexity,
In addition, Tran et al. [73], performed an evaluation study researchers have focused in proposing several significant
with detailed experiments of outlier detection methods in the algorithms and optimizations techniques such as applying
data stream. Among all the algorithms presented, they con- compact data structures [124], [145], using pruning and ran-
clude that MCOD in most settings has the best performance. domization [132], among the many others. Another challenge
worth noting is the inability of distance-based techniques to
4) DISTANCE-BASED APPROACHES-ADVANTAGES, identify local outliers. Distance-based calculations are often
DISADVANTAGES, CHALLENGES, AND GAPS done with respect to global information. For K-nearest neigh-
Advantages: bor approaches, the dataset plays a vital role in determining
i. They are straightforward and easy to comprehend as the perfect KNN score. From most of the algorithms men-
they mostly do not rely on an assumed distribution to tioned, choosing an appropriate threshold when it is required
fit the data. is one of the most complex tasks. Another important thing that
ii. In terms of scalability, they scale better in a multi- also influences the results obtained in these outlier detection
dimensional space as they have a robust theoreti- processes is the choice of k and in choosing appropriate input
cal foundation, and they are computationally efficient parameters.
when compared to statistical methods. Furthermore, in terms of detecting outliers in the data
streams, the fundamental requirement is related to its com-
Disadvantages, Challenges, And Gaps:
putational speed. We believe that designing algorithms that
i. They share some similar drawbacks as statistical and can support fast computation in both single and multiple data
density-based approaches in terms of high dimensional streams using distance-based techniques will be an exciting
space, as their performance declines due to the curse challenge for future directions. For current growing interest
of dimensionality. The objects in the data often have research areas like that of big data which demands the com-
discrete attributes, which makes it challenging to define putation of more massive data sets, it is imperative to design
distances between such objects. robust algorithms using distance-based techniques that can
ii. The search techniques such as the neighborhood scale well with a low computational cost (running time and
and KNN search in high-dimensional space when memory) for large up-to-date real data sets for both batch and
using a distance-based approach is an expensive task. stream processes.
In large data sets, the scalability is also not cost
effective.
iii. Most of the existing distance-based methods that can- D. CLUSTERING-BASED APPROACHES
not deal with data streams are because it is difficult for Clustering-based techniques generally rely on using cluster-
them to maintain the data distribution in the local neigh- ing methods to describe the behavior of the data. To do this,
borhood and in finding the KNN in the data stream. the smaller size clusters that comprise significantly fewer
This is an exception for methods that were specially data points than other clusters are labeled as outliers. It is
designed to tackle data streams. important to note that the clustering methods are different
Discussing further, in Table 4, we present a comprehen- from the outlier detection process. The main aim of clustering
sive well-known distance-based outlier detection algorithm. methods is to recognize the clusters while outlier detection
We give a summary of different techniques in terms of is to detect outliers. The performance of clustering-based
their computational complexity (running time and memory techniques is highly dependent on the effectiveness of the
consumption), address issues, and their drawbacks. Distance- clustering algorithm in capturing the cluster structure of
based methods are widely adopted approach since they have the normal instances [146]. Clustering-based methods are
a strong theoretical basis and are computationally effective. unsupervised since they do not require any prior knowledge.
However, they are faced with some challenges. One of So far, numerous research studies are using clustering-based
the critical underlining drawbacks of most distance-based techniques, and some of them are furnished with mechanisms
methods is their inability to scale well for very high dimen- to minimize the adverse influence of the outliers. Zhang [26],
sional data sets [144]. Issues like the curse of dimension- in his work introduced many clustering-based algorithms
ality continue to be an evolving challenge. When the data and divided them into different groups. As most of these
dimension grows, this influences the descriptive ability of clustering-based algorithms have not been proposed within
the distance measures and makes it quite tricky to apply this decade, we deem it unnecessary to repeat them in our
indexing techniques to search for the neighbors. In multivari- work and refer our readers to [26] or the original references of
ate data sets, computing the distance between data instances the studies for a detailed introduction of the listed algorithms
below. Clustering-based outlier detection algorithms have the maximum number of clusters. Some examples include the
been grouped into the following subgroups. MST [150], CURE [151], CHAMELEON [152]
i. Partitioning Clustering methods: are also known as iii. Density-based clustering methods: They do not require
distance-based clustering algorithms. Here, the number of the number of clusters to be initially given as in the
clusters are either randomly chosen or initially given. case of partitioning methods; such as K-Means. Given the
Some examples of algorithms that fall under this group radius of the cluster, they can model the clusters into
include PAM [147], CLARANS [148], K-Means [149], dense regions. Some examples of density clustering methods
CLARA [147], etc. include DBSCAN [153] and DENCLUE [154].
ii. Hierarchical Clustering methods: They partition the set Other groups include the:
of objects into groups of different levels and form a tree-like iv. Grid-based clustering methods:
structure. To group into different levels, they usually require STING [94], Wavecluster [155], DCluster [156]
v. Clustering methods for High-Dimensional Data: In another study, using a similar concept of k-means as in
CLIQUE [157], HPStream [158] MacQueen et al. [149] together with a rule for the weight,
In addition to the following algorithms, which have the authors proposed a clustering-based framework to detect
been covered in existing literature [5], [22], [26], [33], outliers in changing data streams [165]. They assign a weight
a two-phase algorithm called DenStream was proposed by to the feature with respect to their relevance. The weighted
Cao et al. [9] and D-Stream by Chen et al. [159]. The attributes are significant because they curb the effect of noise
authors make use of the density clustering-based technique attributes in the algorithm process. When this technique is
to address the problems of both online and offline outlier compared to LOF [8], it has less time consumption, shows a
detections. In DenStream, the initial phase records the sum- higher outlier detection rate, and also a low false alarm rate.
mary information of the data streams, and the latter phase Even though the work showed improved performance over
clusters the already summarized data. Outliers are detected the other baseline algorithm (LOF), it falls short in extending
by introducing potential micro-cluster outlier to differen- the algorithm for real-world data sets and in investigating its
tiate between the real data points and outliers. The main effects. Extending the algorithm to address this issue and in
distinction between the two is weight. If the weight is less designing new scales for the outlierness degree in the data
than the density threshold of the outlier microcluster, then stream can be an exciting future study. Hosein et al. [166]
the microcluster is a real outlier. The algorithm, therefore, proposed a clustering-based technique, which is an advance
removes the outlier micro-clusters. To show the effectiveness k-mean incremental algorithm for detecting outliers in big
of the algorithm, it was evaluated against CluStream [160]. data streams.
The algorithm’s efficiency showed improved performance In another work, an unsupervised outlier detection scheme
over that of CluStream in terms of memory since they save that uses both density-based and partitioning-based schemes
snapshots on a disk rather than in memory. However, the for streaming data was proposed by Bhosale et al. [167].
method still falls short in some areas, for example, in finding The main idea is based on partitioning clustering tech-
arbitrary shape clusters at multiple levels of granularity and niques [168], [148], which assign weights (using weighted
in adapting to dynamic parameters in the data streams. Even k-means clustering) to attributes based on their relevance
though the proposed technique was done in 2006, we still and adaptivity. The technique is incremental and can adapt
believe some future work can be done to address these issues, to the concept of evolution. It has a higher outlier detec-
since, to the best of our knowledge, the problems continue tion rate than [163]. The authors in this study suggested
to exist. In the other technique, D-Stream [159], it is similar extending the work for mixed and categorical data for future
to DenStream with an online and offline component, except research.
that it is a density grid-based clustering algorithm. Detecting Moshtaghi et al. [169] used a clustering approach to
outliers here is not as difficult as in the previous method due propose a model which labels objects outside the cluster
to the introduction of sparse, dense and sporadic grids that boundaries as outliers. The mean and covariance matrix
define the noise. Outliers are considered to be grids whose are incrementally updated to observe the fundamental dis-
sparse grids are less than the defined density threshold. The tribution changes in the data stream. Similar to [169],
algorithm also shows better performance over CluStream in Moshtaghi et al. in another work, proposed eTSAD [170],
terms of time performance and clustering performance. Since an approach that models streaming data with established
the algorithms in [9] and [159] use damped window mod- elliptical fuzzy rules. The fuzzy parameters of incoming data
els, Ren et al. [161] proposed SDstream, an algorithm that are updated as in [169]. This helps in the detection of outliers.
uses the sliding window model. Assent et al. [162] proposed Salehi et al. [171] proposed an ensemble technique for evolv-
AnyOut to compute and detect outliers anytime in streaming ing data streams. Instead of modeling and updating the data
data quickly. To identify the outliers in the form of constant streams over time, they proposed using ensemble to generate
varying arrival rates at a given time, AnyOut uses ClusTree clustering models. The outlierness value of an incoming data
to build a precise tree structure. The ClusTree is appropriate point is calculated by utilizing only the applicable set of
for anytime clustering. clustering models. Chenaghlou et al. [172] proposed an effi-
Elahi et al. [163], using k-means, a clustering-based outlier cient outlier detection algorithm, where the concept of active
detection technique was proposed for the data stream that clusters is presented for better time and memory efficient
splits the data stream into chunks for processing. However, outlier detection result. The input data is split into chunks, and
it does not fit well for grouped outliers. The experimental for each current arriving data chunk, active clusters are identi-
results illustrated that their method achieved a better perfor- fied. Here, the models of the underlying distributions are also
mance than some existing techniques [141], [164] for dis- updated. Rizk et al. [173] proposed an optimized calculation
covering significant outliers over the data stream. However, algorithm that enhances the process of searching for outliers
the authors still believe that by integrating distance-based in both large and small clusters. Chenaghlou et al. [174]
methods more firmly with the clustering algorithm, it can extend their work in [172] by proposing an algorithm that can
yield a better result. Moreover, finding other ways to assign detect the outliers in real-time. The algorithm detects outliers
the outlierness degree to the detected outliers in the data in real time and also discovers the sequential evolution of the
stream is another good research quest to investigate. clusters.
Yin et al. [175], proposed some new and effective meth- clustering algorithms require several, in advance and
ods in the context of cluster text outlier detection. In their the shape of the clusters to be defined. However, in data
approach, documents with a low prospect of identifying an stream scenario to assume several clusters in advance
existing cluster are referred to as outliers. They conclude that is very daunting.
the clusters that possess only one document in the result of iii. Partitioning methods are said to be highly sensitive
Gibbs Sampling of Dirichlet Process Multinomial Mixture to the initialization phase, outliers, and noise. Sim-
(GSDPMM) are classified as outliers in the data set. Since ilar to the density-based clustering techniques, they
GSDPMM has a great potential for incremental clustering, also carry some setbacks with regard to the fact that
how to relate GSDPMM in incremental clustering will serve they are highly sensitive to the setting of the input
to be an interesting research direction for future work. parameters. They have inadequate cluster descriptors
When designing clustering-based algorithms for outlier and mostly unsuitable for very large high-dimensional
detection, usually the following questions are taking into datasets because of the curse of dimensionality. Fur-
consideration. thermore, in some of the hierarchical methods [150],
i. Whether the object belongs to a cluster or not, and [152], the cost of clustering is enormous for high
whether the objects outside the cluster can be regarded dimensional and massive datasets. The vagueness in the
as an outlier. criteria for termination and the severe dilapidation of
ii. Whether the distance between the cluster and the object the effectiveness in high dimensional spaces as a result
is distant or closer. If it is at a distant, can it be regarded of the curse of dimensionality is another drawback.
as an outlier? Despite the challenges and drawbacks, clustering based
iii. Whether the object belongs to an insignificant techniques are useful in outlier detection, more especially
smaller or sparse cluster, and how to label the objects in data streams. Clustering-based algorithm for the process
within the cluster? of outlier detection in data streams has drawn the attention
of researchers and is seen as an interesting domain. The
1) CLUSTERING-BASED APPROACHES-ADVANTAGES, challenges of choosing appropriate cluster width and calcu-
DISADVANTAGES, CHALLENGES, AND GAPS lating the distance between objects in multivariate data are
Advantages: among the obstacles researchers try to solve. The detection
i. They are unsupervised methods which make them suit- rate is high in most cases, but they are also challenged with
able choice, and very useful for outlier detection in data high false positives. The density clustering-based techniques
streams. After learning from the clusters, additional handle noise while the partitioning and hierarchical methods
new points can be inserted and then tested for outliers. do not. These techniques have their strength and weaknesses;
This makes them adaptable to an incremental mode. it is challenging to decide which one is superior to the other.
Also, since no prior knowledge is required for the data Interesting work in the future will be to choose a suitable
distribution, this makes it more suitable for incremental dataset and evaluate these methods using different evaluation
mode. metrics. Also, a hybrid approach using the pros of different
ii. They are robust to different data types. The hierarchical techniques. For instance, density-based clustering methods
based methods are versatile, they maintain a good per- are suitable to cluster arbitrary shapes. Reducing the com-
formance on data sets containing non-isotropic clusters putational cost, the speed in the complex and large dataset
and also produce multiple nested partitions that give for partitioning-based methods and hierarchical methods is
users the option to choose different portions according also another interesting future work. Other interesting future
to their similarity level. studies will be to propose algorithms to detect outliers in low-
iii. In partitioning cluster related techniques, they are said density regions or where outliers are within clusters with a
to be relatively simple and scalable, and thus qualify small number of data points. In addition, suggesting methods
them for datasets with compact spherical clusters that that calculate the distance of the data point to the nearest
are well-separated. cluster centroid to detect the outliers efficiently.
Disadvantages, Challenges, And Gaps From the selected clustering-based techniques mentioned,
we can see that not much work has been done recently. To the
i. In clustering settings, outliers are binary; that is, there
best of our knowledge, using a thorough analysis of this
is no quantitative indication of the object’s outlierness.
technique for outlier detection. Therefore, we do not include
They are also known for their lack of back-tracking
any summary table of this technique, unlike in the other
ability; therefore, they can never undo what has already
previous approaches.
been done.
ii. Most clustering methods rely and depend on the users
to specify the number of clusters in advance, which is E. ENSEMBLE-BASED APPROACHES
a difficult task. In clustering methods, arbitrary shape Ensemble-based methods are generally used in machine
clusters also cause some difficulties in realizing the learning as a result of their comparatively better per-
exact clusters of the data. Therefore, most existing formance when compared to other related methods.
Although ensemble-based techniques for outlier detection independent or sequential ensembles, data-centered, or model-
when compared to other OD methods have had very few centered ensembles. The ensemble algorithms are classified
reports [37], [38], [176]–[182]. However, they are often by the ‘‘component independence’’ which tries to answer
used in recent outlier detection problems [183], [184], and to the question of whether the different components of the
have more open challenges. Ensemble techniques are used ensemble are independent or dependent on one another. For
in cases where one is prompted to answer the question of example, in boosting where the results depend on a prior
whether an outlier should be a linear-model based, distance- execution, such a method is not independent of the other,
based, or other kinds of model-based. They are usually while bagging is the opposite; they are independent of one
applied in classification and clustering problems. They com- another. For the ‘‘component type,’’ each component of an
bine the results from dissimilar models to produce more ensemble is described according to its model choice or data
robust models and then reduce the dependency of one model choice. The ‘‘model-centered’’ is independent, while the
to a particular dataset or data locality. However, ensemble ‘‘data-centered’’ is sequential. However, one cannot give an
methods in the context of outlier detection are known to ultimate conclusion because it might depend on the founda-
be very difficult. In recent years, several techniques have tion of the data and models.
been introduced, including the following: (I) Bagging [37] Other succeeding studies [38], [190], [191] in later years
and boosting [184] for classification problems (ii) Isolation that focused on using ensembles for outlier detection faced
forest [192] for parallel techniques. (iii) for sequential meth- numerous challenges. Some of these challenges include
ods [185] and Extreme Gradient Boosting Outlier Detec- the issue of comparing the scores using different functions
tion (XGBOD) [183] and a Bagged Outlier Representation and mixture models to fit the outlier scores and to give a
Ensemble (BORE) [186] for the hybrid methods. score combination. In addition, issues of how to support the
Lazarevic et al. [37], proposed the very first known ensem- combination of different detectors or methods to form one
ble method on improving outlier detection using the ensemble ensemble arise. Schubert et al. [191], compared the out-
method. It makes use of the feature bagging approach to lier ranking based on the scores using similarity measures.
handle very large high dimensional datasets. The technique A greedy ensemble technique was proposed as an applica-
combines the outputs of multiple outlier detection algorithms, tion, which shows the significance of the performance of
each of which is created through a random designated subset ensembles through diversifying approaches. Earlier in 2010,
of features. Each of the algorithms randomly selects a small Nguyen et al. [38] studied the difficulties of ensemble OD
subset of its real feature set and then assigns an outlier score. methods for high dimensional datasets. They proposed a
The score is assigned to all the data records that match up with unified framework that combines non-compatible methods
the probability of them being considered outliers. Each of the of different outlier detection algorithms. Instead of apply-
outlier score obtained from the different algorithms is then ing the same approach each time to determine the outlier
combined to get better quality outliers. From their experi- score, various detection methods are applied to approximate
ment, it shows that the combined method can produce a better the outlier score. Using the formal concept of the outlier
outlier detection performance because it focuses on smaller score, they propose Heterogeneous Detector Ensemble on
feature projections from the combined multiple outputs and random Subspaces (HeDES) through the combination of
distinct predictions. However, considering how to fully char- functions, to address the issue of heterogeneity. Unlike the
acterize these methods for very large and high dimensional Lazarevic et al. [37] framework, HeDES can bring together
datasets would be motivating future work. Also, examining different techniques that produce different outlier scores and
the impact of shifting the data distributions in detecting the score types; for instance, a real-value against that of the
outliers for each round of the combined methods (not limited binary-value. Even though from their experimental studies,
to only distance-based approaches but other approaches) is the framework shows effectiveness in the detection of outliers
worth considering. in a real-world data set, we believe considering an orderly
Aggarwal [178], presented a study on outlier ensemble extension in doing a further experiment on all possible com-
analysis, which has recently provoked great interest in lit- bined functions. In addition, extending the analysis to larger
erature [187], [236]. He discusses various outlier ensemble and higher dimensional datasets could be interesting future
methods and how such outlier ensemble analyses can be work.
more effective. He further explained how these methods are Zimek et al. [176] proposed a random subsampling tech-
connected to ensemble-methods in data mining problems. nique to estimate the nearest neighbors and then its local
Some examples of outlier ensembles in the context of classi- density. Usually, applying subsampling techniques from a set
fication and clustering were then given. In the classification of given datasets, it will obtain the training objects with-
context, boosting [187] and bagging (Bootstrap Aggregat- out replacement. This can improve and enhance the outlier
ing) [37] are two examples of ensemble-based methods that detection method performance. Using other outlier detection
have been proposed. In the context of clustering, the Multi- algorithms coupled with a subsampling technique can give a
view [188] and alterative clustering [189] serve as examples. different set of results and higher efficiency.
Another critical study in their work is how to categorize Zimek et al. [180] in another work, the authors consid-
ensemble outlier analysis problems, whether they are ered their technique from the perspective of learning theory
as another possible approach to ensemble outlier detection. data mining process by minimizing the dependence of
To construct the outlier detection ensemble, the authors pro- the model on a particular data set.
posed a data perturbation technique that brings forth diversity ii. They are suitable for outlier analysis in high dimen-
in different outlier detectors and a method that combines sional data; for example, Lazarevic et al. [37] applied
distinct outlier rankings. The main focus of their approach feature bagging for outlier detection in high dimen-
utilizes the notion of distance and density estimations in sional data.
Euclidean distance type dataspaces. To get a more consistent iii. In noisy and streaming scenarios where an individual
density estimate, the attribute values at each point are altered classifier’s result is not very robust due to the process-
by adding small randomized amounts of noise. All the i ing time and data quality problems, ensemble analysis
perturbed bootstrapped data sets then go through a selected is instrumental.
outlier detection algorithm, which helps in recording each Disadvantages, Challenges, And Gaps
data point identity, aggregates the scores and then ranks the
positions. The i outlier scoring (or rankings) are then com- i. Ensemble techniques in the context of detecting out-
bined to attain a steadier and dependable outlier scoring of liers when compared to other data mining problems are
the data. poorly developed. This is as a result of the difficulties
Pasillas-Diaz et al. [177] considered both subsampling and in evaluating the features of the ensembles. Moreover,
feature bagging techniques. The feature bagging technique is selecting the right meta-detectors is a difficult task.
used to obtain the various elements at each iteration, while ii. For real datasets, the outlier analysis can be very com-
the subsampling technique calculates the outlier scores of the plex to evaluate due to the combination of a smaller
different subsets of data. One key drawback in their method is sample space and its unsupervised nature. This can
the difficulty in obtaining the variance of the objects through further result in the incorrect prediction of the steps
feature bagging. Also, the size of the subsampled dataset of the algorithm in making robust decisions without
influences the sensitivity of the final result. triggering the over-fitting problem.
Zhao et al. [227] proposed Dynamic Combination of Although outlier ensemble techniques have shown promis-
Detector Scores for Outlier Ensembles (DCSO) an unsuper- ing results, they still have areas for improvement. Ensemble
vised outlier detector framework. DCSO tries to solve the analysis techniques can be very useful in areas where the data
challenge of choosing and combining the outlier scores in is noisy and in streaming scenarios. This is mainly because in
the absence of the ground truth for different base detectors. these scenarios they are usually challenged with some draw-
It selects the most suitable base detectors, with focus on the backs, such as the quality of the data and the processing time
locality of the data. DCSO initially labels the local region that makes the results produced from individual classifiers not
of a test instance with respect to its k nearest neighbors. very robust. More techniques are being proposed to address
It then detects the base detectors that show the best perfor- the challenge of model combinations.
mance within the local region. Zhao et al. [228] proposed To address these challenges and many others proposed by
Locally Selective Combination in Parallel Outlier Ensembles Zimek et al. [181], several additional methods have been
(LSCP) framework to address the same issues in [227]. They proposed [38], [182], [190]–[192] to improve outlier detec-
use a similar approach as in [227] and presented four varia- tion using ensemble methods, but most of these methods are
tions of the LSCP framework. meta methods except for those suggested by [37]. To further
For more details and broader discussions about outlier discuss and delve deep into outlier ensembles techniques,
ensemble techniques, the Aggarwal et al. [229] outlier ensem- Zimek et al. [181] in their study have presented several
ble book gives detailed discussions on outlier ensemble meth- open questions and challenges in using ensemble methods
ods. Although most of the studies mentioned there were in the detection of outliers. Although some new emerging
done before 2017, however, the book itself is very compre- research work has started to contribute to these open research
hensive and rich in details for the understanding of outlier problems [180] however, topics about the issues of proposing
ensemble methods. It presents the different types of ensemble diversifying principles and how to combine outlier rankings
methods and categorize them into different types. In addi- remains to be open and engaging for future research direc-
tion, it gives an overview of the outlier ensemble design tions. Some techniques [181], [184] are static and do not
frameworks. involve any detector selection methods. This kind of tech-
nique [184] that is characterized by the absence of a detector
1) ENSEMBLE-BASED APPROACHES-ADVANTAGES, selection process hinders the performance of the manner in
DISADVANTAGES, CHALLENGES, AND GAPS identifying the unknown outlier cases. Another significant
aspect that is not given much attention is the importance of
Advantages
data locality. An open research problem will be to consider
i. They are more stable and give better predictive mod- the data locality. That is, instead of only focusing on evalu-
els. The availability of algorithms like boosting and ating the competence of the base detector on a more global
bagging enhances the ensemble methods to perform view, the local region with respect to the test objects can be
more efficiently. They improve the robustness in the considered as well. This will help in the detector selection and
combination processes. Other essential problems for further and comprehensible global outlier detectors. This is attained
research, are in addressing the issue among ensemble outlier through learning automatically their local weight in particular
methods for creating a variety of models and meaningful data instances by means of label feedback. The fine-tuning
ways of combining the outlier rankings. of the ensemble helps in maximizing the number of correct
outliers discovered. This kind of framework is referred to as a
F. LEARNING-BASED APPROACHES human-in-the-loop because the human analyst for each round
Learning-based methods for the process of outlier detection of iterations gives label feedback.
have been applied in different sub-disciplines in machine Even though active learning for outlier detection has
learning - in active learning, graph-based learning, and deep recently been embraced in the research domain, they still
learning. In the subsequent section, we will introduce some lack in the literature, and there is still more work that needs
research in outlier detection that make use of these learning to be done. The process of discovering true outliers by the
methods. human analyst can be difficult, the need for the techniques to
minimize the effect of false positives through the design and
1) ACTIVE LEARNING configuration of an effective outlier detector is needed for the
Active learning is an example of a semi-supervised learn- human analyst in the future. In addition, better insights and
ing method in which designed algorithms interact with interpretations of outlier scores and related results obtained
the user or information source to get the desired out- through employing different algorithms are needed. Active
puts [193], [194]. For example, in cases of some real dataset learning in the context of outlier detection needs solid inter-
with huge unlabeled datasets, the task of manually labeling pretations and explanations for it to be well understood in
these data is expensive. Such a scenario demands the learning the research community. Finally, the design of active learning
algorithm to query the information source or user actively. algorithms for handling data streams is also a promising
When applying an active learning algorithm in such a sce- research challenge.
nario, the algorithm will be able to discover those smaller
fractions of instances that were labeled by the user in the 2) SUBSPACE LEARNING
training data set. This is done to boost the improvement of Outlier detection methods mentioned to this point usually
the re-trained model. Active learning resembles a system detect outliers from the complete data space considering all
in which the learning algorithm can request the user for the dimensions. But most outliers are often denoted as rare
input labels of the instances to give better predictions. Active neighborhood activities in a declining dimensional subspace.
learning for outlier detection has recently been embraced in For objects with several attributes, Zimek et al. [179] denote
different research domain [195]–[199]. Aggarwal et al. [6] that, only subsets with important attributes give valuable
use the concept of active learning in outlier detection to solve information. While characteristics like the residual attributes
the ambiguity of giving clear reasons why outliers are flagged contribute little or no importance to the task and might delay
and what prompts the relatively high computational demand the process of separating the OD model in solving such an
for density-estimation based OD methods. In their approach, issue, it will be interesting to perceive the outliers from a
they initially apply the classification techniques to the labeled suitable subspace.
dataset that contains potential outliers (artificially generated). In the outlier detection field, subspace learning is widely
The active learning method is then applied to minimize the studied and applied in high dimensional problems. For
classification problem through a selective sampling mech- subspace learning-based OD approaches, the main objec-
anism known as ‘‘ensemble-based minimum margin active tive is to discover meaningful outliers in a well-organized
learning.’’ Gornitz et al. [200] proposed another work where way by examining dissimilar subsets of dimensions in the
an active learning strategy is applied for anomaly detection. dataset. Mostly, these approaches are divided into sparse
To obtain a good predictive performance, they repeat the subspace [195], [196] and relevant subspace [126], [198],
process of alternating between the active learning process [202] methods. The former project the high-dimensional data
and the update of the model. The active learning rule is points onto sparse and low dimensional subspaces. These
applied after the method is trained on unlabeled and improved objects within the sparse subspace can then be labeled as
examples. outliers since they are characterized with a lower density.
Das et al. [196], [197] used an active approach to query One big drawback of these methods is the time consumption
the human analyst to obtain a better result. They avidly with regards to exploring the sparse projections from the
select the best data instances for the querying process, but entire high-dimensional space. To address this drawback,
no clear insight, explanation, and interpretation of the design Aggarwal et al. [6] proposed a method that improves the
choice were given. In the next study, they try to address these effectiveness of exploring the subspaces. The subspaces
issues. In 2019, Das et al. [201] then proposed an active are achieved through an evolutionary algorithm. However,
outlier detection method via ensembles called GLocalized the performance evaluation of the algorithm is highly depen-
Anomaly Detection (GLAD). They study how to automati- dent on the initial population.
cally fit ensemble outlier detectors in active learning prob- An additional method that focuses on the path of the
lems. In GLAD, the end-users maintain the use of modest sparse subspace approaches is the Zhang et al. [195] method.
Here, the concept of the lattice is used to signify the sub- graph-based outlier detection techniques and descriptions.
space relationships. The sparse subspace here also is those They included state-of-the-art methods and some open
with low-density coefficients. Again, creating the idea of research challenges and questions. Furthermore, the impor-
lattice influences and hinders the efficiency of the method tance of adopting graphs for outlier detection techniques was
because of its complexity, and this results in low efficiency. given. The graph-based approach in outlier detection is vital
Dutta et al. [196] proposed a way to achieve sparse space. as they show the inter-dependent state of the data, show
They applied sparse encoding to develop objects to multiple insightful representations, and robust machinery.
linear transformation space. The OD method uses relevant Moonesinghe et al. [204] proposed Outrank, which is
subspaces to examine the local information, which is use- among the first constructed graph-based outlier detection
ful for identifying outliers since they are essential features. framework. From the original data set, they developed fully
Huang et al. [126] proposed Subspace Outlier Detection linked undirected graphs and applied the Markov random
(SOD), a kind of relevant subspace method. Here, every walk method on the predefined graph. The stationary distri-
objects correlation with its shared nearest neighbors is exam- bution values of the random walk serve as the outlier scores.
ined. They use the ranks of individual objects that are close In the most recent study, Wang et al. [205] proposed a novel
as the degree of proximity of the object, but not take into method that combines the representation of the graph together
consideration the objects distance information with respect with each object’s local information in its surroundings. They
to their neighbors. SOD focuses mainly on the variances of address the problem of a high false-positive rates in the OD
the features. Muller et al. [202] proposed another method to methods, which is usually as a result of the neglect of the local
determine the subspaces. They make use of the significant information around each node for graph-based methods. The
relationships of the features; unlike SOD that only focuses on local information obtained from around each object helps in
the variances of the features. However, a significant drawback the construction of a local information graph. The outliers
of their method is its computational demand. are detected by computing the outlier scores through the
In a similar study, Kriegel et al. [17] applied principal process of a random walk on the graph. Wang et al. [206]
component analysis to get the relevant subspaces and Maha- in another study proposed another OD method that captures
lanobis distance computation through gamma distribution to different local information from different standpoints. They
detect the outliers. The key difference compared to the previ- used multiple neighborhood graphs, and the outlier scores
ous study [202], is that a large amount of local data is needed are deduced through a random walk on the predetermined
to identify the deviation trend. This consequently affects the graph. These methods all show improved performances as
flexibility and scalability of the method. To address the issue claimed by the authors. However, since using graph-based
of flexibility of the technique, Keller et al. [85] designed a learning methods has not yet been widely embraced, it is
flexible OD technique which uses subspace searching and another domain for outlier detection research in the future.
outlier ranking process. Initially, using the Monte Carlo sam-
pling method, they obtained the High Contrast Subspaces 4) DEEP LEARNING METHODS
(HiCS) and then combined the LOF scores based on the Recently, more attention has been given to deep learning in
HiCS. Stein et al. [203] then proposed a local subspace many areas including several studies related to outlier detec-
OD method by adopting global neighborhoods in another tion problems [35], [36] [30], [207]–[209]. Most recently,
study. Initially, the HiCS obtained all the relevant subspaces Chalapathy and Chawla [32] in their survey presented a
and instead of LOF, and LoOP technique was used to calcu- comprehensive study of deep learning methods for outlier
late the outlier scores. detection. They review how deep learning methods are used
Although subspace learning OD methods show high effi- in various outlier detection applications and evaluate their
ciency and are useful in some cases, they are generally com- effectiveness. The use of deep learning techniques in detect-
putationally expensive. This is because, in subspace learning ing outliers is important because of one or several of these
methods, there is a prerequisite in exploring the subspace reasons. (1) the need for better ways of detecting outliers in
high dimensional space. Discovering the relevant subspaces large-scale data. (2) better ways of learning the hierarchical
for the outliers can also be another difficult task. Designing discriminative features from the data (3) better ways to set
and proposing effective methods to handle these challenges the boundary between a normal and unusual behavior in
can be exciting research in the future of subspace OD related continuous evolving data sets. Deep learning can be based on
methods. supervised, semi-supervised, and unsupervised approaches to
learning data representations. For example, employing the
3) GRAPH-BASED LEARNING METHODS deep learning concept in fraud and anti-money laundering
The use of graph data is becoming universal in many domains. systems can detect and identify the relationships within the
Using graph-based learning for OD methods has been the data, and subsequently enable researchers to learn the data
focus of some researchers. In graphs, objects usually take points that are not similar to each other and then predict
the form of long-range connections, and a set of new tech- outliers.
niques has been proposed for outlier detection in graph data. In the supervised deep OD methods, the binary or multi-
Akoglu et al. [34], presented a comprehensive survey of class classifier are trained by utilizing the labels of the
normal and abnormal data instances. The supervised mod- The DHM uses deep neural networks. It focuses primarily
els, for instance, that are framed as multi-class classifiers, on autoencoders for feature extraction, and the hidden repre-
help in identifying abnormal behaviors such as fraudulent sentation of the autoencoders learned serves as the input for
health-care transactions [32]. Although supervised methods detecting outliers for most OD algorithms such a One-Class
are shown to have improved performance, however, semi- SVM. Although hybrid approaches maximize the detection
supervised and unsupervised methods are mostly adopted. performance of outliers, however, a notable limitation is the
This is because, supervised methods lack the readiness of shortage of trainable objective solely designed for outlier
labeled training data and also, there is a problem of class detection. Therefore, DHM is limited in extracting rich dif-
imbalance, which makes it sub-optimal to the others. ferential features to detect the outliers. To solve this draw-
In semi-supervised deep OD methods, the ease of obtaining back, Chalapathy et al. [218] and Ruff et al. [219] proposed
the labels of the normal instances compared to the outliers One class neural networks and Deep one-class classification,
makes it more widely appealing. They make good use of respectively. The One-Class Neural Networks (OC-NN) com-
prevailing normal positive classes to differentiate the outliers. bines the advantage of deep networks ability to extract rich
Semi-supervised techniques can be applied for training deep feature representations of the data and the benefit of one-class
autoencoders on data samples missing outliers. With enough creating a close-fitting structure around the normal data.
normal class training samples, the autoencoders will show The deep learning OD based technique is still active to be
significant improvement for the normal instance, with fewer explored further and are promising for future work. In the
reconstruction errors over the abnormal event. discussion section, we propose and recommend some open
In unsupervised deep OD methods, the outliers are detected challenges for future research work.
exclusively on the essential features of the data instances.
Here, the data samples are unlabeled, and unsupervised OD 5) LEARNING-BASED APPROACHES-ADVANTAGES,
techniques are used to label the unlabeled data samples. DISADVANTAGES, CHALLENGES, AND GAPS
In most of unsupervised deep OD models, autoencoders play
a: ADVANTAGES
a central role [210], [238]. Most emerging research studies
In OD based learning methods, such as in active learning,
adopting deep learning techniques for OD methods utilize
the time-consumption in detecting outliers it reduced since
unsupervised methods. Using deep learning for unsupervised
the technique is not passive learning. It helps to reduce the
outlier detection problems has shown to be effective [211],
number of labeled data needed for training the model to
[212]. They are mostly categorized into the model architec-
discover the outliers. Graph-based methods show the vital
ture adopting autoencoders [213] and hybrid models [214].
inter-dependent state of the data and provide an insightful rep-
The autoencoder related models assess the anomalies through
resentation for detecting the outliers. The deep learning tech-
reconstruction errors, i.e., employing the magnitude of the
niques help in delivering and showing better ways of learning
residual vector, whereas, in the later models, the autoencoder
the hierarchical discriminative features from the data. They
is used as the feature extractor, and then the hidden layers
provide better ways of detecting outliers in large-scale data.
represent the input. In another study for deep learning models,
In addition, they offer better ways to set the boundary between
Dan et al. [215] proposed Outlier Exposure (OE) to improve
normal and unusual behavior in continuous evolving data
the outlier detection performance. They offered a method
sets.
through iterations to find a suitable classifier for the model
to learn the heuristics. This helps in differentiating between
the outliers and in-distribution sample. b: DISADVANTAGES, CHALLENGES, AND GAPS
In another study, Du et al. [216] proposed Deeplog, a uni- Some learning-based techniques such as subspace learning,
versal framework that adopts a deep neural network approach can be computationally expensive and challenging to discover
for online log outlier detection and analysis. The deeplog the relevant subspaces for the outliers. In areas like deep
utilizes Long Short-Term Memory (LSTM) to model the learning techniques, with an increase in the volume of data,
system log. The whole log messages are learned and encoded it becomes a big challenge for the possibility of traditional
by the deeplog. Here, the anomaly detection process is done methods to scale well to detect outliers. The need to design
for every log entry level, in contrast to other methods per ses- deep learning OD techniques to capture complex structures
sion level approach. Borghesi et al. [217] proposed a new in large-scale data is crucial. In addition, the traditional man-
way of detecting anomalies in High-Performance Computing ual learning process to extract features from the data has
Systems (HPCS) through adopting a kind of neural network many disadvantages. Therefore, finding better ways through
called autoencoder. They first choose a set of autoencoders learning the hierarchical discriminative features from the data
and train them to learn the normal pattern of the supercom- is vital. The lack of accurate representation of normal and
puter nodes. After the training phase, they are applied to abnormal boundaries also presents challenges for both tradi-
identify abnormal conditions. tional methods and deep learning-based methods. Addressing
In deep OD methods, based on the training objec- these challenges can be interesting work in the future. There
tives, these methods can employ Deep Hybrid Models are still limited studies using unsupervised algorithms such
(DHM) or One Class Neural Networks (OCNN) [32]. as Long Short-Term Memory networks (LSTM), Recurrent
Neural Network (RNN), Deep Belief Network (DBN), etc. iii. Ranking: The main objective is to improve the effi-
in the area of outlier detection. For in-depth knowledge ciency of ANNS pruning rule. There are two sub-
and more references, we suggest Chalapathy et al. [32] and categories of optimization strategies, which include:
Kwon et al. [30] surveys. • Ranking Objects Candidates for Neighbors
(ROCN)
IV. EVALUATION TECHNIQUES, TOOLS, AND DATASETS • Ranking Objects Candidates for the Outlier
FOR OUTLIER DETECTION PROBLEMS (ROCO)
A. EVALUATION METHODS
Many outlier detection algorithms have been proposed over The authors classified their work according to whether
the years, but one major challenge in data mining research these algorithms go through clustering preprocessing phase,
is how to evaluate these methods. Different techniques have the type of pruning scheme use, and whether an object’s
been proposed to tackle this problem. Over the years, with candidates can be ranked according to neighbors and out-
the increasing flow of outlier detection algorithms, some liers. From their study, it is apparent that one cannot jus-
researchers have claimed that their method outperforms other tify the effectiveness of any single optimization or the
methods without a thorough analysis from a broader perspec- combination of optimizations over the other, as it always
tive. Therefore, researchers have seen this as an open research relies on the characteristics of the dataset. In another study,
direction to find ways of evaluating different algorithms. Achtert et al. [43] propose a visualization tool [221], [222]
In recent years, some research studies have concentrated on to compare and evaluate outlier scores for high dimensional
the evaluation of OD methods, for example, for distance- data sets. Over the years, many approaches have presented the
based, ensemble-based approaches, and unsupervised degree of an object being considered as an outlier through an
methods. outlier score or factor. However, these outlier scores or factors
Distance-based methods over the last decade have a con- vary in their contrast, range, and definition among various
siderable amount of literature, and evaluating these methods outlier models. This makes it quite difficult for a novice
is very significant. However, there are many challenges in user with OD methods to be able to interpret the outlier
assessing these techniques against each other. The main inter- score or factor. In some cases, even for the same or similar
est of most researchers and practitioners in outlier detection outlier model, the same score within one or in the same
problems is the effectiveness and efficiency. For example, data set can depict a different outlier degree. For illustration,
in terms of the efficiency, evaluating the efficiency of dif- the same outlier score x in database y and database z can
ferent methods can be quite complicated because the per- have a considerable degree of outlierness. This makes it also
formance will depend on factors such as the size of the very tedious to interpret and to compare the outlier scores.
data, the dataset dimensionality, parameter choice, and other In addition, we should consider that in different models, var-
details related to the implementation. ious assumptions are made, and therefore, this might directly
Orair et al. [220], focused on evaluating several distance- influence the interpretation of the degree of outlierness in the
based methods for outlier detection approaches to infer data set and how to define an outlier in the same datasets or for
some useful guidelines in designing optimized outlier detec- different datasets.
tion algorithms. An outlier detection framework was imple- Not much provision for better evaluation techniques has
mented, and a factorial experiment was conducted on a cou- been proposed in recent studies, and most studies concen-
ple of the optimization strategies that have been proposed trate on introducing new methods to improve the detection
in recent times. It is done to evaluate the pros and cons rate and computational time. In contrast to classification
of these optimization techniques. The results obtained from problems, the evaluation of outlier detection algorithms per-
the factorial experiment is beneficial for giving significant formance is more complicated. Researchers have provided
and interesting insights. For instance, they found that certain several adopted measurements to evaluate outlier detection
combinations of optimizations work efficiently for real and algorithm performance [223]. They are defined as follows:
synthetic data sets. However, none of the combinations of i. Precision - this denotes the ratio of the number of correct
the optimization can claim to be superior to the other in outliers m, divided by the whole number of outliers t. In a
all types of data. The authors proposed three optimization particular application, setting t can be difficult. Therefore,
techniques t is usually assigned as the number of outliers in the ground
i. Approximate Nearest Neighbor Search (ANNS) truth.
ii. Pruning: is the preprocessing step used by several algo- ii. R-precision - this refers to the proportion of correct
rithms to facilitate the partitioning or clustering of data outliers in the top number of ground truth potential outliers
points. The pruning scheme proposed by the authors identified. The R-precision does not contain enough infor-
are: mation because the number of true outliers is minimal when
• Pruning partitions during the search for compared to the total size of the data.
Neighbors (PPSN). iii. Average precision - this denotes to the average of the
• Pruning partitions during the search for precision scores over the ranks of the outlier points. It com-
outliers (PPSO). bines recall and precision.
iv. Receiver Operating Characteristic (ROC) and Area The authors address these issues by performing an evaluation
Under the Curve (AUC) - the ROC is a graphical plot that study that reveals the performance of the effect of the parame-
shows the true positive rate against the false positive rate. ter settings, computational strength, and the overall strengths
The true or false positive rate signifies the number of out- and weaknesses of different algorithms. The list of algorithms
liers or inliers ranked among the potential outliers in the was categorized into nearest-neighbor based methods, sta-
top number of outliers in the ground truth. The AUC shows tistical based methods, clustering based methods, subspace
the numerical evaluation performance of the outlier detection methods, and classifier-based techniques. These algorithms
method. were then compared. For the KNN methods, the choice of
v. Correlation coefficient - is a numerical measure of cor- the parameter k is very significant as it influences the outlier
relation, i.e., a statistical relationship between two variables. score. Other important things to consider are the dataset,
For instance, Spearman’s rank similarity or Pearson corre- the dimensionality, and normalizations. They experimented
lation. More importance is placed on the possible outliers with investigating the influence of the parameter and then
ranked at the top. evaluated the nearest neighbor algorithms. Key findings from
vi. Rank power (RP)-it ranks the true outliers at the top and their study were that local outlier detection algorithms such
normal ones at the bottom. It comprehensively evaluates the as LOF [8], INFLO [75], COF [80] and LoOP [81] are
ranking of true outliers. not suitable for detecting global outliers since they showed
Most of the evaluation methods are instead heuristic poor performance on datasets comprised of global outliers.
and focus on precision, the receiver operating characteristic However, it is the opposite of the performance of global out-
(ROC) curve and area under the curve (AUC) in showing the lier detection problems on local outlier detection problems.
results. The drawback of these evaluation procedures is that In addition, they found that the clustering-based algorithms
there is no provision for a similarity check among the meth- were in most cases inferior to the nearest-neighbor based
ods. Knowing how similar or correlated the ranking of outlier algorithms. Therefore, it is recommended for a global task
scores are is considered a very significant step towards con- to apply the nearest-neighbor techniques, and for a local task,
structing better OD methods. AUC completely disregards the the local outlier algorithms like LOF are more suitable than
small variations among scores and only considers the ranking. other clustering-based methods.
It is also inferior and not perfect for unbalanced class prob- Another issue with regard evaluating most OD models is
lems when compared to techniques such as the area under that there is a scarcity of rich knowledge about the strength
the precision-recall curve, which shows a better possibility and weaknesses of these outlier detection models, suitable
in highlighting small detection changes. However, despite benchmark datasets for outlier detection task and some biases
these drawbacks, AUC, ROC, and precision-recall still serve that are used in the evaluation process that are not well
as the de facto standard in evaluating many outlier detection understood. Campos et al. [226], similar to [97], did an
problems. Since knowing how similar or correlated the rank- experimental study across a wide variety of specific datasets
ing of outlier scores are is a very significant step towards to observe the performance of different unsupervised outlier
constructing better OD methods, Schubert et al. [191] detection algorithms. In their study, they classified different
in their study gave a global view that permits the evalua- datasets and deliberated on how suitable they are as outlier
tion of the performance of different approaches against each detection standard datasets. Also, they further discuss and
other. The proposed framework considers the problem of examine the commonly-known and used methods/measures
class imbalance and then offers a new understanding about for comparing outlier detection performance. Some com-
the similarity and redundancy of prevailing outlier detection mon misconceptions the authors clarify are, for instance,
techniques. To achieve the main objective in giving a better the ground-truth datasets containing a large number of out-
evaluation for both the outlier rankings and scores, a suitable liers. It is sometimes believed that these outliers will influence
correlation measure for comparing rankings by taking into the evaluation performance of these methods, but this does
account the outlier scores was established. not hold in all scenarios. Usually, the large proportions of out-
In another study, Goldstein et al. [102] proposed a com- liers in datasets are not suitable to evaluate outlier detection
paratively universal evaluation of nineteen different unsuper- techniques because outliers are supposed to be less common
vised outlier detection algorithms with ten publicly available in the datasets. A small percentage of outliers and normalized
datasets. The main aim was to address the lack of interesting datasets usually produce a much better performance in most
literature that exists [224], [225] that gives a better evaluation cases. Another critical misconception held is concerning the
of outlier detection algorithms. One notable trend with exist- influence of the dimensionality. An increase in the dimen-
ing research literature is the comparison of newly proposed sionality often results in a high computational cost but is not
algorithms with some previous or state-of-the-art methods. directly proportional to the overall performance, especially in
However, most of these studies fail to publish the datasets terms of the detection rate.
with appropriate preprocessing or indicate which application Another important area to consider is the evaluation of
scenarios they are most suitable. They also mostly lack a outliers in data streams. Outlier detection in data streams is
clear understanding of the effect of the parameter k and the usually a difficult task, because the data should be learned
established criteria of whether it is a local or global outlier. and processed in real-time while concurrently making good
predictions. Except for the Lavin et al. [230] Numenta the most popular and useful databases that contain real-world
Anomaly Benchmark (NAB) framework, there is a lack of datasets for outlier detection include the following:
benchmarks to effectively test and score the effectiveness of 1) The UCI repository [52].
real-time outlier detection methods. With more recent studies The UCI repository has hundreds of freely available data
concentrated in this domain, there is a need for proposing effi- sets, and many OD methods use the repository to evaluate
cient and rigorous benchmarks to evaluate real-time outlier the performance of the algorithms. However, the majority
detection algorithms in data streams effectively. of these datasets are designed for classification methods.
In outlier detection scenarios, the generally used approach
B. TOOLS FOR OUTLIER DETECTION is to preprocess the datasets. The outliers represent objects
In outlier detection, many tools and datasets have been used. in the minor class, and the rest are considered as the normal
Here, we introduce some popular tools used for outlier detec- ones.
tion processes and some outlier detection databases. 2) Outlier Detection Datasets (ODDS) [51].
The prevalence of outlier detection in industrial applica- Unlike UCI repository, ODDS, provides open access to a
tions has seen the development of many software tools such collection of datasets only suitable for the outlier detection
as the following provided below. process. The datasets are grouped into different types includ-
Scikit-learn Outlier Detection [231] ing multi-dimensional datasets, time series univariate and
The scikit-learn project offers some machine learning tools multivariate datasets, and time series graph datasets.
that can be applied for outlier detection problems. It includes 3) ELKI Outlier Datasets [50].
some algorithms like LOF [8] and Isolation Forest [192]. ELKI has a collection of data sets for outlier detection and
2) Python Outlier Detection (PyOD) [232] also many data sets for OD methods evaluation. These data
PyOD is used for detecting outliers in multivariate data. It is a sets are used to study the performance of several OD algo-
scalable python tool that has been used in many research and rithms and parameters.
commercial projects, including new deep learning and outlier 4) Unsupervised Anomaly Detection Dataverse [49].
ensembles models [60], [62], [233]. These datasets are used for evaluating unsupervised outlier
3) Environment for Developing KDD-Applications Sup- detection algorithms by making comparison with the stan-
ported by Index-Structures (ELKI) [43] dards. It is obtained from multiple sources with the majority
of the data sets from supervised machine learning datasets.
ELKI is an open source data mining algorithm that provides
It is important to note that with real-world data sets, a lot
a collection of data mining algorithms, including OD algo-
of data is not publicly accessible due to privacy and security
rithms. It allows the ease and fair assessment and benchmark-
concerns.
ing of OD algorithms. It is written in Java.
Synthetic datasets are often created under the settings of
4) Rapid Miner [234] defined constraints and conditions. Synthetic datasets, when
The extension of this tool contains many popular unsuper- compared to real-world datasets, are mostly less complex and
vised outlier detection algorithms such as LOF, COF [80], eccentric, and shows better validity of the OD algorithms
LOCI [82], and LoOP [81]. performance. For the outlier detection process, since most
5) MATLAB [235] of the data adopted are not purpose-specific for just OD
MATLAB also supports many outlier detection algorithms methods, the repurposing of supervised classification data
and functions. Algorithms can be implemented using has been widely adopted. In many studies, the data has been
MATLAB because it is user-friendly. treated as it is, rather than manipulated.
As stated earlier, in outlier detection experiments, to eval-
6) Massive Online Analysis (MOA) tool [143]. uate the OD methods there is need to use both real-world
MOA is an open source framework that provides a collection and synthetic data sets. Also, many benchmark datasets are
of data stream mining algorithm. It includes some distance- required to develop an algorithm that captures a broader view
based outlier detection algorithms such as COD, ACOD, of the problems. The availability of many benchmark datasets
Abstract C, MCOD, and some tools for evaluation. also helps in the better and more robust way of reporting
and presenting the results. In most supervised classification
C. DATASETS FOR OUTLIER DETECTION types of datasets, they require some preprocessing for outlier
Outlier detection methods have been applied in different detection tasks. Two important aspects are considered in the
kinds of data, such as in regular and high-dimensional preprocessing phase [226]. That is, for semantically signifi-
data sets [240], streaming datasets, network data, uncertain cant outlier datasets, the outliers are the classes related to the
data [241], and time series data. In outlier detection literature, minor objects and the normal data is the rest of the data.
two types of data are mostly considered and required for When choosing a data set for OD methods, the data should
evaluating the performance of the algorithms. They are real- be tailored in terms of precise and meaningful attributes
world datasets and synthetic datasets. The real-world datasets which can fit the problem definition. For example, for an
can be obtained from publicly available databases. Some of OD method related to the data stream, it is better to use
streaming data rather than other kinds of data. The selected as mentioned earlier, it will be of interest for further
algorithm should fit the data in terms of the right attribute research work to address these challenging issues to
types, the correct distribution model, the speed and scalability detect outliers more efficiently. In the existence of high
and other important anticipated incremental capabilities that dimensional data, most existing data stream algorithms
can be managed and model well upon the arrival of new for OD methods lose their effectiveness. Therefore,
objects. future studies are needed on how to redesign the contem-
Some of the other concerns in dealing with datasets porary models to detect the outlying patterns correctly
include how to handle the downsampling of data, dealing and efficiently.
with duplicate data, transforming categorical attributes to • The recent explosion of massive datasets, gives rise
numeric types, normalization, and dealing with missing val- to many openings for future research relating to the
ues. In future work, it will be crucial to study how to evaluate design of efficient approaches to identify outliers, which
dataset for outlier detection methods and what key attributes are usually the most significant points within the data
to take into consideration. set. Their discoveries can lead to vital and unforeseen
insights. We will also suggest the need for designing
V. CONCLUSION AND OPEN RESEARCH GAPS robust outlier detection algorithms that are scalable, can
In this paper, we have provided a comprehensive survey in handle large dimensional data sets, and have a minimum
a structured manner that reviews state-of-the-art methods run time.
of detecting outliers by grouping them into different cate- • We found that in distance-based methods based on KNN,
gories. We have grouped the algorithms into density-based, the parameter K is sensitive, therefore setting of the
statistical-based, distance-based, clustering-based, ensemble- parameter k is very significant. Setting and finding the
based, and learning-based approaches. In our discussion appropriate k is worth considering for neighbor ranking-
section, we discussed their most significant advantages, based OD methods. Also, the distance metrics usually
drawbacks, and challenges. Furthermore, we attempted to adopted for neighbor-based approaches do not fit well
review and provide state-of-the-art open research problems for high dimensional data. Addressing the equidistance
and challenges. We also discussed the evaluation techniques, issue and introducing effective distance metrics is neces-
tools, and data sets adopted for outlier detection methods. sary for high-dimensional data. The neighbor -based OD
We succeeded in providing researchers with an in-depth algorithms are sensitive to the nearest neighbors which
knowledge of the fundamental requirements of these tech- are chosen for the models. Therefore, further studies
niques before choosing a particular technique for an outlier can be done on how to determine the precise number of
detection problem. neighbors needed.
From our review, it is evident that despite the progress in • In statistical-methods, apart from designing more
outlier detection research, there are still lots of open research robust algorithms for detecting outliers more efficiently,
questions and issues to be addressed. It is apparent that we noticed that, to the best of our knowledge, no work
future studies are needed in most of the outlier detection- has been done to compare the influence of parametric
based approaches. Therefore, in addition to the already stated and non-parametric approaches in the outlier detection
future work in each of the categories, the following are still process. It is essential for researchers to know the pros
supplementary to open research gaps: and cons of using the parametric and non-parametric
• Further studies need to be done to fully characterize approaches and also to design algorithms that can be
and relate some of these methods to real-life data, par- able to outperform and address some of the drawbacks
ticularly in very large and high dimensional databases, of statistical OD methods.
where first-hand techniques for estimating data densities • For clustering techniques, since they are generally not
are worth bearing in mind. In high dimensional data sets, considered to be designed explicitly for outlier detec-
the problem of the curse of dimensionality and distance tion, ensemble techniques, which combine the results
concentration are still open challenges to be addressed. from dissimilar models to produce a more robust model,
• Outliers usually show unusual local behaviors. The pro- will create a much better result. Ensemble methods
cess of discovering in high dimensional space these are well known to improve the performance of outlier
local correlations is challenging. Also, how to accurately detection by both the quality of the detected outliers
determine the correlations makes the whole issue more and run time. Therefore, ensemble outlier detection,
complicated. Therefore, solving these challenges are which shows great potential in enhancing outlier detec-
still open research problems tion algorithms, can be another worthy future research
• It would also be thought-provoking to examine the influ- direction. More accurate models can be proposed to
ence of extraneous features in outlier detection processes address the unexplored areas.
to select the appropriate features for the outlier detection • For the learning methods such as subspace-based and
task. ensemble-based learning methods, with a large range of
• Since a vast amount of data now comes in the form of the subspaces or base learners, we often get a reasonably
data streams, which are characterized by some issues good performance. Therefore, finding ways to choose
the precise subspaces and base learners is important. [14] A. Ayadi, O. Ghorbel, A. M. Obeid, and M. Abid, ‘‘Outlier detection
Also, choosing the right quantities and combination approaches for wireless sensor networks: A survey,’’ Comput. Netw.,
vol. 129, pp. 319–333, Dec. 2017.
strategies are all still open research issues to address. [15] J. Mao, W. Tao, C. Jin, and A. Zhou, ‘‘Feature grouping-based outlier
• In terms of evaluating OD methods, it is still an open detection upon streaming trajectories,’’ IEEE Trans. Knowl. Data Eng.,
challenge on how to effectively and broadly assess the vol. 29, no. 12, pp. 2696–2709, Dec. 2017.
[16] C. D’Urso, ‘‘EXPERIENCE: Glitches in databases, how to ensure data
OD methods performance. This has been difficult to quality by outlier detection techniques,’’ J. Data Inf. Qual., vol. 7, no. 3,
achieve because outliers are not frequent and often the 2016, Art. no. 14.
ground truth in real situations is absent. [17] H. P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, ‘‘Outlier detection in
axis-parallel subspaces of high dimensional data,’’ in Proc. Pacific-Asia
• Another exciting open research direction for the future Conf. Knowl. Discovery Data Mining, Berlin, Germany: Springer, 2009,
is the arrival or loss of new or existing dimensions over pp. 831–838.
the period. In potential application areas such as outlier [18] Z. Cai, Z. He, X. Guan, and Y. Li, ‘‘Collective data-sanitization for pre-
venting sensitive information inference attacks in social networks,’’ IEEE
detection in IoT (internet of things) devices, the sensors Trans. Depend. Sec. Comput., vol. 15, no. 4, pp. 577–590, Aug. 2018.
can be on or off sporadically over the period, and new [19] Y. Yu, L. Cao, E. A. Rundensteiner, and Q. Wang, ‘‘Outlier Detection over
ways are needed to detect outliers more efficiently in this Massive-Scale Trajectory Streams,’’ ACM Trans. Database Syst., vol. 42,
challenging scenario. no. 2, pp. 10:1–10:33, 2017.
[20] Y. Djenouri, A. Belhadi, J. C.-W. Lin, D. Djenouri, and A. Cano, ‘‘A sur-
• Although considerable advances have been made in vey on urban traffic anomalies detection algorithms,’’ IEEE Access, vol. 7,
using deep learning methods in other application areas, pp. 12192–12205, 2019.
However, there is a relative shortage of deep learn- [21] Y. Djenouri, A. Zimek, and M. Chiarandini, ‘‘Outlier detection in urban
traffic flow distributions,’’ in Proc. IEEE Int. Conf. Data Mining (ICDM),
ing methods for outlier detection problems. Therefore, Nov. 2018, pp. 935–940.
the use of deep learning techniques for OD methods is [22] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’
still open for further research. ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, Jul. 2009.
[23] S. Ranshous, S. Shen, D. Koutra, S. Harenberg, C. Faloutsos, and
N. F. Samatova, ‘‘Anomaly detection in dynamic networks: A survey,’’
REFERENCES Wiley Interdiscipl. Rev., Comput. Stat., vol. 7, no. 3, pp. 223–247, 2015.
[24] X. Su and T. C. Leroy, ‘‘Outlier detection,’’ in Robust Regression and
[1] E. L. Paula, M. Ladeira, R. N. Carvalho, and T. Marzagáo, ‘‘Deep learning Outlier Detection, vol. 1, no. 3. Hoboken, NJ, USA: Wiley, 2011,
anomaly detection as support fraud investigation in Brazilian exports and pp. 261–268.
anti-money laundering’’ in Proc. 15th IEEE Int. Conf. Mach. Learn. Appl.
[25] J. Tamboli and M. Shukla, ‘‘A survey of outlier detection algorithms for
(ICMLA), Anaheim, CA, USA, Oct. 2016, pp. 954–960.
data streams,’’ in Proc. 3rd Int. Conf. Comput. Sustain. Global Develop.,
[2] U. Porwal and S. Mukund, ‘‘Credit card fraud detection in e-commerce: Mar. 2016, pp. 3535–3540.
An outlier detection approach,’’ 2018, arXiv:1811.02196. [Online]. [26] J. Zhang, ‘‘Advancement of outlier detection: A survey,’’ ICST Trans.
Available: https://arxiv.org/abs/1811.02196 Scalable Inf. Syst., vol. 13, pp. 1–26, Feb. 2013.
[3] K. Alrawashdeh and C. Purdy, ‘‘Toward an online anomaly intrusion [27] A. Zimek, E. Schubert, and H.-P. Kriegel, ‘‘A survey on unsupervised
detection system based on deep learning,’’ in Proc. 15th IEEE Int. outlier detection in high-dimensional numerical data,’’ Stat. Anal. Data
Conf. Mach. Learn. Appl. (ICMLA), Anaheim, CA, USA, Dec. 2016, Mining, vol. 5, no. 5, pp. 363–387, Oct. 2012.
pp. 195–200. [28] C. C. Aggarwal, Outlier Analysis. 2nd Ed. New York, NY, USA: Springer,
[4] G. Gebremeskel, C. Yi, Z. He, and D. Haile, ‘‘Combined data mining 2016.
techniques based patient data outlier detection for healthcare safety,’’ Int. [29] A. S. Hadi, R. Imon, and M. Werner, Detection of Outliers, vol. 1, no. 1.
J. Intell. Comput. Cybern., vol. 9, no. 1, pp. 42–68, 2016. Hoboken, NJ, USA: Wiley, 2009, pp. 57–70.
[5] V. J. Hodge and J. Austin, ‘‘A survey of outlier detection methodologies,’’ [30] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim, ‘‘A survey
Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, 2004. of deep learning-based network anomaly detection,’’ Cluster Comput.,
[6] C. C. Aggarwal and P. S. Yu, ‘‘An effective and efficient algorithm vol. 10, pp. 1–13, Sep. 2017.
for high-dimensional outlier detection,’’ Int. J. Very Large Data Bases, [31] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, ‘‘Outlier detection for
vol. 14, no. 2, pp. 211–221, 2005. temporal data: A survey,’’ IEEE Trans. Knowl. Data Eng., vol. 26, no. 9,
[7] F. Angiulli, S. Basta, and C. Pizzuti, ‘‘Distance-based detection and pp. 2250–2267, Sep. 2014.
prediction of outliers,’’ IEEE Trans. Knowl. Data Eng., vol. 18, no. 2, [32] R. Chalapathy and S. Chawla, ‘‘Deep learning for anomaly
pp. 145–160, Feb. 2006. detection: A survey,’’ 2019, arXiv:1901.03407. [Online]. Available:
[8] M. Breunig, H. Kriegel, R. T. Ng, and J. Sander, ‘‘LOF: Identify- https://arxiv.org/abs/1901.03407
ing density-based local outliers,’’ ACM SIGMOD Rec., vol. 29, no. 2, [33] P. Gogoi, D. K. Bhattacharyya, B. Borah, and J. K. Kalita, ‘‘A survey of
pp. 93–104, 2000. outlier detection methods in network anomaly identification,’’ Comput. J.,
[9] F. Cao, M. Ester, W. Qian, and A. Zhou, ‘‘Density-based clustering over vol. 54, no. 4, pp. 570–588, Apr. 2011.
an evolving data stream with noise,’’ in Proc. SIAM Conf. Data Mining, [34] L. Akoglu, H. Tong, and D. Koutra, ‘‘Graph based anomaly detection and
Apr. 2006, pp. 328–339. description: A survey,’’ Data Mining Knowl. Discovery, vol. 29, no. 3,
[10] Z. Zheng, H. Y. Jeong, T. Huang, and J. Shu, ‘‘KDE based outlier pp. 626–688, 2015.
detection on distributed data streams in multimedia network,’’ Multimedia [35] L. F. Maimó, A. L. P. Gómez, F. G. J. Clemente, M. G. Pérez, and
Tools Appl., vol. 76, no. 17, pp. 18027–18045, Sep. 2017. G. M. A. Pérez, ‘‘Self-adaptive deep learning-based system for anomaly
[11] L. L. Sheng, ‘‘Fractal-based outlier detection algorithm over RFID data detection in 5G networks,’’ IEEE Access, vol. 6, pp. 7700–7712, 2018.
streams,’’ Int. J. Online Eng., vol. 12, no. 1, pp. 35–41, Feb. 2016. [36] I. Kakanakova and S. Stoyanov, ‘‘Outlier Detection via Deep Learning
[12] D. van Hieu and P. Meesad, ‘‘A fast outlier detection algorithm for Architecture,’’ Proc. 18th Int. Conf. Comput. Syst. technol., vol. 2017,
big datasets,’’ in Recent Advances in Information and Communication pp. 73–79.
Technology (Advances in Intelligent Systems and Computing), vol. 463, [37] A. Lazarevic and V. Kumar, ‘‘Feature bagging for outlier detection,’’ in
P. Meesad, S. Boonkrong, and H. Unger, Eds. Cham, Switzerland: Proc. 11th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
Springer, 2016. Aug. 2005, pp. 157–166.
[13] X. T. Wang, D. R. Shen, M. Bai, T. Z. Nie, Y. Kou, and G. Yu, ‘‘An effi- [38] H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan, ‘‘Mining outliers with
cient algorithm for distributed outlier detection in large multi-dimensional ensemble of heterogeneous detectors on random subspaces,’’ in Database
datasets,’’ J. Comput. Sci. Technol., vol. 30, no. 6, pp. 1233–1248, Systems for Advanced Applications, Berlin, Germany: Springer, 2010,
Nov. 2015. pp. 368–383.
[39] V. Barnett and T. Lewis, Outliers in Statistical Data. Hoboken, NJ, USA: [63] J. Lei, T. Jiang, K. Wu, H. Du, and L. Zhu, ‘‘Robust local outlier detection
Wiley, 1994. with statistical parameters for big data,’’ Comput. Syst. Sci. Eng., vol. 30,
[40] S. Walfish, ‘‘A review of statistical outlier methods,’’ Pharmaceutical no. 5, pp. 411–419, 2015.
Technol., vol. 30, no. 11, pp. 1–5, 2006. [64] R. Yu, H. Qiu, Z. Wen, C. Lin, and Y. Liu, ‘‘A survey on social media
[41] A. Patcha and J.-M. Park, ‘‘An overview of anomaly detection tech- anomaly detection,’’ ACM SIGKDD Explor. Newslett., vol. 18, no. 1,
niques: Existing solutions and latest technological trends,’’ Comput. pp. 1–14, 2016.
Netw., vol. 51, no. 12, pp. 3448–3470, Aug. 2007. [65] R. H. X. Yu and Y. Liu, ‘‘Glad: Group anomaly detection in social media
[42] M. F. Jiang, S. S. Tseng, and C. M. Su, ‘‘Two-phase clustering process analysis,’’ ACM Trans. Knowl. Discovery Data (TKDD), vol. 10, no. 2,
for outliers detection,’’ Pattern Recognit. Lett., vol. 22, pp. 691–700, p. 18, 2015.
May 2011. [66] S. Ghanbari, B. Ali Hashemi, and C. Amza, ‘‘Stage-aware anomaly
[43] E. Achtert, H. P. Kriegel, L. Reichert, E. Schubert, R. Wojdanowski, and detection through tracking log points,’’ in Proc. 15th Int. Middleware
A. Zimek, ‘‘Visual evaluation of outlier detection models,’’ in Proc. 15th Conf., Dec. 2014, pp. 253–264. doi: 10.1145/2663165.2663319.
Int. Conf. Database Syst. Adv. Appl. (DASFAA), 2010, pp. 396–399. [67] K. Pradnya and H. K. Khanuja, ‘‘A methodology for outlier
[44] Alberto Quesada, Artelnics. Three methods to deal with detection in audit logs for financial transactions,’’ in Proc. Int.
outliers, Artelnics, Machine Learning Blog. Accessed: Conf. Comput. Commun. Control Automat., Feb. 2015, pp. 837–840.
Feb. 20, 2018. [Online]. Available: https://www.neuraldesigner.com/ doi: 10.1109/ICCUBEA.2015.167.
blog/3_methods_to_deal_with_outliers [68] A. Abid, A. Kachouri, and A. Mahfoudhi, ‘‘Outlier detection for wireless
[45] K. Chenaoua, F. Kurugollu, and A. Bouridane, ‘‘Data cleaning and outlier sensor networks using density-based clustering approach,’’ IET Wireless
removal: Application in human skin detection,’’ in Proc. 5th Eur. Work- Sensor Syst., vol. 7, no. 4, pp. 83–90, Aug. 2017.
shop Vis. Inf. Process. (EUVIP), Dec. 2014, pp. 1–6. [69] H. Feng, L. Liang, and H. Lei, ‘‘Distributed outlier detection algorithm
[46] G. Pang, L. Cao, L. Chen, and H. Liu, ‘‘Learning representations of based on credibility feedback in wireless sensor networks,’’ IET Com-
ultrahigh-dimensional data for random distance-based outlier detection,’’ mun., vol. 11, no. 8, pp. 1291–1296, Jun. 2017.
in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining [70] N. Shahid, I. H. Naqvi, and S. B. Qaisar, ‘‘Characteristics and classifica-
(KDD), Jul. 2018, pp. 2041–2050. tion of outlier detection techniques for wireless sensor networks in harsh
[47] H. Liu, X. Li, J. Li, and S. Zhang, ‘‘Efficient outlier detection for high- environments: A survey,’’ Artif. Intell. Rev., vol. 43, no. 2, pp. 193–228,
dimensional data,’’ IEEE Trans. Syst., Man, Cybern. Syst., vol. 48, no. 12, Feb. 2015.
pp. 2451–2461, Dec. 2018. [71] H. Zhang, J. Liu, and C. Zhao, ‘‘Distance based method for outlier detec-
[48] A. Emmott, S. Das, T. Dietterich, A. Fern, and W. K. Wong, ‘‘Anomaly tion in body sensor networks,’’ EAI Endorsed Trans. Wireless Spectr.,
detection meta-analysis benchmarks,’’ in Oregon State University, San vol. 2, no. 7, p. e4, Apr. 2016.
Francisco, CA, USA: Dataset, 2016. doi: 10.7267/N97H1GGX. [72] M. Shukla, Y. P. Kosta, and P. Chauhan, ‘‘Analysis and evaluation of
outlier detection algorithms in data streams,’’ in Proc. IEEE Int. Conf.
[49] M. Goldstein, Unsupervised Anomaly Detection Benchmark. Harvard
Comput., Commun. Control (IC4), Sep. 2015, pp. 1–8.
Dataverse, 2015. doi: 10.7910/DVN/OPQMVF.
[73] L. Tran, L. Fan, and C. Shahabi, ‘‘Distance-based outlier detection in
[50] (2018). ELKI Outlier Datasets. [Online]. Available: https://elki-
data streams,’’ in Proc. VLDB Endowment (PVLDB), vol. 9, no. 12,
project.github.io/datasets/outlier
pp. 1089–1100, Aug. 2016.
[51] S. Rayana, (2018). ODDS Library. Stony Brook, NY: Stony Brook
[74] E. Manzoor, H. Lamba, and L. Akoglu, ‘‘Outlier detection in feature-
University, Department of Computer Science. [Online]. Available:
evolving data streams,’’ in Proc. 24th ACM SIGKDD Int. Conf. Knowl.
http://odds.cs.stonybrook.edu
Discovery Data Mining (KDD). Aug. 2018, pp. 1963–1972.
[52] K. Bache, M. Lichman. (2013). UCI Machine Learning Repository.
[75] W. Jin, A. K. Tung, J. Han, and W. Wang, ‘‘Ranking outliers using
[Online]. Available: http://archive.ics.uci.edu/ml
symmetric neighborhood relationship,’’ in Proc. 10th Pacific-Asia Conf.
[53] A. V. Pawar, P. P. Kalavadekar, and S. N. Tambe, ‘‘A survey on outlier Adv. Knowl. Discovery Data Mining, 2006, pp. 577–593.
detection techniques for credit card fraud detection,’’ IOSR J. Comput. [76] K. Zhang, M. Hutter, and H. Jin, ‘‘A new local distance-based outlier
Eng., vol. 16, no. 2, pp. 44–48, 2014. detection approach for scattered real-world data,’’ in Proc. Pacific-Asia
[54] D. Huang, D. Mu, L. Yang, and X. Cai, ‘‘CoDetect: Financial fraud Conf. Knowl. Discovery Data Mining, 2009, pp. 813–822.
detection with anomaly feature detection,’’ IEEE Access, vol. 6, [77] M. Bai, X. Wang, J. Xin, and G. Wang, ‘‘An Efficient algorithm for
pp. 19161–19174, 2018. distributed density-based outlier detection on big data,’’ Neurocomputing,
[55] G. Singh, F. Masseglia, C. Fiot, A. Marascu, and P. Poncelet, ‘‘Mining vol. 181, pp. 19–28, Mar. 2016.
common outliers for intrusion detection,’’ in Advances in Knowledge [78] B. Tang and H. He, ‘‘A local density-based approach for outlier
Discovery and Management, F. Guillet, G. Ritschard, D. A. Zighed, and detection,’’ Neurocomputing, vol. 241, pp. 171–180, Jun. 2017.
H. Briand. (eds). Berlin, Germany: Springer, 2009, pp. 217–234. [Online]. Available: http://www.sciencedirect.com/science/article/
[56] S. Cateni, V. Colla, and M. Vannucci, ‘‘Outlier detection methods for pii/S0925231217303302
industrial applications,’’ in Advances in Robotics, Automation and [79] E. Schubert, A. Zimek, and H.-P. Kriegel, ‘‘Local outlier detection recon-
Control, J. Aramburo and A. R. Trevino Eds. Rijeka, Croatia: InTech, sidered: A generalized view on locality with applications to spatial, video,
2008. [Online]. Available: http://www.intechopen.com/books/advances_ and network outlier detection,’’ Data Mining Knowl. Discovery, vol. 28,
in_robotics_automation_and_control/outlier_detection_methods_for_ no. 1, pp. 190–237, 2014.
industrial_applications [80] J. Tang, Z. Chen, A. Fu, and D. Cheung, ‘‘Enhancing effectiveness of out-
[57] A. Zhang, S. Song, J. Wang, and P. S. Yu, ‘‘Time series data cleaning: lier detections for low density patterns,’’ in Advances in Knowledge Dis-
From anomaly detection to anomaly repairing,’’ Proc. VLDB Endown- covery and Data Mining. Berlin, Germany: Springer, 2002, pp. 535–548.
ment, vol. 10, no. 10, pp. 1046–1057, Jun. 2017. [81] H. Kriegel, P. Kröger, E. Schubert, and A. Zimek, ‘‘LoOP: Local outlier
[58] S. E. Benkabou, K. Benabdeslem, and B. Canitia, ‘‘Unsupervised outlier probabilities,’’ in Proc. 18th ACM Conf. Inf. Knowl. Manage., Nov. 2009,
detection for time series by entropy and dynamic time warping,’’ Knowl. pp. 1649–1652.
Inf. Syst., vol. 54, no. 2, pp. 463–486, Feb. 2018. [82] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, ‘‘LOCI:
[59] J. Zhu, W. Jiang, A. Liu, G. Liu, and L. Zhao, ‘‘Effective and efficient tra- Fast outlier detection using the local correlation integral,’’ in Proc. 19th
jectory outlier detection based on time-dependent popular route,’’ World Int. Conf. Data Eng., Mar. 2003, pp. 315–326.
Wide Web, vol. 20, no. 1, pp. 111–134, Jan. 2017. [83] D. Ren, B. Wang, and W. Perrizo, ‘‘RDF: A density-based outlier detec-
[60] J. Ramakrishnan, E. Shaabani, C. Li, and M. A. Sustik, ‘‘Anomaly tion method using vertical data representation,’’ in Proc. Int. Conf. Data
detection for an e-commerce pricing system,’’ 2019, arXiv:1902.09566. Mining, Nov. 2004, pp. 503–506.
[Online]. Available: https://arxiv.org/abs/1902.09566 [84] K. Cao, L. Shi, G. Wang, D. Han, and M. Bai, ‘‘Density-based local out-
[61] R. Kannan, H. Woo, C. Charu Aggarwal, and H. Park, ‘‘Outlier detection lier detection on uncertain data,’’ in Web-Age Information Management.
for text data : An extended version,’’ 2017, arXiv:1701.01325. [Online]. WAIM (Lecture Notes in Computer Science), vol. 8485, F. Li, G. Li, S.
Available: https://arxiv.org/abs/1701.01325 Hwang, B. Yao, and Z. Zhang, Eds. Cham, Switzerland: Springer, 2014.
[62] Y. Weng, N. Zhang, and C. Xia, ‘‘Multi-agent-based unsupervised detec- [85] F. Keller, E. Müller, and K. Bohm, ‘‘HiCS: High contrast subspaces for
tion of energy consumption anomalies on smart campus,’’ IEEE Access, density-based outlier ranking,’’ in Proc. IEEE 28th Int. Conf. Data Eng.
vol. 7, pp. 2169–2178, 2019. (ICDE), Apr. 2012, pp. 1037–1048.
[86] R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander, ‘‘Hierarchical [112] M. S. Uddin, A. Kuh, and Y. Weng, ‘‘Online bad data detection using
density estimates for data clustering, visualization, and outlier detection,’’ kernel density estimation,’’ in Proc. IEEE Power Energy Sociaty General
ACM Trans. Knowl. Discovery Data (TKDD), vol. 10, no. 1, 2015, Meeting, Jul. 2015, pp. 1–5.
Art. no. 5. doi: 10.1145/2733381. [113] S. Smrithy, S. Munirathinam, and R. Balakrishnan, ‘‘Online anomaly
[87] R. Momtaz, N. Mohssen, and M. A. Gowayyed, ‘‘DWOF: A robust detection using non-parametric technique for big data streams in cloud
density-based outlier detection approach,’’ in Proc. Iberian Conf. Pattern collaborative environment,’’ in Proc. IEEE Int. Conf. Big Data (Big Data),
Recognit. Image Anal., 2013, pp. 517–525. Dec. 2016, pp. 1950–1955.
[88] H. Fan, O. R. ZaÏane, A. Foss, and J. Wu, ‘‘Resolution-based outlier [114] L. Zhang, J. Lin, and R. Karim, ‘‘Adaptive kernel density-based
factor: Detecting the top-n most outlying data points in engineering data,’’ anomaly detection for nonlinear systems,’’ Knowl.-Based Syst., vol. 139,
Knowl. Inf. Syst., vol. 19, no. 1, pp. 31–51, 2009. pp. 50–63, Jan. 2018.
[89] K. Wu, K. Zhang, W. Fan, A. Edwards, and P. S. Yu, ‘‘RS-forest: A rapid [115] X. Qin, L. Cao, E. A. Rundensteiner, and S. Madden, ‘‘Scalable kernel
density estimator for streaming anomaly detection,’’ in Proc. IEEE Int. density estimation-based local outlier detection over large data streams,’’
Conf. Data Mining, Dec. 2014, pp. 600–609. in Proc. EDBT, 2019, pp. 421–432. doi: 10.5441/002/edbt.2019.37.
[90] E. Lozano and E. Acufia, ‘‘Parallel algorithms for distance-based and [116] M. Goldstein and A. Dengel, ‘‘Histogram-based outlier score (HBOS):
density-based outliers,’’ in Proc. 5th IEEE Int. Conf. Data Mining, A fast unsupervised anomaly detection algorithm,’’ in Proc. Poster Demo
Nov. 2005, pp. 729–732. Track, Sep. 2012, pp. 59–63.
[91] F. I. Vázquez, T. Zseby, and A. Zimek, ‘‘Outlier detection based on low [117] P. J. Rousseeuw and M. Hubert, ‘‘Robust statistics for outlier detection,’’
density models,’’ Proc. ICDM Workshops, 2018, pp. 970–979. Data Mining Knowl. Discovery, vol. 1, no. 1, pp. 73–79, 2011.
[92] J. Ning, L. Chen, and J. Chen, ‘‘Relative density-based outlier detection [118] H. Du, S. Zhao, and D. Zhang, ‘‘Robust local outlier detection,’’ in
algorithm,’’ in Proc. CSAI/ICIMT, Dec. 2018, pp. 227–231. Proc. IEEE Int. Conf. Data Mining Workshop (ICDMW), Nov. 2015,
[93] S. Su, L. Xiao, L. Ruan, F. Gu, S. Li, Z. Wang, and R. Xu, ‘‘An efficient pp. 116–123.
density-based local outlier detection approach for scattered data,’’ IEEE [119] J. Gebhardt, M. Goldstein, F. Shafait, and A. Dengel, ‘‘Document authen-
Access, vol. 7, pp. 1006–1020, 2019. tication using printing technique features and unsupervised anomaly
[94] W. Wang, J. Yang, and R. Muntz, ‘‘STING: A statistical information grid detection,’’ in Proc. 12th Int. Conf. Document Anal. Recognit., Aug. 2013,
approach to spatial data mining,’’ in Proc. 23rd VLDB Conf., Aug. 1997, pp. 479–483.
pp. 186–195. [120] F. Angiulli and C. Pizzuti, ‘‘Fast outlier detection in high dimensional
[95] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori, ‘‘Statis- spaces,’’ in Proc. Eur. Conf. Princ. Data Mining Knowl. Discovery,
tical outlier detection using direct density ratio estimation,’’ Knowl. Inf. Sep. 2002, pp. 15–26.
Syst., vol. 26, no. 2, pp. 309–336, 2011. [121] T. T. Dang, H. Y. T. Ngan, and W. Liu, ‘‘Distance-based k-nearest neigh-
[96] H. Kriegel, P. Kröger, and A. Zimek, ‘‘Outlier detection techniques,’’ in bors outlier detection method in large-scale traffic data,’’ in Proc. IEEE
Proc. Tutorial KDD, 2009, pp. 1–10. Int. Conf. Digital Signal Process., Jul. 2015, pp. 507–510.
[97] M. Goldstein and S. Uchida, ‘‘A comparative evaluation of unsupervised [122] E. M. Knorr and R. T. Ng, ‘‘Algorithms for mining distance based outliers
anomaly detection algorithms for multivariate data,’’ PLoS One, vol. 11, in large data sets,’’ in Proc. 24th Int. Conf. Very Large Databases Conf.,
no. 4, 2016, Art. no. e0152173. doi: 10.1371/journal.pone.0152173. 1998, pp. 392–403.
[98] E. Eskin, ‘‘Anomaly detection over noisy data using learned probability [123] S. Ramaswamy, R. Rastogi, and S. Kyuseok, ‘‘Efficient algorithms for
distributions,’’ in Proc. 17th Int. Conf. Mach. Learn. (ICML). Jul. 2000, mining outliers from large data sets,’’ in Proc. ACM SIGMOD Int. Conf.
pp. 255–262. Manage. Data, May 2000, pp. 427–438.
[99] E. M. Knorr, R. T. Ng, and V. Tucakov, ‘‘Distance-based outliers: Algo- [124] A. Ghoting, S. Parthasarathy, and M. E. Otey, ‘‘Fast mining of distance-
rithms and applications,’’ VLDB J., vol. 8, nos. 3–4, pp. 237–253, 2000. based outliers in high-dimensional datasets,’’ Data Mining Knowl. Dis-
[100] W. Contributors. (2015). Maximum Likelihood Estimation covery, vol. 16, vol. 3, pp. 349–364, Jun. 2008.
[Internet]. [Online]. Available: https://en.wikipedia.org/w/ [125] J. Liu and H. Deng, ‘‘Outlier detection on uncertain data based on local
index.php?title=Maximum_likelihood_estimation&oldid=857905834. information,’’ Knowl. Based Syst., vol. 51, pp. 60–71, Oct. 2013.
[101] X. Yang, L. J. Latecki, and D. Pokrajac, ‘‘Outlier detection with globally [126] H. Huang, K. Mehrotra, and C. K. Mohan, ‘‘Rank-based outlier detec-
optimal exemplar-based GMM,’’ in Proc. SIAM Int. Conf. on Mining tion,’’ J. Stat. Comput. Simul., vol. 83, no. 3, pp. 518–531, Oct. 2013.
(SDM), Apr. 2009, pp. 145–154. [127] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, ‘‘Outlier detection
[102] X. Tang, R. Yuan, and J. Chen, ‘‘Outlier detection in energy disaggrega- using neighborhood rank difference,’’ Pattern Recognit. Lett., vol. 60,
tion using subspace learning and Gaussian mixture model,’’ Int. J. Control pp. 24–31, Aug. 2015.
Autom., vol. 8, no. 8, pp. 161–170, 2015. [128] X. Wang, X. L. Wang, Y. Ma, and D. M. Wilkes, ‘‘A fast MST-inspired
[103] B. N. Saha, N. Ray, and H. Zhang, ‘‘Snake validation: A PCA-based kNN-based outlier detection method,’’ Inf. Syst., vol. 48, pp. 89–112,
outlier detection method,’’ IEEE Signal Process. Lett., vol. 16, no. 6, Mar. 2015.
pp. 549–552, Jun. 2009. [129] M. Radovanović, A. Nanopoulos, and M. Ivanović, ‘‘Reverse nearest
[104] M. H. Satman, ‘‘A new algorithm for detecting outliers in linear regres- neighbors in unsupervised distance-based outlier detection,’’ IEEE Trans.
sion,’’ Int. J. Statist. Probab., vol. 2, no. 3, pp. 101–109, Aug. 2013. Knowl. Data Eng., vol. 27, no. 5, pp. 1369–1382, May 2015.
[105] C. M. Park and J. Jeon, ‘‘Regression-based outlier detection of sensor [130] J. Huang, Q. Zhu, L. Yang, and J. Feng, ‘‘A non-parameter outlier detec-
measurements using independent variable synthesis,’’ in Proc. Int. Conf. tion algorithm based on natural neighbor,’’ Knowl. Based Syst., vol. 92,
Data Sci., Dec. 2015, pp. 78–86. pp. 71–77, Jan. 2016.
[106] P. I. F. Dalatu, A. Fitrianto, and A. Mustapha, ‘‘A comparative study of [131] J. Ha, S. Seok, and J.-S. Lee, ‘‘A precise ranking method for outlier
linear and nonlinear regression models for outlier detection,’’ in Proc. Int. detection,’’ Inf. Sci., vol. 324, pp. 88–107, Dec. 2015.
Conf. Soft Comput. Data Mining, 2017, vol. 549, pp. 316–327. [132] S. Bay and M. Schwabacher, ‘‘Mining distance-based outliers in near
[107] M. Pavlidou and G. Zioutas, ‘‘Kernel density outlier detector,’’ in Top- linear time with randomization and a simple pruning rule,’’ in Proc.
ics Nonparametric Statistics. New York, NY, USA: Springer, 2014, ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2003,
pp. 241–250. pp. 29–38.
[108] L. J. Latecki, A. Lazarevic, and D. Pokrajac, ‘‘Outlier detection with [133] F. Angiulli and F. Fassetti, ‘‘Very efficient mining of distance-based
kernel density functions,’’ in Proc. 5th Int. Conf. Mach. Learn. Data outliers,’’ Proc. 16th ACM Conf. Inf. Knowl. Manage., Nov. 2007,
Mining Pattern Recognit., 2007, pp. 61–75. pp. 791–800.
[109] J. Gao, W. Hu, Z. Zhang, X. Zhang, and O. Wu, ‘‘RKOF: Robust kernel- [134] D. Ren, I. Rahal, W. Perrizo, and K. Scott, ‘‘A vertical distance-based
based local outlier detection,’’ in Proc. Pacific-Asia Conf. Knowl. Discov- outlier detection method with local pruning,’’ in Proc. 13th ACM CIKM
ery Data Mining, 2011, pp. 270–283. Int. Conf. Inf. Knowl. Manage., Nov. 2004, pp. 279–284.
[110] V. S. K. Samparthi and H. K. Verma, ‘‘Outlier detection of data in wireless [135] N. H. Vu and V. Gopalkrishnan, ‘‘Efficient pruning schemes for distance-
sensor networks using kernel density estimation,’’ Int. J. Comput. Appl., based outlier detection,’’ in Proc. Eur. Conf. Mach. Learn. Knowl. Dis-
vol. 5, no. 7, pp. 28–32, Aug. 2010. covery Databases, 2009, pp. 160–175.
[111] A. O. Boedihardjo, C.-T. Lu, and F. Chen, ‘‘Fast adaptive kernel density [136] F. Angiulli and F. Fassetti, ‘‘Distance-based outlier queries in data
estimator for data streams,’’ Knowl. Inf. Syst., vol. 42, no. 2, pp. 285–317, streams: The novel task and algorithms,’’ Data Mining Knowl. Discovery,
Feb. 2015. vol. 20, pp. 290–324, Mar. 2010.
[137] C. C. Aggarwal, ‘‘On abnormality detection in spurious populated data [163] M. Elahi, K. Li, W. Nisar, X. Lv, and H. Wang, ‘‘Efficient clustering-
streams,’’ in Proc. SIAM Int. Conf. Data Mining, Apr. 2005, pp. 80–91. based outlier detection algorithm for dynamic data stream,’’ in Proc. 5th
[138] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and Int. Conf. Fuzzy Syst. Knowl. Discovery, vol. 5, Oct. 2008, pp. 298–304.
D. Gunopulos, ‘‘Online outlier detection in sensor data using non- [164] D. Pokrajac, A. Lazarevic, and L. J. Latecki, ‘‘Incremental local outlier
parametric models,’’ in Proc. Int. Conf. Very Large Data Bases, Sep. 2006, detection for data streams,’’ in Proc. IEEE Symp. Comput. Intell. Data
pp. 187–198. Mining, Apr. 2007, pp. 504–551.
[139] D. Yang, E. A. Rundensteiner, and M. Ward, ‘‘Neighbor-based pattern [165] Yogita and D. Toshniwala, ‘‘A framework for outlier detection in evolv-
detection for windows over streaming data,’’ in Proc. 12th Int. Conf. ing data streams by weighting attributes in clustering,’’ in Proc. 2nd
Extending Database Technol., Mar. 2009, pp. 529–540. Int. Conf. Commun., Comput. Secur., vol. 6, 2012, pp. 214–222. doi:
[140] M. Kontaki, A. Gounaris, A. N. Papadopoulos, and K. Tsichlas, ‘‘Contin- 10.1016/j.protcy.2012.10.026.
uous monitoring of distance-based outliers over data streams,’’ in Proc. [166] H. Moradi, S. Ibrahim, and J. Hosseinkhani, ‘‘Outlier detection in stream
IEEE 27th Int. Conf. Data Eng., Apr. 2011, pp. 135–146. data by clustering method,’’ Int. J. Adv. Comput. Sci. Inf. Technol., vol. 2,
[141] F. Angiulli and F. Fassetti, ‘‘Detecting distance-based outliers in streams no. 3, pp. 25–34, 2013.
of data,’’ in Proc. 16th ACM Conf. Inf. Knowl. Manage., Nov. 2007, [167] S. V. Bhosale, ‘‘Outlier detection in streaming data using clustering
pp. 811–820. approached,’’ Int. J. Adv. Comput. Sci. Inf. Technol., vol. 5, no. 5,
[142] L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Runden- pp. 6050–6053, 2014.
steiner, ‘‘Scalable distance-based outlier detection over high-volume data [168] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduc-
streams,’’ in Proc. IEEE 30th Int. Conf. Data Eng., Apr. 2014, pp. 76–87. tion to Cluster Analysis. Hoboken, NJ, USA: Wiley, 1990.
[143] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, ‘‘MOA: Massive [169] M. Moshtaghi, J. C. Bezdek, T. C. Havens, C. Leckie, S. S. Karunasekera,
Online Analysis Tool,’’ J. Mach. Learn. Res., vol. 11, pp. 1601–1604, S. Rajasegarar, and M. Palaniswami, ‘‘Streaming analysis in wireless
Oct. 2011. [Online]. Available: https://moa.cms.waikato.ac.nz/ sensor networks,’’ Wireless Commun. Mobile Comput., vol. 14, no. 9,
[144] C. C. Aggarwal and P. S. Yu, ‘‘Outlier detection for high dimensional pp. 905–921, Jun. 2014.
data,’’ ACM SIGMOD Rec., vol. 30, no. 2, pp. 37–46, 2001. [170] M. Moshtaghi, J. C. Bezdek, C. Leckie, S. Karunasekera, and M.
[145] K. Bhaduri, B. L. Mathews, and C. R. Giannella, ‘‘Algorithms for speed- Palaniswami, ‘‘Evolving fuzzy rules for anomaly detection in data
ing up distance-based outlier detection,’’ in Proc. ACM KDD Conf., streams,’’ IEEE Trans. Fuzzy Syst., vol. 23, no. 3, pp. 688–700, 2015.
Aug. 2011, pp. 859–867.
[171] M. Salehi, C. A. Leckie, M. Moshtaghi, and T. Vaithianathan, ‘‘A rele-
[146] M. B. Al-Zoubi, ‘‘An effective clustering-based approach for outlier vance weighted ensemble model for anomaly detection in switching data
detection,’’ Eur. J. Sci. Res., vol. 28, no. 2, pp. 310–316, Jan. 2009. streams,’’ in Proc. Pacific-Asia Conf. Knowl. Discovery Data Mining,
[147] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduc- 2014, pp. 461–473.
tion to Cluster Analysis. Hoboken, NJ, USA: Wiley, 1990.
[172] M. Chenaghlou, M. Moshtaghi, C. Leckie, and M. Salehi, ‘‘An efficient
[148] R. Ng and J. Han, ‘‘Efficient and effective clustering methods for spatial method for anomaly detection in non-stationary data streams,’’ in Proc.
data mining,’’ in Proc. 20th VLDB Conf., 1994, pp. 144–155. IEEE Global Commun. Conf. (GLOBECOM), Dec. 2017, pp. 1–6.
[149] J. MacQueen, ‘‘Some methods for classification and analysis of multi-
[173] H. Rizk, S. Elgokhy, and A. Sarhan, ‘‘A hybrid outlier detection algorithm
variate observations,’’ in Proc. 5th Berkeley Symp. Math. Statist, Prob.,
based on partitioning clustering and density measures,’’ in Proc. 10th Int.
Jun. 1967, vol. 1, pp. 281–297.
Conf. Comput. Eng. Syst. (ICCES), Dec. 2015, pp. 175–181.
[150] C. T. Zahn, ‘‘Graph-theoretical methods for detecting and describing
[174] M. Chenaghlou, M. Moshtaghi, C. Leckie, and M. Salehi, ‘‘Online clus-
gestalt clusters,’’ IEEE Trans. Comput., vol. C-20, no. 2, pp. 68–86,
tering for evolving data streams with online anomaly detection,’’ in Proc.
Jan. 1971.
Pacific-Asia Conf. Knowl. Discovery Data Mining. New York, NY, USA:
[151] S. Guha, R. Rastogi, and K. Shim, ‘‘CURE: An efficient clustering algo-
Springer, 2018, pp. 508–521.
rithm for large databases,’’ in Proc. ACM SIGMOD Int. Conf. Manage.
Data, Jun. 1998, pp. 73–84. [175] J. Yin and J. Wang, ‘‘A model-based approach for text clustering with
outlier detection,’’ in Proc. 32nd Int. Conf. Data Eng. (ICDE), May 2016,
[152] G. Karypis, E. H. Han, and V. Kumar, ‘‘CHAMELEON: A hierarchical
pp. 625–636.
clustering algorithm using dynamic modeling,’’ IEEE Comput., vol. 27,
no. 3, pp. 329–341, Aug. 1999. [176] A. Zimek, M. Gaudet, R. J. Campello, and J. Sander, ‘‘Subsampling
[153] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ‘‘A density-based algorithm for efficient and effective unsupervised outlier detection ensembles,’’ in
for discovering clusters a density-based algorithm for discovering clusters Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
in large spatial databases with noise,’’ in Proc. 2nd Int. Conf. Knowl. Aug. 2013, pp. 428–436.
Discovery Data Mining, Aug. 1996, pp. 226–231. [177] J. R. Pasillas-Díaz and S. Ratté, ‘‘Bagged subspaces for unsupervised
[154] A. Hinneburg and D. A. Keim, ‘‘An efficient approach to cluster in large outlier detection,’’ Int. J. Comput. Intell., vol. 33, no. 3, pp. 507–523,
multimedia databases with noise,’’ in Proc. SIGKDD, 1998, pp. 58–65. Aug. 2017.
[155] G. Sheikholeslami, S. Chatterjee, and A. Zhang, ‘‘WaveCluster: A [178] C. C. Aggarwal, ‘‘Outlier ensembles: Position paper,’’ SIGKDD Explor.
wavelet-based clustering approach for spatial data in very large Newslett., vol. 14, pp. 49–58, Apr. 2013.
databases,’’ VLDB J., vol. 8, nos. 3–4, pp. 289–304, Feb. 2000. [179] A. Zimek, M. Gaudet, R. J. Compello, and J. Sander, ‘‘Subsampling for
[156] J. Zhang, W. Hsu, and M. L. Lee, ‘‘Clustering in dynamic spatial effcient and effective unsupervised outlier detection ensembles,’’ in Proc.
databases,’’ J. Intell. Inf. Syst. (JIIS), vol. 24, no. 1, pp. 5–27, Jan. 2005. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2013,
[157] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, ‘‘Automatic sub- pp. 428–436.
space clustering of high dimensional data for data mining applications,’’ [180] A. Zimek, R. J. Campello, and J. Sander, ‘‘Data perturbation for outlier
in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1998, pp. 94–105. detection ensembles,’’ in Proc. 26th Int. Conf. Sci. Stat. Database Man-
[158] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, ‘‘A framework for projected age., Jul. 2014, pp. 1–13. doi: 10.1145/2618243.2618257.
clustering of high dimensional data streams,’’ in Proc. 30th Very Large [181] A. Zimek, R. J. G. B. Campello, and J. Sander, ‘‘Ensembles for unsu-
Data Bases, Aug. 2004, pp. 852–863. pervised outlier detection: Challenges and research questions a position
[159] Y. Chen and L. Tu, ‘‘Density-based clustering for real-time stream data,’’ paper,’’ ACM SIGKDD Explor. Newslett., vol. 15, no. 1, pp. 11–22,
inProc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jun. 2014.
Aug. 2007, pp. 133–142. [182] C. C. Aggarwal and S. Sathe, ‘‘Theoretical foundations and algorithms
[160] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, ‘‘A framework for for outlier ensembles,’’ ACM SIGKDD Explor. Newslett., vol. 17, no. 1,
clustering evolving data streams,’’ in Proc. 29 th Int. Conf. Very Large pp. 24–47, Jun. 2015.
Database, vol. 29, pp. 81–92. [183] Y. Zhao and M. K. Hryniewicki, ‘‘XGBOD: Improving supervised
[161] J. Ren and R. Ma, ‘‘Density-based data streams clustering over sliding outlier detection with unsupervised representation learning,’’
windows,’’ in Proc. Int. Conf. Fuzzy Syst. Knowl. Discovery (FSKD), in Proc. Int. Joint Conf. Neural Netw., Jul. 2018, pp. 1–8.
Jul. 2009, pp. 248–252. doi: 10.1109/IJCNN.2018.8489605.
[162] I. Assent, P. Kranen, C. Baldauf, and T. Seidl, ‘‘AnyOut: Anytime outlier [184] S. Rayana and L. Akoglu, ‘‘Less is more: Building selective anomaly
detection on streaming data,’’ in Proc. 17th Int. Conf. Database Syst. Adv. ensembles,’’ ACM Trans. Knowl. Discovery Data, vol. 10, no. 4, pp. 1–33,
Appl., 2012, pp: 228-242. Jul. 2016.
[185] S. Rayana, W. Zhong, and L. Akoglu, ‘‘Sequential ensemble learn- [209] D. Li, D Chen, L. Shi, B. Jin, J. Goh, and S. K. Ng, ‘‘MAD-GAN:
ing for outlier detection: A bias-variance perspective,’’ in Proc. ICDM, Multivariate anomaly detection for time series data with generative
Dec. 2017, pp. 1167–1172. adversarial networks,’’ 2019, arXiv:1901.04997. [Online]. Available:
[186] B. M. B. Micenková and I. Assent, ‘‘Learning representations for outlier https://arxiv.org/abs/1901.04997
detection on a budget,’’ 2015, arXiv:1507.08104. [Online]. Available: [210] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and
https://arxiv.org/abs/1507.08104 H. Chen, ‘‘Deep autoencoding Gaussian mixture model for unsupervised
[187] G. O. Campos, A. Zimek, and W. Meira, Jr., ‘‘An unsupervised boosting anomaly detection,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), 2018,
strategy for outlier detection ensembles,’’ in Proc. 22nd Pacific-Asia pp. 1–19.
Conf. Adv. Knowl. Discovery Data Mining (PAKDD), 2018, pp. 564–576. [211] C. Zhou and R. C. Paffenroth, ‘‘Anomaly detection with robust deep
[188] S. Bickel and T. Scheffer, ‘‘Multi-view clustering,’’ in Proc. 4th IEEE Int. autoencoders,’’ in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discovery
Conf. Data Mining, Apr. 2004, pp. 19–26. Data Mining, Aug. 2017, pp. 665–674.
[189] E. Müller, S. Gunnemann, T. Seidl, and I. Farber, ‘‘Discovering multiple [212] R. Chalapathy, A. K. Menon, and S. Chawla, ‘‘Robust, deep and induc-
clustering solutions: Grouping objects in different views of the data,’’ in tive anomaly detection,’’ 2017, arXiv:1704.06743. [Online]. Available:
Proc. 28th IEEE ICDE Conf., Apr. 2012, pp. 1207–1210. https://arxiv.org/abs/1704.06743
[190] H. P. Kriegel, P. Kréger, E. Schubert, and A. Zimek, ‘‘Interpreting and [213] J. T. A. Andrews, E. J. Morton, and L. D. Griffin, ‘‘Detecting anomalous
unifying outlier scores,’’ in Proc. SDM, Apr. 2011, pp. 13–24. data using auto-encoders,’’ Int. J. Mach. Learn. Comput., vol. 6, no. 1,
[191] E. Schubert, R. Wojdanowski, A. Zimek, and H. P. Kriegel, ‘‘On evalu- pp. 21–26, 2016.
ation of outlier rankings and outlier scores,’’ in Proc. SDM, Apr. 2012, [214] S. M. Erfani, S. Rajasegarar, S. Karunasekera, and C. Leckie, ‘‘High-
pp. 1047–1058. dimensional and largescale anomaly detection using a linear one-class
[192] F. T. Liu, K. M. Ting, and Z.-H. Zhou, ‘‘Isolation forest,’’ in svm with deep learning,’’ Pattern Recognit., vol. 58, pp. 121–134,
Proc. 8th IEEE Int. Conf. Data Mining, Jul. 2008, pp. 413–42. Oct. 2016.
doi: 10.1109/ICDM.2008.17. [215] D. Hendrycks, M. Mazeika, and T. Dietterich, ‘‘Deep anomaly detection
[193] S. Das, W. K. Wong, T. Dietterich, A. Fern, and A. Emmott, ‘‘Incor- with outlier exposure,’’ 2019, arXiv:1812.04606. [Online]. Available:
porating expert feedback into active anomaly discovery,’’ in Proc. https://arxiv.org/abs/1812.04606
IEEE 16th Int. Conf. Data Mining (ICDM), Dec. 2016, pp. 853–858. [216] M. Du, F. Li, G. Zheng, and V. Srikumar, ‘‘DeepLog: Anomaly detection
doi: 10.1109/ICDM.2016.0102. and diagnosis from system logs through deep learning,’’ in Proc. ACM
[194] T. Pimentel, M. Monteiro, and J. Viana, ‘‘A generalized active SIGSAC Conf. Comput. Commun. Secur., Oct. 2017, pp. 1285–1298.
learning approach for unsupervised anomaly detection,’’ vol. 2018. [217] A. Borghesi, A. Bartolini, M. Lombardi, M. Milano, and L. Benini,
arXiv:1805.09411. [Online]. Available: https://arxiv.org/abs/1805.09411 ‘‘Anomaly Detection Using Autoencoders in high performance
computing systems,’’ 2018, arXiv:1811.05269. [Online]. Available:
[195] J. Zhang, Y. Jiang, K. H. Chang, S. Zhang, J. Cai, and L. Hu, ‘‘A con-
https://arxiv.org/abs/1811.05269
cept lattice based outlier mining method in lowdimensional subspaces,’’
[218] R. Chalapathy, A. K. Menon, and S. Chawla, ‘‘Anomaly detection using
Pattern Recognit. Lett., vol. 30, no. 15, pp. 1434–1439, Nov. 2009.
one-class neural networks,’’ 2018, arXiv:1802.06360. [Online]. Avail-
[196] J. K. Dutta, B. Banerjee, and C. K. Reddy, ‘‘RODS: Rarity based outlier
able: https://arxiv.org/abs/1802.06360
detection in a sparse coding framework,’’ IEEE Trans. Knowl. Data Eng.,
[219] L. Ruff, N. Gornitz, L. Deecke, S. A. Siddiqui, R. Vandermeulen,
vol. 28, no. 2, pp. 483–495, Feb. 2016.
A. Binder, E. Müller, and M. Kloft, ‘‘Deep one-class classification,’’ in
[197] J. Zhang, S. Zhang, K. H. Chang, and X. Qin, ‘‘An outlier mining
Proc. Int. Conf. Mach. Learn., Jul. 2018, pp. 4390–4399.
algorithm based on constrained concept lattice,’’ Int. J. Syst. Sci., vol. 45,
[220] G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr, and S. Parthasarathy,
no. 5, pp. 1170–1179, May 2014.
‘‘Distance-based outlier detection: Consolidation and renewed bearing,’’
[198] E. Müller, I. Assent, U. Steinhausen, and T. Seidl, ‘‘OutRank: Ranking VLDB Endowment, vol. 3, no. 2, pp. 1469–1480, 2010.
outliers in high dimensional data,’’ in Proc. IEEE 24th Int. Conf. Data
[221] (2018). ELKI Tutorials - The Demonstrated Software is Available As
Eng. Workshop, Apr. 2008, pp. 600–603.
Release 0.3 of the ELKI Framework. [Online]. Available: https://elki-
[199] Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, and project.github.io/tutorial/
X. He, ‘‘Generative adversarial active learning for unsupervised [222] E. Achtert, T. Bernecker, H. P. Kriegel, E. Schubert, and A. Zimek, ‘‘ELKI
outlier detection,’’ IEEE Trans. Knowl. Data Eng., to be published. in Time: ELKI 0.2 for the performance evaluation of distance mea-
doi: 10.1109/TKDE.2019.2905606. sures for time series,’’ in Advances in Spatial and Temporal Databases,
[200] N. Gornitz, M. Kloft, K. Rieck, and U. Bredfeld, ‘‘Toward supervised N. Mamoulis, T. Seidl, T. B. Pedersen, K. Torp, and I. Assent, (eds.)
anomaly detection,’’ J. Artif. Intell. Res., vol. 46, pp. 235–262, 2013. Heidelberg, Germany: Springer, 2009, pp. 436–440.
[201] S. Das, M. R. Islam, N. K. Jayakodi, and J. R. Doppa, ‘‘Active anomaly [223] R. Domingues, M. Filippone, P. Michiardi, and J. Zouaoui, ‘‘A compar-
detection via ensembles: Insights, algorithms, and interpretability,’’ 2019. ative evaluation of outlier detection algorithms: Experiments and analy-
arXiv:1901.08930. [Online]. Available: https://arxiv.org/abs/1901.08930 ses,’’ Pattern Recognit., vol. 74, pp. 406–421, Feb. 2018.
[202] E. Müller, M. Schiffer, and T. Seidl, ‘‘Statistical selection of relevant [224] X. Ding, Y. Li, A. Belatreche, and L. P. Maguire, ‘‘An experimental
subspace projections for outlier ranking,’’ in Proc. IEEE 27th Int. Conf. evaluation of novelty detection methods,’’ Neurocomputing, vol. 135,
Data Eng., Apr. 2011, pp. 434–445. pp. 313–327, Jul. 2014.
[203] B. V. Stein, M. van Leeuwen, and T. Bäck, ‘‘Local subspace-based outlier [225] U. Carrasquilla, ‘‘Benchmarking algorithms for detecting anomalies in
detection using global neighbourhoods,’’ in Proc. IEEE Int. Conf. Big large datasets,’’ CMG J., vol. 1, pp. 1–16, Nov. 2011.
Data (Big Data), Dec. 2016, pp. 1136–1142. [226] G. O. Campos, A. Zimek, and J. Sander, ‘‘On the evaluation of unsu-
[204] H. D. K. Moonesinghe and P.-N. Tan, ‘‘Outrank: A graph-based outlier pervised outlier detection: Measures, datasets, and an empirical study,’’
detection framework using random walk,’’ Int. J. Artif. Intell. Tools, J. Data Mining Knowl. Discovery, vol. 30, no. 4, pp. 891–927, Jul. 2016.
vol. 17, no. 1, pp. 19–36. doi: 10.1142/s0218213008003753. [227] Y. Zhao and M. K. Hryniewicki, ‘‘Dcso: Dynamic combination of detector
[205] C. Wang, H. Gao, Z. Liu, and Y. Fu, ‘‘A new outlier detection model scores for outlier ensembles,’’ in Proc. ACM SIGKDD Conf. Knowl.
using random walk on local information graph,’’ IEEE Access, vol. 6, Discovery Data Mining (KDD), 2018, pp. 1–8.
pp. 75531–75544, 2018. [228] Y. Zhao, Z. Nasrullah, M. K. Hryniewicki, and Z. Li, ‘‘LSCP: Locally
[206] W. Chao, G. Hui, L. Zhen, and F. Yan, ‘‘Outlier detection selective combination in parallel outlier ensembles,’’ in Proc. SIAM Int.
using diverse neighborhood graphs,’’ in Proc. 15th Int. Conf. Data Mining, May 2019, pp. 585–593.
Comput. Conf. Wavelet Act. Media Technol., 2018, pp. 58–62. [229] C. C. Aggarwal, ‘‘Outlier ensembles,’’ ACM SIGKDD Explor. Newslett.,
doi: 10.1109/ICCWAMTIP.2018.8632604. vol. 14, no. 2, pp. 49–80, 2017.
[207] B. R. Kiran, D. M. Thomas, and R. Parakkal, ‘‘An overview of deep [230] A. Lavin and S. Ahmad, ‘‘Evaluating real-time anomaly detection
learning-based methods for unsupervised and semi-supervised anomaly algorithms—The numenta anomaly benchmark,’’ in Proc. 14th Int. Conf.
detection in videos,’’ J. Imag., vol. 4, no. 2, p. 36, 2018. Mach. Learn. Appl., Dec. 2015, pp. 38–44.
[208] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and [231] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
T. Soderstrom, ‘‘Detecting spacecraft anomalies using LSTMs and O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and
nonparametric dynamic thresholding,’’ in Proc. 24th ACM SIGKDD Int. J. Vanderplas, ‘‘Scikit-learn: Machine learning in python,’’ J. Mach.
Conf. Knowl. Discovery Data Mining, Aug. 2018, pp. 387–395. Learn. Res., vol. 12, pp. 2825–2830, 2011.
[232] Y. Zhao, Z. Nasrullah, and Z. Li, ‘‘Pyod: A python toolbox for scal- MOHAMED JAWARD BAH received the B.Eng.
able outlier detection,’’ 2019, arXiv:1901.01588. [Online]. Available: degree in electrical and electronics engineering
https://arxiv.org/abs/1901.01588 from the University of Sierra Leone, in 2013, and
[233] I. Kalayci and T. Ercan, ‘‘Anomaly detection in wireless sensor networks the M.Sc. degree in computer science from the
data by using histogram based outlier score method,’’ in Proc. 2nd Int. Nanjing University of Information Science and
Symp. Multidisciplinary Studies Innov. Technol. (ISMSIT), Oct. 2018, Technology, China, in 2016. He is currently pur-
pp. 1–6. suing the Ph.D. degree with the Harbin Institute of
[234] M. Goldstein, M. Amer, J. Gebhardt, P. Kalka, and A. Elsawy.
Technology, China. His research interests include
(2018). RapidMiner Anomaly Detection Extension. [Online]. Available:
data mining, and outlier detection in data streams
https://github.com/Markus-Go/rapidminer-anomaly detection
[235] Detect and Remove Outliers in Data. Accessed: Oct. 15, 2018. [Online]. and big data.
Available: https://www.mathworks.com/help/matlab/ref/rmoutliers.html
[236] E. Kirner, E. Schubert, and A. Zimek, ‘‘Good and bad neighborhood
approximations for outlier detection ensembles,’’ in Proc. Int. Conf.
Similarity Search Appl., 2017, pp. 173–187.
[237] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, ‘‘Fake news detection on
social media: A data mining perspective,’’ ACM SIGKDD Explorations
Newslett., vol. 19, no. 1, pp. 22–36, 2017.
[238] J. Chen, S. Sathe, C. Aggarwal, and D. Turaga, ‘‘Outlier detection with
autoencoder ensembles,’’ in Proc. SIAM Int. Conf. Data Mining, Soc. Ind.
Appl. Math., Jul. 2017, pp. 90–98.
[239] T. Xiao, C. Zhang, and H. Zha, ‘‘Learning to detect anomalies
in surveillance video,’’ IEEE Signal Process. Lett., vol. 22, no. 9,
pp. 1477–1481, Sep. 2015.
[240] A. Koufakou and M. Georgiopoulos, ‘‘A fast outlier detection strategy
for distributed high-dimensional data sets with mixed attributes,’’
Data Mining Knowl. Discovery, vol. 20, no. 2, pp. 259–289,
2010.
[241] C. C. Aggarwal and P. S. Yu, ‘‘Outlier detection with uncertain data,’’ in
MOHAMED HAMMAD received the M.Sc.
Proc. SIAM Int. Conf., 2008, pp. 483–493.
degree from the Information Technology Depart-
ment, Faculty of Computers and Information,
HONGZHI WANG is currently a Professor and a Menoufia University, Egypt, in 2015. He is cur-
Doctoral Supervisor with the Harbin Institute of rently pursuing the Ph.D. degree with the School of
Technology. He was awarded the Microsoft Fel- Computer Science and Technology, Harbin Insti-
lowship, the Chinese Excellent Database Engineer, tute of Technology, Harbin, China. He has been a
and the IBM Ph.D. Fellowship. He has published Demonstrator and an Assistant Lecturer with the
more than 200 papers in refereed journals and con- Faculty of Computers and Information, Menoufia
ferences. His research interests include big data University, since April 2012. His research inter-
management, data mining, data quality, and graph ests include computer vision, machine learning, pattern recognition, and
data management. biometrics.