Official code and data repository of [UADB: Unsupervised Anomaly Detection Booster]. Please star, watch, and fork UADB for the active updates!
UADB is a booster for unsupervised anomaly detection (UAD) on tabular tasks. Note that UADB is not a universal winner on all taular tasks, however, it is a model-agnostic framework that can generally enhance any UAD on all types of tabular datasets in a unified way.
Prepare (create Results first)
mkdir Results
Select tabular data and source UAD needed to be enhanced
- modify config.py
Run UADB
python main.py
Isolation Forest (IForest) paper that isolates observations by randomly selecting a feature and a splitting point;
Histogram-based outlier detection (HBOS) paper assumes the feature independence and calculates the degree of outlyingness by building histograms;
Local Outlier Factor (LOF) paper measures the local deviation of the density of a sample with respect to its neighbors;
K-Nearest Neighbors (KNN) paper views an instance's distance to its kth nearest neighbor as the outlying score;
Principal Component Analysis (PCA) paper is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. In anomaly detection, it projects the data to the lower dimensional space and then reconstruct it, thus the reconstruction errors are viewed as the anomaly scores;
One-class SVM (OCSVM) paper maximizes the margin between the abnormal and the normal samples, and uses the hyperplane that determines the margin for decision;
Clustering Based Local Outlier Factor (CBLOF) paper classifies the samples into small clusters and large clusters and then using the distance among clusters as anomaly scores;
Connectivity-Based Outlier Factor (COF) paper uses the ratio of average chaining distance of data point and the average of average chaining distance of k nearest neighbor of the data point, as the outlier score for observations;
Subspace Outlier Detection (SOD) paper detects outlier in varying subspaces of a high dimensional feature space;
Empirical-Cumulative-distribution-based Outlier Detection (ECOD) paper is a parameter-free, highly interpretable outlier detection algorithm based on empirical CDF functions;
Gaussian Mixture Models (GMM) paper fit k Gaussians to the data. Then for each data point, calculate the probabilities of belonging to each of the clusters, where the lower probabilities indicate higher anomaly scores;
Lightweight on-line detector of anomalies (LODA) paper is an ensemble detector and is particularly useful in domains where a large number of samples need to be processed in real-time or in domains where the data stream is subject to concept drift and the detector needs to be updated online;
Copula Based Outlier Detector (COPOD) paper is a parameter-free, highly interpretable outlier detection algorithm based on empirical copula models;
Deep Support Vector Data Description (DeepSVDD) paper trains a neural network while minimizing the volume of a hypersphere that encloses the network representations of the data, the distance of the transformed embedding to the hypersphere's center is used to calculate the anomaly score.
For all source UAD models, we use their default parameters in their original papers (which have been fine-tuned to achieve the best performance). Please refer to PyOD for more information. The following codes show the example to import UAD models. Please see the Table for complete source UAD models included in UADB and their parameter setting links.
from pyod.models.iforest import IForest
from pyod.models.hbos import HBOS
from pyod.models.pca import PCA
from pyod.models.ocsvm import OCSVM
from pyod.models.lof import LOF
from pyod.models.cblof import CBLOF
from pyod.models.cof import COF
from pyod.models.knn import KNN
from pyod.models.sod import SOD
from pyod.models.ecod import ECOD
from pyod.models.deep_svdd import DeepSVDD
from pyod.models.loda import LODA
from pyod.models.copod import COPOD
from pyod.models.gmm import GMM
def get_init_labels(self):
pseudo_models = {'pca':PCA(), 'iforest':IForest(), 'hbos':HBOS(), 'ocsvm':OCSVM(), 'lof':LOF(), 'cblof':CBLOF(), 'cof':COF(), 'knn':KNN(), 'sod':SOD(), 'ecod':ECOD(), 'deep_svdd':DeepSVDD(), 'loda':LODA(), 'copod':COPOD(), 'gmm':GMM()}
# model = IForest()
model = pseudo_models[self.config.pseudo_model]
model.fit(self.inputs)
score = model.decision_function(self.inputs)
score = MinMaxScaler().fit_transform(score.reshape(-1, 1))
return score
Model | Source |
---|---|
IForest | Link |
HBOS | link |
LOF | link |
KNN | link |
PCA | link |
OCSVM | link |
CBLOF | link |
COF | link |
SOD | link |
ECOD | link |
GMM | link |
LODA | link |
COPOD | link |
DeepSVDD | link |
For the default UADB Setup (i.e. 3-layer MLP, hidden dimension=128, epochs=10, batch size=256, learning rate=0.001, training iterations=10), the average runtime on 84 tabular datasets is 49 seconds, the minimum runtime is 32 seconds and maximum runtime is 65 seconds (evaluated on an NVIDIA Tesla V100 GPU with 16 GiB RAM).
Dataset | time (seconds) |
---|---|
1_abalone | 45.97332 |
2_ALOI | 62.09925 |
3_annthyroid | 55.46907 |
4_Arrhythmia | 39.06063 |
5_breastw | 36.12518 |
6_cardio | 33.07165 |
7_Cardiotocography | 33.20133 |
9_concrete | 31.79388 |
10_cover | 64.64015 |
11_fault | 39.39834 |
12_glass | 37.71293 |
13_HeartDisease | 37.38352 |
14_Hepatitis | 38.01874 |
15_http | 61.8857 |
16_imgseg | 40.50281 |
17_InternetAds | 45.73133 |
18_Ionosphere | 37.77196 |
19_landsat | 53.47792 |
20_letter | 40.62032 |
21_Lymphography | 38.29522 |
23_mammography | 64.00619 |
24_mnist | 59.77169 |
25_musk | 45.249 |
26_optdigits | 48.52583 |
27_PageBlocks | 48.49409 |
28_Parkinson | 37.96288 |
29_pendigits | 54.03328 |
30_Pima | 38.34605 |
31_satellite | 53.66807 |
32_satimage-2 | 50.83709 |
33_shuttle | 61.62736 |
34_skin | 63.52546 |
35_smtp | 61.39091 |
36_SpamBase | 45.18679 |
37_speech | 47.31526 |
38_Stamps | 37.86243 |
39_thyroid | 46.09605 |
40_vertebral | 36.9513 |
41_vowels | 38.88513 |
42_Waveform | 45.00404 |
43_WBC | 38.99096 |
44_WDBC | 35.97541 |
45_Wilt | 46.98276 |
46_wine | 37.96173 |
47_WPBC | 37.3959 |
48_yeast | 39.1033 |
49_CIFAR10_0 | 52.26063 |
49_CIFAR10_1 | 52.14188 |
49_CIFAR10_2 | 45.87855 |
49_CIFAR10_3 | 46.25881 |
49_CIFAR10_4 | 46.14659 |
49_CIFAR10_5 | 52.32802 |
49_CIFAR10_6 | 51.16338 |
49_CIFAR10_7 | 53.35524 |
49_CIFAR10_8 | 53.4575 |
49_CIFAR10_9 | 50.95251 |
50_FashionMNIST_0 | 54.4895 |
50_FashionMNIST_1 | 53.37795 |
50_FashionMNIST_2 | 45.93535 |
50_FashionMNIST_3 | 47.02001 |
50_FashionMNIST_4 | 45.6286 |
50_FashionMNIST_5 | 46.80384 |
50_FashionMNIST_6 | 43.90277 |
50_FashionMNIST_7 | 45.52983 |
50_FashionMNIST_8 | 49.52233 |
50_FashionMNIST_9 | 49.5379 |
51_SVHN_0 | 53.37439 |
51_SVHN_1 | 55.47349 |
51_SVHN_2 | 63.5158 |
51_SVHN_3 | 56.23896 |
51_SVHN_4 | 56.96233 |
51_SVHN_5 | 51.96182 |
51_SVHN_6 | 51.20783 |
51_SVHN_7 | 64.19061 |
51_SVHN_8 | 52.99753 |
51_SVHN_9 | 63.43625 |
52_agnews_0 | 58.6838 |
52_agnews_1 | 56.56641 |
52_agnews_2 | 56.18832 |
52_agnews_3 | 63.5042 |
53_amazon | 59.95886 |
54_imdb | 64.06704 |
55_yelp | 58.62 |