Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2403.14233v1 [cs.CV] 21 Mar 2024

SoftPatch: Unsupervised Anomaly Detection with Noisy Data

Xi Jiang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
jiangx2020@mail.sustech.edu.cn
Jianlin Liu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
jenningsliu@tencent.com
Jinbao Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT11footnotemark: 1
wangjb@sustech.edu.cn
Qian Nie22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
stephennie@tencent.com
Kai Wu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
lloydwu@tencent.com
Yong Liu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
choasliu@tencent.com
Chengjie Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
jasoncjwang@tencent.com
Feng Zheng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
zhengf@sustech.edu.cn
\affiliations11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTSouthern University of Science and Technology,
Department of Computer Science and Engineering
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTTencent, Youtu Lab
Corresponding Author
Abstract

Although mainstream unsupervised anomaly detection (AD) algorithms perform well in academic datasets, their performance is limited in practical application due to the ideal experimental setting of clean training data. Training with noisy data is an inevitable problem in real-world anomaly detection but is seldom discussed. This paper considers label-level noise in image sensory anomaly detection for the first time. To solve this problem, we proposed a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Noise discriminators are utilized to generate outlier scores for patch-level noise elimination before coreset construction. The scores are then stored in the memory bank to soften the anomaly detection boundary. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset. Comprehensive experiments in various noise scenes demonstrate that SoftPatch outperforms the state-of-the-art AD methods on the MVTecAD and BTAD benchmarks and is comparable to those methods under the setting without noise.

1 Introduction

Detecting anomalies by only nominal images without annotation is an appealing topic, especially in industrial applications where defects can be extremely tiny and hard to collect. Unsupervised sensory anomaly detection, also called covariate shift detection [1; 2], is proposed to solve this problem and has been largely explored. Recent deep learning methods [3; 4; 5; 6; 7] usually model the AD problem as a one-class learning problem and employ computer visual tricks to improve the perception where a clean nominal training set is provided to extract representative features. Most previous unsupervised AD methods have to measure the distance between the test sample and the standard dataset distribution to determine whether a sample differs from the standard dataset. Even though recent methods have achieved excellent performance, they all rely on the clean training set to extract nominal features for later comparison with anomalous features. Putting too much faith in training data can lead to pitfalls. If the standard normal dataset is polluted with noisy data, i.e., the defective samples, the estimated boundary will be unreliable, and the classification for abnormal data will have low accuracy. In general, current unsupervised AD methods are not designed for and are not robust to noisy data.

However, in real-world practice, it is inevitable that there are noises that sneak into the standard normal dataset, especially for industrial manufacturing, where a large number of products are produced daily. This noise usually comes from the inherent data shift or human misjudgment. Meanwhile, existing unsupervised AD methods [8; 9; 10] are susceptible to noisy data due to their exhaustive strategy to model the training set. As in Fig. 1, noisy samples easily misinform those overconfident AD algorithms, so algorithms misclassify similar anomaly samples in the test set and generate wrong locations. Additionally, AD with noisy data can be developed to a fully unsupervised setting, which discards the implicit supervised signal that the training set is all defect-free, compared with the previous unsupervised setting in AD. This setting helps to expand more industrial quality inspection scenarios, i.e., rapid deployment to new production lines without data filtration.

In this paper, we first point out the significance of studying noisy data problems in AD and especially in unsupervised sensory AD. Our solution is inspired by one of the recent state-of-the-art methods, PatchCore [8]. PatchCore proposed a method to subsample the original CNN features of the standard normal dataset with the nearest searching and establish a smaller coreset as a memory bank. However, the coreset selection and classification process are vulnerable to polluted data. In this regard, we propose a patch-level selection strategy to wipe off the noisy image patch of noisy samples. Compared to conventional sample-level denoising, the abnormal patches are separated, and the normal patches of a noise sample are exploited in coreset. Specifically, the denoising algorithm assigns an outlier factor to each patch to be selected into coreset. Based on the patch-level denoising, we propose a novel AD algorithm with better noise robustness named SoftPatch. Considering noisy samples are hard to be removed completely, SoftPatch utilizes the outlier factor to re-weight the coreset examples. Patch-level denoising and re-weighting the coreset samples are proved effective in revising misaligned knowledge and alleviating the overconfidence of coreset in inference. Extensive experiments in various noise scenes demonstrate that SoftPatch outperforms the state-of-the-art (SOTA) AD methods on MVTec Anomaly Detection (MVTecAD) [11] benchmark. Meanwhile, due to the noise in existing datasets, SoftPatch achieves optimal results on the original BTAD  [12] dataset. The code can be found in https://github.com/TencentYoutuResearch/AnomalyDetection-SoftPatch.

Our main contributions are summarized as follows:

  • \bullet

    To the best of our knowledge, we are the first to focus on the image sensory anomaly detection with noisy data, which is a more practical setting but seldom investigated. Existing image sensory AD methods fully trust the training set’s cleanliness, leading to their performance degradation in noise interference.

  • \bullet

    We propose a patch-level denoising strategy for coreset memory bank, which essentially improves the data usage rate compared to conventional sample-level denoising. Based on this strategy, we apply three noise discriminators which strengthen model robustness by combining the re-weighting of coreset.

  • \bullet

    We set a baseline for unsupervised AD with noisy data, which performs well in the settings with additional noisy data and the general settings without noise, providing a new view for further research.

Refer to caption
Figure 1: Illustration of SoftPatch. Unlike previous methods that construct coreset without considering the negative effect of noisy data, SoftPatch wipes off easy noisy data to formulate a clean training set and alleviates hard noisy data’s impact by soft-reweighting.

2 Related Work

2.1 Unsupervised Anomaly Detection

Training with agent tasks. Also known as self-supervised learning, agent tasks is a viable solution when there is no category and shape information of anomalies.  Sheynin et al. [13] employ transformations such as horizontal flip, shift, rotation, and gray-scale change after a multi-scale generative model to enhance the representation learning.  Li et al. [14] mention that naively applying existing self-supervised tasks is sub-optimal for detecting local defects and propose a novelty agent task named CutPaste, which simulates an abnormal sample by clipping a patch of a standard image and pasting it back at a random location. Similarity, DRAEM [15] synthesizes anomalies through Perlin Noise. Nevertheless, the inevitable discrepancy between the synthetic anomaly and the real anomaly disturbs the criteria of the model and limits the generalization performance. The gap between anomalies is usually larger than that between anomaly and normal. This is why AD methods deceived by some noisy samples can still work well when handling other kinds of anomalies.

Agnostic methods. Including knowledge distillation and image reconstruction, agnostic methods based on a theory that models that have never seen anomalies will behave differently in inference when inputting both normal and anomaly samples. Knowledge distillation is ingeniously used in anomaly detection.  Bergmann et al. [16] propose that the representations of unusual patches are different between a pretrained teacher model and a student model, which tried its best to simulate teacher output with an anomaly-free training set. Based on this theory,  Salehi et al. [17] propose that considering multiple intermediate outputs in distillation and using a smaller student network lead to a better result. Reverse distillation [18] uses a reverse flow that avoids the confusion caused by the same filters and prevents the propagation of anomaly perturbation to the student model, whose structure is similar to reconstruction networks. Image Reconstruction methods [7; 19; 20] utilize the assumption that the reconstruction network trained in the normal set can not reconstruct the anomaly part. A high-resolution result can be obtained by comparing the differences between the reconstructed and original images. However, all agnostic methods need long training stages, which limit their usage, i.e., the rapid deployment assumption in fully unsupervised learning.

Feature modeling. We specifically refer to the direct modeling of the output features of the extractor, including distribution estimation [21; 22], distribution transformation [23; 9], pre-trained model adaption [24; 25], and memory storage [26; 8]. PaDiM [21] utilizes multivariate Gaussian distributions to estimate the patch embedding of nominal data. In the inference stage, the embedding of irregular patches will be out of distribution. It is a simple but efficient method, but Gaussian distribution is inadequate for more complex data cases. So to enhance the estimation of density, DifferNet [23] and CFLOW [9] leverage the reversible normalizing flows based on multi-scale representation.  Hou et al. [26] proposed that the granularity of division on feature maps is closely related to the reconstruction capability of the model for both normal and abnormal samples. So a multi-scale block-wise memory bank is embedded into an autoencoder network as a model of past data. PatchCore [8] is a more explicit but valuable memory-based method, which stores the sub-sampled patch features in the memory bank and calculates the nearest neighbor distance between the test feature and the coreset as an anomaly score. Although PatchCore is outperformance in the typical setting, it is overconfident in the training set, which leads to poor noise robustness.

2.2 Learning with Noisy Data

Noisy label recognition is becoming an emerging topic for supervised learning but has rarely been explored in unsupervised anomaly detection because there is no apparent label. For classification, some research [27; 28] propose to filter noisy pseudo-labeled data with a high confidence threshold. Li et al. [29] selects noisy-labeled data with a mixture model and trains in a semi-supervised manner.  Kong et al. [30] relabel harmful training samples. For object detection, multi-augmentation [31], teacher-student [32], or contrastive learning [33] are adopted to alleviate noise with the help of the expert model’s knowledge. However, current noisy label recognition methods all rely on labeled data to co-rectify noisy data. In comparison, we target to improve the model’s noise robustness in an unsupervised manner without introducing labor annotations.

While there are some model robustness researches on unsupervised AD, their objects and tasks are distinguished from our work since “anomaly detection” is an overloaded term. A recent survey [34] explores the model robustness of 30 AD algorithms. Nevertheless, unsupervised methods are excluded from the annotation errors setting.  Pang et al. [35] deals with video anomaly without manually labeled data where information in consecutive frames can be exploited. While our work tackle anomaly detection from a single image. Other related papers [36; 37; 38] eliminate noisy and corrupted data in semantic anomaly detection. Unlike semantic anomaly detection, we focus on image sensory anomaly detection [1], which has recently raised much concern and contains a new task, anomaly localization. Although some existing methods [39; 40; 41; 42] treat covariate shift the same way they treat semantic shift and enhance model robustness with universal processes, their basic structures are poor compared with rapid-developed sensory AD methods, which leads to the robustness improvement insignificant. Noise in image sensory anomaly detection is more similar to the normal data and brings more challenges.

Refer to caption
Figure 2: Overview of the proposed method. In the training phase, the noises are distinguished at patch level at each position of the feature map by a noise discriminator. The deeper color a patch node has, the higher probability that it is a noise patch. After achieving outlier scores for all patches, the top τ𝜏\tauitalic_τ% patches with the highest outlier score are removed. The coreset is a subset of remaining patches after denoising. Different from other methods, our memory bank consists of the samples in coreset and their outlier scores which are stored as soft weights. Soft weights will be further utilized to re-weight the anomaly score in inference.

3 The Proposed Method

3.1 Overview

Patch-based unsupervised anomaly methods, such as PatchCore [8] and CFA  [25], have three main processes: feature extraction, coreset selection with memory bank construction, and anomaly detection. One of the important assumptions is that the training set only contains nominal images, and the coreset should have full coverage of the entire training data distribution. During the test, an incoming image will directly search in the memory bank for similar features, and the anomaly score is the dissimilarity with the nearest patches. The searching process may collapse if the assumed clean full coverage memory bank contains noise. Therefore, we propose SoftPatch, which filters noisy data by a noise discriminator before coreset construction and softens the searching process for down-weighting the hard unfiltered noisy samples.

The general denoising methods against the label contamination at the sample level are sub-optimal in image sensory anomaly detection. The abnormalities in image sensory AD, represented by industrial defect detection and medical image analysis, usually occupy only a tiny area of the image. At the sample level, noisy data is hard to distinguish, but the sample’s inherent deviation may be more remarkable. So we propose a patch-level denoising strategy that works on the feature space to judge the noisy patch better. First, the collected features are grouped according to position to narrow the domain of noise discrimination. Then we insert the patch-level noise discrimination process before the coreset sampling, which generates the noise score according to the feature distribution of each position. Since most areas of the noisy image are usually anomaly-free, we remove those noisy patches and retain the rest to maximize the use of data. At the same time, the rest denoising scores reflecting the behavior of clustering are used to scale the anomaly score in inference. The other parts of the algorithm, such as feature extraction, dimension reduction, coreset sampling, and nearest neighbor search, follow the baseline PatchCore[8]. Figure 2 shows the framework of SoftPatch.

The target of image-level denoising is to find 𝒳noisesubscript𝒳𝑛𝑜𝑖𝑠𝑒\mathcal{X}_{noise}caligraphic_X start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT from 𝒳𝒳\mathcal{X}caligraphic_X, where 𝒳={xi:i[1,N],xiC×H×W}𝒳conditional-setsubscript𝑥𝑖formulae-sequence𝑖1𝑁subscript𝑥𝑖superscript𝐶𝐻𝑊\mathcal{X}=\{x_{i}:i\in[1,N],x_{i}\in\mathbb{R}^{C\times H\times W}\}caligraphic_X = { italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ [ 1 , italic_N ] , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT } denotes training images (channels C𝐶Citalic_C, height H𝐻Hitalic_H, width W𝑊Witalic_W). Following convention in existing work[8], we use ϕic*×h*×w*subscriptitalic-ϕ𝑖superscriptsuperscript𝑐superscriptsuperscript𝑤\phi_{i}\in\mathbb{R}^{c^{*}\times h^{*}\times w^{*}}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT × italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the feature map (channels c*superscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, height h*superscripth^{*}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, width w*superscript𝑤w^{*}italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT) of image xi𝒳subscript𝑥𝑖𝒳x_{i}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X, ϕi(h,w)c*subscriptitalic-ϕ𝑖𝑤superscriptsuperscript𝑐\phi_{i}(h,w)\in\mathbb{R}^{c^{*}}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as the patch at (h,w)𝑤(h,w)( italic_h , italic_w ) on the aggregated feature map with dimension c𝑐citalic_c .

3.2 Noise Discriminative Coreset Selection

With increasing training images, the features memory can become exceedingly large and infeasible to discriminate noise by overall statistics. Therefore, we group all features by position and count their outlier scores. Then all the scores are aggregated to determine noise patches, after which we just remove the features with top τ𝜏\tauitalic_τ percent scores. We apply three noise reduction methods in total.

3.2.1 Nearest Neighbor

With the assumption that the amount of noisy samples Xnoisesubscript𝑋𝑛𝑜𝑖𝑠𝑒X_{noise}italic_X start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT is much less than clean samples Xnominalsubscript𝑋𝑛𝑜𝑚𝑖𝑛𝑎𝑙X_{nominal}italic_X start_POSTSUBSCRIPT italic_n italic_o italic_m italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT, we set Nearest neighbor distance as our baseline  [43] where a large distance means an outlier. Given a set of images, ϕN×c*×h*×w*italic-ϕsuperscript𝑁superscript𝑐superscriptsuperscript𝑤\phi\in\mathbb{R}^{N\times c^{*}\times h^{*}\times w^{*}}italic_ϕ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_c start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT × italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT represents all features. Each patch’s nearest neighbor distance 𝒲innsuperscriptsubscript𝒲𝑖𝑛𝑛\mathcal{W}_{i}^{nn}caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT is defined as:

𝒲inn(h,w)=minn[1,N],niϕi(h,w)ϕn(h,w)2,subscriptsuperscript𝒲𝑛𝑛𝑖𝑤subscriptformulae-sequence𝑛1𝑁𝑛𝑖subscriptnormsubscriptitalic-ϕ𝑖𝑤subscriptitalic-ϕ𝑛𝑤2\mathcal{W}^{nn}_{i}(h,w)=\min_{n\in[1,N],n\neq i}\|\phi_{i}(h,w)-\phi_{n}(h,w% )\|_{2},caligraphic_W start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) = roman_min start_POSTSUBSCRIPT italic_n ∈ [ 1 , italic_N ] , italic_n ≠ italic_i end_POSTSUBSCRIPT ∥ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) - italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h , italic_w ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

We first calculate the distances, then take the minimum among batch dimensions (neighbor) as Wnnsuperscript𝑊𝑛𝑛W^{nn}italic_W start_POSTSUPERSCRIPT italic_n italic_n end_POSTSUPERSCRIPT. Previous methods [8; 25] have proved that the minimum feature distance from a pretrained network can be an indicator to discriminate anomaly. This method can discriminate apparent outliers but suffer from uneven distribution of different clusters, where some clusters can have large inter-distance and lead to being mistakenly threshed as noisy data. To treat all clusters equally, we propose another multi-variate Gaussian method to calculate the outlier score without the interference of different clusters’ densities.

3.2.2 Multi-Variate Gaussian

With Gaussian’s normalizing effect, all clean images’ features can be treated equally. To apply Gaussian distribution on image characteristics dynamically, we calculate the inlier probabilities on the batch dimension for each patch ϕi(h,w)subscriptitalic-ϕ𝑖𝑤\phi_{i}(h,w)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ), similar to 3.2.1. The multi-variate Gaussian distribution (μh,w,Σh,w)subscript𝜇𝑤subscriptΣ𝑤\mathbb{N}(\mu_{h,w},\Sigma_{h,w})blackboard_N ( italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) can be formulated that μh,wsubscript𝜇𝑤\mu_{h,w}italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT is the batch mean of ϕi(h,w)subscriptitalic-ϕ𝑖𝑤\phi_{i}(h,w)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) and sample covariance Σh,wsubscriptΣ𝑤{\Sigma}_{h,w}roman_Σ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT is:

Σh,w=1N1n=1N(ϕn(h,w)μh,w)(ϕn(h,w)μh,w)T)+ϵI,{{\Sigma}_{h,w}}=\frac{1}{N-1}\sum_{n=1}^{N}{(\phi_{n}(h,w)-\mu_{h,w})(\phi_{n% }(h,w)-\mu_{h,w})^{T})+\epsilon I},roman_Σ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h , italic_w ) - italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) ( italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_h , italic_w ) - italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) + italic_ϵ italic_I , (2)

where the regularization term ϵIitalic-ϵ𝐼\epsilon Iitalic_ϵ italic_I makes h,wsubscript𝑤{\sum}_{h,w}∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT full rank and invertible [21]. Finally, with the estimated multi-variate Gaussian distribution 𝒩(μh,w,h,w)𝒩subscript𝜇𝑤subscript𝑤\mathcal{N}(\mu_{h,w},\sum_{h,w})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ). Notice that we do not exclude the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when calculating the 𝒩𝒩\mathcal{N}caligraphic_N, so all samples can share a joint distribution, which is much less computationally intensive than doing the calculations one by one. Mahalanobis distance is calculated as the noisy magnitude 𝒲imvg(h,w)superscriptsubscript𝒲𝑖𝑚𝑣𝑔𝑤\mathcal{W}_{i}^{mvg}(h,w)caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_v italic_g end_POSTSUPERSCRIPT ( italic_h , italic_w ) of each patch:

𝒲imvg(h,w)=(ϕi(h,w)μ(h,w))TΣh,w1(ϕi(h,w)μ(h,w)).superscriptsubscript𝒲𝑖𝑚𝑣𝑔𝑤superscriptsubscriptitalic-ϕ𝑖𝑤subscript𝜇𝑤𝑇superscriptsubscriptΣ𝑤1subscriptitalic-ϕ𝑖𝑤subscript𝜇𝑤\mathcal{W}_{i}^{mvg}(h,w)=\sqrt{(\phi_{i}(h,w)-\mu_{(h,w)})^{T}{\Sigma}_{h,w}% ^{-1}(\phi_{i}(h,w)-\mu_{(h,w)})}.caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_v italic_g end_POSTSUPERSCRIPT ( italic_h , italic_w ) = square-root start_ARG ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) - italic_μ start_POSTSUBSCRIPT ( italic_h , italic_w ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) - italic_μ start_POSTSUBSCRIPT ( italic_h , italic_w ) end_POSTSUBSCRIPT ) end_ARG . (3)

A high Mahalanobis distance means a high outlier score. Even though Gaussian distribution normalizes and captures the essence of image characteristics, small feature clusters may be overwhelmed by large feature clusters. In the scenario of a prominent feature cluster and a small cluster in a batch, the small cluster may be out of 1-, 2- or 3-ΣΣ\Sigmaroman_Σ of calculated (μh,w,Σh,w)subscript𝜇𝑤subscriptΣ𝑤\mathbb{N}(\mu_{h,w},\Sigma_{h,w})blackboard_N ( italic_μ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ) and erroneously classified as outliers. After analyzing the above two methods, we need a method that can: 1. treat all image characteristics equally; 2. treat large and small clusters equally; 3. high dimension calculation applicable.

3.2.3 Local Outlier Factor (LOF)

LOF[44] is a local-density-based outlier detector used mainly on E-commerce for criminal activity detection. Inspired by LOF, we can solve above mentioned three questions in 3.2.2: 1. Calculating the relative density of each cluster can normalize different density clusters; 2. Using local k-distance as a metric to alleviate the overwhelming effect of large clusters; 3. Modeling distance as normalized feature distance can be used on high-dimensional patch features. Therefore, the k-distance-based absolute local reachability density lrdi(h,w)𝑙𝑟subscript𝑑𝑖𝑤lrd_{i}(h,w)italic_l italic_r italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) is first calculated as:

lrdi(h,w)=1/(b𝒩k(ϕi(h,w))distkreach(ϕi(h,w),ϕb(h,w))|𝒩k(ϕi(h,w))|),𝑙𝑟subscript𝑑𝑖𝑤1subscript𝑏subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤𝑑𝑖𝑠subscriptsuperscript𝑡𝑟𝑒𝑎𝑐𝑘subscriptitalic-ϕ𝑖𝑤subscriptitalic-ϕ𝑏𝑤subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤lrd_{i}(h,w)=1/(\frac{\sum_{b\in\mathcal{N}_{k}(\phi_{i}(h,w))}dist^{reach}_{k% }(\phi_{i}(h,w),\phi_{b}(h,w))}{|\mathcal{N}_{k}(\phi_{i}(h,w))|}),italic_l italic_r italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) = 1 / ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) , italic_ϕ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_h , italic_w ) ) end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) | end_ARG ) , (4)
distkreach(ϕi(h,w),ϕb(h,w))=max(distk(ϕb(h,w)),d(ϕi(h,w),ϕb(h,w))),𝑑𝑖𝑠subscriptsuperscript𝑡𝑟𝑒𝑎𝑐𝑘subscriptitalic-ϕ𝑖𝑤subscriptitalic-ϕ𝑏𝑤𝑚𝑎𝑥𝑑𝑖𝑠subscript𝑡𝑘subscriptitalic-ϕ𝑏𝑤𝑑subscriptitalic-ϕ𝑖𝑤subscriptitalic-ϕ𝑏𝑤dist^{reach}_{k}(\phi_{i}(h,w),\phi_{b}(h,w))=max(dist_{k}(\phi_{b}(h,w)),d(% \phi_{i}(h,w),\phi_{b}(h,w))),italic_d italic_i italic_s italic_t start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) , italic_ϕ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_h , italic_w ) ) = italic_m italic_a italic_x ( italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_h , italic_w ) ) , italic_d ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) , italic_ϕ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_h , italic_w ) ) ) , (5)

where d(ϕi(h,w),ϕb(h,w))𝑑subscriptitalic-ϕ𝑖𝑤subscriptitalic-ϕ𝑏𝑤d(\phi_{i}(h,w),\phi_{b}(h,w))italic_d ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) , italic_ϕ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_h , italic_w ) ) is L2-norm, distk(ϕi(h,w))𝑑𝑖𝑠subscript𝑡𝑘subscriptitalic-ϕ𝑖𝑤dist_{k}(\phi_{i}(h,w))italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) is the distance of kth-neighbor, 𝒩k(ϕi(h,w))subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤\mathcal{N}_{k}(\phi_{i}(h,w))caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) is the set of k-nearest neighbors of ϕi(h,w)subscriptitalic-ϕ𝑖𝑤\phi_{i}(h,w)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) and |𝒩k(ϕi(h,w))|subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤|\mathcal{N}_{k}(\phi_{i}(h,w))|| caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) | is the number of the set which usually equal k when without repeated neighbors. With the local reachability density of each patch, the overwhelming effect of large clusters is largely reduced. To normalize local density to relative density for treating all clusters equally, the relative density 𝒲iLOFsuperscriptsubscript𝒲𝑖𝐿𝑂𝐹\mathcal{W}_{i}^{LOF}caligraphic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L italic_O italic_F end_POSTSUPERSCRIPT of image i𝑖iitalic_i is defined below:

𝒲iLOF(h,w)=b𝒩k(ϕi(h,w))Irdb((h,w))|𝒩k(ϕi(h,w))|Irdi(h,w).subscriptsuperscript𝒲𝐿𝑂𝐹𝑖𝑤subscript𝑏subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤𝐼𝑟subscript𝑑𝑏𝑤subscript𝒩𝑘subscriptitalic-ϕ𝑖𝑤𝐼𝑟subscript𝑑𝑖𝑤\mathcal{W}^{LOF}_{i}(h,w)=\frac{\sum_{b\in\mathcal{N}_{k}(\phi_{i}(h,w))}Ird_% {b}((h,w))}{|\mathcal{N}_{k}(\phi_{i}(h,w))|\cdot Ird_{i}(h,w)}.caligraphic_W start_POSTSUPERSCRIPT italic_L italic_O italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) end_POSTSUBSCRIPT italic_I italic_r italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( ( italic_h , italic_w ) ) end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) ) | ⋅ italic_I italic_r italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) end_ARG . (6)

𝒲iLOF(h,w)subscriptsuperscript𝒲𝐿𝑂𝐹𝑖𝑤\mathcal{W}^{LOF}_{i}(h,w)caligraphic_W start_POSTSUPERSCRIPT italic_L italic_O italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_h , italic_w ) is the relative density of the neighbors over patch’s own, and represents as a patch’s confidence of inlier. Our experiments found that all three noise reduction methods above are helpful in data pre-selection before coreset construction, while LOF𝐿𝑂𝐹LOFitalic_L italic_O italic_F provides the best performance. However, after visualization of our cleaned training set, we found that hard noisy samples, which are similar to nominal samples, are still hidden in the dataset. To further alleviate the effect of noisy data, we propose a soft re-weighting method that can down-weight noisy samples according to outlier scores.

3.3 Anomaly Detection based on SoftPatch

Besides the construction of the Coreset, outlier factors of all the selected patches are stored as soft weights in the memory bank. With the denoised patch-level memory bank \mathcal{M}caligraphic_M as shown in figure 2, the image-level anomaly score s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R can be calculated for a test sample xi𝒳testsubscript𝑥𝑖superscript𝒳𝑡𝑒𝑠𝑡x_{i}\in\mathcal{X}^{test}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_t italic_e italic_s italic_t end_POSTSUPERSCRIPT by nearest neighbor searching at patch level. Denoting the collection of patch features of a test sample as 𝒫(xi)𝒫subscript𝑥𝑖\mathcal{P}(x_{i})caligraphic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), for each patch ph,w𝒫xisubscript𝑝𝑤subscript𝒫subscript𝑥𝑖p_{h,w}\in\mathcal{P}_{x_{i}}italic_p start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT the nearest neighbor searching can be formulated as the following equation:

m*=argminmpm2.superscript𝑚subscript𝑚subscriptnorm𝑝𝑚2m^{*}=\mathop{\arg\min}\limits_{m\in\mathcal{M}}\|p-m\|_{2}.italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT ∥ italic_p - italic_m ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (7)

After nearest searching, pairs of test patch and its corresponding nearest neighbor in \mathcal{M}caligraphic_M can be achieved as (p,m*)𝑝superscript𝑚(p,m^{*})( italic_p , italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ). For each patch pi,j𝒫xisubscript𝑝𝑖𝑗subscript𝒫subscript𝑥𝑖p_{i,j}\in\mathcal{P}_{x_{i}}italic_p start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the patch-level anomaly score is calculated by

sh,w=𝒲m*ph,wm*2,subscript𝑠𝑤subscript𝒲superscript𝑚subscriptnormsubscript𝑝𝑤superscript𝑚2s_{h,w}=\mathcal{W}_{m^{*}}\|p_{h,w}-m^{*}\|_{2},italic_s start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT = caligraphic_W start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (8)

where 𝒲m*subscript𝒲superscript𝑚\mathcal{W}_{m^{*}}caligraphic_W start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the soft weight calculated by one noise discriminator. The image-level anomaly score is attained by finding the largest soft weights re-weighted patch-level anomaly score: s*=max(h,w)sh,wsuperscript𝑠subscript𝑤subscript𝑠𝑤s^{*}=\mathop{\max}\limits_{(h,w)}s_{h,w}italic_s start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT ( italic_h , italic_w ) end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT.

Different from PatchCore which directly considers patches equally, SoftPatch softens anomaly scores by noise level from noise discriminater. The soft weights, i.e., local outlier factors, have considered the local relationship around the nearest node. Thus, a similar effect can be achieved as PatchCore but with more noise robustness and fewer searches. According to the image-level anomaly score, a sample is classified into a normal sample or an abnormal sample.

4 Experiments

4.1 Experimental Details

Datasets.

Our experiments are mainly conducted on the MVTecAD and BTAD benchmarks[11; 12]. MVTecAD contains 15 categories with 3629 training images and 1725 test images in total, and BTAD has three categories with 1799 images, where different classes of industry production mean a comprehensive challenge, such as object or texture and whether rotation. Since each category of MVTecAD is divided into nominal-only images and a test set with both nominal and anomalous samples, to create a noisy training set, we sample anomalous images randomly from the test set and mix them with the existing training images. Notice that the original normal number of samples in the training set remains unchanged compared with the noiseless case. In this setting(No overlap), the injected anomalous samples will not be evaluated, which is more likely the case in the real application. We also construct a different setting(Overlap) where the injected anomalous samples are also in the test set to demonstrate the risk that defects with similar appearance will severely exacerbate the performance of an anomaly detector trained with noisy data. Meanwhile, the overlap samples test the outlier detection performance of our algorithm. By controlling the proportion of negative samples being injected into the train set, we obtain several new datasets with different noise ratios dubbed MVTecAD-noise-n𝑛nitalic_n, where n𝑛nitalic_n refers to the ratio of noise. For BTAD, we just use the original fold.

Table 1: Anomaly detection performance on MVTecAD with noise. The results are evaluated on MVTecAD-noise-0.1. Overlap means the injected anomalous images are included in the test set. PaDiM* uses ResNet18 as the backbone. PatchCore-random uses 1% random subsampler instead of the default greedy subsampler. Gap row shows the performance gap between a noisy scene and a normal scene.
Noise=0.1 No overlap | Overlap
Category PaDiM CFLOW PatchCore
SoftPatch-
nearest
SoftPatch-
gaussian
SoftPatch-
lof
PaDiM* PatchCore
PatchCore-
random
SoftPatch-
lof
bottle 0.994 0.998 1.000 1.000 0.997 0.937 1.000 0.692 0.998 1.000
cable 0.873 0.925 0.982 0.935 0.952 0.995 0.680 0.756 0.920 0.994
capsule 0.920 0.947 0.976 0.916 0.662 0.963 0.796 0.783 0.779 0.955
carpet 0.999 0.961 0.996 0.995 0.999 0.991 0.890 0.681 0.973 0.993
grid 0.966 0.891 0.971 0.972 0.997 0.968 0.674 0.526 0.793 0.969
hazelnut 0.956 1.000 0.998 1.000 1.000 1.000 0.543 0.441 0.998 1.000
leather 1.000 1.000 1.000 1.000 1.000 1.000 0.964 0.739 1.000 1.000
metal_nut 0.987 0.959 0.999 0.994 0.997 0.999 0.820 0.765 0.969 1.000
pill 0.918 0.929 0.975 0.921 0.873 0.963 0.722 0.770 0.874 0.955
screw 0.838 0.784 0.966 0.862 0.475 0.960 0.567 0.710 0.462 0.923
tile 0.977 0.991 0.985 0.996 0.997 0.993 0.830 0.716 1.000 0.981
toothbrush 0.927 0.906 0.997 1.000 0.997 0.997 0.700 0.800 0.797 0.994
transistor 0.953 0.896 0.953 1.000 0.992 0.990 0.471 0.491 0.943 0.999
wood 0.991 0.972 0.984 0.984 0.997 0.987 0.831 0.579 0.980 0.986
zipper 0.852 0.928 0.981 0.976 0.979 0.978 0.679 0.792 0.950 0.974
Average 0.943 0.939 0.984 0.970 0.927 0.986 0.740 0.683 0.896 0.982
Gap -0.007 -0.03 -0.008 +0.002 -0.001 0.0 -0.151 -0.309 -0.015 -0.004
Evaluation Metrics.

We report both image-level and pixel-level AUROC for each category in MVTecAD and average them to get the average image/pixel level AUROC. In order to represent noise robustness, the performance gaps between noise-free data and noisy data are also displayed. When not otherwise stated, our method SoftPatch refers to SoftPatch-LOF that uses LOF in Section 3.2.3.

Implementation Details.

We test three SOTA AD algorithms, PatchCore [8], PaDim [21] and CFLOW [9] in noise scene and follow their main settings. In the absence of specific instructions, the backbone of the feature extractor is Wide-ResNet50, and the coreset sampling ratio of PatchCore and SoftPatch is 10%. For MVTecAD images, we only use 256×256256256256\times 256256 × 256 resolution and center crops them into 224×224224224224\times 224224 × 224 along with a normalization. For BTAD, we use 512×512512512512\times 512512 × 512 resolution. We train a separate model for each class. Notice that unlike many methods setting the hyperparameters according to the noise ratio, which is unknowable in reality, we set the threshold τ𝜏\tauitalic_τ in SoftPatch and the LOF-K to constant 0.15 and 6 for all noisy scenarios and classes. The effects of hyperparameters are studied in the ablation study. All our experiments are run on Nvidia V100 GPU and repeated three times to report the average results.

4.2 Anomaly Detection Performance with Noise

Table 2: Anomaly localization performance on MVTecAD with noise. The results are evaluated on MVTecAD-noise-0.1.
Noise=0.1 No overlap | Overlap
Category PaDiM CFLOW PatchCore
SoftPatch-
nearest
SoftPatch-
gaussian
SoftPatch-
lof
PaDiM* PatchCore
PatchCore-
random
SoftPatch-
lof
Average 0.972 0.969 0.956 0.971 0.977 0.979 0.955 0.654 0.951 0.969
Gap -0.007 -0.006 -0.025 -0.008 -0.001 -0.002 -0.013 -0.327 -0.021 -0.012

Experiments on MVTecAD. As indicated in Table 1 and Table 2, when 10% of anomalous samples are added to corrupt the train set, all existing methods have different extents of performance decrease, although not disastrously in No overlap setting. Compared to other methods, the proposed SoftPatch exhibits much stronger robustness against noisy data both in terms of anomaly detection and localization, no matter which noise discriminator is used. Among three variants of SoftPatch, SoftPatch-lof achieves the best overall performance with the highest accuracy and strongest robustness. Interestingly, PaDiM[21], CFLOW[9] and SoftPatch-gaussian show significantly less performance drop than PatchCore, which indicates that modeling feature as Gaussian distribution does help denoising. While modeling feature distribution at each spatial location as a single Gaussian distribution can’t handle misaligned images, such as screw class in MVTecAD, which explains the poor performance on these classes(see screw row). On the other hand, PatchCore’s greedy-sampling strategy is a double-edged sword with higher feature space coverage and higher sensitivity to noise. That’s why using random sampling in PatchCore is more robust with compromised performance(see PatchCore 1%-Random column). SoftPatch-nearest does a slightly better job in the misaligned cases. However, it doesn’t take feature distribution into account, which leads to inferior performance.

Table 3: Anomaly detection performance on BTAD without additional noise. The best results are in bold, and the second-best results are underlined. The last column lists the count of anomaly samples in the test set.
Category SPADE P-SVDD PatchCore PaDiM SoftPatch(ours) Anomaly samples
01 0.914 0.957 1.000 1.000 0.999 50
02 0.714 0.721 0.871 0.871 0.934 200
03 0.999 0.821 0.999 0.971 0.997 41
Mean 0.876 0.833 0.957 0.947 0.977 -

Experiments on BTAD. We also compare SoftPatch with other SOTA methods on another dataset, BTAD. Surprisingly, SoftPatch gives out a new SOTA result, even in the original setting that contains no additional noise (Table 3). By reviewing all the training samples, we find that there are already many noisy samples (usually small scratches) in the training set of category BTAD-02, which is more consistent with our setting and further demonstrates the necessity of our approach. The noisy images are provided in Appendix A.6 (Table 8). Moreover, the BTAD-02 contains more anomaly samples with similar appearance anomalies. In the category of BTAD-02, our method attains significant improvement compared to others. SoftPatch can also maintain the leading performance if the noise is added artificially(Appendix A.8).

Refer to caption
Figure 3: The comparison of anomaly detection performance under noisy training. no overlap means the injected anomalous images are removed from test set while overlap are not.

Performance trends. In order to explore how different methods behave with the increasing noise level, experiments are further performed on MVTecAD-noise-{00.15}similar-to00.15\{0\sim 0.15\}{ 0 ∼ 0.15 }. The results of the proposed methods are provided in Figure 3. Under the No overlap setting, as the noise ratio increases, PatchCore shows a pixel-level AUROC drop up to 3.7%. The performance decreases as the noise ratio rises. On the contrary, although the default performance is slightly poorer than PatchCore(about 0.006 and 0 decrease in image-level and pixel-level AUROC), the proposed SoftPatch-lof deteriorates much slower, which demonstrates better denoising ability. As for SoftPatch-nearest and SoftPatch-gaussian, they are also more robust, however, with worse base performance(see Figure 3 at noise ratio=0). The visualization of the coreset in Figure 5 also shows that random sampling avoids sampling the outlier but can not model normal adequately. Being consistent with the discussion above, under the Overlap setting, PatchCore’s performance is getting worse and worse catastrophically(up to 40% AUROC drop in both image and pixel level) as more noises are added. This is expectable since PatchCore uses a greedy strategy for coreset sampling, which favors outliers in feature space. SoftPatch-lof consistently outperforms other methods with no significant performance drop as the noise level goes up. Appendix A.3 (Figure 6 and 7) shows more comparison with others. The experimental results indicate that the risk is hidden by the fact that defects in MVTecAD have very different appearances. In this case, even if some anomalous features are added mistakenly to the coreset, they are unlikely to be retrieved during test time. However, the risk still exists and will be triggered when similar defects show up at test time.

More experiments can be found in appendix, such as the comparison of image-level and patch-level denoising(Appendix A.4), computational analysis(Appendix A.5) and an augmented overlap setting(Appendix A.7).

Table 4: The ablation study of soft weight. The performance scores are Image/pixel-level AUROC on MVTecAD.
No overlap | Overlap
Noise discriminator Soft weight Image level Pixel level Image level Pixel level
None 0.985 0.946 0.685 0.693
Gaussian 0.927 0.977 0.925 0.961
Gaussian 0.922 0.974 0.924 0.965
Nearest 0.970 0.971 0.966 0.944
Nearest 0.972 0.978 0.968 0.958
LOF 0.985 0.984 0.984 0.963
LOF 0.986 0.979 0.982 0.969

4.3 Ablation Study

4.3.1 Effectiveness of the Proposed Modules

We validated the effectiveness of two proposed modules noise discriminator and soft weight by removing them from the pipeline. As shown in Table 4, the noise discriminator significantly improves the noise robustness in terms of pixel-level AUROC. Among three decision choices of noise discriminator, LOF achieved the best balance between robustness and capacity, resulting in the most performance boost under all settings. We further analyzed the intermediate results by visualizing the sampled coreset of different methods, which shows that SoftPatch-LOF sampled much fewer anomalous features than the baseline(see Figure 5). Soft weight is used alongside the noise discriminator to further improve the final results. We only observed minor improvement for using Soft weight in SoftPatch-Nearest. We suspect that the other two kinds of noise discriminators are already robust against noise data.

Table 5: Image/pixel-level AUROC result for different LOF-K on two settings.
K 3 4 5 6 7 8 9
Overlap 0.983/0.955 0.982/0.951 0.983/0.959 0.982/0.975 0.981/0.973 0.982/0.968 0.980/0.968
No overlap 0.985/0.972 0.985/0.975 0.984/0.977 0.984/0.982 0.985/0.980 0.984/0.983 0.981/0.982
Refer to caption
Figure 4: Performance trend with the threshold τ𝜏\tauitalic_τ in SoftPatch-LOF. The results are evaluated on MVTecAD-noise-0.1.

4.3.2 Parameter Selection

To explore the impact of two parameters (LOF-k and threshold τ𝜏\tauitalic_τ) on the final performance, we perform parameters searching on our method. As in Table 5, our method achieves better performance when LOF-k is greater than 5, which suggests that our method is not sensitive to LOF-k, as long as it is not too small or too large. If LOF-k is too small, it fails to estimate the local density accurately because too few neighbors are considered. On the contrary, a large LOF-k may lead to undesirable cross-clusters connection that can not capture real data distribution.

Threshold τ𝜏\tauitalic_τ refers to the ratio of eliminated patch features when building coreset. Figure 4 indicates an increasing trend of AUROC as threshold τ𝜏\tauitalic_τ increases under Overlap setting, which is expected since a higher threshold means a more aggressive denoising strategy. In Overlap setting, the mistakenly sampled features are the direct reason for the drastic performance drop. Therefore more aggressive denoising improves the result significantly. However, In No Overlap setting, the effect of the noisy feature is less prominent. Although the best LOF-k and threshold τ𝜏\tauitalic_τ are changed according to the class and noise level, we simply use fixed values, 6 and 0.15, in all situations.

5 Conclusions

This paper emphasizes the practical value of investigating noisy data problems in unsupervised AD. Introducing a novel noisy setting on the previous task, we test the performance of existing methods and SoftPatch. For existing methods, despite no adaptation to noisy settings, some of them have a slight performance decrease in some scenes. However, the performance decrease could be more significant and catastrophic for other methods or in other scenes. For the proposed SoftPatch, it shows consistent performance in all noise settings, which outperforms other methods.

Industrial inspection systems are an important computer vision application that requires good robustness. The noise injected into the training set break with the naive assumption that the training samples were normal. Noise also gives the model early exposure to the distribution of anomalies. The unsupervised AD with noisy data needs more research in the future.

Acknowledgement

This work is supported by the National Natural Science Foundation of China under Grant No. 61972188, 62122035 and 62206122.

References

  • Yang et al. [2021] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
  • Tao et al. [2022] Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. Deep learning for unsupervised anomaly localization in industrial images: A survey. IEEE Transactions on Instrumentation and Measurement, 2022.
  • Yang et al. [2022a] Minghui Yang, Peng Wu, Jing Liu, and Hui Feng. Memseg: A semi-supervised method for image surface defect detection using differences and commonalities. arXiv preprint arXiv:2205.00908, 2022a.
  • Zavrtanik et al. [2022] Vitjan Zavrtanik, Matej Kristan, and Danijel Skocaj. Dsr - a dual subspace re-projection network for surface anomaly detection. CoRR, abs/2208.01521, 2022.
  • Ding et al. [2022] Choubo Ding, Guansong Pang, and Chunhua Shen. Catching both gray and black swans: Open-set supervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7388–7398, 2022.
  • Ristea et al. [2022] Nicolae-Cătălin Ristea, Neelu Madan, Radu Tudor Ionescu, Kamal Nasrollahi, Fahad Shahbaz Khan, Thomas B Moeslund, and Mubarak Shah. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13576–13586, 2022.
  • Yan et al. [2021] Xudong Yan, Huaidong Zhang, Xuemiao Xu, Xiaowei Hu, and Pheng-Ann Heng. Learning semantic context from normal samples for unsupervised anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3110–3118, 2021.
  • Roth et al. [2021] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. arXiv preprint arXiv:2106.08265, 2021.
  • Gudovskiy et al. [2022] Denis Gudovskiy, Shun Ishizaka, and Kazuki Kozuka. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 98–107, 2022.
  • Zheng et al. [2022] Ye Zheng, Xiang Wang, Rui Deng, Tianpeng Bao, Rui Zhao, and Liwei Wu. Focus your distribution: Coarse-to-fine non-contrastive learning for anomaly detection and localization. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.
  • Bergmann et al. [2019] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9592–9600, 2019.
  • Mishra et al. [2021] Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, and Gian Luca Foresti. Vt-adl: A vision transformer network for image anomaly detection and localization. In 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), pages 01–06. IEEE, 2021.
  • Sheynin et al. [2021] Shelly Sheynin, Sagie Benaim, and Lior Wolf. A hierarchical transformation-discriminating generative model for few shot anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8495–8504, October 2021.
  • Li et al. [2021] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9664–9674, 2021.
  • Zavrtanik et al. [2021] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2021.
  • Bergmann et al. [2020] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4183–4192, 2020.
  • Salehi et al. [2021] Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14902–14912, 2021.
  • Deng and Li [2022] Hanqiu Deng and Xingyu Li. Anomaly detection via reverse distillation from one-class embedding. arXiv preprint arXiv:2201.10703, 2022.
  • Zhou et al. [2020] Kang Zhou, Yuting Xiao, Jianlong Yang, Jun Cheng, Wen Liu, Weixin Luo, Zaiwang Gu, Jiang Liu, and Shenghua Gao. Encoding structure-texture relation with p-net for anomaly detection in retinal images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 360–377. Springer, 2020.
  • Liang et al. [2022] Yufei Liang, Jiangning Zhang, Shiwei Zhao, Runze Wu, Yong Liu, and Shuwen Pan. Omni-frequency channel-selection representations for unsupervised anomaly detection. arXiv preprint arXiv:2203.00259, 2022.
  • Defard et al. [2021] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier. Padim: a patch distribution modeling framework for anomaly detection and localization. In International Conference on Pattern Recognition, pages 475–489. Springer, 2021.
  • Yi and Yoon [2020] Jihun Yi and Sungroh Yoon. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian Conference on Computer Vision, 2020.
  • Rudolph et al. [2021] Marco Rudolph, Bastian Wandt, and Bodo Rosenhahn. Same same but differnet: Semi-supervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1907–1916, 2021.
  • Reiss et al. [2021] Tal Reiss, Niv Cohen, Liron Bergman, and Yedid Hoshen. Panda: Adapting pretrained features for anomaly detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2806–2814, 2021.
  • Lee et al. [2022] Sungwook Lee, Seunghyun Lee, and Byung Cheol Song. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. arXiv preprint arXiv:2206.04325, 2022.
  • Hou et al. [2021] Jinlei Hou, Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu, and Hong Zhou. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8791–8800, 2021.
  • Hu et al. [2021] Zijian Hu, Zhengyu Yang, Xuefeng Hu, and Ram Nevatia. Simple: Similar pseudo label exploitation for semi-supervised classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15099–15108, June 2021.
  • Sohn et al. [2020] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596–608, 2020.
  • Li et al. [2020] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
  • Kong et al. [2021] Shuming Kong, Yanyan Shen, and Linpeng Huang. Resolving training biases via influence-based data relabeling. In International Conference on Learning Representations, 2021.
  • Xu et al. [2021] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3060–3069, 2021.
  • Liu et al. [2021a] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480, 2021a.
  • Yang et al. [2022b] Fan Yang, Kai Wu, Shuyi Zhang, Guannan Jiang, Yong Liu, Feng Zheng, Wei Zhang, Chengjie Wang, and Long Zeng. Class-aware contrastive semi-supervised learning. arXiv preprint arXiv:2203.02261, 2022b.
  • Han et al. [2022] Songqiao Han, Xiyang Hu, Hailiang Huang, Mingqi Jiang, and Yue Zhao. Adbench: Anomaly detection benchmark. arXiv preprint arXiv:2206.09426, 2022.
  • Pang et al. [2020] Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, and Xiao Bai. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12173–12182, 2020.
  • Liu et al. [2021b] Boyang Liu, Ding Wang, Kaixiang Lin, Pang-Ning Tan, and Jiayu Zhou. Rca: A deep collaborative autoencoder approach for anomaly detection. In IJCAI, pages 1505–1511, 2021b.
  • Zhou and Paffenroth [2017] Chong Zhou and Randy C Paffenroth. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 665–674, 2017.
  • Wu et al. [2022] Shuang Wu, Jingyu Zhao, and GuangJian Tian. Understanding and mitigating data contamination in deep anomaly detection: A kernel-based approach. In IJCAI, pages 2319–2325, 2022.
  • Yoon et al. [2021] Jinsung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, Chen-Yu Lee, and Tomas Pfister. Self-supervise, refine, repeat: Improving unsupervised anomaly detection. 2021.
  • Qiu et al. [2022] Chen Qiu, Aodong Li, Marius Kloft, Maja Rudolph, and Stephan Mandt. Latent outlier exposure for anomaly detection with contaminated data. arXiv preprint arXiv:2202.08088, 2022.
  • Cordier et al. [2022] Antoine Cordier, Benjamin Missaoui, and Pierre Gutierrez. Data refinement for fully unsupervised visual inspection using pre-trained networks. arXiv preprint arXiv:2202.12759, 2022.
  • Chen et al. [2022] Yuanhong Chen, Yu Tian, Guansong Pang, and Gustavo Carneiro. Deep one-class classification via interpolated gaussian descriptor. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 383–392, 2022.
  • Peterson [2009] Leif E Peterson. K-nearest neighbor. Scholarpedia, 4(2):1883, 2009.
  • Breunig et al. [2000] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104, 2000.

Appendix A Appendix

A.1 Visualization of Coreset

Figure 5 shows the visualization of coreset in the memory bank. PatchCore reserve too many noisy features, which are obviously outliers. Though replacing the greedy sampling with random sampling, PatchCore avoids most noisy features but is poor at model training set and still misled by some noise. The coreset of SoftPatch is clean and decentralized. Our coreset saves some features from noisy samples because we believe that abnormal images also contain a large number of normal patches. So the features conforming to the normal distribution are reserved to enhance the model perception.

Refer to caption
Figure 5: Comparison between corsets of AD methods with same noisy train set, MVTecAD-Pill with noise-0.1. We use t-SNE for dimension reduction for visualization. The yellow dots represent patch features from noisy sample, while the purple dots are nominal. Compared with the other two, SoftPatch wipe off the noisy patch and model the nominal data properly.

A.2 Details of Experimental Results

Table 6: Anomaly localization performance details of all classes. The results are evaluated on MVTecAD-noise-0.1.
Noise=0.1 No overlap | Overlap
Category PaDiM CFLOW PatchCore
SoftPatch-
nearest
SoftPatch-
gaussian
SoftPatch-
lof
PaDiM* PatchCore
PatchCore-
random
SoftPatch-
lof
bottle 0.986 0.984 0.987 0.987 0.986 0.987 0.981 0.714 0.979 0.975
cable 0.916 0.958 0.843 0.915 0.981 0.983 0.946 0.670 0.969 0.971
capsule 0.986 0.985 0.986 0.988 0.977 0.990 0.984 0.883 0.984 0.989
carpet 0.992 0.989 0.992 0.992 0.993 0.992 0.980 0.765 0.951 0.989
grid 0.974 0.947 0.991 0.990 0.989 0.990 0.879 0.482 0.882 0.974
hazelnut 0.987 0.991 0.990 0.990 0.991 0.990 0.978 0.418 0.957 0.924
leather 0.994 0.994 0.991 0.994 0.994 0.993 0.992 0.683 0.987 0.993
metal_nut 0.933 0.956 0.842 0.894 0.964 0.984 0.911 0.779 0.938 0.983
pill 0.956 0.983 0.971 0.974 0.972 0.981 0.960 0.608 0.971 0.976
screw 0.989 0.977 0.995 0.991 0.969 0.994 0.974 0.745 0.953 0.969
tile 0.956 0.953 0.953 0.960 0.962 0.954 0.921 0.700 0.919 0.954
toothbrush 0.991 0.988 0.989 0.988 0.988 0.985 0.954 0.692 0.984 0.985
transistor 0.960 0.887 0.847 0.965 0.954 0.942 0.939 0.317 0.914 0.936
wood 0.973 0.964 0.969 0.947 0.946 0.939 0.946 0.522 0.896 0.929
zipper 0.986 0.978 0.986 0.989 0.988 0.988 0.978 0.823 0.975 0.986
Average 0.972 0.969 0.956 0.971 0.977 0.979 0.955 0.654 0.951 0.969
Gap -0.007 -0.006 -0.025 -0.008 -0.001 -0.002 -0.013 -0.327 -0.021 -0.012

A.3 Performance Trends in Noise

Figure  6 and  7 show the performance trends of SOTA AD methods and SoftPatch in different noisy scenes. Since overconfident in the training data and the greedy subsampling algorithm, PatchCore performance decreases most obviously with the noise increase. In contrast, CFLOW and PaDiM are also affected by noise, but the amplitudes are smaller. SoftPatch maintains a consistent level of performance at all noise levels. Unfortunately, SoftPatch is slightly weaker than PatchCore in noiseless scenes, which may be due to the excessively conservative threshold setting.

Refer to caption
Figure 6: Performance in different level of no overlap noise.
Refer to caption
Figure 7: Performance in different level of overlap noise.

A.4 Image-level Denoising V.S. Patch-level Denoising

A simple strategy to eliminate the noisy data is to delete the anomaly samples before training, which is an unsupervised outlier detection task. However, the existing outlier detection methods do not work well because the distance between abnormal and normal images is much smaller than the distance between different classes. Meanwhile, we found that some AD methods could also detect outliers in the training set. Following  [41], we apply PaDiM* as the image-level denoising method which give consideration to the costs and effects. PaDiM* is a simplified version of PaDiM, which uses ResNet18 as the backbone with faster computing speed. PaDiM* scores all training samples and then removes the pieces with high outliers based on the threshold. The comparison in Table  7 and Table  8 show that image-level denoising dramatically improves the performance of existing SOTA AD methods in the noisy scene. But there is still a gap when compared with SoftPatch.

Table 7: The anomaly detection performance of image-level denoising and patch-level denoising. The PaDiM*+PaDiM*, PaDiM*+CFLOW, and PaDiM*+PatchCore are AD methods with image-level denoising. PaDiM* is used in image level denoising, where we use the same threshold (0.15) as it in SoftPatch. And we also tried the tricky threshold-0.1 as the noise ratio, but it works worse. The results are evaluated on MVTecAD-noise-0.1 with overlap.
Category PaDiM*
PaDiM*+
PaDiM*
CFLOW
PaDiM*+
CFLOW
PatchCore
PaDiM*(threshold
-0.1)+PatchCore
PaDiM*+
PatchCore
SoftPatch-lof
bottle 0.937 0.994 1.000 1.000 0.692 0.984 1.000 1.000
cable 0.680 0.741 0.916 0.841 0.756 0.890 0.888 0.994
capsule 0.796 0.854 0.945 0.939 0.783 0.892 0.909 0.955
carpet 0.890 0.937 0.960 0.950 0.681 0.963 0.974 0.993
grid 0.674 0.765 0.799 0.830 0.526 0.850 0.870 0.969
hazelnut 0.543 0.725 0.999 0.990 0.441 0.871 0.929 1.000
leather 0.964 0.979 0.996 1.000 0.739 0.957 0.989 1.000
metal_nut 0.820 0.949 0.957 0.986 0.765 0.965 0.977 1.000
pill 0.722 0.745 0.897 0.924 0.770 0.898 0.913 0.955
screw 0.567 0.542 0.570 0.639 0.710 0.916 0.907 0.923
tile 0.830 0.906 0.980 0.981 0.716 0.939 0.957 0.981
toothbrush 0.700 0.869 0.878 0.928 0.800 0.981 0.997 0.994
transistor 0.471 0.770 0.872 0.788 0.491 0.777 0.825 0.999
wood 0.831 0.966 0.954 0.970 0.579 0.943 0.976 0.986
zipper 0.679 0.678 0.931 0.873 0.792 0.909 0.914 0.974
Average 0.740 0.828 0.910 0.909 0.683 0.916 0.935 0.982
Table 8: The anomaly localization performance of image-level denoising and patch-level denoising.
Category PaDiM*
PaDiM*+
PaDiM*
CFLOW
PaDiM*+
CFLOW
PatchCore
PaDiM*(threshold
-0.1)+PatchCore
PaDiM*+
PatchCore
SoftPatch-lof
bottle 0.981 0.983 0.984 0.986 0.714 0.984 0.985 0.975
cable 0.946 0.954 0.950 0.956 0.670 0.738 0.739 0.971
capsule 0.984 0.982 0.986 0.985 0.883 0.851 0.876 0.989
carpet 0.980 0.984 0.986 0.988 0.765 0.960 0.988 0.989
grid 0.879 0.876 0.961 0.948 0.482 0.797 0.818 0.974
hazelnut 0.978 0.977 0.982 0.987 0.418 0.798 0.825 0.924
leather 0.992 0.993 0.993 0.995 0.683 0.966 0.979 0.993
metal_nut 0.911 0.968 0.960 0.984 0.779 0.784 0.834 0.983
pill 0.960 0.956 0.976 0.984 0.608 0.706 0.713 0.976
screw 0.974 0.968 0.973 0.970 0.745 0.887 0.889 0.969
tile 0.921 0.927 0.945 0.946 0.700 0.924 0.968 0.954
toothbrush 0.954 0.986 0.984 0.983 0.692 0.977 0.986 0.985
transistor 0.939 0.965 0.834 0.908 0.317 0.932 0.945 0.936
wood 0.946 0.947 0.934 0.943 0.522 0.800 0.918 0.929
zipper 0.978 0.973 0.979 0.967 0.823 0.875 0.878 0.986
Average 0.955 0.963 0.962 0.969 0.654 0.865 0.889 0.969

A.5 Computational Analysis

SoftPatch does not require more runtime than PatchCore, according to theoretical analysis. The complexity of the greedy sampling process in PatchCore is 𝒪(N2h2w2)𝒪superscript𝑁2superscript2superscript𝑤2\mathcal{O}(N^{2}h^{2}w^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which is most expensive part. The complexity of the noise discrimination process in SoftPatch-LOF is 𝒪(N2hw)𝒪superscript𝑁2𝑤\mathcal{O}(N^{2}hw)caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h italic_w ), since features are grouped before. So the computational complexity of SoftPatch is equal PatchCore by 𝒪(N2hw+N2h2w2)=𝒪(N2h2w2)𝒪superscript𝑁2𝑤superscript𝑁2superscript2superscript𝑤2𝒪superscript𝑁2superscript2superscript𝑤2\mathcal{O}(N^{2}hw+N^{2}h^{2}w^{2})=\mathcal{O}(N^{2}h^{2}w^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h italic_w + italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In fact, SoftPatch will be faster because it removes a part of the patch as noise.

Excluding the loading time of data, the comparison of the remaining time overhead between SoftPatch and PatchCore is shown in Figure 9. The GPU used in this experiment is RTX TITAN 24G. Both spend almost the same amount of time training and testing, which means that our patch-level denoising does not bring unacceptable overhead. On the contrary, the image-level denoising dramatically increases training time.

Table 9: Mean training and inference time per category on MVTecAD. The unit of time is second.
Training time Inference time
SoftPatch-LOF 21.2958 15.6146
PatchCore 21.3869 15.8763
PaDiM*+PatchCore 74.2912 15.5386

A.6 The Noise in Existing Datasets

Although existing research datasets are well organized, some abnormal samples are misclassified. Fig. 8 show anomaly samples in normal set in two wide-used datasets. In the actual production data, the noise interference will be more serious.

Refer to caption
Figure 8: Noisy examples in (a) MVTecAD dataset and (b) BTAD dataset.

A.7 Performance in Augmented Overlap Setting

We do another experiment where the overlap images are augmented in the train set to make them different from the images in the test set. We experiment with varying degrees of appearance and structural augmentation. The result in Table  10 shows that our method still presents better robustness when the overlap samples have been transformed, though the performance of PatchCore is improved.

Table 10: Performance on MVTecAD in augmented overlap setting.
Setting Overlap with gaussian noise Overlap with noise and blur Overlap with rotation Overlap with affine transformation
Method PatchCore / Ours PatchCore / Ours PatchCore / Ours PatchCore / Ours
Detection 0.760 / 0.984 0.848 / 0.984 0.950 / 0.984 0.933 / 0.984
Localization 0.790 / 0.969 0.864 / 0.970 0.924 / 0.978 0.915 / 0.978

A.8 Performance on BTAD with noise

The performance comparisons are provided in table  11 and  12. Since the anomaly samples in category BTAD-03 are not enough to meet the requirement of the number of noise samples, we experience the other two.

Table 11: Anomaly detection performance on BTAD-noise-0.1.
Noise = 0.1 No overlap | Overlap
Category PatchCore SoftPatch-LOF PatchCore SoftPatch-LOF
01 1.000 1.000 0.522 1.000
02 0.860 0.922 0.738 0.912
Mean 0.930 0.961 0.630 0.956
Table 12: Anomaly localization performance on BTAD-noise-0.1.
Noise = 0.1 No overlap | Overlap
Category PatchCore SoftPatch-LOF PatchCore SoftPatch-LOF
01 0.982 0.999 0.319 0.815
02 0.949 0.953 0.754 0.936
Mean 0.966 0.976 0.536 0.875