Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.05767v2 [cs.CV] 22 Feb 2024

AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model

Teng Hu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT11footnotemark: 1,   Jiangning Zhang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,   Ran Yi11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,   Yuzhen Du11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,   Xu Chen22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,
   Liang Liu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,   Yabiao Wang22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,   Chengjie Wang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTShanghai Jiao Tong University   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTYoutu Lab, Tencent
{{\{{hu-teng, ranyi, Haaaaaaaaaa}normal-}\}}@sjtu.edu.cn;
{normal-{\{{vtzhang, cxxuchen, leoneliu, caseywang, jasoncjwang}normal-}\}}@tencent.com;
Equal contributions.Corresponding author.
Abstract

Anomaly inspection plays an important role in industrial manufacture. Existing anomaly inspection methods are limited in their performance due to insufficient anomaly data. Although anomaly generation methods have been proposed to augment the anomaly data, they either suffer from poor generation authenticity or inaccurate alignment between the generated anomalies and masks. To address the above problems, we propose AnomalyDiffusion, a novel diffusion-based few-shot anomaly generation model, which utilizes the strong prior information of latent diffusion model learned from large-scale dataset to enhance the generation authenticity under few-shot training data. Firstly, we propose Spatial Anomaly Embedding, which consists of a learnable anomaly embedding and a spatial embedding encoded from an anomaly mask, disentangling the anomaly information into anomaly appearance and location information. Moreover, to improve the alignment between the generated anomalies and the anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism. Based on the disparities between the generated anomaly image and normal sample, it dynamically guides the model to focus more on the areas with less noticeable generated anomalies, enabling generation of accurately-matched anomalous image-mask pairs. Extensive experiments demonstrate that our model significantly outperforms the state-of-the-art methods in generation authenticity and diversity, and effectively improves the performance of downstream anomaly inspection tasks. The code and data are available in https://github.com/sjtuplayer/anomalydiffusion.

1 Introduction

Refer to caption
Figure 1: Top: Our model generates extensive anomaly data, which supports the downstream Anomaly Detection (AD), Localization (AL) and Classification (AC) tasks, while previous methods mainly rely on unsupervised learning or few-shot supervised learning due to the limited anomaly data; Bottom: Generated anomaly results on hazelnut-crack and capsule-squeeze of our model and existing anomaly generation methods, where our results are the most authentic.

In recent years, industrial anomaly inspection algorithms, i.e., anomaly detection, localization, and classification, plays a crucial role in industrial manufacture (Duan et al. 2023). However, in real-world industrial production, the anomaly samples are very few, posing a significant challenge for anomaly inspection (Fig. 1-top). To mitigate the issue of few anomaly data, existing anomaly inspection mostly relies on unsupervised learning methods that only use normal samples (Zavrtanik, Kristan, and Skočaj 2021; Li et al. 2021), or few-shot supervised learning methods (Zhang et al. 2023a). Although these methods perform well in anomaly detection, they have limited performance in anomaly localization and cannot handle anomaly classification.

To cope with the problem of scarce anomaly samples, researchers propose anomaly generation methods to supplement the anomaly data, which can be divided into two types: 1) The model-free methods randomly crop and paste patches from existing anomalies or anomaly texture dataset onto normal samples (Li et al. 2021; Lin et al. 2021; Zavrtanik, Kristan, and Skočaj 2021). But such methods exhibit poor authenticity in the synthesized data (Fig. 1-bottom-a/b). 2) The GAN-based methods (Zhang et al. 2021; Niu et al. 2020; Duan et al. 2023) utilize Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) to generate anomalies, but most of them require a large amount of anomaly samples for training. The only few-shot generation model DFMGAN (Duan et al. 2023) employs StyleGAN2 (Karras et al. 2020) pretrained on normal samples, and then performs domain adaption with a few anomaly samples. But the generated anomalies are not accurately aligned with the anomaly masks (Fig. 1-bottom-c). To sum up, the existing anomaly generation methods either fail to generate authentic anomalies or accurately-aligned anomalous image-mask pairs by learning from few-shot anomaly data, which limits their improvement in the downstream anomaly inspection tasks.

To address the above issues, we propose AnomalyDiffusion, a novel anomaly generation method based on the diffusion model, which generates anomalies onto the input normal samples with the anomaly masks. By leveraging the strong prior information of a pretrained LDM (Rombach et al. 2022) learned from large-scale dataset (Schuhmann et al. 2021), we can extract better anomaly representation using only a few anomaly images and boost the generation authenticity and diversity. To generate anomalies with specified type and locations, we propose Spatial Anomaly Embedding, which disentangles anomaly information into an anomaly embedding (a learned textual embedding to represent the appearance type of anomaly) and a spatial embedding (encoded from an anomaly mask to indicate the locations). By disentangling anomaly location from appearance, we can generate anomalies in any desired positions, which enables producing a large amount of anomalous image-mask pairs for the downstream tasks. Moreover, we propose an Adaptive Attention Re-weighting Mechanism to allocate more attention to the areas with less noticeable generated anomalies, which dynamically adjusts the cross-attention maps based on disparities between the generated images and input normal samples during the diffusion inference stage. This adaptive mechanism results in accurately aligned generated anomaly images and anomaly masks, which greatly facilitates downstream anomaly localization tasks.

Extensive qualitative and quantitative experiments and comparisons demonstrate that our AnomalyDiffusion outperforms state-of-the-art anomaly generation models in terms of generation authenticity and diversity. Moreover, our generated anomaly images can be effectively applied to downstream anomaly inspection tasks, yielding a pixel-level 99.1% AUROC and 81.4% AP score in anomaly localization on MVTec (Bergmann et al. 2019). The main contribution of this paper can be summarized as follows:

  • We propose AnomalyDiffusion, a few-shot diffusion-based anomaly generation method, which disentangles anomalies into anomaly embedding (for anomaly appearance) and spatial embedding (for anomaly location), and generates authentic and diverse anomaly images.

  • We design Adaptive Attention Re-weighting Mechanism, which adaptively allocates more attention to the areas with less noticeable generated anomalies, improving the alignment between the generated anomalies and masks.

  • Extensive experiments demonstrate the superiority of our model over the state-of-the-art competitors, and our generated anomaly data effectively improves the performance of downstream anomaly inspection tasks, which will be released to facilitate future research.

2 Related Work

2.1 Generative Models

Generative models. VAEs (Kingma and Welling 2013) and GANs (Goodfellow et al. 2014) have achieved great progress in image generation. Recently, diffusion model (Nichol and Dhariwal 2021) demonstrates a more enhanced potential in generating images in a wide range of domains. Latent diffusion model (LDM) (Rombach et al. 2022) further improves the generation ability through compression of the diffusion space and obtains strong prior information by training on LAION dataset (Schuhmann et al. 2021).

Few-shot image generation. Few-shot image generation aims to generate diverse images with limited training data. Early methods propose modifying network weights (Mo, Cho, and Shin 2020), using various regularization techniques (Li et al. 2020) and data augmentation (Tran et al. 2021) to prevent overfitting. To deal with the extremely limited data (less than 10), recent works  (Ojha et al. 2021; Wang et al. 2022; Hu et al. 2023) introduce cross-domain consistency losses to keep the generated distribution. Textual Inversion  (Gal et al. 2022) and Dreambooth (Ruiz et al. 2023) encode a few images into the textual space of a pre-trained LDM, but cannot control the generated locations accurately.

2.2 Anomaly Inspection

Anomaly inspection. The anomaly inspection task consists of anomaly detection, localization and classification. Some existing methods  (Schlegl et al. 2017, 2019; Liang et al. 2023) rely on image reconstruction, comparing the differences between reconstructed images and anomaly images to achieve anomaly detection and localization. Moreover, deep feature modeling-based methods (Lee, Lee, and Song 2022; Cao et al. 2022; Roth et al. 2022; Gu et al. 2023; Wang et al. 2023) build a feature space for input images and then compare the differences between features to detect and localize anomalies. Additionally, some supervised learning-based methods (Zhang et al. 2023a) utilize a small number of anomaly samples to enhance the anomaly localization capabilities. Some studies conduct zero-/few-shot AD without using or with only a small number of anomaly samples (Jeong et al. 2023; Cao et al. 2023; Chen, Han, and Zhang 2023; Chen et al. 2023; Zhang et al. 2023b; Huang et al. 2022). Although these methods have shown promising results in anomaly detection, their performance in anomaly localization is still limited due to the lack of anomaly data.

Refer to caption
Figure 2: Overall framework of our AnomalyDiffusion: 1) The Spatial Anomaly Embedding e𝑒eitalic_e, consisting of an anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (a learned textual embedding to represent anomaly appearance type) and a spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (encoded from an input anomaly mask m𝑚mitalic_m to indicate anomaly locations), serves as the text condition to guide the anomaly generation process; 2) The Adaptive Attention Re-weighting Mechanism computes the weight map wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT based on the difference between the denoised image x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the input normal sample y𝑦yitalic_y, and adaptively reweights the cross-attention map mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT by the weight map wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to help the model focus more on the less noticeable anomaly areas during the denoising process.

Anomaly generation. The scarcity of anomaly data has sparked research interest in anomaly generation. DRAEM (Zavrtanik, Kristan, and Skočaj 2021), Cut-Paste (Li et al. 2021), Crop-Paste (Lin et al. 2021) and PRN (Zhang et al. 2023a) crop and paste unrelated textures or existing anomalies into normal sample. But they either generate less realistic anomalies or have limited generated diversity. The GAN-based model SDGAN (Niu et al. 2020) and Defect-GAN (Zhang et al. 2021), generate anomalies on normal samples by learning from anomaly data. But they require a large amount of anomaly data and cannot generate anomaly mask. DFMGAN (Duan et al. 2023) transfers a StyleGAN2 (Karras et al. 2020) pretained on normal samples to anomaly domain, but lacks generation authenticity and accurate alignment between generated anomalies and masks. In contrast, our model incorporates spatial anomaly embedding and adaptive attention re-weighting mechanism, which can generate anomalous image-mask pairs with great diversity and authenticity.

3 Method

Our AnomalyDiffusion aims to generate a large amount of anomaly data aligned with anomaly masks, by learning from a few anomaly samples. The inputs to our model include an anomaly-free sample y𝑦yitalic_y and an anomaly mask m𝑚mitalic_m, and the output is an image with anomalies generated in the mask area, while the remaining region is consistent with the input anomaly-free sample.

As shown in Fig. 2, our AnomalyDiffusion is developed based on Latent Diffusion Model (Rombach et al. 2022). To disentangle the anomaly location information from anomaly appearance, we propose Spatial Anomaly Embedding e𝑒eitalic_e, which consists of an anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (for anomaly appearance) and a spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (for anomaly location). Moreover, to enhance the alignment between the generated anomalies and given masks, we introduce an Adaptive Attention Re-weighting Mechanism, which helps the model to allocate more attention to the areas with less noticeable generated anomalies (Fig. 3(c)).

Specifically, the anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT provides the anomaly appearance type information, with one easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT corresponding to a certain type of anomaly (e.g., hazelnut-crack, capsule-squeeze), which is learned by our masked textual inversion (Sec. 3.2). And the spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT provides the anomaly location information, which is encoded from the input anomaly mask m𝑚mitalic_m by a spatial encoder E𝐸Eitalic_E (shared among all anomalies). By combining the anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT with spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the spatial anomaly embedding e𝑒eitalic_e contains both the anomaly appearance and spatial information, which serves as the text condition in the diffusion model to guide the generation process. With the the spatial anomaly embedding as condition, given a normal sample, we generate an anomaly image with the blended diffusion process (Avrahami, Lischinski, and Fried 2022):

xt1=pθ(xt1|xt,e)m+q(yt1|y0)(1m),subscript𝑥𝑡1direct-productsubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝑒𝑚direct-product𝑞conditionalsubscript𝑦𝑡1subscript𝑦01𝑚x_{t-1}=p_{\theta}(x_{t-1}|x_{t},e)\odot m+q(y_{t-1}|y_{0})\odot(1-m),italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e ) ⊙ italic_m + italic_q ( italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⊙ ( 1 - italic_m ) , (1)

where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the generated anomaly image at timestep t𝑡titalic_t, y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the input normal sample, m𝑚mitalic_m is the anomaly mask, and q()𝑞q(\cdot)italic_q ( ⋅ ) and pθ()subscript𝑝𝜃p_{\theta}(\cdot)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) are the forward and backward process in diffusion as illustrated in Sec. 3.1.

3.1 Preliminaries

Denoising diffusion probabilistic models (DDPM) (Ho, Jain, and Abbeel 2020) has achieved significant success in image generation tasks. It employs a forward process to add noise into the data and then learns denoising during the backward process, thereby accomplishing the fitting of the training data distribution. With the training image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the forward process q()𝑞q(\cdot)italic_q ( ⋅ ) in diffusion model is formulated as:

q(x1,,xTx0):=t=1Tq(xtxt1),assign𝑞subscript𝑥1conditionalsubscript𝑥𝑇subscript𝑥0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1\displaystyle q\left(x_{1},\ldots,x_{T}\mid x_{0}\right):=\prod_{t=1}^{T}q% \left(x_{t}\mid x_{t-1}\right),italic_q ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (2)
q(xtxt1):=𝒩(xt;1βtxt1,βt𝐈),assign𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐈\displaystyle q\left(x_{t}\mid x_{t-1}\right):=\mathcal{N}\left(x_{t};\sqrt{1-% \beta_{t}}x_{t-1},\beta_{t}\mathbf{I}\right),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) ,

where βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the variance at timestep t𝑡titalic_t.

The backward process is approximated by predicting the mean μθ(xt,t)subscript𝜇𝜃subscript𝑥𝑡𝑡\mu_{\theta}(x_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and variance Σθ(xt,t)subscriptΣ𝜃subscript𝑥𝑡𝑡\Sigma_{\theta}\left(x_{t},t\right)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (set as a constant in DDPM) of a Gaussian distribution iteratively by:

pθ(xt1xt):=𝒩(xt1;μθ(xt,t),Σθ(xt,t)).assignsubscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscriptΣ𝜃subscript𝑥𝑡𝑡p_{\theta}\left(x_{t-1}\mid x_{t}\right):=\mathcal{N}\left(x_{t-1};\mu_{\theta% }\left(x_{t},t\right),\Sigma_{\theta}\left(x_{t},t\right)\right).italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) . (3)

Textual inversion (Gal et al. 2022) utilizes a pre-trained Latent Diffusion Model to extract the shared content information in few-shot input samples by optimizing text embeddings. With the refined text embeddings as condition c𝑐citalic_c, textual inversion can generate novel images x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with similar contents of input images by:

x0=t=1Tpθ(xt1|xt,c),xT𝒩(0,1).formulae-sequencesubscript𝑥0superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝑐similar-tosubscript𝑥𝑇𝒩01x_{0}=\prod\limits_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t},c),\,x_{T}\sim\mathcal{N}% (0,1).italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) . (4)

3.2 Spatial Anomaly Embedding

Disentangle spatial information from anomaly appearance. We aim at controllable anomaly generation with specified anomaly type and location. A direct solution is to control anomaly type by textual embedding learned from textual inversion (Gal et al. 2022), and control anomaly location by the input mask. However, textual inversion tends to capture the location of anomalies along with the anomaly type information, which results in the generated anomalies only distributed in specific locations. To address the issue, we propose to disentangle the textual embedding into two parts, where one part (the spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) is directly encoded from the anomaly mask to indicate the anomaly location, leaving the rest (the anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) to only learn anomaly type information. We name our decomposed textual embedding as Spatial Anomaly Embedding.

Anomaly embedding is a learned textual embedding that represents the anomaly appearance type information. Different from textual inversion method that learns the features of the entire image, in anomaly generation, our model only needs to focus on anomaly areas, without requiring information of the entire image. Therefore, we introduce masked textual inversion, where we mask out irrelevant background and normal regions of the anomaly image, and only the anomaly regions are visible to the model. We initialize the anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT with k𝑘kitalic_k tokens and optimize it using the masked diffusion loss:

dif=m(ϵϵθ(zt,t,{ea,es}))22,subscript𝑑𝑖𝑓superscriptsubscriptnormdirect-product𝑚italic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑒𝑎subscript𝑒𝑠22\displaystyle\mathcal{L}_{dif}=\left\|m\odot(\epsilon-\epsilon_{\theta}\left(z% _{t},t,\{e_{a},e_{s}\}\right))\right\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT = ∥ italic_m ⊙ ( italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , { italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where ϵ𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1)italic_ϵ ∼ caligraphic_N ( 0 , 1 ) and ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noised latent code of the input image x𝑥xitalic_x at timestep t𝑡titalic_t.

Spatial embedding. To provide accurate spatial information of the anomaly locations, we introduce a spatial encoder E𝐸Eitalic_E that encodes the input anomaly mask m𝑚mitalic_m into spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, which is in the form of textual embedding and contains precise location information from the mask. Specifically, we input the anomaly mask into ResNet-50 (He et al. 2016) to extract the image features in different layers and fuse them together by Feature Pyramid Networks (Lin et al. 2017). Finally, several fully-connected networks are employed to map the fused features into textual embedding space, with each network predicting one text token, thereby outputting the final spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with n𝑛nitalic_n tokens.

Overall training framework. For each anomaly type i𝑖iitalic_i, we employ an anomaly embedding ea,isubscript𝑒𝑎𝑖e_{a,i}italic_e start_POSTSUBSCRIPT italic_a , italic_i end_POSTSUBSCRIPT to extract its appearance information, while all anomaly categories share a common spatial encoder E𝐸Eitalic_E. For a set of image-mask pairs (xi,mi)subscript𝑥𝑖subscript𝑚𝑖(x_{i},m_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the training data, we first input anomaly mask misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into spatial encoder E𝐸Eitalic_E to obtain the spatial embedding es=E(mi)subscript𝑒𝑠𝐸subscript𝑚𝑖e_{s}=E(m_{i})italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_E ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then, we concatenate the anomaly embedding ea,isubscript𝑒𝑎𝑖e_{a,i}italic_e start_POSTSUBSCRIPT italic_a , italic_i end_POSTSUBSCRIPT and the spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT together to obtain our spatial anomaly embedding e={ea,es}𝑒subscript𝑒𝑎subscript𝑒𝑠e=\{e_{a},e_{s}\}italic_e = { italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }. Finally, the concatenated textual embedding e𝑒eitalic_e is used as the text condition to the diffusion model, and the training process can be formulated as:

ea*,E*=superscriptsubscript𝑒𝑎superscript𝐸absent\displaystyle e_{a}^{*},E^{*}=italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = argminea,E𝔼z(xi),mi,ϵ,tdif.subscriptsubscript𝑒𝑎𝐸subscript𝔼similar-to𝑧subscript𝑥𝑖subscript𝑚𝑖italic-ϵ𝑡subscript𝑑𝑖𝑓\displaystyle\mathop{\arg\min}\limits_{e_{a},E}\mathbb{E}_{z\sim\mathcal{E}(x_% {i}),m_{i},\epsilon,t}\mathcal{L}_{dif}.start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_E end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_E ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ , italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_f end_POSTSUBSCRIPT . (6)

where ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ) is the image encoder of latent diffusion model and ϵ𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1)italic_ϵ ∼ caligraphic_N ( 0 , 1 ).

Refer to caption
Figure 3: Comparison between the models w/ (Ours) and w/o Adaptive Attention Re-weighting (AAR). The model w/o AAR cannot generate anomalies to fill the entire mask.

3.3 Adaptive Attention Re-Weighting

With the spatial anomaly embedding e𝑒eitalic_e, we can use it as the text condition to guide the generation of anomaly images by Eq. (1). However, the generated anomaly images sometimes fail to fill the entire mask, especially when there are multiple anomaly regions in the mask or when the mask has irregular shapes (Fig. 3-a/c). In such cases, the generated anomalies are usually not well aligned with the mask, which limits the improvement in downstream anomaly localization task. To address this problem, we propose an adaptive attention re–weighting mechanism, which allocates more attention to the areas with less noticeable generated anomalies during the denoising process, thereby facilitating better alignment between the generated anomalies and the anomaly masks.

Adaptive attention weight map. Specifically, at the t𝑡titalic_t-th denoising step, we calculate the corresponding x^0=D(pθ(z^0|zt,e))subscript^𝑥0𝐷subscript𝑝𝜃conditionalsubscript^𝑧0subscript𝑧𝑡𝑒\hat{x}_{0}=D(p_{\theta}(\hat{z}_{0}|z_{t},e))over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_D ( italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e ) ) (where D𝐷Ditalic_D is the decoder of LDM). Then, we calculate the pixel-level difference between x^0subscript^𝑥0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the normal sample y𝑦yitalic_y within the mask m𝑚mitalic_m. Based on the difference, we calculate the weight map wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by the Adaptive Scaling Softmax (ASS) operation:

wm=m1Softmax(f(mymx^022)),subscript𝑤𝑚subscriptnorm𝑚1𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑓subscriptsuperscriptnormdirect-product𝑚𝑦direct-product𝑚subscript^𝑥022\displaystyle w_{m}=\|m\|_{1}\cdot Softmax(f(\|m\odot y-m\odot\hat{x}_{0}\|^{2% }_{2})),italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = ∥ italic_m ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_f ( ∥ italic_m ⊙ italic_y - italic_m ⊙ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , (7)

where f(x)=1x𝑓𝑥1𝑥f(x)=\frac{1}{x}italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_x end_ARG when x!=0𝑥0x!=0italic_x ! = 0 and f(x)=𝑓𝑥f(x)=-\inftyitalic_f ( italic_x ) = - ∞ otherwise. For the regions within the mask that are similar to normal samples, the generated anomalies in these regions are less noticeable. To enhance the anomaly generation effects, these regions are assigned higher weights by Eq. (7) and allocated with more attention by attention re-weighting.

Category DiffAug CDC Crop-Paste SDGAN Defect-GAN DFMGAN Ours
IS \uparrow IC-L \uparrow IS \uparrow IC-L \uparrow IS \uparrow IC-L \uparrow IS \uparrow IC-L IS \uparrow IC-L \uparrow IS \uparrow IC-L \uparrow IS \uparrow IC-L \uparrow

bottle

1.59 0.03 1.52 0.04 1.43 0.04 1.57 0.06 1.39 0.07 1.62 0.12 1.58 0.19

cable

1.72 0.07 1.97 0.19 1.74 0.25 1.89 0.19 1.70 0.22 1.96 0.25 2.13 0.41

capsule

1.34 0.03 1.37 0.06 1.23 0.05 1.49 0.03 1.59 0.04 1.59 0.11 1.59 0.21

carpet

1.19 0.06 1.25 0.03 1.17 0.11 1.18 0.11 1.24 0.12 1.23 0.13 1.16 0.24

grid

1.96 0.06 1.97 0.07 2.00 0.12 1.95 0.10 2.01 0.12 1.97 0.13 2.04 0.44

hazel_nut

1.67 0.05 1.97 0.05 1.74 0.21 1.85 0.16 1.87 0.19 1.93 0.24 2.13 0.31

leather

2.07 0.06 1.80 0.07 1.47 0.14 2.04 0.12 2.12 0.14 2.06 0.17 1.94 0.41

metal nut

1.58 0.29 1.55 0.04 1.56 0.15 1.45 0.28 1.47 0.30 1.49 0.32 1.96 0.30

pill

1.53 0.05 1.56 0.06 1.49 0.11 1.61 0.07 1.61 0.10 1.63 0.16 1.61 0.26

screw

1.10 0.10 1.13 0.11 1.12 0.16 1.17 0.10 1.19 0.12 1.12 0.14 1.28 0.30

tile

1.93 0.09 2.10 0.12 1.83 0.20 2.53 0.21 2.35 0.22 2.39 0.22 2.54 0.55

toothbrush

1.33 0.06 1.63 0.06 1.30 0.08 1.78 0.03 1.85 0.03 1.82 0.18 1.68 0.21

transistor

1.34 0.05 1.61 0.13 1.39 0.15 1.76 0.13 1.47 0.13 1.64 0.25 1.57 0.34

wood

2.05 0.30 2.05 0.03 1.95 0.23 2.12 0.25 2.19 0.29 2.12 0.35 2.33 0.37

zipper

1.30 0.05 1.30 0.05 1.23 0.11 1.25 0.10 1.25 0.10 1.29 0.27 1.39 0.25

Average

1.58 0.09 1.65 0.07 1.51 0.14 1.71 0.13 1.69 0.15 1.72 0.20 1.80 0.32
Table 1: Comparison on IS and IC-LPIPS on MVTec dataset. Our model generates the most high-quality and diverse anomaly data, achieving the best IS and IC-LPIPS. Bold and underline represent optimal and sub-optimal results, respectively.

Attention re-weighting. We employ the weight map wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to adaptively control the cross-attention, in order to guide our model to focus more on the areas with less noticeable generated anomalies. In our cross-attention calculation, Query is calculated from the latent code ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Key and Value are calculated from our spatial anomaly embedding e𝑒eitalic_e:

Q=WQ(i)φi(zt),K=WK(i)e,V=WV(i)e,formulae-sequence𝑄superscriptsubscript𝑊𝑄𝑖subscript𝜑𝑖subscript𝑧𝑡formulae-sequence𝐾superscriptsubscript𝑊𝐾𝑖𝑒𝑉superscriptsubscript𝑊𝑉𝑖𝑒Q=W_{Q}^{(i)}\cdot\varphi_{i}\left(z_{t}\right),K=W_{K}^{(i)}\cdot e,V=W_{V}^{% (i)}\cdot e,italic_Q = italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_K = italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_e , italic_V = italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_e , (8)

where φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the intermediate representation of the U-Net (ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) and the W(i)superscript𝑊𝑖W^{(i)}italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPTs are the learnable projection matrices. The cross-attention calculation process is then formulated as Attn(Q,K,V)=mcV𝐴𝑡𝑡𝑛𝑄𝐾𝑉subscript𝑚𝑐𝑉Attn(Q,K,V)=m_{c}\cdot Vitalic_A italic_t italic_t italic_n ( italic_Q , italic_K , italic_V ) = italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ italic_V, where mc=Softmax(QKTd)subscript𝑚𝑐𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑m_{c}=Softmax(\frac{QK^{T}}{\sqrt{d}})italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) is the cross-attention map.

Refer to caption
Figure 4: Comparison on the generation results on MVTec. Our model generates high quality anomaly images that are accurately aligned with the anomaly masks.

Considering the cross-attention map mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT controls the generated layout and effects, where higher attention leads to stronger generation effects (Hertz et al. 2022), we reweight the cross-attention map by our weight map: mc=mcwmsubscriptsuperscript𝑚𝑐direct-productsubscript𝑚𝑐subscript𝑤𝑚m^{\prime}_{c}=m_{c}\odot w_{m}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊙ italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The new cross-attention map mcsubscriptsuperscript𝑚𝑐m^{\prime}_{c}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT focuses more on the areas with less noticeable generated anomalies, thereby enhancing the alignment accuracy between the generated anomalies and the input anomaly masks. The re-weighted cross attention is formulated as RW-Attn(Q,K,V)=mcV.𝑅𝑊-𝐴𝑡𝑡𝑛𝑄𝐾𝑉subscriptsuperscript𝑚𝑐𝑉RW\text{-}Attn(Q,K,V)=m^{\prime}_{c}\cdot V.italic_R italic_W - italic_A italic_t italic_t italic_n ( italic_Q , italic_K , italic_V ) = italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ italic_V .

3.4 Mask Generation

Recall that our model requires anomaly masks as inputs. However, the number of real anomaly masks in the training datasets is very few, and the mask data lacks diversity even after augmentation, which motivates us to generate more anomaly masks by learning the real mask distribution. We employ textual inversion to learn a mask embedding emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, which can be used as text condition to generate extensive anomaly masks. Specifically, we initialize the mask embedding emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT random tokens and optimize it by:

em*=argminem𝔼z(m),ϵ,t[ϵϵθ(zt,t,em)22].superscriptsubscript𝑒𝑚subscriptsubscript𝑒𝑚subscript𝔼similar-to𝑧𝑚italic-ϵ𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑒𝑚22e_{m}^{*}=\mathop{\arg\min}\limits_{e_{m}}\mathbb{E}_{z\sim\mathcal{E}(m),% \epsilon,t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t,e_{m}\right)% \right\|_{2}^{2}\right].italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_E ( italic_m ) , italic_ϵ , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (9)

With the learned mask embedding, we can generate extensive anomaly masks for each type of anomaly.

4 Experiments

4.1 Experiment Settings

Dataset. we conduct experiments on the widely used MVTec (Bergmann et al. 2019) dataset. We employ one-third of the anomaly data with the lowest ID numbers as the training set, reserving the remaining two-thirds for testing.

Implementation details. We assign k=8𝑘8k=8italic_k = 8 tokens for anomaly embedding, n=4𝑛4n=4italic_n = 4 tokens for spatial embedding, and k=4superscript𝑘4k^{\prime}=4italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 4 tokens for mask embedding. For each type of anomaly, we generate 1000 anomalous image-mask pairs for the downstream anomaly inspection tasks. More details are recorded in the supplementary material.

Metric. 1) For generation, due to the limited anomaly data, FID (Heusel et al. 2017) and KID (Bińkowski et al. 2018) are not reliable since the overfitted model tends to yield better scores (best) (Duan et al. 2023). Therefore, we employ Inception Score (IS), which is independent of the given anomaly data, for a direct assessment of generation quality; we also introduce Intra-cluster pairwise LPIPS distance (IC-LPIPS(Ojha et al. 2021) to measure the generation diversity. 2) for anomaly inspection, we utilize AUROC, Average Precision (AP), and the 𝐅𝟏subscript𝐅1\mathbf{F_{1}}bold_F start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT-max score to evaluate the accuracy of anomaly detection and localization.

Refer to caption
Figure 5: Quantitative anomaly localization comparison with an U-Net trained on the data generated by DRAEM, DFMGAN and our model. It shows that our model achieves the best anomaly localization results.

Category DRAEM PRN DFMGAN Ours
AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max
bottle 96.7 80.2 74.0 97.5 76.4 71.3 98.9 90.2 83.9 99.4 94.1 87.3
cable 80.3 21.8 28.3 94.5 64.4 61.0 97.2 81.0 75.4 99.2 90.8 83.5
capsule 76.2 25.5 32.1 95.6 45.7 47.9 79.2 26.0 35.0 98.8 57.2 59.8
carpet 92.6 43.0 41.9 96.4 69.6 65.6 90.6 33.4 38.1 98.6 81.2 74.6
grid 99.1 59.3 58.7 98.9 58.6 58.9 75.2 14.3 20.5 98.3 52.9 54.6
hazelnut 98.8 73.6 68.5 98.0 73.9 68.2 99.7 95.2 89.5 99.8 96.5 90.6
leather 98.5 67.6 65.0 99.4 58.1 54.0 98.5 68.7 66.7 99.8 79.6 71.0
metal nut 96.9 84.2 74.5 97.9 93.0 87.1 99.3 98.1 94.5 99.8 98.7 94.0
pill 95.8 45.3 53.0 98.3 55.5 72.6 81.2 67.8 72.6 99.8 97.0 90.8
screw 91.0 30.1 35.7 94.0 47.7 49.8 58.8 2.2 5.3 97.0 51.8 50.9
tile 98.5 93.2 87.8 98.5 91.8 84.4 99.5 97.1 91.6 99.2 93.9 86.2
toothbrush 93.8 29.5 28.4 96.1 46.4 46.2 96.4 75.9 72.6 99.2 76.5 73.4
transistor 76.5 31.7 24.2 94.9 68.6 68.4 96.2 81.2 77.0 99.3 92.6 85.7
wood 98.8 87.8 80.9 96.2 74.2 67.4 95.3 70.7 65.8 98.9 84.6 74.5
zipper 93.4 65.4 64.7 98.4 79.0 73.7 92.9 65.6 64.9 99.4 86.0 79.2
Average 92.2 54.1 53.1 96.9 66.2 64.7 90.0 62.7 62.1 99.1 81.4 76.3
Table 2: Comparison on pixel-level anomaly localization on MVTec dataset by training an U-Net on the generated data from DRAEM, PRN, DFMGAN and our model.

4.2 Comparison in Anomaly Generation

Baseline. The compared anomaly generation methods can be classified into 2 groups: 1) the models (Crop&Paste (Lin et al. 2021), DRAEM (Zavrtanik, Kristan, and Skočaj 2021), PRN (Zhang et al. 2023a) and DFMGAN (Duan et al. 2023)) that can generate anomalous image-mask pairs, which are employed to compare anomaly detection and localization; 2) the models (DiffAug (Zhao et al. 2020), CDC (Ojha et al. 2021), Crop&Paste, SDGAN (Niu et al. 2020), Defect-GAN (Zhang et al. 2021) and DFMGAN) that can generate specific anomaly types, which are employed to compare anomaly generation quality and classification.

Category DRAEM PRN DFMGAN Ours
AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max
bottle 99.3 99.8 98.9 94.9 98.4 94.1 99.3 99.8 97.7 99.8 99.9 98.9
cable 72.1 83.2 79.2 86.3 92.0 84.0 95.9 97.8 93.8 100 100 100
capsule 93.2 98.7 94.0 84.9 95.8 94.3 92.8 98.5 94.5 99.7 99.9 98.7
carpet 95.3 98.7 93.4 92.6 97.8 92.1 67.9 87.9 87.3 96.7 98.8 94.3
grid 99.8 99.9 98.8 96.6 98.9 95.0 73.0 90.4 85.4 98.4 99.5 98.7
hazelnut 100 100 100 93.6 96.0 94.1 99.9 100 99.0 99.8 99.9 98.9
leather 100 100 100 99.1 99.7 97.6 99.9 100 99.2 100 100 100
metal_nut 97.8 99.6 97.6 97.8 99.5 96.9 99.3 99.8 99.2 100 100 100
pill 94.4 98.9 95.8 88.8 97.8 93.2 68.7 91.7 91.4 98.0 99.6 97.0
screw 88.5 96.3 89.3 84.1 94.7 87.2 22.3 64.7 85.3 96.8 97.9 95.5
tile 100 100 100 91.1 96.9 89.3 100 100 100 100 100 100
toothbrush 99.4 99.8 97.6 100 100 100 100 100 100 100 100 100
transistor 79.6 80.5 71.4 88.2 88.9 84.0 90.8 92.5 88.9 100 100 100
wood 100 100 100 77.5 92.7 86.7 98.4 99.4 98.8 98.4 99.4 98.8
zipper 100 100 100 98.7 99.7 97.6 99.7 99.9 99.4 99.9 100 99.4
Average 94.6 97.0 94.4 91.6 96.6 92.4 87.2 94.8 94.7 99.2 99.7 98.7
Table 3: Comparison on image-level anomaly detection.
Category DiffAug CDC Crop&Paste SDGAN Defect-GAN DFMGAN Ours
bottle 48.84 38.76 52.71 48.84 53.49 56.59 90.70
cable 21.36 39.06 32.81 21.88 21.36 45.31 67.19
capsule 34.67 28.89 32.89 30.22 32.00 37.23 66.67
carpet 35.48 25.27 27.96 21.50 29.03 47.31 58.06
grid 28.33 35.83 28.33 30.83 27.50 40.83 42.50
hazelnut 65.28 54.86 59.03 43.75 61.11 81.94 85.42
leather 40.74 43.38 34.39 38.10 42.33 49.73 61.90
metalnut 58.85 48.44 59.89 44.27 56.77 64.58 59.38
pill 29.86 21.88 26.74 20.49 28.47 29.52 59.38
screw 25.10 32.92 28.81 26.75 28.81 37.45 48.15
tile 59.65 48.54 68.42 42.69 26.90 74.85 84.21
transistor 38.09 29.76 41.67 32.14 35.72 52.38 60.71
wood 41.27 28.57 47.62 30.95 24.60 49.21 71.43
zipper 22.76 14.63 26.42 21.54 18.70 27.64 69.51
Average 39.31 35.06 40.55 32.43 34.77 49.61 66.09
Table 4: Comparison on anomaly classification accuracy trained on the generated data by the anomaly generation models with a ResNet-18.
Category Unsupervised Supervised
KDAD CFLOW DRAEM SSPCAB CFA RD4AD PatchCore DevNet DRA PRN Ours
bottle 94.7/50.5 98.8/49.9 99.1/88.5 98.9/88.6 98.9/50.9 98.8/51.0 97.6/75.0 96.7/67.9 91.7/41.5 99.4/92.3 99.3/94.1
cable 79.2/11.6 98.9/72.6 94.8/61.4 93.1/52.1 98.4/79.8 98.8/77.0 96.8/65.9 97.9/67.6 86.1/34.8 98.8/78.9 99.2/90.8
capsule 96.3/09.9 99.5/64.0 97.6/47.9 90.4/48.7 98.9/71.1 99.0/60.5 98.6/46.6 91.1/46.6 88.5/11.0 98.5/62.2 98.8/57.2
carpet 91.5/45.8 99.7/67.0 96.3/62.5 92.3/49.1 99.1/47.7 99.4/46.0 98.7/65.0 94.6/19.6 98.2/54.0 99.0/82.0 98.6/81.2
grid 89.0/07.6 99.1/87.8 99.5/53.2 99.6/58.2 98.6/82.9 98.0/75.4 97.2/23.6 90.2/44.9 86.2/28.6 98.4/45.7 98.3/52.9
hazelnut 95.0/34.2 97.9/67.2 99.5/88.1 99.6/94.5 98.5/80.2 94.2/57.2 97.6/55.2 76.9/46.8 88.8/20.3 99.7/93.8 99.8/96.5
leather 98.2/26.7 99.2/91.1 98.8/68.5 97.2/60.3 96.2/60.9 96.6/53.5 98.9/43.4 94.3/66.2 97.2/05.1 99.7/69.7 99.8/79.6
metal nut 81.7/30.6 98.8/78.2 98.7/91.6 99.3/95.1 98.6/74.6 97.3/53.8 97.5/86.6 93.3/57.4 80.3/30.6 99.7/98.0 99.8/98.7
pill 90.1/23.1 98.9/60.3 97.7/44.8 96.5/48.1 98.8/67.9 98.4/58.1 97.0/75.9 98.9/79.9 79.6/22.1 99.5/91.3 99.8/97.0
screw 95.4/05.9 98.8/45.7 99.7/72.9 99.1/62.0 98.7/61.4 99.1/51.8 98.7/34.2 66.5/21.1 51.0/05.1 97.5/44.9 97.0/51.8
tile 78.6/26.7 98.0/86.7 99.4/96.4 99.2/96.3 98.6/92.6 97.4/78.2 94.9/56.0 88.7/63.9 91.0/54.4 99.6/96.5 99.2/93.9
toothbrush 95.6/20.0 99.1/56.9 97.3/49.2 97.5/38.9 98.4/61.7 99.0/63.1 97.6/37.1 96.3/52.4 74.5/04.8 99.6/78.1 99.1/76.5
transistor 76.0/25.9 98.8/40.6 92.2/56.0 85.3/36.5 98.6/82.9 99.6/50.3 91.8/66.7 55.2/04.4 79.3/11.2 98.4/85.6 99.3/92.6
wood 88.3/24.7 98.9/47.2 97.6/81.6 97.2/77.1 97.6/25.6 99.3/39.1 95.7/54.3 93.1/47.9 82.9/21.0 97.8/82.6 98.9/84.6
zipper 95.1/30.5 96.5/63.9 98.6/73.6 98.1/78.2 95.9/53.9 99.7/52.7 98.5/63.1 92.4/53.1 96.8/42.3 98.8/77.6 99.4/86.0
Average 89.6/24.9 98.7/65.3 97.7/69.0 96.2/65.5 98.3/66.3 98.3/57.8 97.1/56.6 86.4/49.3 84.8/25.7 99.0/78.6 99.1/81.4
Table 5: Comparison on pixel-level anomaly localization (AUROC/AP) between the simple U-Net trained on our generated dataset and the existing anomaly detection methods with their official codes or pre-trained models.
Method Metric
SAE Masked \mathcal{L}caligraphic_L AAR AUROC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max
81.3 31.1 46.5
90.3 51.2 60.7
95.0 64.9 68.8
95.5 67.5 68.9
99.1 81.4 76.3
Table 6: Ablation study on our spatial anomaly embedding (SAE), masked diffusion loss (Masked \mathcal{L}caligraphic_L) and adaptive attention re-weighting mechanism (AAR).

Anomaly generation quality. We compare our model with DiffAug, CDC, Crop&Paste, SDGAN, DefectGAN and DFMGAN on anomaly generation quality and diversity in Tab. 1. Since DRAEM and PRN crop random textures to imitate anomalies, we cannot compute IC-LPIPS for them. For each anomaly category, we allocate one-third of the anomaly data for training and generate 1000 anomaly images to compute IS and IC-LPIPS. It demonstrates that our model generates anomaly data with both the highest quality and diversity.

Moreover, we exhibit the generated anomalies in Fig. 4. It can be seen that our model excels in producing high-quality authentic anomalies that accurately align with their corresponding masks. In contrast, CDC yields visually perplexing outcomes, particularly for structural anomaly categories like capsule-squeeze. SDGAN and DefectGAN yield poor outputs, frequently encountering difficulties in generating anomalies such as pill-crack. The state-of-the-art model DFMGAN sometimes struggles to produce authentic anomalies and fails to keep the alignment between the generated anomalies and masks, as shown in metal nut-bent. More results are presented in supplementary material.

Anomaly generation for anomaly detection and localization. We compare the performance of our approach with existing anomaly generation methods in downstream anomaly detection and localization. Due to the inability of DiffAug and SDGAN to generate anomaly masks, we only compare our method with Crop&Paste, DRAEM, PRN and DFMGAN. For each method, we generate 1000 images per anomaly category and train an U-Net (Ronneberger, Fischer, and Brox 2015) alongside normal samples for anomaly localization. The localization outcomes are aggregated using average pooling to derive confidence scores for image-level anomaly detection (the same as DREAM). We compute pixel-level metrics including AUROC, AP, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max. The results, as presented in Tab. 2, illustrate that our model outperforms other anomaly generation models at most conditions.Furthermore, we also evaluate image-level AUROC, AP, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max scores in Tab. 3. It demonstrates our model has the best anomaly detection performance compared to other methods. We also compare the qualitative results on anomaly localization in Fig. 5, which shows our superior performance in localizing the anomalies.

Anomaly generation for anomaly classification. To further validate the generation quality of our model, we employ the generated anomalies to train a downstream anomaly classification model. Specifically, we adopt the experiment setting in DFMGAN, which trains a ResNet-34 (He et al. 2016) on the generated dataset and test the classification accuracy on the remaining shared test dataset. The comparison results are shown in Tab. 4. It can be seen that our model outperforms all other models in almost all types of components and the average accuracy (66.09%) surpasses that of the second-ranked DFMGAN (49.61%) by a margin of 16.48%.

4.3 Comparison with Anomaly Detection Models

To further validate the efficacy of our model, we conduct a comparative experiment with the state-of-the-art anomaly detection methods CFLOW (Gudovskiy, Ishizaka, and Kozuka 2022), DRAEM (Zavrtanik, Kristan, and Skočaj 2021), CFA (Lee, Lee, and Song 2022), RD4AD (Deng and Li 2022), PatchCore (Roth et al. 2022), DevNet (Pang et al. 2021), DRA (Ding, Pang, and Shen 2022) and PRN (Zhang et al. 2023a). We employ their official codes or pre-trained models and evaluate them on the same testing dataset that we use. It is worth noting that due to the absence of the open-source code for PRN, we utilize the data provided in its paper. The comparison results on pixel-level AUROC and AP are presented in Tab. 5. It can be seen that although our model is only a simple U-Net, with the help of our generated anomaly data, it has a good performance in anomaly localization with the highest AP of 81.4% and AUROC of 99.1%, indicating the profound significance of our generated data for downstream anomaly inspection tasks.

4.4 Ablation Study

We evaluate the effectiveness of our components: spatial anomaly embedding (SAE), masked diffusion loss (Masked \mathcal{L}caligraphic_L) and adaptive attention re-weighting mechanism (AAR). Not that the models without SAE employ only an anomaly embedding trained by textual inversion. We train 5 models: 1) with none of these components; 2) only SAE; 3) SAE + masked \mathcal{L}caligraphic_L; 4) masked \mathcal{L}caligraphic_L + AAR and 5) the full model (ours). We employ these models to generate 1000 anomalous image-mask pairs and train an U-Net for anomaly localization. We compare the pixel-level localization results in Tab. 6. It demonstrates that the omission of any of the proposed modules leads to a noticeable decline in the model’s performance on anomaly localization, which validates the efficacy of the proposed modules. For more experiments, please refer to the supplementary material.

5 Conclusion

In this paper, we propose Anomalydiffusion, a novel anomaly generation model which generates anomalous image-mask pairs. We disentangle anomaly information into anomaly appearance and location information represented by anomaly embedding and spatial embedding in the textual space of LDM. Moreover, we also introduce an adaptive attention re-weighting mechanism, which helps our model focus more on the areas with less noticeable generated anomalies, thus improving the alignment between the generated anomalies and masks. Extensive experiments show that our model outperforms the existing anomaly generation methods and our generated anomaly data effectively improves the performance of the downstream anomaly inspection tasks. In future work, we would explore the application of a more potent diffusion model to enhance the resolution of the generated anomalies, which could further improve the performance.

Acknowledgments

This work was supported by National Natural Science Foundation of China (62302297, 72192821, 62272447), Young Elite Scientists Sponsorship Program by CAST (2022QNRC001), Shanghai Sailing Program (22YF1420300), Beijing Natural Science Foundation (L222117), the Fundamental Research Funds for the Central Universities (YG2023QNB17), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Shanghai Science and Technology Commission (21511101200), CCF-Tencent Open Research Fund (RAGR20220121).

References

  • Avrahami, Lischinski, and Fried (2022) Avrahami, O.; Lischinski, D.; and Fried, O. 2022. Blended diffusion for text-driven editing of natural images. In CVPR, 18208–18218.
  • Bergmann et al. (2019) Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C. 2019. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In CVPR, 9592–9600.
  • Bińkowski et al. (2018) Bińkowski, M.; Sutherland, D. J.; Arbel, M.; and Gretton, A. 2018. Demystifying mmd gans. arXiv preprint arXiv:1801.01401.
  • Cao et al. (2022) Cao, Y.; Wan, Q.; Shen, W.; and Gao, L. 2022. Informative knowledge distillation for image anomaly segmentation. Knowledge-Based Systems, 248: 108846.
  • Cao et al. (2023) Cao, Y.; Xu, X.; Sun, C.; Cheng, Y.; Du, Z.; Gao, L.; and Shen, W. 2023. Segment Any Anomaly without Training via Hybrid Prompt Regularization. arXiv preprint arXiv:2305.10724.
  • Chen, Han, and Zhang (2023) Chen, X.; Han, Y.; and Zhang, J. 2023. A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv preprint arXiv:2305.17382.
  • Chen et al. (2023) Chen, X.; Zhang, J.; Tian, G.; He, H.; Zhang, W.; Wang, Y.; Wang, C.; Wu, Y.; and Liu, Y. 2023. CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection. arXiv preprint arXiv:2311.00453.
  • Deng and Li (2022) Deng, H.; and Li, X. 2022. Anomaly detection via reverse distillation from one-class embedding. In CVPR, 9737–9746.
  • Ding, Pang, and Shen (2022) Ding, C.; Pang, G.; and Shen, C. 2022. Catching both gray and black swans: Open-set supervised anomaly detection. In CVPR, 7388–7398.
  • Duan et al. (2023) Duan, Y.; Hong, Y.; Niu, L.; and Zhang, L. 2023. Few-Shot Defect Image Generation via Defect-Aware Feature Manipulation. In AAAI, volume 37, 571–578.
  • Gal et al. (2022) Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A. H.; Chechik, G.; and Cohen-Or, D. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. NIPS, 27.
  • Gu et al. (2023) Gu, Z.; Liu, L.; Chen, X.; Yi, R.; Zhang, J.; Wang, Y.; Wang, C.; Shu, A.; Jiang, G.; and Ma, L. 2023. Remembering Normality: Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16401–16409.
  • Gudovskiy, Ishizaka, and Kozuka (2022) Gudovskiy, D.; Ishizaka, S.; and Kozuka, K. 2022. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 98–107.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • Hertz et al. (2022) Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; and Cohen-Or, D. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
  • Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS, 30.
  • Ho, Jain, and Abbeel (2020) Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  • Ho and Salimans (2022) Ho, J.; and Salimans, T. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  • Hu et al. (2023) Hu, T.; Zhang, J.; Liu, L.; Yi, R.; Kou, S.; Zhu, H.; Chen, X.; Wang, Y.; Wang, C.; and Ma, L. 2023. Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption. In ICCV, 2406–2415.
  • Huang et al. (2022) Huang, C.; Guan, H.; Jiang, A.; Zhang, Y.; Spratling, M.; and Wang, Y.-F. 2022. Registration based few-shot anomaly detection. In ECCV, 303–319. Springer.
  • Jeong et al. (2023) Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; and Dabeer, O. 2023. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19606–19616.
  • Karras et al. (2020) Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and improving the image quality of stylegan. In CVPR, 8110–8119.
  • Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Lee, Lee, and Song (2022) Lee, S.; Lee, S.; and Song, B. C. 2022. Cfa: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access, 10: 78446–78454.
  • Li et al. (2021) Li, C.-L.; Sohn, K.; Yoon, J.; and Pfister, T. 2021. Cutpaste: Self-supervised learning for anomaly detection and localization. In CVPR, 9664–9674.
  • Li et al. (2020) Li, Y.; Zhang, R.; Lu, J.; and Shechtman, E. 2020. Few-shot image generation with elastic weight consolidation. arXiv preprint arXiv:2012.02780.
  • Liang et al. (2023) Liang, Y.; Zhang, J.; Zhao, S.; Wu, R.; Liu, Y.; and Pan, S. 2023. Omni-frequency channel-selection representations for unsupervised anomaly detection. IEEE Transactions on Image Processing.
  • Lin et al. (2021) Lin, D.; Cao, Y.; Zhu, W.; and Li, Y. 2021. Few-shot defect segmentation leveraging abundant defect-free training samples through normal background regularization and crop-and-paste operation. In ICME, 1–6. IEEE.
  • Lin et al. (2017) Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In CVPR, 2117–2125.
  • Mo, Cho, and Shin (2020) Mo, S.; Cho, M.; and Shin, J. 2020. Freeze the discriminator: a simple baseline for fine-tuning gans. arXiv preprint arXiv:2002.10964.
  • Nichol and Dhariwal (2021) Nichol, A. Q.; and Dhariwal, P. 2021. Improved denoising diffusion probabilistic models. In ICML, 8162–8171. PMLR.
  • Niu et al. (2020) Niu, S.; Li, B.; Wang, X.; and Lin, H. 2020. Defect image sample generation with GAN for improving defect recognition. IEEE Transactions on Automation Science and Engineering, 17(3): 1611–1622.
  • Ojha et al. (2021) Ojha, U.; Li, Y.; Lu, J.; Efros, A. A.; Lee, Y. J.; Shechtman, E.; and Zhang, R. 2021. Few-shot image generation via cross-domain correspondence. In CVPR, 10743–10752.
  • Pang et al. (2021) Pang, G.; Ding, C.; Shen, C.; and Hengel, A. v. d. 2021. Explainable deep few-shot anomaly detection with deviation networks. arXiv preprint arXiv:2108.00462.
  • Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In CVPR, 10684–10695.
  • Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
  • Roth et al. (2022) Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; and Gehler, P. 2022. Towards total recall in industrial anomaly detection. In CVPR, 14318–14328.
  • Ruiz et al. (2023) Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; and Aberman, K. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 22500–22510.
  • Schlegl et al. (2019) Schlegl, T.; Seeböck, P.; Waldstein, S. M.; Langs, G.; and Schmidt-Erfurth, U. 2019. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54: 30–44.
  • Schlegl et al. (2017) Schlegl, T.; Seeböck, P.; Waldstein, S. M.; Schmidt-Erfurth, U.; and Langs, G. 2017. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International conference on information processing in medical imaging, 146–157. Springer.
  • Schuhmann et al. (2021) Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; and Komatsuzaki, A. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
  • Tran et al. (2021) Tran, N.-T.; Tran, V.-H.; Nguyen, N.-B.; Nguyen, T.-K.; and Cheung, N.-M. 2021. On data augmentation for gan training. IEEE Transactions on Image Processing, 30: 1882–1897.
  • Wang et al. (2023) Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; and Wang, C. 2023. Multimodal Industrial Anomaly Detection via Hybrid Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8032–8041.
  • Wang et al. (2022) Wang, Y.; Yi, R.; Tai, Y.; Wang, C.; and Ma, L. 2022. Ctlgan: Few-shot artistic portraits generation with contrastive transfer learning. arXiv preprint arXiv:2203.08612.
  • Zavrtanik, Kristan, and Skočaj (2021) Zavrtanik, V.; Kristan, M.; and Skočaj, D. 2021. Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In ICCV, 8330–8339.
  • Zhang et al. (2021) Zhang, G.; Cui, K.; Hung, T.-Y.; and Lu, S. 2021. Defect-GAN: High-fidelity defect synthesis for automated defect inspection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2524–2534.
  • Zhang et al. (2023a) Zhang, H.; Wu, Z.; Wang, Z.; Chen, Z.; and Jiang, Y.-G. 2023a. Prototypical residual networks for anomaly detection and localization. In CVPR, 16281–16291.
  • Zhang et al. (2023b) Zhang, J.; Chen, X.; Xue, Z.; Wang, Y.; Wang, C.; and Liu, Y. 2023b. Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection. arXiv preprint arXiv:2311.02612.
  • Zhao et al. (2020) Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.-Y.; and Han, S. 2020. Differentiable augmentation for data-efficient gan training. NIPS, 33: 7559–7570.

Appendix A Overview

This supplementary material consists of:

  • Details of the data augmentation method (Sec. B).

  • More implementation details (Sec. C).

  • More ablation studies (Sec. D).

  • Comparison between our Spatial Anomaly Embedding and Prompt-to-Prompt (Sec. E).

  • More qualitative comparison results with the anomaly generation methods (Sec. F).

  • More quantitative comparison results with the anomaly generation methods (Sec. G).

Appendix B Data Augmentation

Due to the limited number of samples for each anomaly category, typically less than 10 images are available for training. This constraint makes it challenging for our anomaly embedding to completely eliminate spatial information, as it still tends to generate anomalies at the positions observed in the training images. Additionally, when the training data for the spatial encoder is scarce, the model becomes susceptible to overfitting, making it difficult to accurately generate anomalies at the correct positions.

To address these issues, we employ a data augmentation approach during training. For paired image-mask data, we perform random cropping, translation, and rotation on both the image and its corresponding mask. By recording the maximum and minimum coordinates of the anomaly region in the image, we ensure that the anomaly remains within the image during data augmentation. This data augmentation process effectively disrupts the spatial information within the training data, causing the anomaly embedding to lose its focus on recording positions and instead concentrate solely on the anomaly appearance. Simultaneously, the spatial encoder benefits from having enough augmented data for training, boosting its ability in position encoding.

Appendix C Implementation Details

C.1 Training Details

Training spatial anomaly embedding. For each anomaly type, an anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is assigned, while a shared spatial encoder E𝐸Eitalic_E is employed across all anomaly categories. Each anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is composed of 8 tokens, and the spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT comprises 4 tokens. The batch size is set at 4, and the learning rate is 0.005. During each training iteration, we randomly sample 4 anomalous image-mask pairs from all the anomaly categories. We train all the anomaly embedding and spatial encoder at the same time for 300K iterations in 3 days on an NVIDIA GeForce RTX 3090 24GB GPU.

Training mask embedding. For each anomaly type, we assign a mask embedding emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT for it. To enhance the diversity of the generated masks, each mask embedding consists of 2 tokens, preventing it from overfitting. Furthermore, with a batch size of 4 and a learning rate of 0.005, each mask embedding is trained for 30Kiterations.

Mask Generation. With the trained mask embedding, we input it as a text condition to guide the generation process of latent diffusion model (Rombach et al. 2022). Specifically, we employ the classifier-free guidance (Ho and Salimans 2022) to generate masks:

ϵ^θ(xtem)=ϵθ(xt)+s(ϵθ(xt,em)ϵθ(xt)),subscript^italic-ϵ𝜃conditionalsubscript𝑥𝑡subscript𝑒𝑚subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑠subscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑒𝑚subscriptitalic-ϵ𝜃subscript𝑥𝑡\hat{\epsilon}_{\theta}\left(x_{t}\mid e_{m}\right)=\epsilon_{\theta}\left(x_{% t}\right)+s\cdot\left(\epsilon_{\theta}\left(x_{t},e_{m}\right)-\epsilon_{% \theta}\left(x_{t}\right)\right),over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_s ⋅ ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (10)

where s is set 5 (the same as Textual Inversion).

C.2 Metrics

In the quantitative experiments, we employ the following metrics to measure the model performance.

  • Inception Score(IS) quantifies the quality and diversity of generated images by computing the exponential of the negative of the KL divergence between the marginal distribution of generated images and the conditional distribution of class labels predicted by an Inception model. A higher IS score represents a better generation quality and diversity.

  • Intra-cluster Pairwise LPIPS Distance (IC-LPIPIS) (Ojha et al. 2021) clusters the generated images into k𝑘kitalic_k groups based on LPIPS distance to k𝑘kitalic_k target samples, and then compute the average mean LPIPS distances to corresponding target samples within each cluster. A higher IC-LPIPS indicates a better generation diversity.

  • Area Under the Receiver Operating Characteristic (AUROC) measures the performance of a binary classification model by evaluating its ability to distinguish between true positive and false positive rates across different probability thresholds. A higher AUROC means better anomaly detection and localization performance.

  • Average Precision (AP, which is also known as PR-AUC) assesses the precision-recall curve for a classification model, calculating the average precision of the model across different recall levels, providing a summary of its overall performance. A higher AP means better anomaly detection and localization results.

  • 𝐅𝟏subscript𝐅1\mathbf{F_{1}}bold_F start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT-max is a variant of the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score that maximizes both precision and recall by selecting the threshold that yields the highest F1 score when evaluating a binary classification model. A higher 𝐅𝟏subscript𝐅1\mathbf{F_{1}}bold_F start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT-max represents better anomaly detection and localization results

Appendix D More Ablation Studies

D.1 Ablation on Spatial Anomaly Embedding

Refer to caption
Figure 6: Comparison results between Textual Inversion (Anomaly Embedding only) and our model (Spatial Anomaly Embedding). The generated result of Textual Inversion tends to generate anomalies at the location as the same as the training sample.

We aim to seek a text embedding that guides the latent diffusion model in generating anomalies within a given anomaly mask. However, textual inversion tends to capture the location of anomalies along with the anomaly type information, which results in the generated anomalies only distributed in specific locations. Therefore, we propose spatial anomaly embedding e𝑒eitalic_e, consisting of an anomaly embedding easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (for appearance) and a spatial embedding essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (for location), which disentangles the spatial information from anomaly appearance. To further validate this theory, we directly employ text inversion to train an anomaly embedding and use the trained embedding to generate anomalies with a given mask by blended latent diffusion (the same as our generation process). The generated results are shown in Fig. 6. It can be seen that the generated result by Textual Inversion tends to generate anomalies at the location the same as the training sample, which limits its application in anomaly generation where anomalies can be located at arbitrary positions.

D.2 Ablation on the rate of anomalies

We conduct additional experiments with the rate of anomalies as 10%, 20%, 30% (Ours), 40%, and 50% and test the performance on anomaly localization measured by AUROC and AP. The performance of anomaly localization is shown in Tab. 7. It can be seen that AP decreases quickly when the anomaly rate falls below 30%. This is attributed to the limited availability of training data for most categories, often comprising only 1-2 instances, making it challenging for the model to capture the anomaly information. Conversely, when the rate exceeds 30%, the model performance is similar. This indicates that our model can effectively learn sufficient anomaly information without a heavy reliance on an abundance of training samples.

Anomaly Rate 10% 20% 30% (Ours) 40% 50%
AUROC \uparrow 96.2 98.1 99.1 99.0 98.7
AP \uparrow 64.2 75.5 81.4 81.1 80.0
Table 7: ablation study on the rate of anomalies.

D.3 Ablation on the hyperparameters

We conduct ablation studies on the length of the anomaly embedding lasubscript𝑙𝑎l_{a}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and spatial embedding lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in Tab. 8. Specifically, we train models with different lasubscript𝑙𝑎l_{a}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and then employ their generated data to train a UNet to localize the anomalies. It can be seen that when increasing lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and lasubscript𝑙𝑎l_{a}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, the total parameter number of the model rises, but the final performance in the downstream anomaly localization task is similar, which demonstrates that our model is not snesitive to the hyperparameters.

Model AUROC \uparrow AP \uparrow F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max \uparrow PRO \uparrow
Ls=4,la=8formulae-sequencesubscript𝐿𝑠4subscript𝑙𝑎8L_{s}=4,l_{a}=8italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4 , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 (Ours) 99.1 81.4 76.3 94.0
Ls=4,la=16formulae-sequencesubscript𝐿𝑠4subscript𝑙𝑎16L_{s}=4,l_{a}=16italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4 , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 16 98.8 80.6 75.1 93.2
Ls=8,la=8formulae-sequencesubscript𝐿𝑠8subscript𝑙𝑎8L_{s}=8,l_{a}=8italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 8 , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 8 99.0 80.9 75.8 93.5
Ls=8,la=16formulae-sequencesubscript𝐿𝑠8subscript𝑙𝑎16L_{s}=8,l_{a}=16italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 8 , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 16 99.1 81.2 75.9 93.8
Table 8: Ablation study on the anomaly embedding lasubscript𝑙𝑎l_{a}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and spatial embedding lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

D.4 Ablation on SAE and AAR

We conduct more ablation studies on the effectiveness of our spatial anomaly embedding (SAE) and adaptive attention re-weighting mechanism (AAR) by adding SAE and AAR separately. We compare our model with 3 models: 1) w/o AAR&SAE, 2) AAR only, and 3) SAE only in generating glue anomalies to the leather in the Fig. 7. It shows that the model without AAR&SAE cannot generate authentic anomalies or fill anomaly mask. While adding SAE improves anomaly authenticity, it doesn’t fill the mask. Moreover, incorporating AAR fills the mask but sacrifices authenticity. In contrast, our model (SAE + AAR) effectively generates authentic anomalies filling the mask.

Refer to caption
Figure 7: Ablation study on SAE and AAR.

Appendix E Comparison with Prompt-to-Prompt

Prompt-to-Prompt (Hertz et al. 2022) proposed a method that allows modifying generated images by altering corresponding text descriptions. For instance, when transforming ”a cat sits on the street” to ”a dog sits on the street,” Prompt-to-Prompt replaces the cross-attention map of ”dog” with that of ”cat”, which transforms the cat in the original image into a dog while maintaining nearly unchanged content in other regions, achieving controlled image generation with specific positions. However, Prompt-to-Prompt requires a text corresponding to the original image for generation, which is unavailable in anomaly generation.

It seems that Prompt-to-prompt presents a potential solution for controlling generation positions through the manipulation of cross-attention maps. A direct solution is to resize the mask m𝑚mitalic_m to match and substitute the cross-attention map mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of anomaly embedding, thus controlling the generation location. However, even though the new cross-attention map mcsubscriptsuperscript𝑚𝑐m^{\prime}_{c}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT seemingly dictates anomaly location, it could conflict with the values V𝑉Vitalic_V in the cross-attention module. Since V𝑉Vitalic_V is designed for the original cross-attention map mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the semantic information of V𝑉Vitalic_V in the newly enforced mask mcsubscriptsuperscript𝑚𝑐m^{\prime}_{c}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT might not align with the semantics in original mask mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, consequently leading to a unstable generated results.

To verify it, we conduct reconstruction experiments on real anomalies, comparing the results of Textual Inversion + Prompt-to-Prompt with our approach (Spatial Anomaly Embedding). Specifically, we sample a real anomaly image I𝐼Iitalic_I as ground truth and mask out the anomaly parts for generation. The results are shown in Fig. 8. Textual Inversion + Prompt-to-Prompt can not generate anomalies as authentic as ours. And its generated anomalies are quite different from the ground truth, indicating that replacing the cross-attention map directly cannot generate satisfying anomalies. Moreover, we also conduct a quantitative experiment, where we generate anomalous image-mask pairs to support the downstream anomaly localization task. We follow the experiment settings in the main paper, in which we train an U-Net on the generated data and compare the localization accuracy. The results are recorded in Tab. 9. Our model outperforms Textual Inversion + Prompt-to-Prompt significantly.

Refer to caption
Figure 8: Comparison results between Textual Inversion + Prompt-to-Prompt and our model (Spatial Anomaly Embedding) in anomaly generation.

Method Metric
AUROC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO
Textual Inversion+ Prompt-to-Prompt 91.2 55.1 64.4 73.5
Ours 99.1 81.4 76.4 94.0
Table 9: Comparison with Textual Inversion + Prompt-to-Prompt on anomaly localization.

Appendix F More qualitative experiments

We give a more comprehensive comparison with the existing anomaly generation methods DiffAug (Zhao et al. 2020), CDC (Ojha et al. 2021), Crop&Paste, SDGAN (Niu et al. 2020), Defect-GAN (Zhang et al. 2021) and DFMGAN (Duan et al. 2023). We exhibit the generation results of all the anomaly generation methods across all components in Fig. 9. Our model demonstrates remarkable proficiency in generating high-quality, authentic anomalies that are precisely aligned with the corresponding masks. In contrast, Crop&Paste exhibits limited diversity in generating various anomaly types. DiffAug displays evident overfitting tendencies towards the training samples (the image in the lower-right corner). CDC yields visually perplexing results, particularly for structural anomaly categories like capsule-squeeze. SDGAN and DefectGAN yield poor outputs, frequently encountering challenges in generating anomalies such as pill-crack. The state-of-the-art model DFMGAN occasionally struggles to create authentic anomalies and fails in maintaining alignment between the generated anomalies and masks, as observed in the case of metal nut-bent. In comparison, our model generates anomalies with the highest diversity and authenticity, and the generated anomalies align with the masks accurately, which can effectively support the downstream anomaly inspection tasks.

Refer to caption
Figure 9: Qualitative comparison on the anomaly generation quality. Note that the generated anomlalies by DiffAug is the same as the training samples (images in lower-right corner)

Appendix G More Quantitative experiments

More comparison with the anomaly generation models. In this section, we provide supplementary experiments to complement those presented in the main paper. Specifically, in addition to the methods covered in the main paper, we include Crop&Paste (Lin et al. 2021) for comparison and we additionally introduce the Per Region Overlap (PRO) metric to provide a more comprehensive evaluation on anomaly localization. The experiment settings are the same as that in the main paper, where we train an U-net on the generated anomaly data. The pixel-level anomaly localization results are shown in Tab. 10 and the image-level anomaly detection results are shown in Tab. 11. The quantitative results demonstrate that our model outperforms all the other anomaly generation methods in terms of both anomaly localization and detection, indicating our good anomaly generation quality and diversity.

More comparison with the anomaly localization models. In this section, we further compare the anomaly detection methods with F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max score on anomaly localization. The results are shown in Table 12. It can be seen that our model achieves the best performance in anomaly localization with F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max.

Category Crop&Paste DRAEM PRN DFMGAN Ours
AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max PRO
bottle 94.5 67.4 63.5 77.8 96.7 80.2 74.0 91.2 97.5 76.4 71.3 88.5 98.9 90.2 83.9 91.7 99.4 94.1 87.3 94.3
cable 96.0 75.3 69.3 87.1 80.3 21.8 28.3 58.2 94.5 64.4 61.0 79.7 97.2 81.0 75.4 84.9 99.2 90.8 83.5 95.0
capsule 95.3 49.2 51.1 89.5 76.2 25.5 32.1 81.1 95.6 45.7 47.9 89.7 79.2 26.0 35.0 66.1 98.8 57.2 59.8 95.4
carpet 83.7 36.6 39.7 62.9 92.6 43.0 41.9 80.0 96.4 69.6 65.6 90.6 90.6 33.4 38.1 76.5 98.6 81.2 74.6 91.6
grid 84.7 13.1 22.4 70.2 99.1 59.3 58.7 95.8 98.9 58.6 58.9 95.8 75.2 14.3 20.5 52.3 98.3 52.9 54.6 92.3
hazelnut 88.5 38.0 42.8 74.1 98.8 73.6 68.5 95.9 98.0 73.9 68.2 92.7 99.7 95.2 89.5 96.4 99.8 96.5 90.6 97.1
leather 97.5 76.0 70.8 95.7 98.5 67.6 65.0 96.7 99.4 58.1 54.0 97.5 98.5 68.7 66.7 96.0 99.8 79.6 71.0 98.2
metal nut 96.3 84.2 74.0 67.2 96.9 84.2 74.5 90.4 97.9 93.0 87.1 85.0 99.3 98.1 94.5 88.0 99.8 98.7 94.0 94.8
pill 81.5 17.8 24.3 57.4 95.8 45.3 53.0 83.7 98.3 55.5 72.6 88.2 81.2 67.8 72.6 56.5 99.8 97.0 90.8 97.3
screw 93.4 31.2 36.0 83.9 91.0 30.1 35.7 78.1 94.0 47.7 49.8 83.8 58.8 2.2 5.3 41.8 97.0 51.8 50.9 80.3
tile 94.0 79.3 74.5 79.2 98.5 93.2 87.8 95.3 98.5 91.8 84.4 91.3 99.5 97.1 91.6 97.5 99.2 93.9 86.2 96.1
toothbrush 89.3 30.9 34.6 66.6 93.8 29.5 28.4 75.1 96.1 46.4 46.2 83.1 96.4 75.9 72.6 74.3 99.2 76.5 73.4 91.4
transistor 85.9 52.5 52.1 64.5 76.5 31.7 24.2 54.3 94.9 68.6 68.4 70.0 96.2 81.2 77.0 65.5 99.3 92.6 85.7 96.2
wood 84.0 45.7 48.0 57.9 98.8 87.8 80.9 94.7 96.2 74.2 67.4 82.1 95.3 70.7 65.8 89.9 98.9 84.6 74.5 94.3
zipper 94.8 47.6 51.4 83.4 93.4 65.4 64.7 84.6 98.4 79.0 73.7 93.7 92.9 65.6 64.9 83.0 99.4 86.0 79.2 96.3
Average 90.4 48.4 49.4 74.3 92.2 54.1 53.1 83.1 96.9 66.2 64.7 87.4 90.0 62.7 62.1 76.3 99.1 81.4 76.3 94.0
Table 10: Comparison on the pixel-level anomaly localization with AUC, AP, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max and PRO metrics by training an U-Net on the generated datasets produced by Crop&Paste, DRAEM, PRN, DFMGAN and our model. Bold and underline represent optimal and sub-optimal results, respectively.
Category Crop&Paste DRAEM PRN DFMGAN Ours
AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max AUC AP F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max
bottle 85.4 95.1 90.9 99.3 99.8 98.9 94.9 98.4 94.1 99.3 99.8 97.7 99.8 99.9 98.9
cable 93.3 96.1 91.6 72.1 83.2 79.2 86.3 92.0 84.0 95.9 97.8 93.8 100 100 100
capsule 77.1 94.1 90.4 93.2 98.7 94.0 84.9 95.8 94.3 92.8 98.5 94.5 99.7 99.9 98.7
carpet 57.7 84.3 87.3 95.3 98.7 93.4 92.6 97.8 92.1 67.9 87.9 87.3 96.7 98.8 94.3
grid 83.0 94.1 87.6 99.8 99.9 98.8 96.6 98.9 95.0 73.0 90.4 85.4 98.4 99.5 98.7
hazelnut 68.8 85.0 78.0 100 100 100 93.6 96.0 94.1 99.9 100 99.0 99.8 99.9 98.9
leather 91.9 97.5 90.9 100 100 100 99.1 99.7 97.6 99.9 100 99.2 100 100 100
metal nut 92.2 98.1 93.3 97.8 99.6 97.6 97.8 99.5 96.9 99.3 99.8 99.2 100 100 100
pill 51.7 87.1 91.4 94.4 98.9 95.8 88.8 97.8 93.2 68.7 91.7 91.4 98 99.6 97
screw 59.3 81.9 86.0 88.5 96.3 89.3 84.1 94.7 87.2 22.3 64.7 85.3 96.8 97.9 95.5
tile 73.8 91.1 83.8 100 100 100 91.1 96.9 89.3 100 100 100 100 100 100
toothbrush 81.2 91.0 88.9 99.4 99.8 97.6 100 100 100 100 100 100 100 100 100
transistor 85.9 81.8 80.0 79.6 80.5 71.4 88.2 88.9 84.0 90.8 92.5 88.9 100 100 100
wood 49.5 81.2 86.6 100 100 100 77.5 92.7 86.7 98.4 99.4 98.8 98.4 99.4 98.8
zipper 59.4 82.8 88.9 100 100 100 98.7 99.7 97.6 99.7 99.9 99.4 99.9 100 99.4
Average 74.0 89.4 87.7 94.6 97.0 94.4 91.6 96.6 92.4 87.2 94.8 94.7 99.2 99.7 98.7
Table 11: Comparison on the image-level anomaly detection with AUC, AP and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max metrics by training an U-Net on the generated datasets produced by Crop&Paste, DRAEM, PRN, DFMGAN and our model.
F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-max KDAD CFLOW DREAM SSPCAB CFA RD4AD PatchCore DevNet DRA Ours
bottle 50.9 9.5 83.0 80.3 75.9 82.1 78.6 64.6 53.5 87.3
cable 18.3 10.5 58.5 51.2 76.3 65.2 68.5 54.9 55.3 83.5
capsule 15.1 6.9 48.9 49.5 57.0 60.4 56.7 38.7 47.7 60.8
carpet 54.2 3.5 60.0 47.1 48.3 67.8 67.9 52.3 42.3 74.6
grid 10.9 3.2 56.3 58.4 32.2 59.9 49.1 42.9 50.1 54.6
hazelnut 37.5 3.9 80.6 88.9 61.4 70.0 68.1 22.4 47.2 90.6
leather 30.4 3.9 63.2 58.1 53.8 67.2 54.7 32.1 19.8 71.0
metal nut 34.2 30.7 84.4 87.8 87.1 77.0 86.0 65.2 64.6 94.0
pill 29.9 17.6 62.6 46.5 79.5 63.7 73.5 22.8 45.5 90.8
screw 8.3 0.9 66.9 63.8 37.8 58.7 47.2 14.8 0.7 50.9
tile 27.8 26.6 90.8 88.5 77.8 71.8 69.4 69.9 61.4 86.2
toothbursh 25.1 4.7 47.5 37.2 62.1 58.7 63.8 35.1 22.6 73.4
transistor 26.7 19.9 55.2 34.8 76.3 59.8 64.2 28.9 33.2 85.7
wood 25.2 10.0 75.1 68.7 48.6 61.3 60.3 51.9 49.9 74.5
zipper 26.8 4.5 68.2 73.7 65.8 69.4 70.0 45.6 56.9 79.2
average 28.1 10.4 66.7 62.3 61.4 66.2 65.2 42.8 43.4 76.4
Table 12: Comparison on anomaly localization with 𝐅𝟏subscript𝐅1\mathbf{F_{1}}bold_F start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT-max.