Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\useunder

\ul

M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

Chengjie Wang, Haokun Zhu, Jinlong Peng, Yue Wang, Ran Yi
Yunsheng Wu, Lizhuang Ma, Jiangning Zhang
C. Wang is with Shanghai Jiao Tong University and Youtu Lab, Shanghai, China. H. Zhu, Y. Wang, R. Yi, and L. Ma are with the Shanghai Jiao Tong University, Shanghai, China. J. Peng, Y. Wu, and J. Zhang are with Youtu Lab, Tencent, China. Corresponding author: Ran Yi
Abstract

Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

Index Terms:
Anomaly Detection, Multi-modal Learning, Noisy Learning, Unsupervised Learning
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

1 Introduction

Industrial anomaly detection aims to find the abnormal region of products and plays an important role in industrial quality inspection. Most existing industrial anomaly detection methods [1, 2] primarily focus on RGB images [3, 4] and use a vast number of normal examples for training. Consequently, current industrial anomaly detection methods predominantly rely on unsupervised approaches, meaning they train exclusively on normal RGB examples and only during inference are defect examples tested. These two factors contribute to two significant issues (Fig. 1-Top-Left). First, during the quality inspection of industrial products, human inspectors rely on both 3D shape and color characteristics to assess product quality. The 3D shape information is crucial for accurate defect detection in particular, and identifying defects using only RGB images proves difficult. With advancements in 3D sensor technology, recent MVTec-3D AD dataset that includes both 2D images and 3D point cloud data is proposed to alleviate this problem and has bolstered research in multi-modal industrial anomaly detection (Fig. 1-Top-Middle). Second, the presence of noise in the normal dataset is an unavoidable issue in real-world applications, particularly in industrial manufacturing where products are mass-produced daily. Most existing unsupervised AD methods [5, 6, 7] are prone to noisy data due to their exhaustive strategy to model the training set. However, noisy samples can easily mislead those overconfident AD algorithms, causing them to misclassify similar anomaly samples in the test set and generate incorrect locations. SoftPatch [8] is the first to introduce the setting for noisy industrial detection, but it explored only noisy industrial detection on RGB data.

Refer to caption
Figure 1: Top: Intuitive diagram of different task settings. Middle: Representative PatchCore [5] for solving RGB images, our M3DM [9] (conference version) for solving multi-modal RGB+3D data, and new M3DM-NR to tackle more challenging and practial noisy setting. Bottom: Quantitative visualization results on MVTec 3D-AD dataset [10]. Our M3DM-NR can predict more precise anomaly regions obviously compared to PatchCore+FPFH [11] and M3DM [9].

For the first issue, the core idea for existing unsupervised anomaly detection is to find out the difference between normal representations and anomalies. Current 2D industrial anomaly detection methods can be mainly categorized into two categories: (1) Reconstruction-based methods. Image reconstruction tasks are widely used in anomaly detection methods [3, 12, 13, 14, 15, 16] to learn normal representation. Reconstruction-based methods are easy to implement for a single modal input (2D image or 3D point cloud). But for multi-modal inputs, it is hard to find a reconstruction target. (2) Pretrained feature extractor-based methods. An intuitive way to utilize the feature extractor is to map the extracted feature to a normal distribution and find the out-of-distribution one as an anomaly. Normalizing flow-based methods [6, 17, 18] use an invertible transformation to directly construct normal distribution, and memory bank-based methods[19, 5] store some representative features to implicitly construct the feature distribution. Compared with reconstruction-based methods, directly using a pretrained feature extractor does not involve the design of a multi-modal reconstruction target and is a better choice for the multi-modal task. Besides that, current multi-modal industrial anomaly detection methods [20, 18] directly concatenate the features of the two modalities together. However, when the feature dimension is high, the disturbance between multi-modal features will be violent and cause performance reduction.

Regarding the second issue of noisy anomaly detection, existing methods in noisy industrial detection have primarily focused on single-modality noisy anomaly detection using RGB images, with a lack of research on RGB-3D multi-modal noisy data. However, in practical industrial detection, noise often contaminates 3D data, and RGB-3D multi-modal data serve as an important reference for determining whether a sample is anomalous. The absence of exploration in RGB-3D multi-modal noisy data means that current methods are vulnerable to the multi-modal noisy data in real-world production environments. Furthermore, existing approaches employ a simplistic and naive strategy of patch-level denoising and sample re-weighting based on outlier-detection weights, leading to unsatisfying denoising effects and the persistence of noise in the dataset.

Refer to caption
Figure 2: Overall pipeline of our M3DM-NR that comprises three stages: 1) selecting intra-modal reference samples, 2) denoising the dataset by comparing it with these samples, and 3) achieving multimodal anomaly detection through multimodal feature fusion.

To solve the problems mentioned above, in this paper, we first delve into the RGB-3D multi-modal noisy industrial detection problem (Fig. 1-Top-Right). To address the challenges of RGB-3D multi-modal noisy data, we propose a novel three-stage multi-modal noise-resistant framework termed M3DM-NR, which performs denoising at both sample-level and patch-level, as shown in Fig. 2. This framework utilizes pretrained CLIP [21] and Point-BIND [22] models to extract aligned text, RGB, and 3D point cloud features to denoise multi-modal data through both cross-modal comparison and intra-modality comparison. To the best of our knowledge, we are the first to employ a multi-modal learning approach based on pre-trained CLIP and Point-BIND to solve the RGB-3D multi-modal noisy industrial anomaly detection problem. In this framework, Stage I selects a few normal samples from the training dataset as intra-modal reference samples and compute the suspected anomaly map to focus on abnormal regions by the proposed Intra-Modal Reference Selection. In Stage II, recognizing the fact that in industrial anomaly detection, anomalies often constitute only a small fraction of the entire sample, we thus propose a novel Enhanced Multi-modal Denoising module to rank the anomalies of each training sample by performing multi-scale feature comparison and weighting with a suspected reference, enabling the filtering of anomalous samples. In Stage III, to address the above problems concerning multi-modal anomaly detection, we propose a novel Multimodal Anomaly Detection via Hybrid Fusion scheme to Learn the pattern of the training dataset to conduct anomaly detection and segmentation while filtering out noise at the patch level. Different from the existing methods that directly concatenate the features of the two modalities, we propose a hybrid fusion scheme to reduce the disturbance between multi-modal features and encourage feature interaction. We propose Unsupervised Feature Fusion (UFF) to fuse multi-modal features, which is trained using a patch-wise contrastive loss to learn the inherent relation between multi-modal feature patches at the same position. To encourage the anomaly detection model to keep the single domain inference ability, we construct three memory banks separately for RGB, 3D and fused features. For the final decision, we construct Decision Layer Fusion (DLF) to consider all memory banks for anomaly detection and segmentation. Besides, we further propose a Point Feature Alignment (PFA) operation to better align 3D and 2D features and Noise Discriminative Coreset Selection to filter out noise at patch-level.

To evaluate our method, we conduct extensive experiments on the MVTec 3D-AD [10] and Eyecandies [23] datasets, comparing our method with existing RGB, 3D, and RGB-3D based industrial detection methods. Moreover, to further highlight the robustness of our method, we follow the experiment setting in SoftPatch [8] and conduct experiments under Non-Overlap and, more challenging, Overlap settings. The extensive experimental results and metrics (I-AUROC, P-AUROC, AUPRO) demonstrate that our method surpasses existing state-of-the-art approaches. Additionally, we performed a comprehensive ablation study, thoroughly validating the effectiveness of all novel modules proposed.

This is an extension of the previous conference version (M3DM [9] in CVPR’23). In the conference papar, we mainly proposed M3DM, a novel multi-modal industrial anomaly detection method with hybrid feature fusion, which outperforms the state-of-the-art detection and segmentation precision on MVTec 3D-AD [10]. In this extended journal version, we make the following four contributions:

  • We study a new RGB-3D multi-modal noisy industrial anomaly detection task and have substantially broadened our research to this practical setting, proposing a novel three-stage multi-modal noise-resistant framework termed M3DM-NR. It addresses reference selection, denoising, and final anomaly detection and segmentation, ensuring systematic and hierarchical processing.

  • We design three novel Initial Feature Extraction, Suspected References Selection, and Suspected Anomaly Map Computation modules in Stage I to select a few normal samples from the training dataset as intra-modal reference samples, and it generates suspected anomaly maps to focus on abnormal regions as the reference for the next stage.

  • To obtain cleaner training data, we propose an extra Stage II termed Enhanced Multi-modal Denoising to introduce multi-scale feature comparison and weighting methods to finely rank and denoise training samples.

  • We employ M3DM as Stage III to achieve final anomaly detection and segmentation. Extensive quantitative experiments across various settings demonstrate the performance of our approach over existing state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection. We also conduct massive ablation study to illustrate the effectiveness of each designed component.

2 Related Work

2.1 2D Industrial Anomaly Detection

Current anomaly detection can be mainly categorized into following three parts: 1) Data augmentation based methods [24, 25, 14, 26, 27, 28] propose to introduce pseudo anomalies to normal samples with the aim of improving the system’s ability to identify such anomalies during training. 2) Reconstruction based methods [29, 12, 13, 16, 15, 30, 31, 32, 33, 34] leverage auto-encoders and generative adversarial networks. Although these reconstruction methods may not accurately recover anomalous regions, comparing the reconstructed image with the original can pinpoint anomalies and facilitate decision-making. 3) Feature embedding based methods [35, 36, 6, 17, 5, 37, 38, 39] depend on pre-trained feature extractors, with additional detection modules that learn to identify abnormal areas using the extracted features or representations. Drawing parallels between 2D and 3D anomaly detection, our work expands the application of the memory bank approach to 3D and multi-modal contexts, yielding impressive outcomes.

2.2 3D Industrial Anomaly Detection

The first public 3D industrial anomaly detection dataset is the MVTec 3D-AD dataset [10], which includes both RGB information and point position data for each instance. Current 3D anomaly detection can be mainly categorized into following four parts: 1) Data augmentation-based methods [40, 41] draw inspirations from 2D anomaly detection strategies to generate pseudo RGB and 3D anomaly samples, enhancing the model’s capacity to identify anomalies. 2) Reconstruction-based methods [42, 40] utilize auto-encoders and generative adversarial networks trained to generate normal samples for both RGB and 3D data, irrespective of whether the input is normal or anomalous. This approach fails to reconstruct regions with anomalies effectively. By comparing these reconstructed samples with the originals, anomalies can be identified, thus aiding in decision-making. 3) Feature embedding-based methods [11, 9, 43, 44, 45, 46] rely on pre-trained feature extractors, supplemented with additional fusion modules that align and integrate multi-modal information. Detection modules then utilize these fused features or representations to identify abnormal areas, enhancing the system’s ability to detect anomalies. 4) Knowledge distillation-based methods [47, 18, 48] train a student network to reconstruct samples or extract features, where the disparity between the teacher and student networks serves as an indicator of anomalies. In our research, we adopt the feature embedding-based approach but diverge with a novel pipeline.

2.3 Learning with Noisy Data

Recognizing noisy labels is increasingly gaining attention in the realm of supervised learning. Yet, this concept has scarcely been ventured into within unsupervised anomaly detection, largely due to the absence of clear labels. In classification tasks, certain studies have suggested filtering pseudo-labeled data that carry a high confidence threshold to mitigate noise [49, 50]. Li et al. [51] employ a mixture model to identify noisy-labeled data, adopting a semi-supervised approach for training. In the field of object detection, strategies such as multi-augmentation [52], a teacher-student model [53], or contrastive learning [54] have been leveraged, drawing on the expertise of expert models to reduce noise. However, the prevailing methods for recognizing noisy labels depend heavily on labeled data for correcting inaccuracies. Our research diverges by aiming to enhance a model’s resistance to noise in an unsupervised manner, thereby eliminating the need for manual annotations. A recent review [55] examines the robustness of 30 AD algorithms, yet overlooks unsupervised approaches in the context of annotation errors. Pang et al. [56] address anomalies in video without relying on manually labeled data, exploiting information across consecutive frames, contrasting our focus on detecting anomalies in single images. Other studies [57, 58, 59] tackle the elimination of noisy and corrupted data in semantic anomaly detection. SoftPatch [8] proposed to filter out noise at patch-level using outlier detection, but the employed outlier detection method is rather naive and doesn’t produce very good results. In this paper, we introduce a method that utilizes a pretrained CLIP-based model to extract and align multi-modal information, enabling the effective filtration of noise at sample-level.

2.4 Multi-modal Learning

Among the recent successes of large pre-trained vision-language models (VLMs) [60, 61, 21], CLIP [21] stands out as the first to employ pre-training on web-scale image-text data, demonstrating unprecedented generality. Notable features include its language-driven zero-shot inference capabilities, which have significantly enhanced both effective robustness [62] and perceptual alignment [63]. Other studies [64, 65, 66] have also utilized the pre-trained CLIP model for downstream tasks, such as language-guided detection and segmentation, achieving promising results. Beyond aligning vision and language, Point-Bind [22] extends this alignment to include 3D modality. Recently, some recent works have attempted to apply the multimodal CLIP model to the AD domain [67, 68, 69, 70, 71]. Specific WinCLIP [67] leverages the robust multi-modal capabilities of the pre-trained CLIP model for effective zero-shot 2D anomaly detection.

In this paper, we utilize the Point-BIND’s aligned embedding space of image, language, and 3D modalities to effectively filter out noise at sample-level in the training set.

Refer to caption
Figure 3: Overview framework of our M3DM-NR, which contains three stages to tackle challenging noisy anomaly detection task: Stage I introduces a text prompt ensemble strategy φTsubscript𝜑𝑇\varphi_{T}italic_φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, utilizing pre-trained image encoder EIsubscript𝐸𝐼E_{I}italic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, point cloud encoder EPsubscript𝐸𝑃E_{P}italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, and text encoder ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to extract initial features {FIm}m=1Msuperscriptsubscriptsubscript𝐹subscript𝐼𝑚𝑚1𝑀\left\{F_{I_{m}}\right\}_{m=1}^{M}{ italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, {FPm}m=1Msuperscriptsubscriptsubscript𝐹subscript𝑃𝑚𝑚1𝑀\left\{F_{P_{m}}\right\}_{m=1}^{M}{ italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, fTNorsuperscriptsubscript𝑓𝑇𝑁𝑜𝑟f_{T}^{Nor}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT, and fTAnosuperscriptsubscript𝑓𝑇𝐴𝑛𝑜f_{T}^{Ano}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT. These features are then used to select suspected reference samples {sRn}n=1Nsuperscriptsubscriptsubscript𝑠subscript𝑅𝑛𝑛1𝑁\left\{s_{R_{n}}\right\}_{n=1}^{N}{ italic_s start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT through similarity measurement and to compute corresponding anomaly maps {Wn}n=1Nsuperscriptsubscriptsubscript𝑊𝑛𝑛1𝑁\left\{W_{n}\right\}_{n=1}^{N}{ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Based on the suspected samples, Stage II calculates the anomaly scores {S~m}m=1Msuperscriptsubscriptsubscript~𝑆𝑚𝑚1𝑀\left\{\tilde{S}_{m}\right\}_{m=1}^{M}{ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT for each training sample using multi-scale and feature weighting methods, ultimately filtering out the top-τ𝜏\tauitalic_τ samples to obtain a denoised training set. Stage III comprises four modules to achieve final anomaly detection and segmentation.

3 Methodology

As shown in Fig. 3, our proposed M3DM-NR framework takes RGB images and 3D point clouds as input to perform RGB-3D based multi-modal noisy anomaly detection and segmentation. Specifically, M3DM-NR consists of three stages to achieves this goal: 1) Intra-modal Reference Selection (Stage I in Sec. 3.1) selects a few normal samples from the training dataset as intra-modal reference samples, and the suspected anomaly map is computed to focus on abnormal regions. 2) Enhanced Multi-modal Denoising (Stage II in Sec. 3.2) ranks the anomalies of each training sample by performing multi-scale feature comparison and weighting with a suspected reference, enabling the filtering of anomalous samples. 3) Multimodal Anomaly Detection via Hybrid Fusion (Stage III in Sec. 3.3) learns the pattern of the training dataset to conduct anomaly detection and segmentation while filtering out noise at patch-level.

3.1 Stage I: Intra-modal Reference Selection

3.1.1 Initial Feature Extraction

Given M𝑀Mitalic_M image and point cloud pairs {Im}m=1Msuperscriptsubscriptsubscript𝐼𝑚𝑚1𝑀\left\{I_{m}\right\}_{m=1}^{M}{ italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and {Pm}m=1Msuperscriptsubscriptsubscript𝑃𝑚𝑚1𝑀\left\{P_{m}\right\}_{m=1}^{M}{ italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, RGB-3D anomaly detection requires three modes of information input, so it contains three parts of feature pre-extraction algorithm:

Text prompt ensemble. The effectiveness of text descriptions is crucial for multimodal anomaly detection. Following APRIL-GAN [68], we employ a text prompt ensemble strategy φTsubscript𝜑𝑇\varphi_{T}italic_φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to fully explore the textual representation of defects. Specifically, the proposed strategy φTsubscript𝜑𝑇\varphi_{T}italic_φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT includes several templates, each in the format “A photo of a state class”, where ‘state’ denotes predefined normal and abnormal state descriptions, and ‘class’ represents the class name. The output features are averaged using pooling to obtain the final descriptive features fTNordsuperscriptsubscript𝑓𝑇𝑁𝑜𝑟superscript𝑑f_{T}^{Nor}\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and fTAnodsuperscriptsubscript𝑓𝑇𝐴𝑛𝑜superscript𝑑f_{T}^{Ano}\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Multi-scale image feature representation. For each image Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in the training dataset, we first use pretrained image encoder EIsubscript𝐸𝐼E_{I}italic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT in CLIP model to extract corresponding feature FImsubscript𝐹subscript𝐼𝑚F_{I_{m}}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

FIm=EI(Im).subscript𝐹subscript𝐼𝑚subscript𝐸𝐼subscript𝐼𝑚\displaystyle F_{I_{m}}=E_{I}(I_{m}).italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (1)

Then, a multi-scale segmentation operation Isubscript𝐼\mathcal{H}_{I}caligraphic_H start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is used to segment FImsubscript𝐹subscript𝐼𝑚F_{I_{m}}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT into 3 different scales FImσ,σ{l,m,s}superscriptsubscript𝐹subscript𝐼𝑚𝜎𝜎𝑙𝑚𝑠F_{I_{m}}^{\sigma},\sigma\in\{l,m,s\}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT , italic_σ ∈ { italic_l , italic_m , italic_s }, denoted as:

fIml,FIml,FImm,FImssuperscriptsubscript𝑓subscript𝐼𝑚𝑙superscriptsubscript𝐹subscript𝐼𝑚𝑙superscriptsubscript𝐹subscript𝐼𝑚𝑚superscriptsubscript𝐹subscript𝐼𝑚𝑠\displaystyle f_{I_{m}}^{l},F_{I_{m}}^{l},F_{I_{m}}^{m},F_{I_{m}}^{s}italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT =I(FIm).absentsubscript𝐼subscript𝐹subscript𝐼𝑚\displaystyle=\mathcal{H}_{I}\left(F_{I_{m}}\right).= caligraphic_H start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (2)

where fImlsuperscriptsubscript𝑓subscript𝐼𝑚𝑙f_{I_{m}}^{l}italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the class token and FImσsuperscriptsubscript𝐹subscript𝐼𝑚𝜎F_{I_{m}}^{\sigma}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT is obtained by the following equation:

FImσsuperscriptsubscript𝐹subscript𝐼𝑚𝜎\displaystyle F_{I_{m}}^{\sigma}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ={fuvσ}Imabsentsubscriptsuperscriptsubscript𝑓𝑢𝑣𝜎𝐼𝑚\displaystyle=\left\{f_{uv}^{\sigma}\right\}_{Im}= { italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I italic_m end_POSTSUBSCRIPT (3)
=FIm{Muvσ}absentdirect-productsubscript𝐹subscript𝐼𝑚superscriptsubscript𝑀𝑢𝑣𝜎\displaystyle=F_{I_{m}}\odot\left\{M_{uv}^{\sigma}\right\}= italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ { italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT }
s.t.σ{l,m,s}.s.t.𝜎𝑙𝑚𝑠\displaystyle\textit{s.t.}~{}\sigma\in\{l,m,s\}.s.t. italic_σ ∈ { italic_l , italic_m , italic_s } .

M={Muvσ}𝑀superscriptsubscript𝑀𝑢𝑣𝜎M=\left\{M_{uv}^{\sigma}\right\}italic_M = { italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } is the multi-scale mask, where each Muvσ{0,1}h×wsuperscriptsubscript𝑀𝑢𝑣𝜎superscript01𝑤M_{uv}^{\sigma}\in\{0,1\}^{h\times w}italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT is a binary mask that selects k×k𝑘𝑘k\times kitalic_k × italic_k kernel size centered at (u,v)𝑢𝑣(u,v)( italic_u , italic_v ), with Muvlsuperscriptsubscript𝑀𝑢𝑣𝑙{M_{uv}^{l}}italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT specifically selects the entire point cloud. FImσsuperscriptsubscript𝐹subscript𝐼𝑚𝜎F_{I_{m}}^{\sigma}italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT is the set of image patches at big, middle, or small scale, uv𝑢𝑣uvitalic_u italic_v indicates the coordinate of patches in the original image, and direct-product\odot denotes the element-wise multiplication.

Refer to caption
Figure 4: Visualization of Aligned Multi-Scale Point Cloud Feature Extraction (AMPCFE), which extracts local point cloud features aligned with the granularity of image patching, focusing more on local details and improving the efficacy of multi-modal anomaly detection.

Aligned multi-scale point cloud feature extraction. As previous work [9] shown, in the MVTec 3D-AD [10] dataset, many anomalies cannot be detected through RGB images alone. For example, in the ‘potato’ category, an anomaly type named ‘cut’ can only be identified using 3D point cloud data. Thus, incorporating 3D point cloud data in the noise-filtering process is crucial. Therefore, we proposed to use 3D point cloud modality in noise detection.

However, we find that relying solely on the whole point cloud was insufficient during the experiments. In the MVTec 3D-AD dataset, defects often occupy only a small portion of the entire sample’s point cloud data, meaning that most areas of a sample are normal. Furthermore, existing works [72, 73, 22, 74] aligning point cloud encoders with CLIP focus on object classification tasks, which prioritize the global information of the object’s 3D point cloud data and overlook local details. Traditional multi-scale point cloud data segmentation based on FPS sampling (Fig. 4-Left) presents a full point cloud perspective with varying levels of sparsity but fails to specifically highlight local details. Yet, focusing on these details is crucial for detecting noise samples.

To address this problem, we propose a novel Aligned Multi-Scale Point Cloud Feature Extraction module, as shown in the right part of Fig. 4. This approach enhances the ability of localized noise detection by extracting local point cloud features aligned with the granularity of image patching. Specifically, for each point cloud Pmh×w×3subscript𝑃𝑚superscript𝑤3P_{m}\in\mathbb{R}^{h\times w\times 3}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT in the training dataset, we segment Pmsubscript𝑃𝑚P_{m}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT into three scales, mirroring the approach used for image segmentation. Also, we generate 3 sets of masks {Muvl}superscriptsubscript𝑀𝑢𝑣𝑙\{M_{uv}^{l}\}{ italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT }, {Muvm}superscriptsubscript𝑀𝑢𝑣𝑚\{M_{uv}^{m}\}{ italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, and {Muvs}superscriptsubscript𝑀𝑢𝑣𝑠\{M_{uv}^{s}\}{ italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } as aforementioned operation of image. By applying these three sets of masks to the entire point cloud, we obtain three distinct sets of point clouds at different scales:

{Puvσ}m=Pm{Muvσ},σ{l,m,s},formulae-sequencesubscriptsuperscriptsubscript𝑃𝑢𝑣𝜎𝑚direct-productsubscript𝑃𝑚superscriptsubscript𝑀𝑢𝑣𝜎𝜎𝑙𝑚𝑠\{P_{uv}^{\sigma}\}_{m}=P_{m}\odot\{M_{uv}^{\sigma}\},\;\sigma\in\{l,m,s\},{ italic_P start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⊙ { italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } , italic_σ ∈ { italic_l , italic_m , italic_s } , (4)

Unlike images, in point cloud modality, only the points that do not fall on the backplane are meaningful. Consequently, some smaller patches of the point cloud may contain only a few meaningful points or none at all, making them insignificant or even obstructive for anomaly detection. To enhance efficiency, we identify and discard these non-contributory patches during the segmentation. This process results in filtered sets of point clouds:

{P^uvσ}m={Puvσ|Num(Puvσ)>θ}m,σ{l,m,s},formulae-sequencesubscriptsuperscriptsubscript^𝑃𝑢𝑣𝜎𝑚subscriptconditional-setsuperscriptsubscript𝑃𝑢𝑣𝜎𝑁𝑢𝑚superscriptsubscript𝑃𝑢𝑣𝜎𝜃𝑚𝜎𝑙𝑚𝑠\displaystyle\{\hat{P}_{uv}^{\sigma}\}_{m}=\{P_{uv}^{\sigma}|Num(P_{uv}^{% \sigma})>\theta\}_{m},\;\sigma\in\{l,m,s\},{ over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_P start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT | italic_N italic_u italic_m ( italic_P start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) > italic_θ } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_σ ∈ { italic_l , italic_m , italic_s } , (5)

where θ𝜃\thetaitalic_θ is a hyper-parameter representing the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful.

These sets of point clouds constitute three distinct scales of point cloud representation. The granularity of these patches is aligned with that of image patches, enhancing the efficacy of subsequent multi-modal anomaly detection. We extract features from these multi-scale point cloud patches:

fPml,FPmlsuperscriptsubscript𝑓subscript𝑃𝑚𝑙superscriptsubscript𝐹subscript𝑃𝑚𝑙\displaystyle f_{P_{m}}^{l},F_{P_{m}}^{l}italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT =P(EP({P^uvl}m))absentsubscript𝑃subscript𝐸𝑃subscriptsuperscriptsubscript^𝑃𝑢𝑣𝑙𝑚\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{l}\}_{m})\right)= caligraphic_H start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( { over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) (6)
FPmmsuperscriptsubscript𝐹subscript𝑃𝑚𝑚\displaystyle F_{P_{m}}^{m}italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT =P(EP({P^uvm}m))absentsubscript𝑃subscript𝐸𝑃subscriptsuperscriptsubscript^𝑃𝑢𝑣𝑚𝑚\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{m}\}_{m})\right)= caligraphic_H start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( { over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) )
FPmssuperscriptsubscript𝐹subscript𝑃𝑚𝑠\displaystyle F_{P_{m}}^{s}italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT =P(EP({P^uvs}m))absentsubscript𝑃subscript𝐸𝑃subscriptsuperscriptsubscript^𝑃𝑢𝑣𝑠𝑚\displaystyle=\mathcal{H}_{P}\left(E_{P}(\{\hat{P}_{uv}^{s}\}_{m})\right)= caligraphic_H start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( { over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) )

where fPmlsuperscriptsubscript𝑓subscript𝑃𝑚𝑙f_{P_{m}}^{l}italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the class token and FPmσsuperscriptsubscript𝐹subscript𝑃𝑚𝜎F_{P_{m}}^{\sigma}italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT is the feature map of σ𝜎\sigmaitalic_σ-scale point cloud.

3.1.2 Suspected References Selection

We first try to identify noise samples in the training dataset solely by comparing the class tokens of text and RGB images. However, we observed that certain samples in the MVTec 3D-AD [10] dataset cannot be straightforwardly classified using only cross-modal comparison, i.e., text and image class tokens. For example, the ‘Foam’ category in MVTec 3D-AD includes a defect type labeled ‘color’, which defies classification with our text templates and necessitates comparison with an RGB reference image of a normal sample. Consequently, to achieve comprehensive anomaly classification, a language-guided zero-shot approach falls short, as some defects are only identifiable through intra-modal references, not merely by cross-modal comparison. Given that noise data constitutes a relatively small fraction of the entire training set, the majority of data are normal samples, we propose to select N𝑁Nitalic_N samples that are most representative of normality from the training set in Stage I. These samples will then serve as intra-modal references in Stage II to compensate for the shortcomings of cross-modal comparison. Specifically, fImlsuperscriptsubscript𝑓subscript𝐼𝑚𝑙f_{I_{m}}^{l}italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is used to get suspected anomaly score by computing similarity with fTNorsuperscriptsubscript𝑓𝑇𝑁𝑜𝑟f_{T}^{Nor}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT and fTAnosuperscriptsubscript𝑓𝑇𝐴𝑛𝑜f_{T}^{Ano}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT as follows:

sIm=<fIml,fTAno><fIml,fTAno>+<fIml,fTNor>,s_{I_{m}}=\frac{<f_{I_{m}}^{l},f_{T}^{Ano}>}{{<f_{I_{m}}^{l},f_{T}^{Ano}>}+{<f% _{I_{m}}^{l},f_{T}^{Nor}>}},italic_s start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG < italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT > end_ARG start_ARG < italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT > + < italic_f start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT > end_ARG , (7)

where <,><\cdot,\cdot>< ⋅ , ⋅ > denotes the cosine similarity. sPmsubscript𝑠subscript𝑃𝑚s_{P_{m}}italic_s start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is calculated with fPmlsuperscriptsubscript𝑓subscript𝑃𝑚𝑙f_{P_{m}}^{l}italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, fTNorsuperscriptsubscript𝑓𝑇𝑁𝑜𝑟f_{T}^{Nor}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT, and fTAnosuperscriptsubscript𝑓𝑇𝐴𝑛𝑜f_{T}^{Ano}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT in the same way.

sPm=<fPml,fTAno><fPml,fTAno>+<fPml,fTNor>.s_{P_{m}}=\frac{<f_{P_{m}}^{l},f_{T}^{Ano}>}{{<f_{P_{m}}^{l},f_{T}^{Ano}>}+{<f% _{P_{m}}^{l},f_{T}^{Nor}>}}.italic_s start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG < italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT > end_ARG start_ARG < italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_n italic_o end_POSTSUPERSCRIPT > + < italic_f start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_o italic_r end_POSTSUPERSCRIPT > end_ARG . (8)

Final suspected score srefsubscript𝑠𝑟𝑒𝑓s_{ref}italic_s start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT combines sImsubscript𝑠subscript𝐼𝑚s_{I_{m}}italic_s start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and sPmsubscript𝑠subscript𝑃𝑚s_{P_{m}}italic_s start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT together:

sref=sIm+sPm.subscript𝑠𝑟𝑒𝑓subscript𝑠subscript𝐼𝑚subscript𝑠subscript𝑃𝑚s_{ref}=s_{I_{m}}+s_{P_{m}}.italic_s start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (9)

We select N𝑁Nitalic_N normal samples with the smallest srefsubscript𝑠𝑟𝑒𝑓s_{ref}italic_s start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT as intra-modal references for the next Stage II that is identified as {RIn}n=1Nsuperscriptsubscriptsubscript𝑅subscript𝐼𝑛𝑛1𝑁\left\{R_{I_{n}}\right\}_{n=1}^{N}{ italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and {RPn}n=1Nsuperscriptsubscriptsubscript𝑅subscript𝑃𝑛𝑛1𝑁\left\{R_{P_{n}}\right\}_{n=1}^{N}{ italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in Fig. 3.

3.1.3 Suspected Anomaly Map Computation

Furthermore, we have observed that in industrial anomaly detection tasks, anomalies typically constitute only a small fraction of the entire sample. This means that focusing on all small local patch with uniform attention will not effectively facilitate optimal noise sample detection. Consequently, we propose using the preliminary suspected anomaly map obtained from Stage I as the attention map in Noise-Focused Aggregation within Stage II. This strategy allows for differentiated attention across all local patches, enabling our model to more precisely focus on specific local patches that may contain noise. To generate the preliminary suspected anomaly map, we follow WinCLIP [67], using Harmonic aggregation of windows and multi-scale aggregation to get the suspected anomaly map Wnh×wsubscript𝑊𝑛superscript𝑤W_{n}\in\mathbb{R}^{h\times w}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT (n=1,,N𝑛1𝑁n=1,\cdots,Nitalic_n = 1 , ⋯ , italic_N). This suspected anomaly maps {Wn}n=1Nsuperscriptsubscriptsubscript𝑊𝑛𝑛1𝑁\left\{W_{n}\right\}_{n=1}^{N}{ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT serve as the attention map to enhance the denoising process in Stage II.

3.2 Stage II: Enhanced Multi-modal Denoising

In industrial anomaly detection tasks, anomalies often occupy only a small portion of the entire sample. Therefore, after segmenting the sample into multi-scale patches, some patches will contain anomalies while others will not. Naturally, we aim to focus more on those patches containing anomalies and less on those without when computing the suspected anomaly score through intra-modality comparison, to enhance the accuracy of anomaly detection. This is achieved by assigning a weight to each patch based on the suspected anomaly map computed in Sec. 3.1.3, thereby allowing differential attention to patches based on their likelihood of containing anomalies. Specifically, this process is divided into four steps:

Intra-modal comparison. With N𝑁Nitalic_N intra-modal references selected during Stage I, we employ these image features {RIn}n=1Nsuperscriptsubscriptsubscript𝑅subscript𝐼𝑛𝑛1𝑁\left\{R_{I_{n}}\right\}_{n=1}^{N}{ italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and point cloud features {RPn}n=1Nsuperscriptsubscriptsubscript𝑅subscript𝑃𝑛𝑛1𝑁\left\{R_{P_{n}}\right\}_{n=1}^{N}{ italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for reference:

rInl,RInl,RInm,FInssuperscriptsubscript𝑟subscript𝐼𝑛𝑙superscriptsubscript𝑅subscript𝐼𝑛𝑙superscriptsubscript𝑅subscript𝐼𝑛𝑚superscriptsubscript𝐹subscript𝐼𝑛𝑠\displaystyle r_{I_{n}}^{l},R_{I_{n}}^{l},R_{I_{n}}^{m},F_{I_{n}}^{s}italic_r start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT =RInabsentsubscript𝑅subscript𝐼𝑛\displaystyle=R_{I_{n}}= italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT (10)
rPnl,RPnl,RPnm,FPnssuperscriptsubscript𝑟subscript𝑃𝑛𝑙superscriptsubscript𝑅subscript𝑃𝑛𝑙superscriptsubscript𝑅subscript𝑃𝑛𝑚superscriptsubscript𝐹subscript𝑃𝑛𝑠\displaystyle r_{P_{n}}^{l},R_{P_{n}}^{l},R_{P_{n}}^{m},F_{P_{n}}^{s}italic_r start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT =RPn,absentsubscript𝑅subscript𝑃𝑛\displaystyle=R_{P_{n}},= italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where rInlsuperscriptsubscript𝑟subscript𝐼𝑛𝑙r_{I_{n}}^{l}italic_r start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and rPnlsuperscriptsubscript𝑟subscript𝑃𝑛𝑙r_{P_{n}}^{l}italic_r start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are class tokens, while RInσ={ruvσ}Insuperscriptsubscript𝑅subscript𝐼𝑛𝜎subscriptsuperscriptsubscript𝑟𝑢𝑣𝜎subscript𝐼𝑛R_{I_{n}}^{\sigma}=\left\{r_{uv}^{\sigma}\right\}_{I_{n}}italic_R start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT = { italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT and RPnσ={ruvσ}Pnsuperscriptsubscript𝑅subscript𝑃𝑛𝜎subscriptsuperscriptsubscript𝑟𝑢𝑣𝜎subscript𝑃𝑛R_{P_{n}}^{\sigma}=\left\{r_{uv}^{\sigma}\right\}_{P_{n}}italic_R start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT = { italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT are σ𝜎\sigmaitalic_σ-scale feature maps. The intra-modality suspected anomaly score is determined by the cosine similarity between the feature vectors of the original query samples and those of intra-modal references:

{s¯uvσ}Im={1max<fuvσ|Im,ruvσ|I[1,N]>}msubscriptsuperscriptsubscript¯𝑠𝑢𝑣𝜎subscript𝐼𝑚subscript1quantum-operator-productsuperscriptsubscript𝑓𝑢𝑣𝜎subscript𝐼𝑚superscriptsubscript𝑟𝑢𝑣𝜎subscript𝐼1𝑁𝑚\displaystyle\{\bar{s}_{uv}^{\sigma}\}_{I_{m}}=\{1-\max<f_{uv}^{\sigma}|I_{m},% r_{uv}^{\sigma}|I_{[1,N]}>\}_{m}{ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { 1 - roman_max < italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT | italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT | italic_I start_POSTSUBSCRIPT [ 1 , italic_N ] end_POSTSUBSCRIPT > } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (11)
{s¯uvσ}Pm={1max<fuvσ|Pm,ruvσ|P[1,N]>}m,subscriptsuperscriptsubscript¯𝑠𝑢𝑣𝜎subscript𝑃𝑚subscript1quantum-operator-productsuperscriptsubscript𝑓𝑢𝑣𝜎subscript𝑃𝑚superscriptsubscript𝑟𝑢𝑣𝜎subscript𝑃1𝑁𝑚\displaystyle\{\bar{s}_{uv}^{\sigma}\}_{P_{m}}=\{1-\max<f_{uv}^{\sigma}|P_{m},% r_{uv}^{\sigma}|P_{[1,N]}>\}_{m},{ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { 1 - roman_max < italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT | italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT | italic_P start_POSTSUBSCRIPT [ 1 , italic_N ] end_POSTSUBSCRIPT > } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,

where s¯Im={s¯uvσ}Imsubscript¯𝑠subscript𝐼𝑚subscriptsuperscriptsubscript¯𝑠𝑢𝑣𝜎subscript𝐼𝑚\bar{s}_{I_{m}}=\{\bar{s}_{uv}^{\sigma}\}_{I_{m}}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, s¯Pm={s¯uvσ}Pmsubscript¯𝑠subscript𝑃𝑚subscriptsuperscriptsubscript¯𝑠𝑢𝑣𝜎subscript𝑃𝑚\bar{s}_{P_{m}}=\{\bar{s}_{uv}^{\sigma}\}_{P_{m}}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and σ{l,m,s}𝜎𝑙𝑚𝑠\sigma\in\{l,m,s\}italic_σ ∈ { italic_l , italic_m , italic_s }.

Compute weights for local patches. We first compute weight for every local patch. Given the suspected anomaly map Wh×w𝑊superscript𝑤W\in\mathbb{R}^{h\times w}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT, we initially procure individual suspected anomaly maps for distinct patches by applying the masks generated in Sec. 3.1 to the whole suspected anomaly map.

{Wuvσ}nsubscriptsuperscriptsubscript𝑊𝑢𝑣𝜎𝑛\displaystyle\{W_{uv}^{\sigma}\}_{n}{ italic_W start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ={WnMuvσ},σ{l,m,s}.formulae-sequenceabsentdirect-productsubscript𝑊𝑛superscriptsubscript𝑀𝑢𝑣𝜎𝜎𝑙𝑚𝑠\displaystyle=\{W_{n}\odot M_{uv}^{\sigma}\},\;\sigma\in\{l,m,s\}.= { italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } , italic_σ ∈ { italic_l , italic_m , italic_s } . (12)

In this way, we can determine the weight for each local patch at both middle and small scales. For large scale, the entire suspected anomaly map can be directly used as the weight.

Refer to caption
Figure 5: Detailed Explanation of multi-scale suspected anomaly score computation, which focuses more on the patches containing anomalies and less on those without when computing the intra-modal suspected anomaly scores to enhance the accuracy of anomaly detection.

Multi-scale anomaly score aggregation. For each local patch, the suspected anomaly score s¯uvσsubscriptsuperscript¯𝑠𝜎𝑢𝑣\bar{s}^{\sigma}_{uv}over¯ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT is first distributed to every pixel of the local patch. Then at each pixel in the whole point cloud, we aggregate multiple scores from all overlapping local patches to improve anomaly classification. In order to focus more on those patches which contain anomalies, we re-weight the score s¯uvσsubscriptsuperscript¯𝑠𝜎𝑢𝑣\bar{s}^{\sigma}_{uv}over¯ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT using Wuvσsubscriptsuperscript𝑊𝜎𝑢𝑣W^{\sigma}_{uv}italic_W start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT while aggregating multi-scale information. In this way, regions will be paid attention based on their likelihood of containing anomalies (Fig. 5-Left):

{s¯¯uvσ}Im={p,q(Wpqσs¯pqσ)uvp,q(Mpqσ)uv}Imsubscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝜎subscript𝐼𝑚subscriptsubscript𝑝𝑞subscriptdirect-productsuperscriptsubscript𝑊𝑝𝑞𝜎superscriptsubscript¯𝑠𝑝𝑞𝜎𝑢𝑣subscript𝑝𝑞subscriptsuperscriptsubscript𝑀𝑝𝑞𝜎𝑢𝑣subscript𝐼𝑚\displaystyle\{\bar{\bar{s}}_{uv}^{\sigma}\}_{I_{m}}=\{\frac{\sum_{p,q}(W_{pq}% ^{\sigma}\odot\bar{s}_{pq}^{\sigma})_{uv}}{{\sum_{p,q}(M_{pq}^{\sigma}})_{uv}}% \}_{I_{m}}{ over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { divide start_ARG ∑ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ⊙ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT (13)
{s¯¯uvσ}Pm={p,q(Wpqσs¯pqσ)uvp,q(Mpqσ)uv}Pm.subscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝜎subscript𝑃𝑚subscriptsubscript𝑝𝑞subscriptdirect-productsuperscriptsubscript𝑊𝑝𝑞𝜎superscriptsubscript¯𝑠𝑝𝑞𝜎𝑢𝑣subscript𝑝𝑞subscriptsuperscriptsubscript𝑀𝑝𝑞𝜎𝑢𝑣subscript𝑃𝑚\displaystyle\{\bar{\bar{s}}_{uv}^{\sigma}\}_{P_{m}}=\{\frac{\sum_{p,q}(W_{pq}% ^{\sigma}\odot\bar{s}_{pq}^{\sigma})_{uv}}{{\sum_{p,q}(M_{pq}^{\sigma}})_{uv}}% \}_{P_{m}}.{ over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { divide start_ARG ∑ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ⊙ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Final suspected anomaly score computation. The final suspected image anomaly score s~Imsubscript~𝑠subscript𝐼𝑚\tilde{s}_{I_{m}}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is computed using both cross-modal score sPsubscript𝑠𝑃s_{P}italic_s start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT calculated in Eq. 7 and intra-modality score {s¯¯uvσ}Im={{s¯¯uvl}Im,{s¯¯uvm}Im,{s¯¯uvs}Im}subscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝜎subscript𝐼𝑚subscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝑙subscript𝐼𝑚subscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝑚subscript𝐼𝑚subscriptsuperscriptsubscript¯¯𝑠𝑢𝑣𝑠subscript𝐼𝑚\{\bar{\bar{s}}_{uv}^{\sigma}\}_{I_{m}}=\{\{\bar{\bar{s}}_{uv}^{l}\}_{I_{m}},% \{\bar{\bar{s}}_{uv}^{m}\}_{I_{m}},\{\bar{\bar{s}}_{uv}^{s}\}_{I_{m}}\}{ over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } calculated in Eq. 13:

s~Im=13(sIm+maxuv{{s¯¯uvm}Im+{s¯¯uvs}Im}+maxuv{s¯¯uvł}Im).\tilde{s}_{I_{m}}=\frac{1}{3}(s_{I_{m}}+\max_{uv}\{\{\bar{\bar{s}}_{uv}^{m}\}_% {I_{m}}+\{\bar{\bar{s}}_{uv}^{s}\}_{I_{m}}\}+\max_{uv}\{\bar{\bar{s}}_{uv}^{\l% }\}_{I_{m}}).over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( italic_s start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT { { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } + roman_max start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ł end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (14)

Detailed explaination can be viewed in the right part of Fig. 5-Left. The final suspected point cloud anomaly score s~Pmsubscript~𝑠subscript𝑃𝑚\tilde{s}_{P_{m}}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is computed using the same way:

s~Pm=13(sPm+maxuv{{s¯¯uvm}IPm+{s¯¯uvs}Pm}+maxuv{s¯¯uvł}Pm).\tilde{s}_{P_{m}}=\frac{1}{3}(s_{P_{m}}+\max_{uv}\{\{\bar{\bar{s}}_{uv}^{m}\}_% {IP_{m}}+\{\bar{\bar{s}}_{uv}^{s}\}_{P_{m}}\}+\max_{uv}\{\bar{\bar{s}}_{uv}^{% \l}\}_{P_{m}}).over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( italic_s start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_max start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT { { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_I italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } + roman_max start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT { over¯ start_ARG over¯ start_ARG italic_s end_ARG end_ARG start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ł end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (15)

Analogously, the final suspected anomaly score sI~~subscript𝑠𝐼\tilde{s_{I}}over~ start_ARG italic_s start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG is calculated as a weighted combination of s~Imsubscript~𝑠subscript𝐼𝑚\tilde{s}_{I_{m}}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and s~Imsubscript~𝑠subscript𝐼𝑚\tilde{s}_{I_{m}}over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT, given by the equation:

sI~=λIs~Im+λPs~Pm,~subscript𝑠𝐼subscript𝜆𝐼subscript~𝑠subscript𝐼𝑚subscript𝜆𝑃subscript~𝑠subscript𝑃𝑚\tilde{s_{I}}=\lambda_{I}\tilde{s}_{I_{m}}+\lambda_{P}\tilde{s}_{P_{m}},over~ start_ARG italic_s start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG = italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (16)

where λIsubscript𝜆𝐼\lambda_{I}italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and λPsubscript𝜆𝑃\lambda_{P}italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are hyper-parameters controlling the extent to which RGB and point cloud modalities are integrated. Finally, we remove the samples with top τ𝜏\tauitalic_τ percent scores.

3.3 Fused Anomaly Detection

Refer to caption
Figure 6: Details of Unsupervised Feature Fusion (UFF), which is a unified module trained with all training data of MVTec 3D-AD. The patch-wise contrastive loss consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT encourages the multimodal patch features in the same position to have the most mutual information, i.e., the diagonal elements of the contrastive matrix have the biggest values.

As shown in Fig. 3, Stage III takes in the dataset filtered by Stage I&II as input and learns its pattern to conduct anomaly detection and segmentation. Besides, Stage III also filters out noise at patch-level in case some hard noise samples still exist in the training dataset.

3.3.1 Point Feature Alignment

Point Feature Interpolation. Post-FPS conducted within the Point Transformer (EPsubscriptsuperscript𝐸𝑃E^{\prime}_{P}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT), the center points of the point cloud are unevenly distributed, leading to an imbalance in the density of point features. To address this, we interpolate the features back to the original point cloud. With K𝐾Kitalic_K point features gisubscript𝑔𝑖{g_{i}}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to K𝐾Kitalic_K center points cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we employ inverse distance weighting to interpolate the feature for each point pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the input point cloud. The interpolation is mathematically represented as:

pj=i=1Kαigi,αi=1cipj2+ϵk=1Kt=1T1ckpt2+ϵ,formulae-sequencesubscriptsuperscript𝑝𝑗superscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝑔𝑖subscript𝛼𝑖1subscriptnormsubscript𝑐𝑖subscript𝑝𝑗2italic-ϵsuperscriptsubscript𝑘1𝐾superscriptsubscript𝑡1𝑇1subscriptnormsubscript𝑐𝑘subscript𝑝𝑡2italic-ϵ\displaystyle p^{\prime}_{j}=\sum_{i=1}^{K}\alpha_{i}g_{i},\quad\alpha_{i}=% \frac{\frac{1}{\|c_{i}-p_{j}\|_{2}+\epsilon}}{\sum_{k=1}^{K}\sum_{t=1}^{T}% \frac{1}{\|c_{k}-p_{t}\|_{2}+\epsilon}},italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG divide start_ARG 1 end_ARG start_ARG ∥ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ϵ end_ARG end_ARG , (17)

where ϵitalic-ϵ\epsilonitalic_ϵ is a small constant to prevent division by zero.

Point Feature Projection. After interpolation, we project the interpolated point features pjsubscriptsuperscript𝑝𝑗p^{\prime}_{j}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT onto a 2D plane as p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG using the point coordinates and camera parameters. Noting the sparsity of point clouds, we assign a value of 0 to any 2D plane position lacking a corresponding point. The resulting projected feature map matches the size of the RGB image.

3.3.2 Unsupervised Feature Fusion

The interaction between multi-modal features can yield new information beneficial for industrial anomaly detection. For instance, as shown in Fig. 1, detecting a hole in a cookie necessitates the integration of both its black color and the shape depression. To decipher the intrinsic relationship between these modalities in the training data, we developed the Unsupervised Feature Fusion (UFF) module.

We introduce a patch-wise contrastive loss to train this module. Given RGB features fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and point cloud features fPsubscript𝑓𝑃f_{P}italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, our goal is to promote a higher correlation of information between features from different modalities at identical spatial positions while minimizing this correlation for features at distinct positions.

The features of a sample are represented as {{fuv}Ii,{fuv}Pi}subscriptsubscript𝑓𝑢𝑣subscript𝐼𝑖subscriptsubscript𝑓𝑢𝑣subscript𝑃𝑖\{\{f_{uv}\}_{I_{i}},\{f_{uv}\}_{P_{i}}\}{ { italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , { italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, where i𝑖iitalic_i denotes the index of the training sample, and u,v𝑢𝑣u,vitalic_u , italic_v represents the patch position. We employ MLP {χI,χP}subscript𝜒𝐼subscript𝜒𝑃\{\chi_{I},\chi_{P}\}{ italic_χ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_χ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } to derive interaction information between the two modalities and utilize fully connected layers {σI,σP}subscript𝜎𝐼subscript𝜎𝑃\{\sigma_{I},\sigma_{P}\}{ italic_σ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } to transform the processed features into query or key vectors, denoted as {{huv}Ii,{huv}Pi}subscriptsubscript𝑢𝑣subscript𝐼𝑖subscriptsubscript𝑢𝑣subscript𝑃𝑖\{\{h_{uv}\}_{I_{i}},\{h_{uv}\}_{P_{i}}\}{ { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. For contrastive learning, we apply the InfoNCE loss:

con={huv}Ii{huv}Pit=1Nbuv{huv}It{huv}Pt,subscript𝑐𝑜𝑛subscriptsubscript𝑢𝑣subscript𝐼𝑖subscriptsubscript𝑢𝑣subscript𝑃𝑖superscriptsubscript𝑡1subscript𝑁𝑏subscript𝑢𝑣subscriptsuperscriptsubscript𝑢𝑣𝑡𝐼subscriptsuperscriptsubscript𝑢𝑣𝑡𝑃\mathcal{L}_{con}=\frac{\{h_{uv}\}_{I_{i}}\cdot\{h_{uv}\}_{P_{i}}}{\sum_{t=1}^% {N_{b}}\sum_{uv}\{h_{uv}\}^{t}_{I}\cdot\{h_{uv}\}^{t}_{P}},caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ⋅ { italic_h start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG , (18)

where Nbsubscript𝑁𝑏N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is the batch size. The UFF module, trained with collective training data from all categories in MVTec 3D-AD, is depicted in Fig. 6.

During inference, outputs of the MLP layers are concatenated to form a fused patch feature, denoted as {fuv}Fisubscriptsubscript𝑓𝑢𝑣subscript𝐹𝑖\{f_{uv}\}_{F_{i}}{ italic_f start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

3.3.3 Noise Discriminative Coreset Selection

In our experimental process, we found that, despite pre-processing the training data to remove noise at the sample level, some noise samples that closely resembled normal samples could not be eliminated. To address this, we conducted a second round of denoising at the patch level. Following Softpatch [8], we discard noise patches in coreset selection process. Initially, we calculated outlier scores for all patches. These scores were then aggregated to identify the noise patches, after which we just remove the patches with top τ𝜏\tauitalic_τ percent scores. We implemented it using the Local Outlier Factor (LOF) method.

LOF is a local-density-based outlier detector. Inspired by Softpatch, we propose to use LOF in M3DM in two ways. Firstly, we will use LOF to rule out noise patches with the aim of making the training datset contain only normal samples. Secondly, we will use the LOF as the soft weight for patches to achieve more accurate anomaly detection.

The k-distance-based absolute local reachability density lrduvi𝑙𝑟subscript𝑑𝑢subscript𝑣𝑖{lrd}_{{uv}_{i}}italic_l italic_r italic_d start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is first calculated as:

lrduvi=(b𝒩k(fuvi)distkreach(fuvi,fuvb)|𝒩k(fuvi)|)1,distkreach(fuvi,fuvb)=max(distk(fuvb),d(fuvi,fuvb)),formulae-sequence𝑙𝑟subscript𝑑𝑢subscript𝑣𝑖superscriptsubscript𝑏subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖𝑑𝑖𝑠superscriptsubscript𝑡𝑘𝑟𝑒𝑎𝑐subscript𝑓𝑢subscript𝑣𝑖subscriptsuperscript𝑓𝑏𝑢𝑣subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖1𝑑𝑖𝑠superscriptsubscript𝑡𝑘𝑟𝑒𝑎𝑐subscript𝑓𝑢subscript𝑣𝑖subscriptsuperscript𝑓𝑏𝑢𝑣𝑑𝑖𝑠subscript𝑡𝑘subscriptsuperscript𝑓𝑏𝑢𝑣𝑑subscript𝑓𝑢subscript𝑣𝑖subscriptsuperscript𝑓𝑏𝑢𝑣\begin{gathered}{lrd}_{{uv}_{i}}=(\frac{\sum_{b\in\mathcal{N}_{k}(f_{{uv}_{i}}% )}dist_{k}^{reach}(f_{{uv}_{i}},f^{b}_{uv})}{|\mathcal{N}_{k}(f_{{uv}_{i}})|})% ^{-1},\\ {dist}_{k}^{reach}(f_{{uv}_{i}},f^{b}_{uv})=\max(dist_{k}(f^{b}_{uv}),d(f_{{uv% }_{i}},f^{b}_{uv})),\end{gathered}start_ROW start_CELL italic_l italic_r italic_d start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_c italic_h end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) = roman_max ( italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) , italic_d ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (19)

where d(fuvi,fuvb)𝑑subscript𝑓𝑢subscript𝑣𝑖subscriptsuperscript𝑓𝑏𝑢𝑣d(f_{{uv}_{i}},f^{b}_{uv})italic_d ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) is L2-norm, distk(fuvi)𝑑𝑖𝑠subscript𝑡𝑘subscript𝑓𝑢subscript𝑣𝑖dist_{k}(f_{{uv}_{i}})italic_d italic_i italic_s italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is the distance of kth-neighbor, 𝒩k(fuvi)subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖\mathcal{N}_{k}(f_{{uv}_{i}})caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is the set of k-nearest neighbors of fuvisubscript𝑓𝑢subscript𝑣𝑖f_{{uv}_{i}}italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and |𝒩k(fuvi)|subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖|\mathcal{N}_{k}(f_{{uv}_{i}})|| caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | is the number of the set which usually equal k when without repeated neighbors. With the local reachability density of each patch, the overwhelming effect of large clusters is largely reduced. To normalize local density to relative density for treating all clusters equally, the relative density ηisuperscript𝜂𝑖\eta^{i}italic_η start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of image i𝑖iitalic_i is defined below:

ηuvi=b𝒩k(fuvi)lrduvb|𝒩k(fuvi)|lrduvi.subscript𝜂𝑢subscript𝑣𝑖subscript𝑏subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖𝑙𝑟subscriptsuperscript𝑑𝑏𝑢𝑣subscript𝒩𝑘subscript𝑓𝑢subscript𝑣𝑖𝑙𝑟subscript𝑑𝑢subscript𝑣𝑖\eta_{{uv}_{i}}=\frac{\sum_{b\in\mathcal{N}_{k}(f_{{uv}_{i}})}{lrd}^{b}_{uv}}{% |\mathcal{N}_{k}(f_{{uv}_{i}})|\cdot{lrd}_{{uv}_{i}}}.italic_η start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_l italic_r italic_d start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | ⋅ italic_l italic_r italic_d start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG . (20)

ηuvisubscript𝜂𝑢subscript𝑣𝑖\eta_{{uv}_{i}}italic_η start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the relative density of the neighbors over patch’s own, and represents as a patch’s confidence of inlier. Patches with top τ𝜏\tauitalic_τ scores are removed before coreset selection.

3.3.4 Decision Layer Fusion

As depicted in Fig. 1, certain industrial anomalies, such as the protruding part of a potato, manifest exclusively in a single domain, making the correlation between multi-modal features less evident. Additionally, despite the advantages of Feature Fusion in enhancing multi-modal feature interaction, we observed some loss of information during the fusion process. Furthermore, we observed that, despite undergoing denoising at both the image and patch levels, some hard noise patches remain within the dataset. These hard noise elements can adversely affect the precision of anomaly scores during the final inference stage.

To address these issues, we propose utilizing multiple memory banks to preserve the original color feature (fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT), point cloud feature (fPsubscript𝑓𝑃f_{P}italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT), and fusion feature (fFsubscript𝑓𝐹f_{F}italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT). These are denoted as Isubscript𝐼\mathcal{M}_{I}caligraphic_M start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, Psubscript𝑃\mathcal{M}_{P}caligraphic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, and Fsubscript𝐹\mathcal{M}_{F}caligraphic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT respectively. Besides, we propose to use ηuvisubscript𝜂𝑢subscript𝑣𝑖\eta_{{uv}_{i}}italic_η start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT obtained in Sec. 3.3.3 to re-weight the anomaly score during inference, which can down-weight noisy samples according to outlier scores. During inference, each bank contributes to predicting an anomaly score and a segmentation map. Two learnable One-Class Support Vector Machines (OCSVMs), 𝒟imagesubscript𝒟𝑖𝑚𝑎𝑔𝑒\mathcal{D}_{image}caligraphic_D start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT and 𝒟pixelsubscript𝒟𝑝𝑖𝑥𝑒𝑙\mathcal{D}_{pixel}caligraphic_D start_POSTSUBSCRIPT italic_p italic_i italic_x italic_e italic_l end_POSTSUBSCRIPT, are employed to finalize the anomaly score Simagesubscript𝑆𝑖𝑚𝑎𝑔𝑒S_{image}italic_S start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT and the segmentation map Spixelsubscript𝑆𝑝𝑖𝑥𝑒𝑙S_{pixel}italic_S start_POSTSUBSCRIPT italic_p italic_i italic_x italic_e italic_l end_POSTSUBSCRIPT. This procedure is referred to as Decision Layer Fusion (DLF) and can be mathematically represented as follows:

Simage=𝒟image(ϕ(I,fI),ϕ(P,fP),ϕ(F,fF)),Spixel=𝒟pixel(ψ(I,fI),ψ(P,fP),ψ(F,fF)),formulae-sequencesubscript𝑆𝑖𝑚𝑎𝑔𝑒subscript𝒟𝑖𝑚𝑎𝑔𝑒italic-ϕsubscript𝐼subscript𝑓𝐼italic-ϕsubscript𝑃subscript𝑓𝑃italic-ϕsubscript𝐹subscript𝑓𝐹subscript𝑆𝑝𝑖𝑥𝑒𝑙subscript𝒟𝑝𝑖𝑥𝑒𝑙𝜓subscript𝐼subscript𝑓𝐼𝜓subscript𝑃subscript𝑓𝑃𝜓subscript𝐹subscript𝑓𝐹\begin{gathered}S_{image}=\mathcal{D}_{image}(\phi(\mathcal{M}_{I},f_{I}),\phi% (\mathcal{M}_{P},f_{P}),\phi(\mathcal{M}_{F},f_{F})),\\ S_{pixel}=\mathcal{D}_{pixel}(\psi(\mathcal{M}_{I},f_{I}),\psi(\mathcal{M}_{P}% ,f_{P}),\psi(\mathcal{M}_{F},f_{F})),\end{gathered}start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_ϕ ( caligraphic_M start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) , italic_ϕ ( caligraphic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) , italic_ϕ ( caligraphic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_p italic_i italic_x italic_e italic_l end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_p italic_i italic_x italic_e italic_l end_POSTSUBSCRIPT ( italic_ψ ( caligraphic_M start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) , italic_ψ ( caligraphic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) , italic_ψ ( caligraphic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (21)

where ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ are scoring functions, defined as follows:

ϕ(,f)=ηuvifuvim2ψ(,f)={minmfuvim2|fuvif}fuvi,,m=argmaxfuvifargminmfuvim2,formulae-sequenceitalic-ϕ𝑓subscript𝜂𝑢subscript𝑣𝑖subscriptdelimited-∥∥subscriptsuperscript𝑓𝑢subscript𝑣𝑖superscript𝑚2𝜓𝑓conditional-setsubscript𝑚subscriptdelimited-∥∥subscript𝑓𝑢subscript𝑣𝑖𝑚2subscript𝑓𝑢subscript𝑣𝑖𝑓subscriptsuperscript𝑓𝑖𝑢𝑣superscript𝑚subscriptsubscript𝑓𝑢subscript𝑣𝑖𝑓subscript𝑚subscriptdelimited-∥∥subscript𝑓𝑢subscript𝑣𝑖𝑚2\begin{gathered}\phi(\mathcal{M},f)=\eta_{{uv}_{i}}\|f^{*}_{{uv}_{i}}-m^{*}\|_% {2}\\ \psi(\mathcal{M},f)=\{\min_{m\in\mathcal{M}}\|f_{{uv}_{i}}-m\|_{2}\Big{|}f_{{% uv}_{i}}\in f\}\\ f^{i,*}_{uv},m^{*}=\arg\max_{f_{{uv}_{i}}\in f}\arg\min_{m\in\mathcal{M}}\|f_{% {uv}_{i}}-m\|_{2},\end{gathered}start_ROW start_CELL italic_ϕ ( caligraphic_M , italic_f ) = italic_η start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_ψ ( caligraphic_M , italic_f ) = { roman_min start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_m ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_f } end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUPERSCRIPT italic_i , ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT , italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_f end_POSTSUBSCRIPT roman_arg roman_min start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT ∥ italic_f start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_m ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW (22)

where {I,P,F}subscript𝐼subscript𝑃subscript𝐹\mathcal{M}\in\{\mathcal{M}_{I},\mathcal{M}_{P},\mathcal{M}_{F}\}caligraphic_M ∈ { caligraphic_M start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT }, f{fI,fP,fF}𝑓subscript𝑓𝐼subscript𝑓𝑃subscript𝑓𝐹f\in\{f_{I},f_{P},f_{F}\}italic_f ∈ { italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } and ηuvisubscript𝜂𝑢subscript𝑣𝑖\eta_{{uv}_{i}}italic_η start_POSTSUBSCRIPT italic_u italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the weight parameter obtained in Sec. 3.3.3.

4 Experiment

TABLE I: I-AUROC score for regular anomaly detection of all categories of MVTec-3D AD. Our method maintains the regular anomaly detection ability. The results of baselines are from the  [10, 20, 18, 75]. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D 3D-ST[47] 86.2 48.4 83.2 89.4 84.8 66.3 76.3 68.7 95.8 48.6 74.8
FPFH[20] 82.5 55.1 95.2 79.7 88.3 58.2 75.8 88.9 92.9 65.3 78.2
AST[18] 88.1 57.6 96.5 95.7 67.9 79.7 99.0 91.5 95.6 61.1 83.3
M3DM[9] \ul94.1 65.1 96.5 \ul96.9 90.5 76.0 88.0 97.4 92.6 \ul76.5 87.4
Ours 94.2 66.1 \ul95.5 97.2 \ul90.4 \ul77.2 88.1 \ul96.4 91.6 78.5 87.4
RGB PADiM[19] 97.5 77.5 69.8 58.2 95.9 66.3 85.8 53.5 83.2 76.0 76.4
PatchCore[5] 87.6 88.0 79.1 68.2 91.2 70.1 69.5 61.8 84.1 70.2 77.0
STFPM[76] 93.0 84.7 89.0 57.5 94.7 76.6 71.0 59.8 96.5 70.1 79.3
CS-Flow[6] 94.1 93.0 82.7 79.5 99.0 88.6 73.1 47.1 98.6 74.5 83.0
AST[18] 94.7 92.8 85.1 82.5 98.1 95.1 89.5 61.3 99.2 82.1 88.0
M3DM[9] 94.4 91.8 89.6 74.9 95.9 76.7 \ul91.9 \ul64.8 93.8 76.7 85.0
Ours 94.2 91.7 \ul89.4 73.9 96.1 77.8 93.3 64.9 92.8 77.7 85.1
RGB + 3D Voxel GAN[10] 68.0 32.4 56.5 39.9 49.7 48.2 56.6 57.9 60.1 48.2 51.7
PatchCore + FPFH[20] 91.8 74.8 96.7 88.3 93.2 58.2 89.6 91.2 92.1 88.6 86.5
AST[18] 98.3 87.3 \ul97.6 97.1 93.2 88.5 97.4 98.1 100.0 79.7 93.7
M3DM [9] 99.4 \ul90.9 97.2 97.6 96.0 94.2 97.3 89.9 97.2 85.0 94.5
Ours \ul99.3 91.1 97.7 97.6 96.0 \ul92.2 97.3 89.9 95.5 88.2 94.5
TABLE II: AUPRO score for regular anomaly segmentation of all categories of MVTec-3D. Our method maintains the regular anomaly segmentation ability. The results of baselines are from the  [10, 20, 75]. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D 3D-ST[47] 95.0 48.3 98.6 92.1 90.5 63.2 94.5 98.8 97.6 54.2 83.3
FPFH[20] 97.3 87.9 98.2 90.6 89.2 73.5 97.7 98.2 95.6 96.1 92.4
M3DM [9] 94.3 81.8 97.7 88.2 88.1 74.3 95.8 97.4 95.0 92.9 90.6
Ours 94.2 81.8 97.8 88.3 88.0 74.3 95.8 97.4 95.0 92.9 90.6
RGB CFlow[6] 85.5 91.9 95.8 86.7 96.9 50.0 88.9 93.5 90.4 91.9 87.1
PatchCore[5] 90.1 94.9 92.8 87.7 89.2 56.3 90.4 93.2 90.8 90.6 87.6
PADiM[19] 98.0 94.4 94.5 92.5 96.1 79.2 96.6 94.0 93.7 91.2 93.0
M3DM [9] 95.2 97.2 97.3 89.1 93.2 84.3 97.0 95.6 96.8 96.6 94.2
Ours 95.4 \ul97.0 97.3 89.1 93.4 84.3 97.0 95.6 96.8 96.6 94.2
RGB+3D Voxel GAN[10] 66.4 62.0 76.6 74.0 78.3 33.2 58.2 79.0 63.3 48.3 63.9
PatchCore + FPFH[20] 97.6 96.9 97.9 97.3 93.3 88.8 97.5 98.1 95.0 97.1 95.9
M3DM [9] 97.0 97.1 97.9 95.0 94.1 \ul93.2 \ul97.7 97.1 \ul97.1 97.5 \ul96.4
Ours 97.4 97.1 97.8 94.5 \ul93.8 94.7 97.8 97.1 97.2 \ul97.4 96.5

4.1 Experimental Setup

Dataset. 3D industrial anomaly detection is in the beginning stage. The MVTec-3D AD dataset is the first 3D industrial anomaly detection dataset. Our experiments were performed on the MVTec-3D dataset. MVTec-3D AD[10] dataset consists of 10 categories, a total of 2,656 training samples, and 1,137 testing samples. The 3D scans were acquired by an industrial sensor using structured light, and position information was stored in 3 channel tensors representing x𝑥xitalic_x, y𝑦yitalic_y and z𝑧zitalic_z coordinates. Those 3 channel tensors can be single-mapped to the corresponding point clouds. Additionally, the RGB information is recorded for each point. Because all samples in the dataset are viewed from the same angle, the RGB information of each sample can be stored in a single image. Totally, each sample of the MVTec-3D AD dataset contains a colored point cloud.

We conduct both regular anomaly detection in Sec. 4.2 and noisy anomaly detection in Sec. 4.3. For noisy anomaly detection, in odrder to generate a noisy training set, we randomly select 10% anomalous samples from the test set and integrate them into the existing training samples. Additionally, we establish two distinct settings, Overlap and Non-Overlap, to assess the robustness of our model. In the Overlap setting, the anomalous samples added to the training dataset will also be included in the test dataset to demonstrate the risk that defects with similar appearance will severely exacerbate the performance of an anomaly detector trained with noisy data. Conversely, in the Non-Overlap setting, these samples will not be retested.

Data Pre-processing. Different from 2D data, 3D ones are easier to remove the background information. Following [20], we estimate the background plane with RANSAC[77] and any point within 0.005 distance is removed. At the same time, we set the corresponding pixel of removed points in the RGB image as 0. This operation not only accelerates the 3D feature processing during training and inference but also reduces the background disturbance for anomaly detection. Finally, we resize both the position tensor and the RGB image to 224×224224224224\times 224224 × 224 size, which is matched with the feature extractor input size.

Feature Extractors. In Stage I&II, we use text and image encoder from LAION-2B based CLIP with ViT-H/14 and point cloud encoder from Point-BIND. In Stage III, we use the ViT-B/8 pretrained on ImageNet[78] with DINO[79] as the RGB image encoder and a Point Transformer[80, 81], which is pretrained on ShapeNet[82] dataset as the 3D point cloud encoder, use the {3,7,11}3711\{3,7,11\}{ 3 , 7 , 11 } layer output as our 3D point cloud feature.

Learnable Module Details. Stage I&II are traing-free and Stage III has 2 learnable modules: the Unsupervised Feature Fusion module and the Decision Layer Fusion module. 1) For UFF, χIsubscript𝜒𝐼\chi_{I}italic_χ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and χPsubscript𝜒𝑃\chi_{P}italic_χ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are 2 two-layer MLPs with 4×4\times4 × hidden dimension as input feature. We use AdamW  optimizer with the learning rate as 0.003 and cosine warm-up in 250 steps. Batch size as 16 and we report the best anomaly detection results under 750 UFF training steps. 2) For DLF, we use two linear OCSVMs [83] with SGD [84] optimizers, and the learning rate is set as 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and each class is trained for 1000 steps.

Evaluation Metrics. All evaluation metrics are exactly the same as in [10]. We evaluate the image-level anomaly detection performance with the area under the receiver operator curve (I-AUROC), and higher I-AUROC means better image-level anomaly detection performance. For segmentation evaluation, we use the per-region Overlap (AUPRO) metric, which is defined as the average relative Overlap of the binary prediction with each connected component of the ground truth. Similar to I-AUROC, the receiver operator curve of pixel level predictions can be used to calculate P-AUROC for evaluating the segmentation performance.

TABLE III: I-AUROC score for anomaly detection under Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 50.0±plus-or-minus\pm±0.8 48.5±plus-or-minus\pm±1.9 67.8±plus-or-minus\pm±0.2 58.1±plus-or-minus\pm±0.4 58.2±plus-or-minus\pm±3.8 49.2±plus-or-minus\pm±2.8 40.5±plus-or-minus\pm±0.6 47.0±plus-or-minus\pm±1.3 43.3±plus-or-minus\pm±1.1 45.0±plus-or-minus\pm±2.7 50.8±plus-or-minus\pm±0.5
FPFH 53.4±plus-or-minus\pm±2.8 40.9±plus-or-minus\pm±3.2 71.4±plus-or-minus\pm±1.2 62.7±plus-or-minus\pm±0.8 64.5±plus-or-minus\pm±2.4 38.5±plus-or-minus\pm±0.3 46.8±plus-or-minus\pm±2.6 45.3±plus-or-minus\pm±1.5 52.2±plus-or-minus\pm±1.5 51.5±plus-or-minus\pm±4.2 52.7±plus-or-minus\pm±0.3
AST 61.0±plus-or-minus\pm±0.6 38.4±plus-or-minus\pm±0.6 \ul72.9±plus-or-minus\pm±0.6 75.2±plus-or-minus\pm±0.6 47.8±plus-or-minus\pm±0.6 55.7±plus-or-minus\pm±0.6 \ul66.9±plus-or-minus\pm±0.6 60.6±plus-or-minus\pm±0.6 55.5±plus-or-minus\pm±1.0 49.2±plus-or-minus\pm±0.6 58.3±plus-or-minus\pm±0.2
Shape-Guided 66.1±plus-or-minus\pm±5.1 \ul58.7±plus-or-minus\pm±10.4 71.4±plus-or-minus\pm±6.0 \ul76.4±plus-or-minus\pm±1.4 71.6±plus-or-minus\pm±0.7 54.1±plus-or-minus\pm±3.1 61.0±plus-or-minus\pm±4.5 59.3±plus-or-minus\pm±5.7 60.7±plus-or-minus\pm±4.5 64.3±plus-or-minus\pm±7.4 64.4±plus-or-minus\pm±1.9
M3DM \ul74.0±plus-or-minus\pm±0.7 56.7±plus-or-minus\pm±1.8 72.2±plus-or-minus\pm±1.7 74.5±plus-or-minus\pm±0.6 \ul77.4±plus-or-minus\pm±0.7 \ul62.3±plus-or-minus\pm±0.6 56.2±plus-or-minus\pm±1.9 \ul64.1±plus-or-minus\pm±0.5 \ul72.5±plus-or-minus\pm±0.5 \ul74.3±plus-or-minus\pm±1.8 \ul68.4±plus-or-minus\pm±0.7
Ours 93.5±plus-or-minus\pm±1.6 71.8±plus-or-minus\pm±1.3 93.8±plus-or-minus\pm±0.7 91.1±plus-or-minus\pm±2.3 78.0±plus-or-minus\pm±2.7 67.2±plus-or-minus\pm±3.2 79.9±plus-or-minus\pm±1.4 79.9±plus-or-minus\pm±2.2 87.9±plus-or-minus\pm±0.4 79.8±plus-or-minus\pm±3.5 82.3±plus-or-minus\pm±0.4
RGB PaDim 70.8±plus-or-minus\pm±0.7 57.3±plus-or-minus\pm±2.6 54.7±plus-or-minus\pm±0.5 43.2±plus-or-minus\pm±1.6 72.1±plus-or-minus\pm±0.3 55.4±plus-or-minus\pm±2.2 61.7±plus-or-minus\pm±0.3 36.8±plus-or-minus\pm±1.3 74.8±plus-or-minus\pm±2.5 55.2±plus-or-minus\pm±1.5 58.2±plus-or-minus\pm±0.4
PatchCore 64.9±plus-or-minus\pm±0.7 71.4±plus-or-minus\pm±0.9 71.5±plus-or-minus\pm±1.5 52.5±plus-or-minus\pm±2.2 73.3±plus-or-minus\pm±1.2 56.5±plus-or-minus\pm±2.9 46.6±plus-or-minus\pm±1.1 36.8±plus-or-minus\pm±0.4 54.2±plus-or-minus\pm±1.3 57.2±plus-or-minus\pm±1.3 58.5±plus-or-minus\pm±0.4
AST 57.6±plus-or-minus\pm±0.6 62.2±plus-or-minus\pm±0.0 50.7±plus-or-minus\pm±0.0 47.5±plus-or-minus\pm±0.6 58.8±plus-or-minus\pm±0.0 56.0±plus-or-minus\pm±0.0 54.6±plus-or-minus\pm±0.0 43.7±plus-or-minus\pm±0.6 42.8±plus-or-minus\pm±0.0 44.6±plus-or-minus\pm±0.6 51.8±plus-or-minus\pm±0.2
Shape-Guided 62.7±plus-or-minus\pm±4.4 64.3±plus-or-minus\pm±9.3 66.9±plus-or-minus\pm±7.3 57.3±plus-or-minus\pm±16.4 72.1±plus-or-minus\pm±0.9 51.5±plus-or-minus\pm±3.2 52.9±plus-or-minus\pm±10.0 \ul50.3±plus-or-minus\pm±11.1 50.5±plus-or-minus\pm±9.4 58.2±plus-or-minus\pm±9.3 58.7±plus-or-minus\pm±5.8
SoftPatch \ul88.8±plus-or-minus\pm±1.1 \ul87.3±plus-or-minus\pm±2.2 \ul84.9±plus-or-minus\pm±1.3 \ul63.3±plus-or-minus\pm±1.2 96.5±plus-or-minus\pm±0.8 \ul75.0±plus-or-minus\pm±1.6 \ul62.3±plus-or-minus\pm±0.7 43.6±plus-or-minus\pm±2.1 \ul89.3±plus-or-minus\pm±1.4 \ul71.0±plus-or-minus\pm±0.9 \ul76.2±plus-or-minus\pm±0.3
M3DM 64.1±plus-or-minus\pm±1.4 62.1±plus-or-minus\pm±2.1 65.5±plus-or-minus\pm±0.9 53.6±plus-or-minus\pm±2.1 70.7±plus-or-minus\pm±0.9 57.0±plus-or-minus\pm±1.2 54.7±plus-or-minus\pm±2.0 42.1±plus-or-minus\pm±2.3 53.8±plus-or-minus\pm±1.1 58.3±plus-or-minus\pm±0.9 58.2±plus-or-minus\pm±0.5
Ours 90.3±plus-or-minus\pm±0.4 87.5±plus-or-minus\pm±3.4 86.5±plus-or-minus\pm±1.8 67.1±plus-or-minus\pm±4.6 \ul86.1±plus-or-minus\pm±0.6 79.2±plus-or-minus\pm±2.8 84.4±plus-or-minus\pm±2.3 54.6±plus-or-minus\pm±6.2 90.0±plus-or-minus\pm±2.2 73.1±plus-or-minus\pm±1.1 79.9±plus-or-minus\pm±0.4
3D+RGB PatchCore+FPFH 61.3±plus-or-minus\pm±2.7 58.3±plus-or-minus\pm±0.9 72.3±plus-or-minus\pm±0.4 69.0±plus-or-minus\pm±1.1 67.2±plus-or-minus\pm±1.0 47.1±plus-or-minus\pm±1.9 53.0±plus-or-minus\pm±2.0 52.1±plus-or-minus\pm±1.3 52.7±plus-or-minus\pm±1.0 68.2±plus-or-minus\pm±0.8 60.1±plus-or-minus\pm±0.4
AST 65.3±plus-or-minus\pm±0.6 \ul69.5±plus-or-minus\pm±0.6 73.8±plus-or-minus\pm±0.6 \ul83.1±plus-or-minus\pm±0.0 68.1±plus-or-minus\pm±0.6 \ul64.4±plus-or-minus\pm±0.6 \ul64.7±plus-or-minus\pm±0.6 \ul64.1±plus-or-minus\pm±0.6 49.7±plus-or-minus\pm±0.6 55.8±plus-or-minus\pm±0.0 65.8±plus-or-minus\pm±0.0
Shape-Guided 69.1±plus-or-minus\pm±0.7 67.2±plus-or-minus\pm±1.4 \ul76.3±plus-or-minus\pm±0.5 71.3±plus-or-minus\pm±0.8 71.8±plus-or-minus\pm±0.3 58.0±plus-or-minus\pm±0.3 62.0±plus-or-minus\pm±0.3 60.4±plus-or-minus\pm±0.7 55.3±plus-or-minus\pm±0.3 67.8±plus-or-minus\pm±0.6 65.9±plus-or-minus\pm±0.2
M3DM \ul72.5±plus-or-minus\pm±2.2 62.4±plus-or-minus\pm±0.8 69.6±plus-or-minus\pm±1.4 72.4±plus-or-minus\pm±2.1 \ul73.9±plus-or-minus\pm±0.9 64.3±plus-or-minus\pm±2.0 60.1±plus-or-minus\pm±0.3 54.0±plus-or-minus\pm±2.0 \ul62.1±plus-or-minus\pm±1.8 \ul71.4±plus-or-minus\pm±2.1 \ul66.3±plus-or-minus\pm±0.5
Ours 96.7±plus-or-minus\pm±2.1 86.2±plus-or-minus\pm±3.0 95.5±plus-or-minus\pm±1.3 90.3±plus-or-minus\pm±3.4 86.0±plus-or-minus\pm±3.0 79.1±plus-or-minus\pm±3.7 86.6±plus-or-minus\pm±3.7 72.2±plus-or-minus\pm±3.3 92.0±plus-or-minus\pm±0.5 81.3±plus-or-minus\pm±1.6 86.6±plus-or-minus\pm±1.3
TABLE IV: AUPRO score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly segmentation ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 69.1±plus-or-minus\pm±1.6 68.2±plus-or-minus\pm±0.8 85.3±plus-or-minus\pm±0.4 72.3±plus-or-minus\pm±0.8 67.1±plus-or-minus\pm±1.4 55.7±plus-or-minus\pm±1.5 64.3±plus-or-minus\pm±1.4 66.6±plus-or-minus\pm±1.7 69.9±plus-or-minus\pm±0.8 72.6±plus-or-minus\pm±1.2 69.1±plus-or-minus\pm±0.4
FPFH 70.5±plus-or-minus\pm±1.6 73.7±plus-or-minus\pm±0.6 88.5±plus-or-minus\pm±0.2 72.6±plus-or-minus\pm±0.8 72.6±plus-or-minus\pm±2.7 56.7±plus-or-minus\pm±2.4 66.7±plus-or-minus\pm±1.6 75.0±plus-or-minus\pm±2.2 65.5±plus-or-minus\pm±1.8 77.2±plus-or-minus\pm±1.3 71.9±plus-or-minus\pm±0.4
Shape-Guided 74.6±plus-or-minus\pm±0.6 83.7±plus-or-minus\pm±2.2 98.1±plus-or-minus\pm±0.1 \ul81.9±plus-or-minus\pm±5.4 88.6±plus-or-minus\pm±0.1 80.4±plus-or-minus\pm±6.7 \ul88.9±plus-or-minus\pm±7.3 88.2±plus-or-minus\pm±0.0 88.7±plus-or-minus\pm±3.6 93.7±plus-or-minus\pm±5.5 \ul86.7±plus-or-minus\pm±1.7
M3DM \ul84.0±plus-or-minus\pm±1.0 \ul79.7±plus-or-minus\pm±1.1 95.8±plus-or-minus\pm±0.4 79.6±plus-or-minus\pm±1.3 \ul85.5±plus-or-minus\pm±0.6 \ul68.3±plus-or-minus\pm±1.6 86.4±plus-or-minus\pm±0.9 91.3±plus-or-minus\pm±0.8 \ul90.3±plus-or-minus\pm±1.5 88.7±plus-or-minus\pm±0.4 85.0±plus-or-minus\pm±0.4
Ours 95.0±plus-or-minus\pm±1.3 78.8±plus-or-minus\pm±0.8 \ul97.2±plus-or-minus\pm±0.1 84.5±plus-or-minus\pm±1.4 83.9±plus-or-minus\pm±3.0 66.6±plus-or-minus\pm±2.4 91.2±plus-or-minus\pm±1.6 \ul89.9±plus-or-minus\pm±0.6 92.7±plus-or-minus\pm±0.5 \ul89.9±plus-or-minus\pm±0.7 87.0±plus-or-minus\pm±0.2
RGB PaDim 77.9±plus-or-minus\pm±2.7 79.9±plus-or-minus\pm±3.8 \ul91.8±plus-or-minus\pm±0.2 72.2±plus-or-minus\pm±1.3 \ul90.0±plus-or-minus\pm±0.7 92.4±plus-or-minus\pm±1.9 91.4±plus-or-minus\pm±1.2 92.6±plus-or-minus\pm±1.2 \ul91.3±plus-or-minus\pm±1.3 92.2±plus-or-minus\pm±0.8 \ul87.2±plus-or-minus\pm±0.7
PatchCore 67.1±plus-or-minus\pm±1.7 73.3±plus-or-minus\pm±0.0 77.0±plus-or-minus\pm±0.3 72.1±plus-or-minus\pm±0.8 69.9±plus-or-minus\pm±1.2 59.1±plus-or-minus\pm±2.4 61.7±plus-or-minus\pm±1.2 64.3±plus-or-minus\pm±1.1 56.1±plus-or-minus\pm±1.6 73.1±plus-or-minus\pm±1.2 67.4±plus-or-minus\pm±0.8
Shape-Guided 67.5±plus-or-minus\pm±0.6 73.9±plus-or-minus\pm±0.7 81.2±plus-or-minus\pm±0.1 72.1±plus-or-minus\pm±0.1 76.1±plus-or-minus\pm±0.6 56.0±plus-or-minus\pm±0.0 62.5±plus-or-minus\pm±0.2 71.6±plus-or-minus\pm±1.0 64.7±plus-or-minus\pm±0.5 73.8±plus-or-minus\pm±0.1 69.9±plus-or-minus\pm±0.1
SoftPatch \ul83.9±plus-or-minus\pm±2.0 \ul89.3±plus-or-minus\pm±2.7 91.4±plus-or-minus\pm±0.5 \ul79.2±plus-or-minus\pm±0.7 91.8±plus-or-minus\pm±1.8 72.4±plus-or-minus\pm±2.8 76.5±plus-or-minus\pm±2.4 72.9±plus-or-minus\pm±2.7 89.8±plus-or-minus\pm±2.6 90.1±plus-or-minus\pm±1.7 83.7±plus-or-minus\pm±0.3
M3DM 68.6±plus-or-minus\pm±1.7 72.7±plus-or-minus\pm±0.8 77.4±plus-or-minus\pm±0.3 70.5±plus-or-minus\pm±0.6 68.6±plus-or-minus\pm±1.3 59.8±plus-or-minus\pm±1.4 64.9±plus-or-minus\pm±1.4 65.0±plus-or-minus\pm±1.4 57.0±plus-or-minus\pm±0.8 75.1±plus-or-minus\pm±1.2 68.0±plus-or-minus\pm±0.7
Ours 93.1±plus-or-minus\pm±1.6 91.9±plus-or-minus\pm±1.3 96.1±plus-or-minus\pm±0.4 82.1±plus-or-minus\pm±1.8 81.5±plus-or-minus\pm±5.6 \ul73.9±plus-or-minus\pm±1.0 \ul90.4±plus-or-minus\pm±2.1 \ul84.3±plus-or-minus\pm±1.4 94.2±plus-or-minus\pm±1.0 \ul90.2±plus-or-minus\pm±0.6 87.8±plus-or-minus\pm±0.5
3D+RGB PatchCore+FPFH 70.4±plus-or-minus\pm±1.5 72.8±plus-or-minus\pm±0.6 77.9±plus-or-minus\pm±0.3 77.5±plus-or-minus\pm±1.0 68.8±plus-or-minus\pm±1.5 64.9±plus-or-minus\pm±1.0 65.0±plus-or-minus\pm±1.7 65.9±plus-or-minus\pm±1.3 56.4±plus-or-minus\pm±0.8 75.3±plus-or-minus\pm±1.3 69.5±plus-or-minus\pm±0.6
Shape-Guided \ul74.6±plus-or-minus\pm±0.6 \ul80.9±plus-or-minus\pm±0.5 \ul93.6±plus-or-minus\pm±0.3 \ul79.3±plus-or-minus\pm±0.9 89.3±plus-or-minus\pm±0.9 \ul76.6±plus-or-minus\pm±0.2 \ul82.4±plus-or-minus\pm±0.2 94.0±plus-or-minus\pm±0.3 \ul86.6±plus-or-minus\pm±0.1 93.7±plus-or-minus\pm±0.8 \ul85.1±plus-or-minus\pm±0.0
M3DM 69.0±plus-or-minus\pm±1.4 72.5±plus-or-minus\pm±0.8 77.8±plus-or-minus\pm±0.4 72.8±plus-or-minus\pm±1.0 68.0±plus-or-minus\pm±1.5 61.3±plus-or-minus\pm±0.7 65.2±plus-or-minus\pm±1.5 65.3±plus-or-minus\pm±1.4 57.2±plus-or-minus\pm±0.8 75.3±plus-or-minus\pm±1.2 68.4±plus-or-minus\pm±0.6
Ours 95.9±plus-or-minus\pm±1.3 92.0±plus-or-minus\pm±1.2 96.7±plus-or-minus\pm±0.4 90.4±plus-or-minus\pm±1.1 \ul84.6±plus-or-minus\pm±2.3 83.4±plus-or-minus\pm±1.7 91.9±plus-or-minus\pm±2.7 \ul85.8±plus-or-minus\pm±1.7 94.5±plus-or-minus\pm±0.3 \ul91.4±plus-or-minus\pm±0.5 90.7±plus-or-minus\pm±0.2

4.2 Regular Anomaly Detection on MVTec 3D-AD

In the regular anomaly detection setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tabs. I and II show the anomaly detection results record with I-AUROC and the segmentation results record with AUPRO respectively. We report the P-AUROC in P-AUROC for regular anomaly segmentation on MVTec 3D-AD. From Tabs. I and I, we can conclude that our M3DM-NR also maintains the regular anomaly detection ability.

4.3 Noisy Anomaly Detection on MVTec 3D-AD

TABLE V: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 68.8±plus-or-minus\pm±1.1 65.0±plus-or-minus\pm±2.6 86.1±plus-or-minus\pm±0.3 72.9±plus-or-minus\pm±0.6 79.7±plus-or-minus\pm±5.2 69.1±plus-or-minus\pm±3.9 61.3±plus-or-minus\pm±1.0 69.7±plus-or-minus\pm±1.9 74.6±plus-or-minus\pm±1.9 59.3±plus-or-minus\pm±3.6 70.7±plus-or-minus\pm±0.6
FPFH 73.4±plus-or-minus\pm±3.8 54.8±plus-or-minus\pm±4.3 90.7±plus-or-minus\pm±1.5 78.5±plus-or-minus\pm±0.9 88.3±plus-or-minus\pm±3.3 54.0±plus-or-minus\pm±0.4 70.9±plus-or-minus\pm±3.8 67.2±plus-or-minus\pm±2.3 90.0±plus-or-minus\pm±2.6 67.9±plus-or-minus\pm±5.6 73.6±plus-or-minus\pm±0.4
AST 82.8±plus-or-minus\pm±0.6 51.9±plus-or-minus\pm±0.6 91.3±plus-or-minus\pm±0.6 \ul92.3±plus-or-minus\pm±1.2 64.3±plus-or-minus\pm±1.2 78.5±plus-or-minus\pm±0.2 98.3±plus-or-minus\pm±2.9 90.3±plus-or-minus\pm±0.3 94.7±plus-or-minus\pm±1.7 63.3±plus-or-minus\pm±1.2 80.8±plus-or-minus\pm±0.9
Shape-Guided \ul90.2±plus-or-minus\pm±0.9 67.5±plus-or-minus\pm±0.1 \ul91.4±plus-or-minus\pm±0.3 92.1±plus-or-minus\pm±1.2 80.8±plus-or-minus\pm±10.1 67.7±plus-or-minus\pm±4.1 \ul86.5±plus-or-minus\pm±7.4 87.1±plus-or-minus\pm±1.0 89.6±plus-or-minus\pm±1.3 83.3±plus-or-minus\pm±6.7 \ul83.6±plus-or-minus\pm±2.0
M3DM 87.1±plus-or-minus\pm±0.8 \ul68.2±plus-or-minus\pm±1.2 79.4±plus-or-minus\pm±3.1 87.8±plus-or-minus\pm±1.3 83.8±plus-or-minus\pm±2.8 \ul73.0±plus-or-minus\pm±2.5 76.6±plus-or-minus\pm±2.6 82.6±plus-or-minus\pm±0.7 \ul92.9±plus-or-minus\pm±2.0 80.0±plus-or-minus\pm±1.6 81.1±plus-or-minus\pm±0.8
Ours 94.5±plus-or-minus\pm±0.6 74.4±plus-or-minus\pm±2.4 94.8±plus-or-minus\pm±0.9 93.7±plus-or-minus\pm±0.8 \ul83.8±plus-or-minus\pm±1.1 72.8±plus-or-minus\pm±3.5 84.0±plus-or-minus\pm±0.2 \ul87.3±plus-or-minus\pm±0.4 89.8±plus-or-minus\pm±1.3 \ul82.2±plus-or-minus\pm±1.2 85.7±plus-or-minus\pm±0.7
RGB PaDim 93.0±plus-or-minus\pm±1.0 73.3±plus-or-minus\pm±3.3 66.3±plus-or-minus\pm±0.7 52.4±plus-or-minus\pm±2.0 88.3±plus-or-minus\pm±1.0 72.2±plus-or-minus\pm±3.2 \ul84.3±plus-or-minus\pm±1.3 50.7±plus-or-minus\pm±2.2 91.9±plus-or-minus\pm±2.7 68.6±plus-or-minus\pm±2.2 74.1±plus-or-minus\pm±0.6
PatchCore 89.2±plus-or-minus\pm±0.9 95.2±plus-or-minus\pm±1.4 90.8±plus-or-minus\pm±1.9 65.9±plus-or-minus\pm±2.8 97.5±plus-or-minus\pm±1.0 77.4±plus-or-minus\pm±4.7 70.6±plus-or-minus\pm±1.7 54.6±plus-or-minus\pm±0.6 93.5±plus-or-minus\pm±2.2 75.4±plus-or-minus\pm±1.7 81.0±plus-or-minus\pm±0.7
AST 79.5±plus-or-minus\pm±0.1 83.1±plus-or-minus\pm±0.1 63.2±plus-or-minus\pm±0.8 60.2±plus-or-minus\pm±0.1 80.7±plus-or-minus\pm±0.6 77.5±plus-or-minus\pm±1.8 81.1±plus-or-minus\pm±1.0 63.4±plus-or-minus\pm±0.1 74.3±plus-or-minus\pm±0.8 59.2±plus-or-minus\pm±0.0 72.2±plus-or-minus\pm±0.1
Shape-Guided 79.3±plus-or-minus\pm±1.0 89.6±plus-or-minus\pm±2.4 77.4±plus-or-minus\pm±0.3 58.6±plus-or-minus\pm±2.0 94.3±plus-or-minus\pm±0.2 71.4±plus-or-minus\pm±3.6 67.7±plus-or-minus\pm±0.7 \ul62.1±plus-or-minus\pm±0.0 72.0±plus-or-minus\pm±1.6 66.5±plus-or-minus\pm±0.3 73.9±plus-or-minus\pm±0.8
SoftPatch 90.6±plus-or-minus\pm±0.2 91.8±plus-or-minus\pm±1.7 87.6±plus-or-minus\pm±0.4 \ul67.8±plus-or-minus\pm±0.8 98.0±plus-or-minus\pm±0.6 \ul78.0±plus-or-minus\pm±4.8 70.6±plus-or-minus\pm±0.7 55.3±plus-or-minus\pm±1.5 93.4±plus-or-minus\pm±2.7 75.6±plus-or-minus\pm±1.2 80.9±plus-or-minus\pm±0.4
M3DM 87.7±plus-or-minus\pm±2.3 83.0±plus-or-minus\pm±2.7 83.1±plus-or-minus\pm±1.1 66.4±plus-or-minus\pm±1.7 \ul96.7±plus-or-minus\pm±1.4 77.7±plus-or-minus\pm±1.7 82.7±plus-or-minus\pm±3.1 62.5±plus-or-minus\pm±3.4 92.9±plus-or-minus\pm±1.8 76.7±plus-or-minus\pm±1.2 \ul80.9±plus-or-minus\pm±0.8
Ours \ul90.8±plus-or-minus\pm±1.3 \ul90.2±plus-or-minus\pm±4.0 \ul86.9±plus-or-minus\pm±1.8 68.0±plus-or-minus\pm±3.6 91.0±plus-or-minus\pm±3.6 83.2±plus-or-minus\pm±1.8 88.7±plus-or-minus\pm±2.1 57.7±plus-or-minus\pm±6.7 \ul93.3±plus-or-minus\pm±1.1 \ul75.9±plus-or-minus\pm±1.6 82.6±plus-or-minus\pm±0.5
3D+RGB PatchCore+FPFH 81.1±plus-or-minus\pm±4.0 77.8±plus-or-minus\pm±1.4 91.7±plus-or-minus\pm±0.5 84.5±plus-or-minus\pm±1.6 91.8±plus-or-minus\pm±1.3 64.8±plus-or-minus\pm±2.6 79.5±plus-or-minus\pm±3.1 77.3±plus-or-minus\pm±1.9 90.9±plus-or-minus\pm±1.6 89.8±plus-or-minus\pm±1.1 82.9±plus-or-minus\pm±0.8
AST 85.4±plus-or-minus\pm±0.6 \ul88.9±plus-or-minus\pm±0.6 91.3±plus-or-minus\pm±0.6 95.6±plus-or-minus\pm±0.6 89.2±plus-or-minus\pm±1.0 85.9±plus-or-minus\pm±0.6 \ul92.8±plus-or-minus\pm±0.6 91.6±plus-or-minus\pm±0.6 79.6±plus-or-minus\pm±0.6 70.0±plus-or-minus\pm±0.6 87.0±plus-or-minus\pm±0.3
Shape-Guided 91.0±plus-or-minus\pm±0.5 86.3±plus-or-minus\pm±2.0 \ul94.2±plus-or-minus\pm±0.5 86.4±plus-or-minus\pm±1.0 \ul94.2±plus-or-minus\pm±0.1 77.1±plus-or-minus\pm±0.5 88.6±plus-or-minus\pm±0.1 \ul85.8±plus-or-minus\pm±1.0 88.3±plus-or-minus\pm±0.1 \ul85.1±plus-or-minus\pm±0.2 \ul87.7±plus-or-minus\pm±0.3
M3DM \ul96.6±plus-or-minus\pm±2.2 85.7±plus-or-minus\pm±1.9 88.4±plus-or-minus\pm±2.5 86.4±plus-or-minus\pm±3.1 96.1±plus-or-minus\pm±1.3 \ul86.3±plus-or-minus\pm±5.4 85.1±plus-or-minus\pm±0.6 76.5±plus-or-minus\pm±2.3 \ul94.8±plus-or-minus\pm±1.3 79.3±plus-or-minus\pm±2.4 87.5±plus-or-minus\pm±0.5
Ours 98.1±plus-or-minus\pm±0.8 91.0±plus-or-minus\pm±2.6 96.8±plus-or-minus\pm±0.8 \ul94.2±plus-or-minus\pm±2.0 93.7±plus-or-minus\pm±0.8 90.6±plus-or-minus\pm±2.0 92.9±plus-or-minus\pm±1.6 81.9±plus-or-minus\pm±2.0 95.3±plus-or-minus\pm±1.4 84.7±plus-or-minus\pm±2.4 91.9±plus-or-minus\pm±1.0
TABLE VI: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. Our method clearly outperforms other methods in 3D and 3D + RGB settings, indicating the superior anomaly segmentation ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 86.4±plus-or-minus\pm±0.0 70.2±plus-or-minus\pm±0.0 90.3±plus-or-minus\pm±0.0 86.1±plus-or-minus\pm±0.0 90.6±plus-or-minus\pm±0.0 60.3±plus-or-minus\pm±0.0 85.0±plus-or-minus\pm±0.0 95.3±plus-or-minus\pm±0.0 93.8±plus-or-minus\pm±0.0 86.3±plus-or-minus\pm±0.0 84.4±plus-or-minus\pm±0.0
FPFH 92.6±plus-or-minus\pm±0.0 78.3±plus-or-minus\pm±0.0 92.1±plus-or-minus\pm±0.0 85.5±plus-or-minus\pm±0.0 \ul88.2±plus-or-minus\pm±0.0 68.3±plus-or-minus\pm±0.0 90.5±plus-or-minus\pm±0.0 94.3±plus-or-minus\pm±0.0 92.1±plus-or-minus\pm±0.0 90.3±plus-or-minus\pm±0.0 87.2±plus-or-minus\pm±0.0
Shape-Guided \ul95.6±plus-or-minus\pm±0.0 80.3±plus-or-minus\pm±0.0 98.1±plus-or-minus\pm±0.0 89.5±plus-or-minus\pm±0.0 \ul88.2±plus-or-minus\pm±0.0 70.3±plus-or-minus\pm±0.0 95.2±plus-or-minus\pm±0.6 96.3±plus-or-minus\pm±0.0 93.1±plus-or-minus\pm±0.0 93.7±plus-or-minus\pm±0.0 90.0±plus-or-minus\pm±0.1
M3DM 93.7±plus-or-minus\pm±0.5 \ul81.1±plus-or-minus\pm±0.3 \ul97.6±plus-or-minus\pm±0.2 86.3±plus-or-minus\pm±0.4 87.9±plus-or-minus\pm±1.3 75.3±plus-or-minus\pm±4.6 \ul95.4±plus-or-minus\pm±0.2 96.9±plus-or-minus\pm±0.4 94.6±plus-or-minus\pm±0.4 92.7±plus-or-minus\pm±0.3 \ul90.1±plus-or-minus\pm±0.6
Ours 95.8±plus-or-minus\pm±0.3 81.2±plus-or-minus\pm±0.4 \ul97.6±plus-or-minus\pm±0.1 \ul86.6±plus-or-minus\pm±0.7 88.0±plus-or-minus\pm±1.1 \ul73.0±plus-or-minus\pm±4.0 95.5±plus-or-minus\pm±0.4 \ul96.5±plus-or-minus\pm±0.1 \ul94.2±plus-or-minus\pm±0.6 \ul93.5±plus-or-minus\pm±0.8 90.2±plus-or-minus\pm±0.5
RGB PaDim 93.0±plus-or-minus\pm±2.4 87.5±plus-or-minus\pm±2.6 93.7±plus-or-minus\pm±0.4 86.8±plus-or-minus\pm±0.9 92.7±plus-or-minus\pm±1.3 93.3±plus-or-minus\pm±7.0 94.9±plus-or-minus\pm±0.5 95.0±plus-or-minus\pm±1.0 92.4±plus-or-minus\pm±0.6 94.9±plus-or-minus\pm±0.6 92.4±plus-or-minus\pm±0.5
PatchCore 90.9±plus-or-minus\pm±0.6 97.0±plus-or-minus\pm±0.1 96.2±plus-or-minus\pm±0.5 \ul88.4±plus-or-minus\pm±0.5 95.7±plus-or-minus\pm±0.4 79.1±plus-or-minus\pm±2.5 89.2±plus-or-minus\pm±0.5 93.4±plus-or-minus\pm±0.9 96.5±plus-or-minus\pm±0.7 95.1±plus-or-minus\pm±0.2 92.2±plus-or-minus\pm±0.2
Shape-Guided 90.2±plus-or-minus\pm±1.9 94.5±plus-or-minus\pm±2.2 94.9±plus-or-minus\pm±1.3 86.5±plus-or-minus\pm±1.2 93.6±plus-or-minus\pm±0.5 74.8±plus-or-minus\pm±6.5 90.7±plus-or-minus\pm±4.0 92.4±plus-or-minus\pm±1.7 91.8±plus-or-minus\pm±4.3 93.3±plus-or-minus\pm±2.2 90.3±plus-or-minus\pm±2.2
SoftPatch 93.2±plus-or-minus\pm±0.3 96.1±plus-or-minus\pm±0.1 96.4±plus-or-minus\pm±0.1 89.7±plus-or-minus\pm±0.7 \ul95.3±plus-or-minus\pm±0.5 78.4±plus-or-minus\pm±1.7 90.0±plus-or-minus\pm±0.3 93.5±plus-or-minus\pm±0.7 96.2±plus-or-minus\pm±0.7 94.7±plus-or-minus\pm±0.5 92.3±plus-or-minus\pm±0.2
M3DM \ul93.5±plus-or-minus\pm±0.3 \ul96.8±plus-or-minus\pm±0.3 96.9±plus-or-minus\pm±0.5 86.0±plus-or-minus\pm±0.6 93.8±plus-or-minus\pm±0.8 79.2±plus-or-minus\pm±1.6 96.2±plus-or-minus\pm±0.4 94.8±plus-or-minus\pm±0.6 96.8±plus-or-minus\pm±0.4 96.9±plus-or-minus\pm±0.1 93.1±plus-or-minus\pm±0.1
Ours 93.7±plus-or-minus\pm±0.9 96.0±plus-or-minus\pm±0.6 \ul96.8±plus-or-minus\pm±0.3 84.0±plus-or-minus\pm±1.5 92.4±plus-or-minus\pm±1.0 \ul79.5±plus-or-minus\pm±2.4 \ul95.6±plus-or-minus\pm±0.1 \ul94.8±plus-or-minus\pm±0.6 \ul96.8±plus-or-minus\pm±0.6 \ul95.3±plus-or-minus\pm±0.3 \ul92.5±plus-or-minus\pm±0.2
3D+RGB PatchCore+FPFH \ul96.6±plus-or-minus\pm±0.4 96.1±plus-or-minus\pm±1.2 97.7±plus-or-minus\pm±0.5 92.6±plus-or-minus\pm±3.2 92.5±plus-or-minus\pm±1.4 89.1±plus-or-minus\pm±0.5 \ul96.5±plus-or-minus\pm±0.2 \ul96.7±plus-or-minus\pm±0.2 95.3±plus-or-minus\pm±1.1 97.2±plus-or-minus\pm±0.1 95.0±plus-or-minus\pm±0.4
Shape-Guided 93.5±plus-or-minus\pm±0.1 94.0±plus-or-minus\pm±0.2 97.5±plus-or-minus\pm±0.3 93.0±plus-or-minus\pm±0.3 95.5±plus-or-minus\pm±0.1 93.1±plus-or-minus\pm±0.8 95.3±plus-or-minus\pm±0.1 97.9±plus-or-minus\pm±0.1 95.6±plus-or-minus\pm±0.1 \ul97.2±plus-or-minus\pm±0.2 \ul95.2±plus-or-minus\pm±0.1
M3DM 94.3±plus-or-minus\pm±0.8 96.5±plus-or-minus\pm±0.3 97.4±plus-or-minus\pm±0.5 89.2±plus-or-minus\pm±0.2 92.7±plus-or-minus\pm±0.9 82.8±plus-or-minus\pm±1.0 96.4±plus-or-minus\pm±0.3 95.4±plus-or-minus\pm±0.6 97.2±plus-or-minus\pm±0.4 96.7±plus-or-minus\pm±0.3 93.9±plus-or-minus\pm±0.2
Ours 96.9±plus-or-minus\pm±0.3 \ul96.3±plus-or-minus\pm±0.2 \ul97.6±plus-or-minus\pm±0.0 \ul92.7±plus-or-minus\pm±0.5 \ul93.9±plus-or-minus\pm±0.4 \ul91.8±plus-or-minus\pm±1.3 97.0±plus-or-minus\pm±0.5 96.4±plus-or-minus\pm±0.1 \ul97.0±plus-or-minus\pm±0.2 96.5±plus-or-minus\pm±0.1 95.6±plus-or-minus\pm±0.1

In the noisy anomaly detection setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tabs. III and V show the anomaly detection results record with I-AUROC under Overlap and Non-Overlap settings respectively. Tabs. IV and VI show the segmentation results record with AUPRO under Overlap and Non-Overlap settings respectively. We report the P-AUROC in P-AUROC for noisy anomaly segmentation on MVTec 3D-AD.

Overlap and Non-Overlap Analysis. Compared to the Non-Overlap setting, our method significantly outperformed all baseline methods in the Overlap setting, especially in anomaly detection (I-AUROC). Specifically, our approach exceeded the second-best by 13.9%, 3.7%, and 20.3% in I-AUROC for the 3D, RGB, and 3D+RGB settings, respectively. This indicates the effectiveness of sample-level denoising in Stage I & II of our method, as most baseline methods struggled with anomalies existing in both the training and test datasets. This includes approaches like SoftPatch [8], which only perform denoising at the patch-level, whereas our method remained largely unaffected. This demonstrates the enhanced robustness of our proposed Stage I & II, especially in situations where defects with similar appearances existing in both the training and test datasets, i.e., a common scenario in real-world industrial settings.

3D-Based. On pure 3D anomaly detection, we get the highest I-AUROC and outperform M3DM [9] 13.9% in Overlap and Shape-Guided [44] 2.1% in Non-Overlap. For segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.3% in Overlap and M3DM 0.1% in Non-Overlap. This shows our method has much better detection and segementation performance than the previous method, and with our PFA, the Point Transformer is the better 3D feature extractor for this task.

RGB-Based. Our I-AUROC in RGB domain is 3.7% higher than SoftPatch in Overlap and 1.7% higher than Softpatch and M3DM in Non-Overlap. For segmentation, we get the highest AUPRO score, 0.6% higher than PaDim in Overlap and second best score in Non-Overlap.

Hybrid 3D/RGB. On multi-modal 3D/RGB anomaly detection, we get the highest I-AUROC and outperform M3DM 20.3% in Overlap and Shape-Guided 4.2% in Non-Overlap. For segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.6% in Overlap and Shape-guided 0.4% in Non-Overlap. These results are contributed by our fusion strategy and the high-performance 3D anomaly detection results.

Refer to caption
Figure 7: Heatmap of our anomaly segmentation results (multi-modal inputs) under Overlap setting. Compared with existing methods, our method remains unaffected by noise and outputs a more accurate segmentation region.
TABLE VII: Main ablation study of M3DM-NR. Stage I&II indicates removing stage I&II, R𝑅Ritalic_R indicates removing Intra-modality Reference Selection, Psubscript𝑃\mathcal{H}_{P}caligraphic_H start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT indicates removing Aligned Multi-scale Point Cloud Extraction and W𝑊Witalic_W indicates removing Noise-focused Aggregation. Noise-level refers to the percentage of noise data in the entire training set after denoising in stage I&II. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Stage I&II R𝑅Ritalic_R Psubscript𝑃\mathcal{H}_{P}caligraphic_H start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT W𝑊Witalic_W Overlap Non-Overlap Noise-level \downarrow
I-AUROC \uparrow P-AUROC \uparrow AUPRO \uparrow I-AUROC \uparrow P-AUROC \uparrow AUPRO \uparrow
66.4±plus-or-minus\pm±0.4 72.9±plus-or-minus\pm±0.9 66.5±plus-or-minus\pm±3.4 87.7±plus-or-minus\pm±0.5 98.7±plus-or-minus\pm±0.1 94.5±plus-or-minus\pm±0.2 9.09±plus-or-minus\pm±0.00
79.7±plus-or-minus\pm±1.1 89.2±plus-or-minus\pm±1.1 84.5±plus-or-minus\pm±0.6 88.6±plus-or-minus\pm±0.6 98.8±plus-or-minus\pm±0.1 94.9±plus-or-minus\pm±0.3 5.13±plus-or-minus\pm±0.13
82.6±plus-or-minus\pm±0.7 92.7±plus-or-minus\pm±0.5 87.8±plus-or-minus\pm±0.3 89.2±plus-or-minus\pm±0.8 \ul98.7±plus-or-minus\pm±0.0 94.9±plus-or-minus\pm±0.1 3.87±plus-or-minus\pm±0.08
\ul86.2±plus-or-minus\pm±0.5 \ul94.3±plus-or-minus\pm±0.4 \ul90.3±plus-or-minus\pm±0.5 \ul91.3±plus-or-minus\pm±0.2 98.9±plus-or-minus\pm±0.1 \ul95.4±plus-or-minus\pm±0.0 \ul2.79±plus-or-minus\pm±0.18
86.6±plus-or-minus\pm±1.3 94.6±plus-or-minus\pm±0.3 90.7±plus-or-minus\pm±0.2 91.9±plus-or-minus\pm±1.0 98.9±plus-or-minus\pm±0.0 95.6±plus-or-minus\pm±0.1 2.73±plus-or-minus\pm±0.05

4.4 Visualization Results

In this section, we visualize anomaly segmentation results for all categories of MVTec-3D AD datasets under the overlap setting. As shown in Fig. 7, we visualize the heatmap results of our method and PatchCore + FPFH [20], M3DM [9] and Shape-Guided [44] with multi-modal inputs. Our method outperforms the previous ones by producing more accurate segmentation maps and exhibiting greater resilience to dataset noise. While the earlier approaches were often confounded by noise samples within the dataset, this is particularly noticeable in the Cable Gland, Dowel, Foam, and Peach results for PatchCore + FPFH, as well as the Foam and Rope results for Shape-Guided. More visualization results under the non-overlap setting is shown in Visualization results of Non-Overlap setiing.

4.5 Ablation Study

We conduct an ablation study on the main components introduced in Sec. 3, namely Stage I & II two-stage sample-level denoising, intra-modality reference, Aligned Multi-Scale Point Cloud Feature Extraction and Noise-Focused Aggregation. The results are displayed in Tab. VII. It was observed that the incremental inclusion of each component led to improvements in I-AUROC, P-AUROC, and AUPRO under both Overlap and Non-Overlap settings, particularly under the more challenging Overlap setting. Besides these metrics, the Noise-level metric also clearly demonstrates that the model’s capability for sample-level denoising progressively increased with the addition of each module.

Different Scales. We also conduct an ablation study on the feature scales extracted in the Aligned Multi-Scale Point Cloud Feature Extraction, with results presented in Tab. VIII. The model performance varies across different scale configurations. Notably, when incorporating all scales, all performance metrics peaked, demonstrating that multi-scale consideration can enhance model performance. When the small scale is excluded, our model performs nearly as well as the full configuration, indicating that omitting small-scale processing has a relatively minor impact. This could be attributed to small-scale patches often containing too few point cloud points, many of which might be deemed insignificant and discarded during segmentation.

TABLE VIII: Ablation Study on Aligned Multi-scale Point Cloud Extraction. w/o multi-scale represents removing all big. mid and small scales.
Methods
w/o
multi-scale
w/o
big-scale
w/o
mid-scale
w/o
small-scale
Full
Over I-AUROC \uparrow 82.6±plus-or-minus\pm±0.7 84.6±plus-or-minus\pm±1.0 83.7±plus-or-minus\pm±1.2 \ul85.3±plus-or-minus\pm±0.4 86.6±plus-or-minus\pm±1.3
P-AUROC \uparrow 92.7±plus-or-minus\pm±0.5 94.0±plus-or-minus\pm±0.3 93.6±plus-or-minus\pm±0.2 \ul94.2±plus-or-minus\pm±0.4 94.6±plus-or-minus\pm±0.3
AUPRO \uparrow 87.8±plus-or-minus\pm±0.3 89.8±plus-or-minus\pm±0.3 89.4±plus-or-minus\pm±0.4 \ul90.2±plus-or-minus\pm±0.2 90.7±plus-or-minus\pm±0.2
N-Over I-AUROC \uparrow 89.2±plus-or-minus\pm±0.8 89.6±plus-or-minus\pm±0.6 89.1±plus-or-minus\pm±0.9 \ul89.8±plus-or-minus\pm±0.1 91.9±plus-or-minus\pm±1.0
P-AUROC \uparrow 98.7±plus-or-minus\pm±0.0 98.7±plus-or-minus\pm±0.1 \ul98.8±plus-or-minus\pm±0.1 \ul98.8±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.0
AUPRO \uparrow 94.9±plus-or-minus\pm±0.1 95.0±plus-or-minus\pm±0.2 \ul95.4±plus-or-minus\pm±0.2 \ul95.4±plus-or-minus\pm±0.2 95.6±plus-or-minus\pm±0.1
Noise-level \downarrow 3.87±plus-or-minus\pm±0.08 2.77±plus-or-minus\pm±0.18 3.18±plus-or-minus\pm±0.20 \ul2.76±plus-or-minus\pm±0.07 2.73±plus-or-minus\pm±0.05
TABLE IX: Exploring Aligned Multi-scale Point Cloud Extraction Setting. σ𝜎\sigmaitalic_σ represents the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful.
θ𝜃\thetaitalic_θ 128 256 512 1024
Over I-AUROC \uparrow 86.6±plus-or-minus\pm±1.3 \ul86.2±plus-or-minus\pm±0.6 85.9±plus-or-minus\pm±0.4 84.0±plus-or-minus\pm±0.5
P-AUROC \uparrow 94.6±plus-or-minus\pm±0.3 94.3±plus-or-minus\pm±0.7 \ul94.4±plus-or-minus\pm±0.1 93.4±plus-or-minus\pm±0.4
AUPRO \uparrow 90.7±plus-or-minus\pm±0.2 \ul90.3±plus-or-minus\pm±0.4 90.2±plus-or-minus\pm±0.4 89.0±plus-or-minus\pm±0.5
N-Over I-AUROC \uparrow 91.9±plus-or-minus\pm±1.0 \ul91.4±plus-or-minus\pm±0.9 91.0±plus-or-minus\pm±0.2 89.4±plus-or-minus\pm±0.5
P-AUROC \uparrow 98.9±plus-or-minus\pm±0.0 \ul98.9±plus-or-minus\pm±0.0 \ul98.9±plus-or-minus\pm±0.1 98.7±plus-or-minus\pm±0.3
AUPRO \uparrow 95.5±plus-or-minus\pm±0.1 \ul95.5±plus-or-minus\pm±0.1 95.4±plus-or-minus\pm±0.1 95.0±plus-or-minus\pm±0.3
Noise-level \downarrow 2.73±plus-or-minus\pm±0.05 2.73±plus-or-minus\pm±0.05 2.75±plus-or-minus\pm±0.13 3.46±plus-or-minus\pm±0.10

Point Cloud Threshold. We also perform an ablation study on the hyper-parameter θ𝜃\thetaitalic_θ introduced, representing the thresholds for the minimum number of points required in a point cloud patch to be considered meaningful. The experimental results are shown in Tab. IX. Given that the point cloud encoder used in our experiments has a minimum group size of 128, we commence our testing from this threshold. The findings indicate that for most metrics, a threshold of 128 points is the most appropriate, aligning with expectations as a lower threshold would mean considering more patches for computing the anomaly score, potentially leading to better accuracy. Therefore, after balancing the considerations of computational complexity and the accuracy of RGB-3D multi-modal anomaly detection, we opted for a threshold θ𝜃\thetaitalic_θ of 128 in this paper.

λIsubscript𝜆𝐼\lambda_{I}italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and λPsubscript𝜆𝑃\lambda_{P}italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.

TABLE X: Exploring RGB and Point Cloud Integration Setting. λrgbsubscript𝜆𝑟𝑔𝑏\lambda_{rgb}italic_λ start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT and λpcsubscript𝜆𝑝𝑐\lambda_{pc}italic_λ start_POSTSUBSCRIPT italic_p italic_c end_POSTSUBSCRIPT are hyper-parameters controlling the extent to which RGB and point cloud modalities are integrated.
λrgbλpcsubscript𝜆𝑟𝑔𝑏subscript𝜆𝑝𝑐\lambda_{rgb}\quad\lambda_{pc}italic_λ start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_p italic_c end_POSTSUBSCRIPT 1.0 1.3 1.0 1.4 1.0 1.5 1.0 1.6 1.0 1.7
Over I-AUROC \uparrow 86.1±plus-or-minus\pm±0.7 85.6±plus-or-minus\pm±0.5 86.6±plus-or-minus\pm±1.3 86.1±plus-or-minus\pm±1.0 86.1±plus-or-minus\pm±1.0
P-AUROC \uparrow 94.3±plus-or-minus\pm±0.7 94.2±plus-or-minus\pm±0.7 94.6±plus-or-minus\pm±0.3 94.2±plus-or-minus\pm±0.7 94.2±plus-or-minus\pm±0.0
AUPRO \uparrow 90.3±plus-or-minus\pm±0.4 90.2±plus-or-minus\pm±0.3 90.7±plus-or-minus\pm±0.3 90.3±plus-or-minus\pm±0.4 90.3±plus-or-minus\pm±0.3
N-Over I-AUROC \uparrow 91.3±plus-or-minus\pm±0.5 90.7±plus-or-minus\pm±0.8 91.9±plus-or-minus\pm±1.0 91.2±plus-or-minus\pm±1.1 91.1±plus-or-minus\pm±0.8
P-AUROC \uparrow 98.9±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.0 98.9±plus-or-minus\pm±0.0 98.9±plus-or-minus\pm±0.1
AUPRO \uparrow 95.4±plus-or-minus\pm±0.2 95.4±plus-or-minus\pm±0.1 95.5±plus-or-minus\pm±0.1 95.4±plus-or-minus\pm±0.2 95.5±plus-or-minus\pm±0.2
Noise-level \downarrow 2.74±plus-or-minus\pm±0.09 2.75±plus-or-minus\pm±0.07 2.71±plus-or-minus\pm±0.19 2.72±plus-or-minus\pm±0.04 2.75±plus-or-minus\pm±0.06
TABLE XI: Exploring the Number of Intra-modal Reference Samples. Ref Num represents the number of intra-modal reference samples selected.
Ref Num 0 1 2 3 4
Over I-AUROC \uparrow 80.7±plus-or-minus\pm±0.9 84.8±plus-or-minus\pm±0.7 85.6±plus-or-minus\pm±1.5 86.1±plus-or-minus\pm±0.5 86.6±plus-or-minus\pm±1.3
P-AUROC \uparrow 89.4±plus-or-minus\pm±1.3 93.5±plus-or-minus\pm±0.4 93.8±plus-or-minus\pm±0.3 93.9±plus-or-minus\pm±0.2 94.6±plus-or-minus\pm±0.3
AUPRO \uparrow 85.5±plus-or-minus\pm±0.7 89.3±plus-or-minus\pm±0.1 89.8±plus-or-minus\pm±0.3 90.0±plus-or-minus\pm±0.4 90.7±plus-or-minus\pm±0.3
N-Over I-AUROC \uparrow 88.7±plus-or-minus\pm±0.9 90.6±plus-or-minus\pm±0.4 91.0±plus-or-minus\pm±0.9 91.5±plus-or-minus\pm±0.4 91.9±plus-or-minus\pm±1.0
P-AUROC \uparrow 98.8±plus-or-minus\pm±0.1 98.8±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.0 98.8±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.0
AUPRO \uparrow 94.9±plus-or-minus\pm±0.4 95.5±plus-or-minus\pm±0.2 95.5±plus-or-minus\pm±0.1 95.4±plus-or-minus\pm±0.1 95.5±plus-or-minus\pm±0.1
Noise-level \downarrow 5.07±plus-or-minus\pm±0.13 3.20±plus-or-minus\pm±0.04 2.88±plus-or-minus\pm±0.20 2.82±plus-or-minus\pm±0.19 2.71±plus-or-minus\pm±0.19

To assess the extent to which RGB and Point Cloud modalities should be integrated, we conducted experiments with the hyper-parameters λIsubscript𝜆𝐼\lambda_{I}italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and λPsubscript𝜆𝑃\lambda_{P}italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, which control the level of integration. The results of these experiments are presented in Tab. X. We observed that the model achieves optimal performance across all metrics for both anomaly detection and segmentation with λI=1.0subscript𝜆𝐼1.0\lambda_{I}=1.0italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = 1.0 and λP=1.5subscript𝜆𝑃1.5\lambda_{P}=1.5italic_λ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 1.5. This indicates that enhancing the integration of the 3D Point Cloud modality can further improve performance. This outcome aligns with findings reported in Secs. 4.2 and 4.3, where most methods performed better using purely 3D data rather than solely RGB data. This suggests that the 3D Point Cloud data in the MVTec 3D-AD dataset [10] contains richer information and facilitates more effective anomaly detection compared to RGB data within the same dataset.

Number of Intra-Modal Reference Samples. To determine the appropriate number of intra-modal reference samples in Stage I, we conducted an ablation study on the quantity of these samples. The results are shown in Tab. XI. We conclude that increasing the number of intra-modal reference samples enhances the model’s performance. This improvement is logical, as more reference samples mean more normal cases for the model to learn from, naturally boosting performance. However, selecting too many intra-modal reference samples can lead to the inclusion of noise samples and increase computational complexity. Therefore, in practical implementation, we opted for 4 intra-modal reference samples, striking a balance between model performance and computational efficiency.

5 Conclusion

In this paper, we first delve into the RGB-3D multi-modal noisy anomaly detection problem and have introduced a novel framework, M3DM-NR, to address the challenging task of RGB-3D multi-modal noisy industrial anomaly detection. Our approach systematically tackles the issues of reference selection, denoising, and final anomaly detection and segmentation through a three-stage process. In Stage I, we developed the Initial Feature Extraction, Suspected References Selection, and Suspected Anomaly Map Computation modules to filter normal samples and generate suspected anomaly maps, providing a robust foundation for subsequent stages. Stage II, termed Enhanced Multi-modal Denoising, leverages multi-scale feature comparison and weighting methods to refine and denoise the training samples, ensuring cleaner data for model training. Finally, Stage III integrates Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion to achieve precise anomaly detection and segmentation while effectively filtering out noise at the patch level. Extensive experiments demonstrate that our M3DM-NR framework significantly outperforms existing state-of-the-art methods in both detection and segmentation precision for 3D-RGB multi-modal noisy anomaly detection. The ablation studies further validate the effectiveness of each component within our framework, highlighting the importance of our systematic and hierarchical approach.

Future Works. Our work not only advances the field of industrial anomaly detection but also sets a new benchmark for handling noisy multi-modal data. Future research can build upon our framework to explore additional modalities and further enhance the robustness and accuracy of anomaly detection systems in practical industrial settings. Future work could consider more realistic methods of injecting noise into the training set. Currently, the approach of using anomalous samples from the test set as noise in the training set is rather naive. Future research could explore how noise naturally occurs in normal samples within real industrial production environments and attempt to construct new multi-modal noisy industrial detection datasets. Additionally, future efforts could look into fine-tuning the CLIP model to better handle the task of multi-modal noisy industrial anomaly detection. The current method employs a training-free approach. The pre-trained CLIP model used in M3DM-NR is trained on a large-scale image dataset containing all categories of images. Subsequent work could consider fine-tuning the CLIP model on specific industrial detection datasets before using it for multi-modal noisy industrial anomaly detection.

References

  • [1] Y. Cao, X. Xu, J. Zhang, Y. Cheng, X. Huang, G. Pang, and W. Shen, “A survey on visual anomaly detection: Challenge, approach, and prospect,” arXiv preprint arXiv:2401.16402, 2024.
  • [2] J. Liu, G. Xie, J. Wang, S. Li, C. Wang, F. Zheng, and Y. Jin, “Deep industrial image anomaly detection: A survey,” Machine Intelligence Research, vol. 21, no. 1, pp. 104–135, 2024.
  • [3] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9592–9600.
  • [4] C. Wang, W. Zhu, B.-B. Gao, Z. Gan, J. Zhang, Z. Gu, S. Qian, M. Chen, and L. Ma, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in CVPR, 2024.
  • [5] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 318–14 328.
  • [6] D. Gudovskiy, S. Ishizaka, and K. Kozuka, “Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 98–107.
  • [7] Y. Zheng, X. Wang, R. Deng, T. Bao, R. Zhao, and L. Wu, “Focus your distribution: Coarse-to-fine non-contrastive learning for anomaly detection and localization,” in 2022 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2022, pp. 1–6.
  • [8] X. Jiang, J. Liu, J. Wang, Q. Nie, K. Wu, Y. Liu, C. Wang, and F. Zheng, “Softpatch: Unsupervised anomaly detection with noisy data,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 433–15 445, 2022.
  • [9] Y. Wang, J. Peng, J. Zhang, R. Yi, Y. Wang, and C. Wang, “Multimodal industrial anomaly detection via hybrid fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8032–8041.
  • [10] P. Bergmann, X. Jin, D. Sattlegger, and C. Steger, “The mvtec 3d-ad dataset for unsupervised 3d anomaly detection and localization,” in Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2022, Volume 5: VISAPP, Online Streaming, February 6-8, 2022, G. M. Farinella, P. Radeva, and K. Bouatouch, Eds.   SCITEPRESS, 2022, pp. 202–213. [Online]. Available: https://doi.org/10.5220/0010865000003124
  • [11] E. Horwitz and Y. Hoshen, “Back to the feature: classical 3d features are (almost) all you need for 3d anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2967–2976.
  • [12] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1705–1714.
  • [13] V. Zavrtanik, M. Kristan, and D. Skočaj, “Reconstruction by inpainting for visual anomaly detection,” Pattern Recognition, vol. 112, p. 107706, 2021.
  • [14] ——, “Draem-a discriminatively trained reconstruction embedding for surface anomaly detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8330–8339.
  • [15] H. Deng and X. Li, “Anomaly detection via reverse distillation from one-class embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9737–9746.
  • [16] P. Perera, R. Nallapati, and B. Xiang, “Ocgan: One-class novelty detection using gans with constrained latent representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.
  • [17] J. Yu, Y. Zheng, X. Wang, W. Li, Y. Wu, R. Zhao, and L. Wu, “Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows,” arXiv preprint arXiv:2111.07677, 2021.
  • [18] M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt, “Asymmetric student-teacher networks for industrial anomaly detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2592–2602.
  • [19] T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” in International Conference on Pattern Recognition.   Springer, 2021, pp. 475–489.
  • [20] E. Horwitz and Y. Hoshen, “An empirical investigation of 3d anomaly detection and segmentation,” arXiv preprint arXiv:2203.05550, 2022.
  • [21] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [22] Z. Guo, R. Zhang, X. Zhu, Y. Tang, X. Ma, J. Han, K. Chen, P. Gao, X. Li, H. Li et al., “Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following,” arXiv preprint arXiv:2309.00615, 2023.
  • [23] L. Bonfiglioli, M. Toschi, D. Silvestri, N. Fioraio, and D. De Gregorio, “The eyecandies dataset for unsupervised multimodal anomaly detection and localization,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 3586–3602.
  • [24] C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” in CVPR, 2021.
  • [25] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in CACV, 2021.
  • [26] Z. Liu, Y. Zhou, Y. Xu, and Z. Wang, “Simplenet: A simple network for image anomaly detection and localization,” in CVPR, 2023.
  • [27] M. Yang, P. Wu, and H. Feng, “Memseg: A semi-supervised method for image surface defect detection using differences and commonalities,” Engineering Applications of Artificial Intelligence, 2023.
  • [28] T. D. Tien, A. T. Nguyen, N. H. Tran, T. D. Huy, S. Duong, C. D. T. Nguyen, and S. Q. Truong, “Revisiting reverse distillation for anomaly detection,” in CVPR, 2023.
  • [29] L. Chen, Z. You, N. Zhang, J. Xi, and X. Le, “Utrad: Anomaly detection and localization with u-transformer,” Neural Networks, 2022.
  • [30] Y. Liang, J. Zhang, S. Zhao, R. Wu, Y. Liu, and S. Pan, “Omni-frequency channel-selection representations for unsupervised anomaly detection,” TIP, 2023.
  • [31] J. Zhang, X. Chen, Y. Wang, C. Wang, Y. Liu, X. Li, M.-H. Yang, and D. Tao, “Exploring plain vit reconstruction for multi-class unsupervised anomaly detection,” arXiv preprint arXiv:2312.07495, 2023.
  • [32] H. He, Y. Bai, J. Zhang, Q. He, H. Chen, Z. Gan, C. Wang, X. Li, G. Tian, and L. Xie, “Mambaad: Exploring state space models for multi-class unsupervised anomaly detection,” arXiv, 2024.
  • [33] J. Zhang, X. Li, G. Tian, Z. Xue, Y. Liu, G. Pang, and D. Tao, “Learning feature inversion for multi-class unsupervised anomaly detection under general-purpose coco-ad benchmark,” arXiv, 2024.
  • [34] H. He, J. Zhang, H. Chen, X. Chen, Z. Li, X. Chen, Y. Wang, C. Wang, and L. Xie, “Diad: A diffusion-based framework for multi-class anomaly detection,” arXiv preprint arXiv:2312.06607, 2023.
  • [35] Q. Wan, L. Gao, X. Li, and L. Wen, “Unsupervised image anomaly detection and segmentation based on pretrained feature mapping,” TII, 2022.
  • [36] Y. Cao, X. Xu, Z. Liu, and W. Shen, “Collaborative discrepancy optimization for reliable image anomaly localization,” TII, 2023.
  • [37] J. Lei, X. Hu, Y. Wang, and D. Liu, “Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow,” in CVPR, 2023.
  • [38] M. Salehi, N. Sadjadi, S. Baselizadeh, M. H. Rohban, and H. R. Rabiee, “Multiresolution knowledge distillation for anomaly detection,” in CVPR, 2021.
  • [39] Y. Cao, Q. Wan, W. Shen, and L. Gao, “Informative knowledge distillation for image anomaly segmentation,” KBS, 2022.
  • [40] R. Chen, G. Xie, J. Liu, J. Wang, Z. Luo, J. Wang, and F. Zheng, “Easynet: An easy network for 3d industrial anomaly detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7038–7046.
  • [41] V. Zavrtanik, M. Kristan, and D. Skočaj, “Keep dræming: Discriminative 3d anomaly detection through anomaly simulation,” Pattern Recognition Letters, 2024.
  • [42] W. Li and X. Xu, “Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network,” arXiv preprint arXiv:2311.14897, 2023.
  • [43] Y. Cao, X. Xu, and W. Shen, “Complementary pseudo multimodal feature for point cloud anomaly detection,” arXiv preprint arXiv:2303.13194, 2023.
  • [44] Y.-M. Chu, L. Chieh, T.-I. Hsieh, H.-T. Chen, and T.-L. Liu, “Shape-guided dual-memory learning for 3d anomaly detection,” 2023.
  • [45] Y. Tu, B. Zhang, L. Liu, Y. Li, C. Xu, J. Zhang, Y. Wang, C. Wang, and C. R. Zhao, “Self-supervised feature adaptation for 3d industrial anomaly detection,” arXiv preprint arXiv:2401.03145, 2024.
  • [46] B. Zhao, Q. Xiong, X. Zhang, J. Guo, Q. Liu, X. Xing, and X. Xu, “Pointcore: Efficient unsupervised point cloud anomaly detector using local-global features,” arXiv preprint arXiv:2403.01804, 2024.
  • [47] P. Bergmann and D. Sattlegger, “Anomaly detection in 3d point clouds using deep geometric descriptors,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2613–2623.
  • [48] Z. Gu, J. Zhang, L. Liu, X. Chen, J. Peng, Z. Gan, G. Jiang, A. Shu, Y. Wang, and L. Ma, “Rethinking reverse distillation for multi-modal anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 8, 2024, pp. 8445–8453.
  • [49] Z. Hu, Z. Yang, X. Hu, and R. Nevatia, “Simple: Similar pseudo label exploitation for semi-supervised classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 099–15 108.
  • [50] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in neural information processing systems, vol. 33, pp. 596–608, 2020.
  • [51] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” arXiv preprint arXiv:2002.07394, 2020.
  • [52] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu, “End-to-end semi-supervised object detection with soft teacher,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3060–3069.
  • [53] Y.-C. Liu, C.-Y. Ma, Z. He, C.-W. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda, “Unbiased teacher for semi-supervised object detection,” arXiv preprint arXiv:2102.09480, 2021.
  • [54] F. Yang, K. Wu, S. Zhang, G. Jiang, Y. Liu, F. Zheng, W. Zhang, C. Wang, and L. Zeng, “Class-aware contrastive semi-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 421–14 430.
  • [55] S. Han, X. Hu, H. Huang, M. Jiang, and Y. Zhao, “Adbench: Anomaly detection benchmark,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 142–32 159, 2022.
  • [56] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai, “Self-trained deep ordinal regression for end-to-end video anomaly detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 173–12 182.
  • [57] B. Liu, D. Wang, K. Lin, P.-N. Tan, and J. Zhou, “Rca: A deep collaborative autoencoder approach for anomaly detection,” in IJCAI: proceedings of the conference, vol. 2021.   NIH Public Access, 2021, p. 1505.
  • [58] C. Zhou and R. C. Paffenroth, “Anomaly detection with robust deep autoencoders,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 665–674.
  • [59] S. Wu, J. Zhao, and G. Tian, “Understanding and mitigating data contamination in deep anomaly detection: A kernel-based approach.” in IJCAI, 2022, pp. 2319–2325.
  • [60] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
  • [61] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  • [62] R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt, “Measuring robustness to natural distribution shifts in image classification,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 583–18 599, 2020.
  • [63] G. Goh, N. Cammarata, C. Voss, S. Carter, M. Petrov, L. Schubert, A. Radford, and C. Olah, “Multimodal neurons in artificial neural networks,” Distill, vol. 6, no. 3, p. e30, 2021.
  • [64] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 082–18 091.
  • [65] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
  • [66] C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European Conference on Computer Vision.   Springer, 2022, pp. 696–712.
  • [67] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616.
  • [68] X. Chen, Y. Han, and J. Zhang, “A zero-/few-shot anomaly classification and segmentation method for cvpr 2023 vand workshop challenge tracks 1&2: 1st place on zero-shot ad and 4th place on few-shot ad,” arXiv preprint arXiv:2305.17382, 2023.
  • [69] Y. Cao, X. Xu, C. Sun, Y. Cheng, Z. Du, L. Gao, and W. Shen, “Segment any anomaly without training via hybrid prompt regularization,” arXiv preprint arXiv:2305.10724, 2023.
  • [70] X. Chen, J. Zhang, G. Tian, H. He, W. Zhang, Y. Wang, C. Wang, Y. Wu, and Y. Liu, “Clip-ad: A language-guided staged dual-path model for zero-shot anomaly detection,” arXiv preprint arXiv:2311.00453, 2023.
  • [71] J. Zhang, X. Chen, Z. Xue, Y. Wang, C. Wang, and Y. Liu, “Exploring grounding potential of vqa-oriented gpt-4v for zero-shot anomaly detection,” arXiv preprint arXiv:2311.02612, 2023.
  • [72] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8552–8562.
  • [73] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2639–2650.
  • [74] L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1179–1189.
  • [75] Y. Zheng, X. Wang, Y. Qi, W. Li, and L. Wu, “Benchmarking unsupervised anomaly detection and localization,” 2022.
  • [76] G. Wang, S. Han, E. Ding, and D. Huang, “Student-teacher feature pyramid matching for anomaly detection,” in 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021.   BMVA Press, 2021, p. 306. [Online]. Available: https://www.bmvc2021-virtualconference.com/assets/papers/1273.pdf
  • [77] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [78] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [79] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
  • [80] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 259–16 268.
  • [81] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” 2022.
  • [82] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
  • [83] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001.
  • [84] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.

Overview

The appendix provides additional sections below to enhance the main manuscript:

  • We report the P-AUROC for regular anomaly segmentation on MVTec 3D-AD in Tab. A1.

  • We report the P-AUROC for noisy anomaly segmentation on MVTec 3D-AD in Tabs. A2 and A3.

  • We show the visualization results of noisy anomaly segmentation under Non-Overlap setiing in Fig. A1.

  • We report the experiment results on Eycandies [23] dataset in Tabs. A4, A5, A6, A7, A8 and A9.

  • We reprot experiment results when injecting different percentages of noise into the training set in Tabs. A11, A10, A12, A13, A14 and A15.

P-AUROC for regular anomaly segmentation on MVTec 3D-AD

TABLE A1: P-AUROC score for regular anomaly segmentation of all categories of MVTec 3D-AD[10] dataset. Our method maintains the regular anomaly segmentation ability. The results of baselines are from the  [10, 20, 75]. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D FPFH [20] 99.4 96.6 99.9 94.6 96.6 92.7 99.6 99.9 99.6 99.0 97.8
M3DM [9] \ul98.1 94.9 99.7 \ul93.2 \ul95.9 \ul92.5 \ul98.9 99.5 \ul99.4 \ul98.1 97.0
Ours \ul98.1 \ul95.0 99.6 \ul93.2 \ul95.9 92.4 \ul98.9 99.6 \ul99.4 \ul98.1 97.0
RGB PatchCore[5] 98.3 98.4 98.0 97.4 97.2 84.9 97.6 98.3 98.7 97.7 96.7
M3DM [9] 99.2 99.0 99.4 97.7 \ul98.3 95.5 99.4 99.0 99.5 \ul99.4 98.7
Ours \ul99.1 99.0 99.4 97.7 98.4 95.5 \ul99.3 99.0 99.5 99.5 98.7
RGB+3D AST[18] - - - - - - - - - - 97.6
PatchCore + FPFH[20] 99.6 99.2 99.7 99.4 98.1 97.4 99.6 99.8 99.4 99.5 99.2
M3DM [9] 99.5 99.3 99.7 \ul98.5 98.5 \ul98.4 99.6 99.4 99.7 99.6 99.2
Ours 99.6 99.3 99.7 97.9 98.5 98.9 99.6 \ul99.5 99.7 99.6 99.2

In the regular anomaly segmentation setting, we compare our method with several 3D-based, RGB-based, and hybrid multi-modal 3D/RGB methods on MVTec-3D. Tab. A1 shows the segmentation results record with P-AUROC and we can conclude that our M3DM-NR also maintains the regular anomaly segmentation ability.

P-AUROC for noisy anomaly segmentation on MVTec 3D-AD

In the main paper, we report the AUPRO score for anomaly segmentation. In this section, we report the P-AUROC score under Overlap and Non-Overlap settings to further verify the segmentation performance of our method, as shown in Tab. A2 and Tab. A3.

3D. On pure 3D anomaly segmentation, we get the highest P-AUROC and outperform Shape-Guided [44] 0.8% in Overlap and M3DM [9] 0.1% in Non-Overlap. This shows our method has better segmentation performance than the previous method and is more resistant to noise in the training dataset, and with our PFA, the Point Transformer is the better 3D feature extractor for this task.

RGB. Our P-AUROC in RGB domain is the same as SoftPatch [8] in Overlap and the same as M3DM in Non-Overlap. But our method has a lower standard deviation, which means our method is more robust.

3D+RGB. On 3D + RGB multi-modal anomaly segmentation, we get the best result with AUPRO and outperform Shape-Guided 0.6% in Overlap and PatchCore+FPFH [20] 0.1% in Non-Overlap. These results are contributed by our novel 3-stage multi-modal noise-resistant framework.

TABLE A2: P-AUROC score for anomaly segmentation under Overlap setting of all categories of MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 69.8±plus-or-minus\pm±4.6 80.6±plus-or-minus\pm±1.4 95.4±plus-or-minus\pm±0.5 78.2±plus-or-minus\pm±0.9 70.6±plus-or-minus\pm±2.0 77.1±plus-or-minus\pm±1.6 66.6±plus-or-minus\pm±1.5 76.4±plus-or-minus\pm±10.0 91.1±plus-or-minus\pm±0.3 75.9±plus-or-minus\pm±1.7 78.2±plus-or-minus\pm±1.6
FPFH 84.5±plus-or-minus\pm±2.7 92.6±plus-or-minus\pm±0.2 96.5±plus-or-minus\pm±0.4 85.8±plus-or-minus\pm±0.6 86.3±plus-or-minus\pm±2.2 84.5±plus-or-minus\pm±1.4 87.8±plus-or-minus\pm±1.2 87.4±plus-or-minus\pm±2.0 83.3±plus-or-minus\pm±0.7 91.8±plus-or-minus\pm±0.7 88.0±plus-or-minus\pm±0.3
AST 89.5±plus-or-minus\pm±0.6 90.2±plus-or-minus\pm±0.0 96.9±plus-or-minus\pm±0.0 85.7±plus-or-minus\pm±0.6 86.8±plus-or-minus\pm±0.0 86.4±plus-or-minus\pm±0.0 93.5±plus-or-minus\pm±0.0 97.0±plus-or-minus\pm±0.6 89.6±plus-or-minus\pm±0.6 89.9±plus-or-minus\pm±0.6 90.6±plus-or-minus\pm±0.2
Shape-Guided 93.5±plus-or-minus\pm±1.7 \ul94.2±plus-or-minus\pm±1.5 99.4±plus-or-minus\pm±0.6 92.4±plus-or-minus\pm±1.2 88.1±plus-or-minus\pm±6.5 91.0±plus-or-minus\pm±3.1 94.6±plus-or-minus\pm±0.8 92.5±plus-or-minus\pm±3.8 97.1±plus-or-minus\pm±1.9 91.2±plus-or-minus\pm±1.2 \ul93.4±plus-or-minus\pm±0.6
M3DM \ul94.3±plus-or-minus\pm±1.1 94.2±plus-or-minus\pm±0.9 98.9±plus-or-minus\pm±0.2 90.6±plus-or-minus\pm±0.9 \ul89.8±plus-or-minus\pm±6.7 87.3±plus-or-minus\pm±2.8 \ul95.1±plus-or-minus\pm±1.0 91.9±plus-or-minus\pm±5.1 \ul98.0±plus-or-minus\pm±0.5 \ul92.6±plus-or-minus\pm±3.8 93.3±plus-or-minus\pm±0.9
Ours 96.6±plus-or-minus\pm±1.7 94.3±plus-or-minus\pm±0.3 \ul99.3±plus-or-minus\pm±0.3 \ul91.8±plus-or-minus\pm±0.4 90.2±plus-or-minus\pm±4.9 \ul88.8±plus-or-minus\pm±1.8 95.7±plus-or-minus\pm±1.2 \ul92.6±plus-or-minus\pm±3.2 98.7±plus-or-minus\pm±0.7 94.3±plus-or-minus\pm±2.6 94.2±plus-or-minus\pm±0.7
RGB PaDim \ul93.4±plus-or-minus\pm±0.9 \ul93.9±plus-or-minus\pm±0.9 \ul97.3±plus-or-minus\pm±0.4 \ul90.6±plus-or-minus\pm±1.3 \ul93.5±plus-or-minus\pm±6.1 \ul88.4±plus-or-minus\pm±0.5 91.8±plus-or-minus\pm±4.5 89.3±plus-or-minus\pm±1.2 \ul98.5±plus-or-minus\pm±0.2 93.8±plus-or-minus\pm±3.8 93.1±plus-or-minus\pm±0.1
PatchCore 75.2±plus-or-minus\pm±3.2 73.6±plus-or-minus\pm±6.2 80.0±plus-or-minus\pm±4.0 80.2±plus-or-minus\pm±3.4 71.1±plus-or-minus\pm±5.5 75.4±plus-or-minus\pm±9.5 68.9±plus-or-minus\pm±7.8 72.3±plus-or-minus\pm±9.3 64.9±plus-or-minus\pm±17.3 75.3±plus-or-minus\pm±6.8 73.7±plus-or-minus\pm±1.4
AST 67.8±plus-or-minus\pm±0.0 74.2±plus-or-minus\pm±0.0 54.2±plus-or-minus\pm±0.0 65.8±plus-or-minus\pm±0.6 68.9±plus-or-minus\pm±0.0 63.4±plus-or-minus\pm±0.6 57.5±plus-or-minus\pm±0.6 61.1±plus-or-minus\pm±0.6 57.2±plus-or-minus\pm±0.0 69.3±plus-or-minus\pm±0.6 63.9±plus-or-minus\pm±0.1
Shape-Guided 78.0±plus-or-minus\pm±3.5 91.2±plus-or-minus\pm±1.4 93.1±plus-or-minus\pm±1.1 84.7±plus-or-minus\pm±0.3 90.1±plus-or-minus\pm±0.4 73.8±plus-or-minus\pm±1.6 82.8±plus-or-minus\pm±1.1 89.3±plus-or-minus\pm±0.8 88.6±plus-or-minus\pm±0.2 88.8±plus-or-minus\pm±0.3 86.0±plus-or-minus\pm±0.6
SoftPatch 90.4±plus-or-minus\pm±1.7 91.9±plus-or-minus\pm±4.1 96.9±plus-or-minus\pm±1.1 87.7±plus-or-minus\pm±2.2 94.8±plus-or-minus\pm±4.6 96.5±plus-or-minus\pm±4.9 94.4±plus-or-minus\pm±0.5 90.9±plus-or-minus\pm±0.7 96.7±plus-or-minus\pm±1.6 97.3±plus-or-minus\pm±0.8 93.8±plus-or-minus\pm±0.5
M3DM 68.8±plus-or-minus\pm±5.0 77.0±plus-or-minus\pm±1.8 77.2±plus-or-minus\pm±2.6 77.1±plus-or-minus\pm±0.4 71.8±plus-or-minus\pm±2.0 68.9±plus-or-minus\pm±2.3 65.8±plus-or-minus\pm±1.7 65.8±plus-or-minus\pm±3.8 60.5±plus-or-minus\pm±2.3 75.2±plus-or-minus\pm±1.4 70.8±plus-or-minus\pm±1.1
Ours 98.5±plus-or-minus\pm±0.5 95.8±plus-or-minus\pm±1.6 98.7±plus-or-minus\pm±0.4 95.0±plus-or-minus\pm±1.1 88.5±plus-or-minus\pm±5.9 85.9±plus-or-minus\pm±1.7 \ul93.4±plus-or-minus\pm±2.6 \ul89.5±plus-or-minus\pm±1.0 98.6±plus-or-minus\pm±0.3 \ul94.6±plus-or-minus\pm±0.4 93.8±plus-or-minus\pm±0.7
3D+RGB PatchCore+FPFH 69.1±plus-or-minus\pm±4.8 77.0±plus-or-minus\pm±1.8 77.4±plus-or-minus\pm±2.6 78.4±plus-or-minus\pm±0.4 71.5±plus-or-minus\pm±2.1 69.3±plus-or-minus\pm±1.5 66.0±plus-or-minus\pm±1.7 65.8±plus-or-minus\pm±3.8 60.5±plus-or-minus\pm±2.3 75.2±plus-or-minus\pm±1.4 71.0±plus-or-minus\pm±0.9
AST 90.7±plus-or-minus\pm±0.6 94.3±plus-or-minus\pm±0.6 97.5±plus-or-minus\pm±0.0 89.4±plus-or-minus\pm±0.0 90.6±plus-or-minus\pm±0.6 \ul89.4±plus-or-minus\pm±0.0 93.3±plus-or-minus\pm±0.6 96.9±plus-or-minus\pm±0.6 90.6±plus-or-minus\pm±0.6 93.6±plus-or-minus\pm±0.0 92.6±plus-or-minus\pm±0.2
Shape-Guided \ul91.0±plus-or-minus\pm±1.7 \ul94.7±plus-or-minus\pm±0.4 \ul98.1±plus-or-minus\pm±0.2 \ul90.9±plus-or-minus\pm±0.1 91.6±plus-or-minus\pm±5.3 90.8±plus-or-minus\pm±1.6 95.3±plus-or-minus\pm±0.3 \ul95.8±plus-or-minus\pm±4.6 \ul96.0±plus-or-minus\pm±0.3 95.5±plus-or-minus\pm±2.7 \ul94.0±plus-or-minus\pm±1.0
M3DM 69.8±plus-or-minus\pm±4.7 77.0±plus-or-minus\pm±2.0 77.4±plus-or-minus\pm±2.6 79.2±plus-or-minus\pm±0.5 71.9±plus-or-minus\pm±3.1 74.0±plus-or-minus\pm±2.4 66.2±plus-or-minus\pm±1.8 66.2±plus-or-minus\pm±3.8 61.8±plus-or-minus\pm±2.5 75.6±plus-or-minus\pm±1.3 71.9±plus-or-minus\pm±1.2
Ours 99.1±plus-or-minus\pm±0.5 95.8±plus-or-minus\pm±1.7 99.0±plus-or-minus\pm±0.5 95.8±plus-or-minus\pm±1.0 \ul90.7±plus-or-minus\pm±2.8 88.1±plus-or-minus\pm±2.5 \ul93.8±plus-or-minus\pm±2.8 89.8±plus-or-minus\pm±1.1 98.8±plus-or-minus\pm±0.2 \ul94.9±plus-or-minus\pm±0.5 94.6±plus-or-minus\pm±0.3
TABLE A3: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories of MVTec 3D-AD. Our method clearly outperforms other methods in 3D, RGB, and 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
3D SIFT 94.0±plus-or-minus\pm±3.1 94.2±plus-or-minus\pm±3.0 93.9±plus-or-minus\pm±4.9 \ul93.0±plus-or-minus\pm±1.9 \ul95.7±plus-or-minus\pm±1.3 92.3±plus-or-minus\pm±2.9 96.0±plus-or-minus\pm±2.8 98.1±plus-or-minus\pm±2.9 \ul99.2±plus-or-minus\pm±0.7 \ul98.6±plus-or-minus\pm±0.7 95.5±plus-or-minus\pm±0.6
FPFH 97.7±plus-or-minus\pm±0.5 93.8±plus-or-minus\pm±2.4 95.2±plus-or-minus\pm±4.5 94.4±plus-or-minus\pm±0.4 96.5±plus-or-minus\pm±0.5 \ul92.6±plus-or-minus\pm±1.4 96.1±plus-or-minus\pm±1.1 \ul99.1±plus-or-minus\pm±1.2 98.9±plus-or-minus\pm±1.2 99.1±plus-or-minus\pm±0.1 96.3±plus-or-minus\pm±0.5
AST 96.4±plus-or-minus\pm±0.6 91.3±plus-or-minus\pm±0.6 98.3±plus-or-minus\pm±0.6 91.9±plus-or-minus\pm±0.6 86.4±plus-or-minus\pm±0.6 94.0±plus-or-minus\pm±0.6 98.9±plus-or-minus\pm±0.6 99.3±plus-or-minus\pm±0.6 92.9±plus-or-minus\pm±0.0 93.8±plus-or-minus\pm±0.0 94.3±plus-or-minus\pm±0.3
Shape-Guided \ul98.4±plus-or-minus\pm±0.5 94.4±plus-or-minus\pm±1.5 98.8±plus-or-minus\pm±1.0 93.0±plus-or-minus\pm±1.7 95.5±plus-or-minus\pm±0.6 90.9±plus-or-minus\pm±4.0 \ul98.7±plus-or-minus\pm±1.2 97.9±plus-or-minus\pm±2.0 98.0±plus-or-minus\pm±0.6 97.7±plus-or-minus\pm±0.1 96.3±plus-or-minus\pm±0.6
M3DM 97.9±plus-or-minus\pm±0.3 94.8±plus-or-minus\pm±0.3 99.6±plus-or-minus\pm±0.1 91.9±plus-or-minus\pm±0.9 94.8±plus-or-minus\pm±2.0 91.5±plus-or-minus\pm±3.1 97.5±plus-or-minus\pm±2.2 \ul99.1±plus-or-minus\pm±0.1 99.3±plus-or-minus\pm±0.1 97.5±plus-or-minus\pm±1.0 \ul96.4±plus-or-minus\pm±0.7
Ours 98.6±plus-or-minus\pm±0.2 \ul94.6±plus-or-minus\pm±0.2 99.6±plus-or-minus\pm±0.1 92.4±plus-or-minus\pm±0.6 95.4±plus-or-minus\pm±0.9 90.8±plus-or-minus\pm±2.9 98.1±plus-or-minus\pm±1.1 98.2±plus-or-minus\pm±1.6 \ul99.2±plus-or-minus\pm±0.3 97.7±plus-or-minus\pm±0.6 96.5±plus-or-minus\pm±0.7
RGB PaDim 97.5±plus-or-minus\pm±1.2 96.1±plus-or-minus\pm±0.9 97.9±plus-or-minus\pm±0.2 95.1±plus-or-minus\pm±0.2 97.8±plus-or-minus\pm±0.4 \ul99.6±plus-or-minus\pm±0.3 \ul99.1±plus-or-minus\pm±0.2 \ul98.6±plus-or-minus\pm±0.3 98.8±plus-or-minus\pm±0.4 99.2±plus-or-minus\pm±0.2 98.0±plus-or-minus\pm±0.2
PatchCore 96.0±plus-or-minus\pm±0.2 98.9±plus-or-minus\pm±0.0 98.1±plus-or-minus\pm±1.9 \ul96.7±plus-or-minus\pm±0.4 \ul98.9±plus-or-minus\pm±0.1 99.9±plus-or-minus\pm±0.0 98.1±plus-or-minus\pm±0.1 96.3±plus-or-minus\pm±2.3 98.8±plus-or-minus\pm±0.8 \ul99.2±plus-or-minus\pm±0.6 98.1±plus-or-minus\pm±0.5
AST 88.5±plus-or-minus\pm±0.6 92.7±plus-or-minus\pm±0.6 65.8±plus-or-minus\pm±0.6 79.4±plus-or-minus\pm±1.0 96.0±plus-or-minus\pm±0.6 80.6±plus-or-minus\pm±1.0 84.4±plus-or-minus\pm±0.6 80.0±plus-or-minus\pm±0.0 89.1±plus-or-minus\pm±0.6 85.6±plus-or-minus\pm±0.6 84.2±plus-or-minus\pm±0.2
Shape-Guided 94.5±plus-or-minus\pm±0.4 97.2±plus-or-minus\pm±0.4 98.3±plus-or-minus\pm±0.2 95.0±plus-or-minus\pm±0.6 98.1±plus-or-minus\pm±0.1 87.8±plus-or-minus\pm±0.8 95.1±plus-or-minus\pm±0.2 96.1±plus-or-minus\pm±0.3 97.3±plus-or-minus\pm±1.0 97.5±plus-or-minus\pm±0.5 95.7±plus-or-minus\pm±0.1
SoftPatch 96.3±plus-or-minus\pm±0.5 98.5±plus-or-minus\pm±0.3 \ul99.2±plus-or-minus\pm±0.1 96.8±plus-or-minus\pm±0.4 98.9±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±1.0 98.3±plus-or-minus\pm±0.3 97.1±plus-or-minus\pm±1.3 98.2±plus-or-minus\pm±0.4 98.5±plus-or-minus\pm±1.0 98.1±plus-or-minus\pm±0.1
M3DM \ul98.8±plus-or-minus\pm±0.3 \ul98.9±plus-or-minus\pm±0.6 99.0±plus-or-minus\pm±0.6 96.6±plus-or-minus\pm±0.3 98.4±plus-or-minus\pm±0.4 93.9±plus-or-minus\pm±0.8 99.1±plus-or-minus\pm±0.1 98.7±plus-or-minus\pm±0.3 99.5±plus-or-minus\pm±0.1 99.4±plus-or-minus\pm±0.1 98.2±plus-or-minus\pm±0.2
Ours 99.0±plus-or-minus\pm±0.2 \ul98.9±plus-or-minus\pm±0.2 99.2±plus-or-minus\pm±0.1 96.4±plus-or-minus\pm±0.3 97.7±plus-or-minus\pm±0.8 94.6±plus-or-minus\pm±0.4 98.9±plus-or-minus\pm±0.1 98.4±plus-or-minus\pm±0.5 \ul99.4±plus-or-minus\pm±0.2 98.9±plus-or-minus\pm±0.1 98.2±plus-or-minus\pm±0.0
3D+RGB PatchCore+FPFH 99.4±plus-or-minus\pm±0.1 98.8±plus-or-minus\pm±0.5 99.3±plus-or-minus\pm±0.6 98.1±plus-or-minus\pm±1.6 98.1±plus-or-minus\pm±0.5 \ul97.5±plus-or-minus\pm±0.2 \ul99.3±plus-or-minus\pm±0.1 98.6±plus-or-minus\pm±0.1 99.5±plus-or-minus\pm±0.1 99.1±plus-or-minus\pm±0.6 \ul98.8±plus-or-minus\pm±0.1
AST 97.4±plus-or-minus\pm±0.6 97.1±plus-or-minus\pm±0.6 \ul99.5±plus-or-minus\pm±0.6 94.0±plus-or-minus\pm±0.0 91.3±plus-or-minus\pm±0.6 97.1±plus-or-minus\pm±0.6 98.7±plus-or-minus\pm±0.0 98.7±plus-or-minus\pm±0.6 93.2±plus-or-minus\pm±0.6 96.9±plus-or-minus\pm±0.0 96.4±plus-or-minus\pm±0.1
Shape-Guided 97.6±plus-or-minus\pm±0.1 98.2±plus-or-minus\pm±0.3 99.5±plus-or-minus\pm±0.1 97.0±plus-or-minus\pm±0.3 98.9±plus-or-minus\pm±0.1 97.2±plus-or-minus\pm±0.2 98.6±plus-or-minus\pm±0.1 \ul99.1±plus-or-minus\pm±1.0 98.9±plus-or-minus\pm±0.5 99.6±plus-or-minus\pm±0.2 98.5±plus-or-minus\pm±0.2
M3DM 98.9±plus-or-minus\pm±0.2 99.1±plus-or-minus\pm±0.1 99.3±plus-or-minus\pm±0.6 96.8±plus-or-minus\pm±0.3 97.5±plus-or-minus\pm±0.9 96.0±plus-or-minus\pm±0.3 99.2±plus-or-minus\pm±0.1 99.0±plus-or-minus\pm±0.3 99.7±plus-or-minus\pm±0.1 \ul99.3±plus-or-minus\pm±0.1 98.5±plus-or-minus\pm±0.1
Ours 99.4±plus-or-minus\pm±0.1 \ul99.0±plus-or-minus\pm±0.1 99.5±plus-or-minus\pm±0.1 \ul97.2±plus-or-minus\pm±0.2 \ul98.2±plus-or-minus\pm±0.4 98.1±plus-or-minus\pm±0.4 99.3±plus-or-minus\pm±0.1 99.2±plus-or-minus\pm±0.0 \ul99.6±plus-or-minus\pm±0.1 \ul99.2±plus-or-minus\pm±0.1 98.9±plus-or-minus\pm±0.0

Visualization results of Non-Overlap setiing

In this section, we visualize anomaly segmentation results for all categories of MVTec-3D AD datasets under Non-Overlap setting. As shown in Fig. A1, we visualize the heatmap results of our method and PatchCore + FPFH [20], M3DM [9] and Shape-Guided [44] with multi-modal inputs. Compared with previous methods, our method gets better segmentation maps.

Refer to caption
Figure A1: Heatmap of our anomaly segmentation results (multi-modal inputs) under Non-Overlap setting. Compared with existing methods, our method remains unaffected by noise and outputs a more accurate segmentation region.

Eyecandies

We have noticed that recently a new dataset Eyecandies [23] provides multimodel information of 10 categories of candies, and each category contains 1000 samples for training, 50 labeled samples for public testing and 400 unlabeled samples for private testing. The source dataset provides 6 RGB images, which are in different light conditions, a depth map, and a normal map of each sample. In this section, we convert the Eyecandies dataset to the format supported by M3DM-NR. In detail, we use the environment light image as our input RGB data, and for 3D data, we first convert the depth image to point clouds with internal parameters, then we remove the background points with point coordinates. For computation efficiency, we use only less than 400 samples from each category for training. Because the public test dataset only contains 25 normal and 25 anomalous samples, which doesn’t meet 10% of the size of training dataset, we implement the Overlap and Non-Overlap setting differently. For Overlap setting, we only conduct experiments of 5% noise by selecting 400 images from training dataset and 20 images from public test dataset as the whole noisy training dataset. For Non-Overlap setting, as the private test dataset contains 200 normal samples and 200 anomalous samples mixed together, we random select 80 samples from the private test dataset and regard it as 40 normal samples and 40 anomalous samples. These 80 samples, along with 320 normal samples selected from the training dataset, make up of the whole noisy training dataset. We report the mean and standard deviation over 3 random seeds for each measurement.

As illustrated in Tabs. A4, A5, A6, A7, A8 and A9, we report the best I-AUCROC, AUPRO and P-AUCROC scores. under both Overlap and Non-Overlap settings.

TABLE A4: I-AUROC score for anomaly detection under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 11.4±plus-or-minus\pm±2.8 19.2±plus-or-minus\pm±3.6 20.9±plus-or-minus\pm±1.6 19.7±plus-or-minus\pm±0.9 25.1±plus-or-minus\pm±5.9 \ul20.8±plus-or-minus\pm±4.7 17.6±plus-or-minus\pm±1.2 24.5±plus-or-minus\pm±3.6 24.8±plus-or-minus\pm±1.4 19.1±plus-or-minus\pm±1.3 20.3±plus-or-minus\pm±0.3
AST 8.0±plus-or-minus\pm±0.6 13.8±plus-or-minus\pm±0.6 6.7±plus-or-minus\pm±0.6 10.9±plus-or-minus\pm±0.6 16.7±plus-or-minus\pm±0.6 10.9±plus-or-minus\pm±0.6 18.4±plus-or-minus\pm±0.6 24.0±plus-or-minus\pm±1.0 9.4±plus-or-minus\pm±0.0 13.7±plus-or-minus\pm±0.0 13.4±plus-or-minus\pm±0.2
Shape-Guided 9.1±plus-or-minus\pm±4.5 18.5±plus-or-minus\pm±1.0 15.3±plus-or-minus\pm±2.5 24.7±plus-or-minus\pm±2.2 15.5±plus-or-minus\pm±3.0 11.8±plus-or-minus\pm±2.4 15.8±plus-or-minus\pm±0.6 25.7±plus-or-minus\pm±1.2 25.9±plus-or-minus\pm±1.3 23.6±plus-or-minus\pm±3.1 18.6±plus-or-minus\pm±0.8
M3DM \ul17.0±plus-or-minus\pm±3.6 \ul30.5±plus-or-minus\pm±4.2 \ul39.6±plus-or-minus\pm±2.7 \ul41.9±plus-or-minus\pm±1.6 \ul39.4±plus-or-minus\pm±3.4 20.7±plus-or-minus\pm±3.8 \ul28.2±plus-or-minus\pm±2.3 \ul33.1±plus-or-minus\pm±3.4 \ul54.6±plus-or-minus\pm±0.4 \ul50.9±plus-or-minus\pm±0.9 \ul35.6±plus-or-minus\pm±0.9
Ours 33.5±plus-or-minus\pm±3.4 74.9±plus-or-minus\pm±4.5 76.9±plus-or-minus\pm±5.5 89.3±plus-or-minus\pm±3.0 55.8±plus-or-minus\pm±6.1 48.0±plus-or-minus\pm±5.7 79.4±plus-or-minus\pm±5.2 65.0±plus-or-minus\pm±4.9 98.9±plus-or-minus\pm±1.0 70.5±plus-or-minus\pm±2.4 69.2±plus-or-minus\pm±1.9
TABLE A5: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 55.4±plus-or-minus\pm±0.8 86.4±plus-or-minus\pm±2.3 72.2±plus-or-minus\pm±2.1 94.3±plus-or-minus\pm±1.9 71.5±plus-or-minus\pm±3.5 49.2±plus-or-minus\pm±5.3 80.9±plus-or-minus\pm±1.0 82.0±plus-or-minus\pm±1.2 99.1±plus-or-minus\pm±0.8 85.8±plus-or-minus\pm±4.7 77.7±plus-or-minus\pm±0.6
AST 47.7±plus-or-minus\pm±0.6 \ul93.4±plus-or-minus\pm±1.0 78.3±plus-or-minus\pm±0.6 93.9±plus-or-minus\pm±0.0 74.7±plus-or-minus\pm±0.6 66.2±plus-or-minus\pm±1.0 83.1±plus-or-minus\pm±0.6 87.3±plus-or-minus\pm±0.0 99.4±plus-or-minus\pm±0.6 92.9±plus-or-minus\pm±0.6 81.7±plus-or-minus\pm±0.2
Shape-Guided 49.4±plus-or-minus\pm±0.5 94.8±plus-or-minus\pm±1.3 77.5±plus-or-minus\pm±2.2 93.9±plus-or-minus\pm±1.1 74.8±plus-or-minus\pm±0.9 \ul64.9±plus-or-minus\pm±2.0 \ul83.3±plus-or-minus\pm±0.4 \ul86.0±plus-or-minus\pm±1.6 \ul99.6±plus-or-minus\pm±0.1 92.6±plus-or-minus\pm±1.3 81.7±plus-or-minus\pm±0.7
M3DM 53.9±plus-or-minus\pm±5.0 90.1±plus-or-minus\pm±0.6 89.4±plus-or-minus\pm±0.8 98.4±plus-or-minus\pm±0.4 \ul81.5±plus-or-minus\pm±1.0 52.3±plus-or-minus\pm±1.8 78.4±plus-or-minus\pm±1.1 83.3±plus-or-minus\pm±1.7 99.5±plus-or-minus\pm±0.2 99.4±plus-or-minus\pm±0.2 \ul82.6±plus-or-minus\pm±0.5
Ours \ul54.5±plus-or-minus\pm±7.7 85.6±plus-or-minus\pm±0.5 \ul88.9±plus-or-minus\pm±2.1 \ul97.2±plus-or-minus\pm±0.7 82.2±plus-or-minus\pm±6.1 54.3±plus-or-minus\pm±2.5 86.8±plus-or-minus\pm±0.2 85.6±plus-or-minus\pm±1.2 99.8±plus-or-minus\pm±0.1 \ul98.6±plus-or-minus\pm±0.7 83.3±plus-or-minus\pm±0.6
TABLE A6: AUPRO score for anomaly segmentation under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 16.7±plus-or-minus\pm±1.9 20.5±plus-or-minus\pm±2.1 15.6±plus-or-minus\pm±1.4 18.7±plus-or-minus\pm±3.2 22.2±plus-or-minus\pm±4.8 18.3±plus-or-minus\pm±2.6 17.3±plus-or-minus\pm±1.9 25.8±plus-or-minus\pm±6.2 19.0±plus-or-minus\pm±1.1 19.6±plus-or-minus\pm±0.5 19.4±plus-or-minus\pm±0.6
Shape-Guided 65.6±plus-or-minus\pm±0.6 \ul44.1±plus-or-minus\pm±0.9 \ul21.1±plus-or-minus\pm±0.9 \ul57.8±plus-or-minus\pm±4.2 \ul52.8±plus-or-minus\pm±2.2 20.7±plus-or-minus\pm±1.7 \ul34.3±plus-or-minus\pm±2.0 84.0±plus-or-minus\pm±3.2 \ul59.1±plus-or-minus\pm±3.0 57.6±plus-or-minus\pm±2.2 \ul49.7±plus-or-minus\pm±1.1
M3DM 21.7±plus-or-minus\pm±3.2 21.0±plus-or-minus\pm±2.3 18.3±plus-or-minus\pm±0.2 18.8±plus-or-minus\pm±3.2 23.3±plus-or-minus\pm±5.1 \ul21.5±plus-or-minus\pm±2.1 17.6±plus-or-minus\pm±2.3 26.7±plus-or-minus\pm±4.7 19.1±plus-or-minus\pm±1.2 20.2±plus-or-minus\pm±0.0 20.8±plus-or-minus\pm±0.7
Ours \ul50.5±plus-or-minus\pm±2.5 82.1±plus-or-minus\pm±2.9 66.8±plus-or-minus\pm±2.2 89.7±plus-or-minus\pm±2.7 60.7±plus-or-minus\pm±4.0 59.3±plus-or-minus\pm±2.2 80.8±plus-or-minus\pm±1.6 \ul70.3±plus-or-minus\pm±2.9 94.1±plus-or-minus\pm±2.6 \ul55.9±plus-or-minus\pm±3.4 71.0±plus-or-minus\pm±0.8
TABLE A7: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 83.5±plus-or-minus\pm±1.5 89.9±plus-or-minus\pm±0.7 67.0±plus-or-minus\pm±0.8 \ul96.4±plus-or-minus\pm±0.0 81.9±plus-or-minus\pm±0.8 51.6±plus-or-minus\pm±1.2 86.7±plus-or-minus\pm±0.6 89.9±plus-or-minus\pm±0.3 94.6±plus-or-minus\pm±0.6 88.6±plus-or-minus\pm±0.7 83.0±plus-or-minus\pm±0.3
Shape-Guided 84.9±plus-or-minus\pm±0.5 \ul91.0±plus-or-minus\pm±0.1 69.8±plus-or-minus\pm±0.4 95.5±plus-or-minus\pm±0.3 84.6±plus-or-minus\pm±0.7 61.1±plus-or-minus\pm±0.9 \ul90.5±plus-or-minus\pm±0.8 95.1±plus-or-minus\pm±0.2 \ul96.4±plus-or-minus\pm±0.2 93.8±plus-or-minus\pm±0.3 86.3±plus-or-minus\pm±0.2
M3DM \ul88.0±plus-or-minus\pm±1.1 90.4±plus-or-minus\pm±1.2 \ul80.6±plus-or-minus\pm±0.2 96.1±plus-or-minus\pm±3.6 \ul87.4±plus-or-minus\pm±1.2 \ul65.7±plus-or-minus\pm±1.3 86.4±plus-or-minus\pm±1.4 91.2±plus-or-minus\pm±0.2 96.2±plus-or-minus\pm±0.6 96.2±plus-or-minus\pm±0.8 \ul87.8±plus-or-minus\pm±0.3
Ours 89.8±plus-or-minus\pm±0.6 91.6±plus-or-minus\pm±0.3 77.6±plus-or-minus\pm±1.8 98.1±plus-or-minus\pm±0.1 86.6±plus-or-minus\pm±2.0 65.2±plus-or-minus\pm±1.1 85.8±plus-or-minus\pm±1.4 90.8±plus-or-minus\pm±0.6 96.9±plus-or-minus\pm±0.3 \ul96.1±plus-or-minus\pm±0.8 87.8±plus-or-minus\pm±0.2
TABLE A8: P-AUROC score for anomaly segmentation under Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 21.7±plus-or-minus\pm±2.3 21.4±plus-or-minus\pm±2.5 28.9±plus-or-minus\pm±3.0 25.0±plus-or-minus\pm±2.0 34.6±plus-or-minus\pm±5.5 35.5±plus-or-minus\pm±3.7 20.6±plus-or-minus\pm±2.0 25.6±plus-or-minus\pm±21.0 22.3±plus-or-minus\pm±3.3 26.8±plus-or-minus\pm±7.9 26.2±plus-or-minus\pm±1.1
AST 48.3±plus-or-minus\pm±0.6 49.3±plus-or-minus\pm±0.6 48.3±plus-or-minus\pm±0.6 48.6±plus-or-minus\pm±0.6 78.1±plus-or-minus\pm±1.0 49.0±plus-or-minus\pm±1.0 76.1±plus-or-minus\pm±1.0 48.7±plus-or-minus\pm±1.0 77.0±plus-or-minus\pm±0.6 49.0±plus-or-minus\pm±0.0 57.2±plus-or-minus\pm±0.5
Shape-Guided 89.7±plus-or-minus\pm±0.4 \ul82.4±plus-or-minus\pm±0.8 \ul71.6±plus-or-minus\pm±1.2 \ul86.0±plus-or-minus\pm±1.5 \ul78.1±plus-or-minus\pm±1.5 \ul67.6±plus-or-minus\pm±2.4 \ul78.4±plus-or-minus\pm±0.7 94.1±plus-or-minus\pm±2.0 \ul81.0±plus-or-minus\pm±0.6 65.5±plus-or-minus\pm±3.1 \ul79.5±plus-or-minus\pm±1.0
M3DM 37.5±plus-or-minus\pm±2.6 24.2±plus-or-minus\pm±1.8 30.2±plus-or-minus\pm±3.9 22.7±plus-or-minus\pm±2.1 34.8±plus-or-minus\pm±4.9 39.7±plus-or-minus\pm±3.0 21.6±plus-or-minus\pm±2.6 26.5±plus-or-minus\pm±21.1 19.6±plus-or-minus\pm±3.6 19.0±plus-or-minus\pm±1.3 27.6±plus-or-minus\pm±1.2
Ours \ul57.0±plus-or-minus\pm±2.5 87.4±plus-or-minus\pm±6.0 78.0±plus-or-minus\pm±2.4 91.6±plus-or-minus\pm±3.9 70.7±plus-or-minus\pm±3.3 82.0±plus-or-minus\pm±4.0 90.2±plus-or-minus\pm±2.3 \ul81.8±plus-or-minus\pm±6.4 98.5±plus-or-minus\pm±1.2 \ul60.3±plus-or-minus\pm±8.3 79.8±plus-or-minus\pm±0.7
TABLE A9: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories in Eyecandies [23]. Our method clearly outperforms other methods in 3D + RGB settings, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method
Candy
Cane
Chocolate
Cookie
Chocolate
Praline
Confetto
Gummy
Bear
Hazelnut
Truffle
Licorice
Sandwich
Lollipop
Marsh-
mallow
Peppermint
Candy
Mean
3D+RGB PatchCore+FPFH 95.7±plus-or-minus\pm±0.2 97.4±plus-or-minus\pm±0.1 91.7±plus-or-minus\pm±0.3 99.4±plus-or-minus\pm±0.0 92.9±plus-or-minus\pm±0.2 87.4±plus-or-minus\pm±0.5 96.9±plus-or-minus\pm±0.2 98.1±plus-or-minus\pm±0.2 99.2±plus-or-minus\pm±0.1 97.3±plus-or-minus\pm±0.2 95.6±plus-or-minus\pm±0.1
AST 95.1±plus-or-minus\pm±0.6 98.3±plus-or-minus\pm±1.0 91.4±plus-or-minus\pm±0.6 99.3±plus-or-minus\pm±0.6 92.0±plus-or-minus\pm±0.6 88.2±plus-or-minus\pm±0.6 96.0±plus-or-minus\pm±0.6 95.9±plus-or-minus\pm±0.6 98.8±plus-or-minus\pm±0.6 97.0±plus-or-minus\pm±0.6 95.2±plus-or-minus\pm±0.2
Shape-Guided 95.8±plus-or-minus\pm±0.1 98.3±plus-or-minus\pm±0.0 92.7±plus-or-minus\pm±0.0 99.0±plus-or-minus\pm±0.1 91.9±plus-or-minus\pm±0.3 89.0±plus-or-minus\pm±0.2 97.9±plus-or-minus\pm±0.2 98.5±plus-or-minus\pm±0.1 99.5±plus-or-minus\pm±0.1 98.4±plus-or-minus\pm±0.1 96.1±plus-or-minus\pm±0.1
M3DM \ul96.4±plus-or-minus\pm±0.3 \ul98.3±plus-or-minus\pm±0.3 \ul95.2±plus-or-minus\pm±1.9 99.8±plus-or-minus\pm±0.0 97.5±plus-or-minus\pm±0.3 93.3±plus-or-minus\pm±0.2 95.5±plus-or-minus\pm±3.1 98.9±plus-or-minus\pm±0.0 \ul99.6±plus-or-minus\pm±0.1 99.4±plus-or-minus\pm±0.1 \ul97.4±plus-or-minus\pm±0.5
Ours 96.9±plus-or-minus\pm±0.3 98.4±plus-or-minus\pm±0.0 95.5±plus-or-minus\pm±0.7 \ul99.8±plus-or-minus\pm±0.1 \ul96.7±plus-or-minus\pm±0.5 \ul92.8±plus-or-minus\pm±0.7 \ul97.1±plus-or-minus\pm±0.2 \ul98.7±plus-or-minus\pm±0.1 99.7±plus-or-minus\pm±0.0 \ul99.3±plus-or-minus\pm±0.3 97.5±plus-or-minus\pm±0.0

Experiments on different noise level

To further validate the robustness of our method against noise in the training dataset, we conducted experiments by injecting different percentages of noise into the training set. Specifically, we performed experiments with 20% and 30% noise data injected into the training dataset. The results of these experiments are presented in the Tabs. A11, A10, A12, A13, A14 and A15 below. Comparing the results of injecting 10% noise, 20% noise and 30% noise, we can conclude that our method is much more robust to noise in the training dataset than previous methods.

TABLE A10: I-AUROC score for anomaly detection under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 42.0±plus-or-minus\pm±1.6 40.8±plus-or-minus\pm±2.8 49.5±plus-or-minus\pm±0.3 53.0±plus-or-minus\pm±0.6 44.1±plus-or-minus\pm±1.3 28.2±plus-or-minus\pm±2.1 27.3±plus-or-minus\pm±1.2 25.9±plus-or-minus\pm±1.5 13.2±plus-or-minus\pm±1.3 45.1±plus-or-minus\pm±2.0 36.9±plus-or-minus\pm±0.3
AST 37.3±plus-or-minus\pm±1.0 44.8±plus-or-minus\pm±0.6 50.3±plus-or-minus\pm±0.6 \ul59.5±plus-or-minus\pm±0.0 43.2±plus-or-minus\pm±0.6 33.2±plus-or-minus\pm±0.6 29.4±plus-or-minus\pm±1.0 \ul31.5±plus-or-minus\pm±1.0 12.4±plus-or-minus\pm±0.6 38.1±plus-or-minus\pm±0.6 38.0±plus-or-minus\pm±0.1
Shape-Guided 42.3±plus-or-minus\pm±1.1 45.1±plus-or-minus\pm±1.6 \ul53.2±plus-or-minus\pm±0.3 50.6±plus-or-minus\pm±0.5 44.6±plus-or-minus\pm±1.3 32.8±plus-or-minus\pm±0.7 29.4±plus-or-minus\pm±0.1 30.1±plus-or-minus\pm±0.5 14.0±plus-or-minus\pm±0.7 45.9±plus-or-minus\pm±1.3 38.8±plus-or-minus\pm±0.3
M3DM \ul45.0±plus-or-minus\pm±1.1 \ul47.3±plus-or-minus\pm±1.0 47.6±plus-or-minus\pm±1.0 56.8±plus-or-minus\pm±1.9 \ul51.4±plus-or-minus\pm±1.0 \ul41.3±plus-or-minus\pm±0.5 \ul32.7±plus-or-minus\pm±0.7 27.9±plus-or-minus\pm±1.5 \ul25.5±plus-or-minus\pm±1.4 \ul53.8±plus-or-minus\pm±1.2 \ul42.9±plus-or-minus\pm±0.5
Ours 92.8±plus-or-minus\pm±1.5 76.4±plus-or-minus\pm±1.8 93.0±plus-or-minus\pm±0.5 85.7±plus-or-minus\pm±0.9 82.4±plus-or-minus\pm±0.7 71.4±plus-or-minus\pm±5.2 67.7±plus-or-minus\pm±5.0 60.2±plus-or-minus\pm±2.9 90.2±plus-or-minus\pm±1.5 73.3±plus-or-minus\pm±2.3 79.3±plus-or-minus\pm±1.0
Noise 30% PatchCore+FPFH 18.6±plus-or-minus\pm±1.5 22.2±plus-or-minus\pm±1.8 30.8±plus-or-minus\pm±0.8 39.7±plus-or-minus\pm±3.4 18.2±plus-or-minus\pm±1.2 13.4±plus-or-minus\pm±2.0 4.2±plus-or-minus\pm±0.4 4.1±plus-or-minus\pm±0.4 7.0±plus-or-minus\pm±0.3 24.9±plus-or-minus\pm±1.3 18.3±plus-or-minus\pm±0.7
AST 14.6±plus-or-minus\pm±0.6 21.4±plus-or-minus\pm±1.0 28.7±plus-or-minus\pm±0.6 38.4±plus-or-minus\pm±0.0 16.4±plus-or-minus\pm±0.0 9.3±plus-or-minus\pm±1.0 4.3±plus-or-minus\pm±0.6 5.6±plus-or-minus\pm±0.6 6.8±plus-or-minus\pm±0.0 20.2±plus-or-minus\pm±1.0 16.6±plus-or-minus\pm±0.1
Shape-Guided 15.7±plus-or-minus\pm±0.6 22.3±plus-or-minus\pm±1.2 \ul32.8±plus-or-minus\pm±1.0 31.3±plus-or-minus\pm±0.2 18.3±plus-or-minus\pm±0.3 9.7±plus-or-minus\pm±0.9 4.2±plus-or-minus\pm±0.1 4.7±plus-or-minus\pm±0.8 7.2±plus-or-minus\pm±0.1 24.7±plus-or-minus\pm±1.5 17.1±plus-or-minus\pm±0.3
M3DM \ul30.4±plus-or-minus\pm±1.6 \ul27.4±plus-or-minus\pm±1.9 32.5±plus-or-minus\pm±0.8 \ul40.7±plus-or-minus\pm±1.4 \ul36.7±plus-or-minus\pm±2.4 \ul25.5±plus-or-minus\pm±3.1 \ul16.0±plus-or-minus\pm±1.4 \ul12.2±plus-or-minus\pm±1.2 \ul19.9±plus-or-minus\pm±2.0 \ul37.9±plus-or-minus\pm±1.3 \ul27.9±plus-or-minus\pm±0.8
Ours 89.7±plus-or-minus\pm±1.3 69.1±plus-or-minus\pm±1.8 93.7±plus-or-minus\pm±0.7 83.7±plus-or-minus\pm±2.0 78.8±plus-or-minus\pm±2.1 69.9±plus-or-minus\pm±4.9 67.1±plus-or-minus\pm±3.4 55.3±plus-or-minus\pm±2.0 90.5±plus-or-minus\pm±0.9 70.0±plus-or-minus\pm±2.1 76.8±plus-or-minus\pm±0.6
TABLE A11: I-AUROC score for anomaly detection under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 84.0±plus-or-minus\pm±1.4 84.0±plus-or-minus\pm±0.8 87.5±plus-or-minus\pm±0.1 79.5±plus-or-minus\pm±2.5 93.0±plus-or-minus\pm±0.5 56.9±plus-or-minus\pm±3.6 82.6±plus-or-minus\pm±3.7 73.0±plus-or-minus\pm±4.8 90.3±plus-or-minus\pm±8.1 \ul84.8±plus-or-minus\pm±3.3 81.6±plus-or-minus\pm±0.1
AST 82.1±plus-or-minus\pm±1.0 91.6±plus-or-minus\pm±0.6 \ul87.6±plus-or-minus\pm±0.6 92.8±plus-or-minus\pm±1.0 93.4±plus-or-minus\pm±0.6 \ul79.7±plus-or-minus\pm±1.0 91.1±plus-or-minus\pm±0.6 90.1±plus-or-minus\pm±0.6 88.1±plus-or-minus\pm±0.6 72.3±plus-or-minus\pm±0.6 \ul86.9±plus-or-minus\pm±0.3
Shape-Guided 82.8±plus-or-minus\pm±2.3 81.8±plus-or-minus\pm±2.9 86.6±plus-or-minus\pm±0.5 79.0±plus-or-minus\pm±0.9 86.2±plus-or-minus\pm±1.3 69.1±plus-or-minus\pm±1.5 74.1±plus-or-minus\pm±0.3 72.8±plus-or-minus\pm±1.2 60.3±plus-or-minus\pm±3.0 79.8±plus-or-minus\pm±2.3 77.3±plus-or-minus\pm±0.7
M3DM \ul92.6±plus-or-minus\pm±3.4 76.8±plus-or-minus\pm±2.1 82.6±plus-or-minus\pm±1.2 82.4±plus-or-minus\pm±3.1 95.2±plus-or-minus\pm±0.8 75.3±plus-or-minus\pm±0.6 83.0±plus-or-minus\pm±4.1 74.1±plus-or-minus\pm±4.2 \ul98.0±plus-or-minus\pm±2.4 84.3±plus-or-minus\pm±2.1 84.4±plus-or-minus\pm±1.0
Ours 97.4±plus-or-minus\pm±0.3 \ul85.0±plus-or-minus\pm±4.2 95.1±plus-or-minus\pm±0.3 \ul90.6±plus-or-minus\pm±0.9 \ul94.0±plus-or-minus\pm±1.9 88.1±plus-or-minus\pm±1.9 \ul87.4±plus-or-minus\pm±1.4 \ul79.8±plus-or-minus\pm±2.4 98.1±plus-or-minus\pm±1.0 85.5±plus-or-minus\pm±0.9 90.1±plus-or-minus\pm±0.7
Noise 30% PatchCore+FPFH 78.2±plus-or-minus\pm±2.3 81.5±plus-or-minus\pm±2.9 \ul86.5±plus-or-minus\pm±2.4 80.7±plus-or-minus\pm±3.6 95.4±plus-or-minus\pm±2.7 62.0±plus-or-minus\pm±5.8 74.1±plus-or-minus\pm±3.6 74.6±plus-or-minus\pm±6.8 \ul96.7±plus-or-minus\pm±3.2 88.5±plus-or-minus\pm±4.5 81.8±plus-or-minus\pm±1.5
AST 73.4±plus-or-minus\pm±0.6 88.8±plus-or-minus\pm±0.6 81.8±plus-or-minus\pm±0.6 96.6±plus-or-minus\pm±0.6 \ul94.4±plus-or-minus\pm±1.0 74.0±plus-or-minus\pm±0.0 96.6±plus-or-minus\pm±0.6 94.4±plus-or-minus\pm±1.0 73.7±plus-or-minus\pm±0.6 85.3±plus-or-minus\pm±0.6 \ul85.7±plus-or-minus\pm±0.8
Shape-Guided 60.2±plus-or-minus\pm±2.2 69.2±plus-or-minus\pm±3.7 77.3±plus-or-minus\pm±2.3 68.5±plus-or-minus\pm±0.5 65.9±plus-or-minus\pm±1.1 45.5±plus-or-minus\pm±4.1 28.0±plus-or-minus\pm±0.4 31.0±plus-or-minus\pm±5.1 41.4±plus-or-minus\pm±0.5 69.2±plus-or-minus\pm±4.0 55.6±plus-or-minus\pm±0.9
M3DM \ul90.6±plus-or-minus\pm±3.5 \ul85.7±plus-or-minus\pm±7.6 78.5±plus-or-minus\pm±2.1 82.4±plus-or-minus\pm±0.9 93.2±plus-or-minus\pm±0.9 84.8±plus-or-minus\pm±3.8 87.2±plus-or-minus\pm±2.3 71.5±plus-or-minus\pm±21.3 95.8±plus-or-minus\pm±4.1 \ul85.1±plus-or-minus\pm±5.3 85.5±plus-or-minus\pm±3.1
Ours 97.9±plus-or-minus\pm±0.9 80.7±plus-or-minus\pm±6.2 95.6±plus-or-minus\pm±1.0 \ul89.7±plus-or-minus\pm±1.4 94.1±plus-or-minus\pm±2.0 \ul83.8±plus-or-minus\pm±1.5 \ul90.2±plus-or-minus\pm±3.5 \ul78.5±plus-or-minus\pm±4.7 98.6±plus-or-minus\pm±1.0 83.8±plus-or-minus\pm±6.8 89.3±plus-or-minus\pm±0.9
TABLE A12: AUPRO score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 46.3±plus-or-minus\pm±1.6 48.9±plus-or-minus\pm±1.1 56.0±plus-or-minus\pm±0.3 58.5±plus-or-minus\pm±1.3 43.3±plus-or-minus\pm±1.0 35.3±plus-or-minus\pm±1.4 31.9±plus-or-minus\pm±0.5 35.3±plus-or-minus\pm±2.8 13.7±plus-or-minus\pm±0.7 48.2±plus-or-minus\pm±0.6 41.7±plus-or-minus\pm±0.2
Shape-Guided \ul68.5±plus-or-minus\pm±0.6 \ul69.2±plus-or-minus\pm±1.3 \ul90.0±plus-or-minus\pm±1.2 \ul64.4±plus-or-minus\pm±1.3 85.1±plus-or-minus\pm±0.9 \ul60.5±plus-or-minus\pm±1.6 82.7±plus-or-minus\pm±1.1 92.4±plus-or-minus\pm±0.5 \ul82.4±plus-or-minus\pm±0.4 90.4±plus-or-minus\pm±1.0 \ul78.6±plus-or-minus\pm±0.1
M3DM 45.7±plus-or-minus\pm±1.1 48.8±plus-or-minus\pm±1.4 55.9±plus-or-minus\pm±0.4 56.1±plus-or-minus\pm±2.3 43.0±plus-or-minus\pm±0.7 36.3±plus-or-minus\pm±1.3 32.3±plus-or-minus\pm±0.2 35.7±plus-or-minus\pm±2.9 13.7±plus-or-minus\pm±0.8 48.2±plus-or-minus\pm±0.6 41.6±plus-or-minus\pm±0.3
Ours 93.0±plus-or-minus\pm±0.8 85.5±plus-or-minus\pm±1.6 95.2±plus-or-minus\pm±0.6 86.3±plus-or-minus\pm±0.5 \ul78.3±plus-or-minus\pm±2.1 76.8±plus-or-minus\pm±2.7 \ul76.0±plus-or-minus\pm±5.0 \ul74.6±plus-or-minus\pm±3.1 90.3±plus-or-minus\pm±0.6 \ul81.3±plus-or-minus\pm±2.9 83.7±plus-or-minus\pm±0.5
Noise 30% PatchCore+FPFH 18.1±plus-or-minus\pm±1.0 23.6±plus-or-minus\pm±1.3 35.2±plus-or-minus\pm±0.6 38.3±plus-or-minus\pm±0.9 17.2±plus-or-minus\pm±0.1 11.7±plus-or-minus\pm±2.7 5.3±plus-or-minus\pm±1.3 6.2±plus-or-minus\pm±1.0 7.0±plus-or-minus\pm±0.8 25.0±plus-or-minus\pm±0.1 18.8±plus-or-minus\pm±0.3
Shape-Guided \ul70.9±plus-or-minus\pm±0.3 \ul64.9±plus-or-minus\pm±1.9 \ul89.1±plus-or-minus\pm±0.3 \ul55.3±plus-or-minus\pm±1.4 83.2±plus-or-minus\pm±0.1 \ul56.6±plus-or-minus\pm±2.2 85.6±plus-or-minus\pm±0.5 93.7±plus-or-minus\pm±0.3 \ul82.6±plus-or-minus\pm±0.4 89.7±plus-or-minus\pm±1.3 \ul77.2±plus-or-minus\pm±0.1
M3DM 18.7±plus-or-minus\pm±1.0 24.0±plus-or-minus\pm±1.0 35.3±plus-or-minus\pm±0.6 39.2±plus-or-minus\pm±0.6 17.7±plus-or-minus\pm±0.2 18.2±plus-or-minus\pm±1.7 5.7±plus-or-minus\pm±1.4 7.1±plus-or-minus\pm±0.7 7.6±plus-or-minus\pm±0.6 25.1±plus-or-minus\pm±0.2 19.9±plus-or-minus\pm±0.2
Ours 90.7±plus-or-minus\pm±1.2 81.5±plus-or-minus\pm±1.4 94.8±plus-or-minus\pm±0.3 84.5±plus-or-minus\pm±1.5 \ul75.4±plus-or-minus\pm±2.0 76.5±plus-or-minus\pm±3.4 \ul75.2±plus-or-minus\pm±1.8 \ul71.4±plus-or-minus\pm±1.8 90.4±plus-or-minus\pm±0.6 \ul80.6±plus-or-minus\pm±2.8 82.1±plus-or-minus\pm±0.4
TABLE A13: AUPRO score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 97.0±plus-or-minus\pm±0.2 96.8±plus-or-minus\pm±0.6 \ul96.8±plus-or-minus\pm±0.0 94.8±plus-or-minus\pm±1.6 91.6±plus-or-minus\pm±0.9 \ul89.7±plus-or-minus\pm±0.4 96.6±plus-or-minus\pm±0.5 \ul95.6±plus-or-minus\pm±0.2 96.6±plus-or-minus\pm±1.6 95.5±plus-or-minus\pm±1.1 \ul95.1±plus-or-minus\pm±0.3
Shape-Guided 91.6±plus-or-minus\pm±1.2 89.4±plus-or-minus\pm±0.7 96.0±plus-or-minus\pm±0.5 88.2±plus-or-minus\pm±0.8 93.1±plus-or-minus\pm±0.8 84.9±plus-or-minus\pm±6.0 90.1±plus-or-minus\pm±1.5 95.1±plus-or-minus\pm±1.0 84.4±plus-or-minus\pm±4.7 96.0±plus-or-minus\pm±1.2 90.9±plus-or-minus\pm±1.2
M3DM 93.8±plus-or-minus\pm±1.3 \ul95.6±plus-or-minus\pm±0.8 96.5±plus-or-minus\pm±0.1 88.1±plus-or-minus\pm±1.3 \ul92.6±plus-or-minus\pm±2.1 80.0±plus-or-minus\pm±0.9 \ul97.1±plus-or-minus\pm±0.2 95.3±plus-or-minus\pm±0.7 97.9±plus-or-minus\pm±0.5 97.0±plus-or-minus\pm±0.5 93.4±plus-or-minus\pm±0.3
Ours \ul96.5±plus-or-minus\pm±0.6 95.6±plus-or-minus\pm±0.2 97.7±plus-or-minus\pm±0.1 \ul92.2±plus-or-minus\pm±0.5 92.6±plus-or-minus\pm±1.7 90.1±plus-or-minus\pm±0.7 97.3±plus-or-minus\pm±0.1 96.0±plus-or-minus\pm±0.2 \ul97.6±plus-or-minus\pm±1.0 \ul96.6±plus-or-minus\pm±0.7 95.2±plus-or-minus\pm±0.3
Noise 30% PatchCore+FPFH \ul96.6±plus-or-minus\pm±0.9 \ul96.3±plus-or-minus\pm±1.9 \ul96.8±plus-or-minus\pm±1.0 94.6±plus-or-minus\pm±0.9 \ul93.1±plus-or-minus\pm±1.2 \ul87.9±plus-or-minus\pm±4.0 97.0±plus-or-minus\pm±0.6 92.3±plus-or-minus\pm±7.5 97.5±plus-or-minus\pm±1.4 97.6±plus-or-minus\pm±0.3 \ul95.0±plus-or-minus\pm±0.3
Shape-Guided 73.7±plus-or-minus\pm±3.3 79.4±plus-or-minus\pm±1.9 93.6±plus-or-minus\pm±0.3 82.4±plus-or-minus\pm±2.1 88.4±plus-or-minus\pm±2.6 69.3±plus-or-minus\pm±0.2 72.6±plus-or-minus\pm±3.4 88.7±plus-or-minus\pm±3.3 81.0±plus-or-minus\pm±5.7 93.7±plus-or-minus\pm±1.9 82.3±plus-or-minus\pm±0.7
M3DM 94.3±plus-or-minus\pm±2.7 97.2±plus-or-minus\pm±0.7 96.4±plus-or-minus\pm±0.9 87.5±plus-or-minus\pm±0.4 92.5±plus-or-minus\pm±1.6 83.6±plus-or-minus\pm±6.5 \ul97.4±plus-or-minus\pm±0.1 \ul93.3±plus-or-minus\pm±5.5 97.6±plus-or-minus\pm±1.2 \ul96.9±plus-or-minus\pm±1.1 93.7±plus-or-minus\pm±0.6
Ours \ul96.6±plus-or-minus\pm±0.7 95.0±plus-or-minus\pm±0.4 97.7±plus-or-minus\pm±0.1 \ul92.3±plus-or-minus\pm±0.8 93.9±plus-or-minus\pm±0.5 89.5±plus-or-minus\pm±2.6 97.7±plus-or-minus\pm±0.4 95.4±plus-or-minus\pm±0.4 \ul97.5±plus-or-minus\pm±1.3 96.1±plus-or-minus\pm±1.2 95.2±plus-or-minus\pm±0.2
TABLE A14: P-AUROC score for anomaly segmentation under Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 50.2±plus-or-minus\pm±2.1 52.4±plus-or-minus\pm±2.6 55.8±plus-or-minus\pm±3.0 62.3±plus-or-minus\pm±2.0 46.3±plus-or-minus\pm±0.6 39.8±plus-or-minus\pm±1.0 32.9±plus-or-minus\pm±2.1 36.3±plus-or-minus\pm±2.1 18.3±plus-or-minus\pm±6.3 49.4±plus-or-minus\pm±0.8 44.4±plus-or-minus\pm±0.8
AST 83.1±plus-or-minus\pm±0.0 91.9±plus-or-minus\pm±0.6 95.8±plus-or-minus\pm±1.0 83.6±plus-or-minus\pm±1.0 89.1±plus-or-minus\pm±0.6 \ul84.6±plus-or-minus\pm±0.6 88.8±plus-or-minus\pm±0.6 88.0±plus-or-minus\pm±0.6 89.0±plus-or-minus\pm±0.6 88.9±plus-or-minus\pm±1.0 88.8±plus-or-minus\pm±0.2
Shape-Guided \ul89.9±plus-or-minus\pm±0.3 91.0±plus-or-minus\pm±0.5 97.0±plus-or-minus\pm±0.1 \ul86.5±plus-or-minus\pm±0.2 85.1±plus-or-minus\pm±0.5 86.4±plus-or-minus\pm±0.6 \ul84.9±plus-or-minus\pm±0.3 \ul81.3±plus-or-minus\pm±5.8 \ul94.7±plus-or-minus\pm±0.4 91.1±plus-or-minus\pm±5.6 \ul88.8±plus-or-minus\pm±0.5
M3DM 52.4±plus-or-minus\pm±2.3 53.0±plus-or-minus\pm±3.1 56.1±plus-or-minus\pm±2.7 65.8±plus-or-minus\pm±1.9 47.3±plus-or-minus\pm±1.3 51.3±plus-or-minus\pm±0.7 34.4±plus-or-minus\pm±2.3 37.0±plus-or-minus\pm±2.1 18.7±plus-or-minus\pm±6.0 50.7±plus-or-minus\pm±0.4 46.7±plus-or-minus\pm±0.8
Ours 97.8±plus-or-minus\pm±0.4 \ul91.4±plus-or-minus\pm±2.1 \ul96.4±plus-or-minus\pm±0.6 94.3±plus-or-minus\pm±0.1 \ul85.5±plus-or-minus\pm±2.4 80.8±plus-or-minus\pm±2.8 81.4±plus-or-minus\pm±2.1 78.4±plus-or-minus\pm±3.0 97.8±plus-or-minus\pm±0.4 86.7±plus-or-minus\pm±1.9 89.0±plus-or-minus\pm±0.4
Noise 30% PatchCore+FPFH 24.0±plus-or-minus\pm±4.5 26.8±plus-or-minus\pm±1.3 34.5±plus-or-minus\pm±2.8 40.6±plus-or-minus\pm±3.4 21.3±plus-or-minus\pm±2.6 17.4±plus-or-minus\pm±0.9 8.8±plus-or-minus\pm±1.6 8.0±plus-or-minus\pm±2.1 8.2±plus-or-minus\pm±3.2 25.8±plus-or-minus\pm±2.7 21.6±plus-or-minus\pm±1.4
AST 15.3±plus-or-minus\pm±0.0 21.4±plus-or-minus\pm±0.0 29.3±plus-or-minus\pm±0.6 37.8±plus-or-minus\pm±0.6 16.4±plus-or-minus\pm±1.0 8.9±plus-or-minus\pm±0.6 3.6±plus-or-minus\pm±1.0 5.3±plus-or-minus\pm±0.0 6.8±plus-or-minus\pm±0.0 19.9±plus-or-minus\pm±0.6 16.5±plus-or-minus\pm±0.1
Shape-Guided \ul90.7±plus-or-minus\pm±0.7 89.4±plus-or-minus\pm±0.5 96.6±plus-or-minus\pm±0.3 \ul83.0±plus-or-minus\pm±1.0 \ul80.9±plus-or-minus\pm±5.8 80.8±plus-or-minus\pm±4.8 90.0±plus-or-minus\pm±5.1 81.6±plus-or-minus\pm±5.7 \ul94.5±plus-or-minus\pm±0.2 87.8±plus-or-minus\pm±0.5 \ul87.5±plus-or-minus\pm±0.9
M3DM 26.3±plus-or-minus\pm±4.5 27.3±plus-or-minus\pm±1.7 35.0±plus-or-minus\pm±2.3 48.3±plus-or-minus\pm±5.0 22.6±plus-or-minus\pm±3.1 35.4±plus-or-minus\pm±1.6 9.9±plus-or-minus\pm±1.4 9.0±plus-or-minus\pm±1.9 7.9±plus-or-minus\pm±3.5 28.4±plus-or-minus\pm±2.4 25.0±plus-or-minus\pm±1.4
Ours 96.6±plus-or-minus\pm±0.5 \ul89.3±plus-or-minus\pm±1.5 \ul96.5±plus-or-minus\pm±0.3 92.7±plus-or-minus\pm±1.5 81.9±plus-or-minus\pm±2.8 \ul79.9±plus-or-minus\pm±2.7 \ul81.9±plus-or-minus\pm±0.7 \ul74.8±plus-or-minus\pm±1.3 97.8±plus-or-minus\pm±0.4 \ul86.8±plus-or-minus\pm±1.9 87.8±plus-or-minus\pm±0.4
TABLE A15: P-AUROC score for anomaly segmentation under Non-Overlap setting of all categories in MVTec 3D-AD. We inject 20% and 30% noise into the training dataset. Our method outperforms other methods, indicating the superior anomaly detection ability of our method. We report the mean and standard deviation over 3 random seeds for each measurement. Optimal and sub-optimal results are in bold and \ulunderlined, respectively.
Method Bagel
Cable
Gland
Carrot Cookie Dowel Foam Peach Potato Rope Tire Mean
Noise 20% PatchCore+FPFH 99.5±plus-or-minus\pm±0.0 99.1±plus-or-minus\pm±0.3 98.2±plus-or-minus\pm±0.1 98.7±plus-or-minus\pm±0.1 98.0±plus-or-minus\pm±0.8 97.6±plus-or-minus\pm±0.1 99.3±plus-or-minus\pm±0.1 98.0±plus-or-minus\pm±0.3 99.7±plus-or-minus\pm±0.2 \ul99.5±plus-or-minus\pm±0.1 \ul98.8±plus-or-minus\pm±0.0
AST 97.9±plus-or-minus\pm±0.6 97.8±plus-or-minus\pm±0.0 \ul99.4±plus-or-minus\pm±1.0 92.2±plus-or-minus\pm±0.6 93.1±plus-or-minus\pm±0.0 99.2±plus-or-minus\pm±0.6 99.5±plus-or-minus\pm±1.0 99.8±plus-or-minus\pm±0.6 97.5±plus-or-minus\pm±0.0 98.6±plus-or-minus\pm±1.0 97.5±plus-or-minus\pm±0.1
Shape-Guided 97.5±plus-or-minus\pm±1.7 97.5±plus-or-minus\pm±0.5 98.9±plus-or-minus\pm±0.3 95.2±plus-or-minus\pm±0.3 97.8±plus-or-minus\pm±0.4 93.3±plus-or-minus\pm±3.3 97.1±plus-or-minus\pm±0.3 98.9±plus-or-minus\pm±0.2 95.6±plus-or-minus\pm±1.4 99.2±plus-or-minus\pm±0.7 97.1±plus-or-minus\pm±0.4
M3DM 99.0±plus-or-minus\pm±0.2 98.8±plus-or-minus\pm±0.3 98.1±plus-or-minus\pm±0.1 96.8±plus-or-minus\pm±0.1 97.8±plus-or-minus\pm±1.0 95.4±plus-or-minus\pm±0.1 99.4±plus-or-minus\pm±0.1 99.0±plus-or-minus\pm±0.2 99.8±plus-or-minus\pm±0.1 99.5±plus-or-minus\pm±0.2 98.4±plus-or-minus\pm±0.0
Ours 99.5±plus-or-minus\pm±0.1 \ul99.0±plus-or-minus\pm±0.1 99.6±plus-or-minus\pm±0.1 \ul97.3±plus-or-minus\pm±0.3 \ul97.9±plus-or-minus\pm±0.6 \ul97.7±plus-or-minus\pm±0.2 \ul99.5±plus-or-minus\pm±0.1 \ul99.1±plus-or-minus\pm±0.1 \ul99.8±plus-or-minus\pm±0.2 99.3±plus-or-minus\pm±0.2 98.9±plus-or-minus\pm±0.1
Noise 30% PatchCore+FPFH 99.6±plus-or-minus\pm±0.1 99.3±plus-or-minus\pm±0.1 98.2±plus-or-minus\pm±1.3 98.5±plus-or-minus\pm±0.2 \ul98.1±plus-or-minus\pm±1.3 98.3±plus-or-minus\pm±0.5 99.5±plus-or-minus\pm±0.2 95.3±plus-or-minus\pm±7.8 99.6±plus-or-minus\pm±0.6 99.6±plus-or-minus\pm±0.1 \ul98.6±plus-or-minus\pm±0.7
AST 91.0±plus-or-minus\pm±1.0 96.3±plus-or-minus\pm±0.6 \ul99.1±plus-or-minus\pm±1.0 92.4±plus-or-minus\pm±0.6 95.6±plus-or-minus\pm±0.0 \ul97.4±plus-or-minus\pm±0.6 99.7±plus-or-minus\pm±0.6 100.2±plus-or-minus\pm±0.6 97.7±plus-or-minus\pm±0.6 98.6±plus-or-minus\pm±1.0 96.8±plus-or-minus\pm±0.2
Shape-Guided 93.3±plus-or-minus\pm±1.3 93.8±plus-or-minus\pm±1.0 98.2±plus-or-minus\pm±0.3 90.5±plus-or-minus\pm±1.2 97.0±plus-or-minus\pm±1.1 85.6±plus-or-minus\pm±2.8 92.7±plus-or-minus\pm±0.0 96.7±plus-or-minus\pm±0.5 92.8±plus-or-minus\pm±1.9 98.6±plus-or-minus\pm±0.7 93.9±plus-or-minus\pm±0.2
M3DM 99.1±plus-or-minus\pm±0.2 \ul99.3±plus-or-minus\pm±0.4 98.0±plus-or-minus\pm±1.4 96.0±plus-or-minus\pm±0.8 97.6±plus-or-minus\pm±1.1 96.6±plus-or-minus\pm±1.9 \ul99.6±plus-or-minus\pm±0.1 98.2±plus-or-minus\pm±2.0 99.7±plus-or-minus\pm±0.5 \ul99.5±plus-or-minus\pm±0.3 98.4±plus-or-minus\pm±0.2
Ours \ul99.5±plus-or-minus\pm±0.2 98.7±plus-or-minus\pm±0.3 99.6±plus-or-minus\pm±0.1 \ul97.1±plus-or-minus\pm±0.6 98.5±plus-or-minus\pm±0.3 97.4±plus-or-minus\pm±1.3 \ul99.6±plus-or-minus\pm±0.1 \ul98.8±plus-or-minus\pm±0.3 \ul99.6±plus-or-minus\pm±0.6 99.2±plus-or-minus\pm±0.4 98.8±plus-or-minus\pm±0.0