3.1. ViT-Query
Feature extraction: Compared to PatchCore’s utility of WideResNet50 [
39] pre-trained on ImageNet [
40], ViT-Query utilizes ViT trained on text–image matching tasks to extract features from normal input images. The model will have the ability to discriminate more high-level semantic information after being trained on the task. We divide the ViT into 4 stages, with each stage comprising six layers. This structuring allows for more specialized processing and feature extraction at different levels of the transformer, enhancing the model’s ability to capture and utilize hierarchical semantic information effectively. And, we concatenate the features from every stage, as shown in
Figure 3.
I is the normal image,
represents the feature extracted by stage
i, and
L is the list of features of the image.The process can be represented as follows:
Neighbor feature aggregation: Using a patch token with multiple aggregation degrees instead of a single patch size is more effective in representing anomalies of varying sizes. This approach results in high-quality anomaly scores. Inspired by Li et al. [
41], unlike PatchCore’s directly utilization of
kernel aggregation, for
i th patch token
extracted by ViT’s chosen stage
j, we utilize a
kernel to obtain the aggregated neighborhood patch token
through the
function and upsample the aggregated patch token into the same size as the initial, as shown in
Figure 3.
,
and
are the channel height and width of the initial features, respectively. Then, we concatenate the aggregated patch token.
is the list of aggregated features. The process can be represented as follows:
Coreset sampling: When the feature map size and number increase, the memory bank requirements expand, and the inference time rises significantly. ViT-Query addresses this issue by employing greedy coreset subsampling, which reduces the quantity of features. Coreset subsampling produces a subset
by utilizing the iterative greedy approximation [
42].
Anomaly Detection: With the downsampled patch level feature memory bank
, ViT-Query queries the anomaly score
s via the maximum distance
between input patch features in its patch collection
and its nearest neighbor
in
:
During the testing process, there may be rare occurrences where the normal features
in memory bank
are close to the test features
, but these normal features have a low similarity to other normal features. Therefore, the anomaly score
needs to be adjusted using the following Formula (
7) proposed by PatchCore [
27]:
is the b nearest patch-features in for test patch-feature .
3.2. Anomaly Generator
On the surfaces of steel with complex textures, defects occur in various forms. There are usually small areas of irregularity and large areas of regular defects on the surfaces, and there is a problem of limited and impossible coverage of all defects during the process of data collection, which significantly limits the use of supervised learning methods for modeling. We have designed a more effective strategy, an anomaly generator, to composite defective samples and introduce them during the training process. These synthetic defective samples, based on the principle of random noise and random shape generation, exhibit randomness in the specific locations of defects. The anomaly synthesis strategy proposed consists of three steps: generate a mask, add representation information, and generate synthetic defect images, as shown in
Figure 4.
Generate mask: In creating anomalies, we start by generating two-dimensional Perlin noise [
43], which is then binarized using a threshold T to produce a binary mask. This mask contains random peaks and valleys, which allows model to extract the features of continuous regions or blocks within the image. To composite small-scale irregular defects on the surfaces of steel, further processing is applied to the binarized mask Perlin noise. Multiple connected domain modules within the mask are retained and ensure that each preserved region block has a minimum area of
. These regions are also randomly located due to the randomness of noise. The regions will be scaled by a random scale factor to alter the size of the regions. These preserved regions serve as labels for forged defects, effectively mimicking the small-scale irregularities commonly encountered in industrial environments. For regular defects, a different approach is needed. We generate binary masks with a regular-shaped area, as shown in
Figure 4. The masks generated during this process are defined as
.
Add representation information: In the process of generating the mask image
with the defect information, two methods are introduced to add the defect information, filling the
with value
, which is out of the distribution. The reference images
form the texture dataset DTD [
44]. We assume that the grayscale of a normal image of steel with complex textures conforms to a Gaussian distribution. The value
is calculated as follows:
and
are the mean and variance of the normal imput image, respectively.
is the function used to obtain the grayscale value of corrdinate
.
and
are the functions used to calculate the mean and variance of the image’s grayscale value, respectively.
is the value of the inverse function of the standard normal distribution function at
p.
is a uniform distribution function. The
in this method is as follows:
⊙ is the element-wise multiply operation. For the method utilizing the
value to add the information of defects, image
is a random image from the chosen images.
is expressed as follows:
Generate synthetic defect images: This step inverts the binarized mask
to
. After that, it computes the element product on
and the original image
I to obtain the image
and obtain the augmented image
. The process is based on the following formula:
Using the above strategy, we obtain synthetic anomaly images from the perspectives of texture and structure. Due to the randomness of defect types and positions on steel with complex textures, our synthetic defects will appear anywhere in the image, maximizing the similarity between synthetic anomaly samples and real anomaly samples. Compared to RGB images, which have
colors, grayscale images only have 255 grayscale values. Owing to this characteristic of grayscale images, when we use out-of-distribution grayscale values to fill the image and create defects, the resulting images are more likely to resemble real defect images, as shown in
Figure 5.
3.3. GalvaNet
In this section, we introduce a novel network architecture, GalvaNet, which consists of two stages, aggregation attention and the classification-aided module, as shown in
Figure 1. This model utilizes synthetic defective images and anomaly maps derived from ViT-Query as network inputs. The purpose of GalvaNet is to leverage the valuable global information from the ViT-Query and enhance the accuracy of defect segmentation and localization. Essentially, this approach aims to augment the model’s capacity to acquire local information while performing segmentation tasks [
45].
Image weighting scheme (IWS): During the aggregated attention stage, GalvaNet adopts the synthetic defective images and anomaly maps obtained from the ViT-Query as its input. It is cumbersome and inefficient work to set the weight ratio of two inputs manually. To address this issue, GalvaNet employs a channel attention mechanism to integrate input images with anomaly maps, as opposed to utilizing it to re-weight features. It operates by learning the importance of each channel in the input feature map. By dynamically adjusting channel weights, the model can effectively focus on task-relevant information while suppressing irrelevant or noisy channels [
46].
In the context of the described mechanism, the input feature maps
and the anomaly maps
obtained from ViT-Query serve as the basis for channel attention. The channel attention mechanism integrates these inputs through two fully connected layers and employs average pooling operations to perform channel-level weighted fusion. This fusion process generates a new feature map
with enhanced feature representation, emphasizing important channels and suppressing less-relevant ones. Compared with the direct weighting of the input image and the anomaly map, the model can adaptively assign weights to different inputs to improve the segmentation performance of the model. The channel attention mechanism expression is as follows:
where
represents the sigmoid activation function,
represents the ReLU activation function,
represents the average pooling operation,
represents the synthetic defect image,
represents the anomaly map,
and
are the weight matrices of two fully connected layers. The input images weighted by channel attention can be represented as follows:
Multi-scale biaxial cross-attention (MBCA): Inspired by EMA and MCA, we design a novel method named multi-scale biaxial cross-attention that integrates multi-scale features and axial features, aiming to better segment defect regions of various sizes and shapes. As illustrated in
Figure 6, MBCA utilizes convolutional kernels of different sizes and shapes to capture local information from various defect regions as comprehensively as possible, thereby enhancing the model’s ability to represent defect features. The MBCA module divides the input feature map
into
g groups, and each set should pass through three branches. The specific feature extraction method of the multi-scale
x-axis convolution and multi-scale
y-axis convolution is illustrated in
Figure 7. When performing feature extraction with axial cross-attention, each input feature map needs to undergo LayerNorm, followed by three strip convolutions of different shapes and then a
convolution operation. The formulation can be written as follows:
where
and
denote the output of multi-scale axis convolution and
is a 1D convolution operation. The specific kernel size is shown in
Figure 7. Then,
and
are utilized to calculate the multi-head cross-attention between them. The calculation process can be written as follows:
MA is multi-head cross-attention. when
and
are obtained, the output of biaxis cross-attention is calculated as follows:
We represent the feature map obtained through the 3 × 3 convolutional kernel as
. We apply the
and
operations to obtain their respective attention matrices as follows:
After that, we use cross multiplication to obtain their feature maps enhanced by the attention matrices, as shown in the following formula:
The feature map improved by MBCA attention can be computed through the following process:
Since the reweighted matrix is calculated based on groups and there is no communication between groups, we aim to introduce global feature information. Therefore, we concatenate the input matrix with the objects to perform global feature fusion as follows:
Coordinate spatial attention (CSA): After obtaining the feature map, based on coordinate attention and spatial attention, coordinate spatial attention is introduced to improve GalvaNet’s perception of coordinates and space. In this module, based on the spatial and coordinate attention mechanism, the model dynamically adjusts the weights of each position in the feature map regarding the spatial attention mechanism, allowing the model to focus more on important information (such as object boundaries) and texture [
35]. Additionally, the coordinate attention focuses on the position information at different features through the computation of each position’s coordinate encoding to depict the feature map positions [
36], thereby enhancing the model’s ability to utilize position information and improve its generalization capability. Therefore, our proposed attention fusion module assists the model in better focusing on spatial and positional information in the feature map, thereby enhancing and the segmentation performance of GalvaNet. The process of CSA is shown in
Figure 8. First, we obtain the attention matrix of the feature map at the coordinate level:
In Formula (
29),
and
represent the max pooling and average pooling of the input feature maps, respectively. The Concat operation represents concatenating the obtained feature maps in the channel dimension. In Formula (
30), after passing through the deep convolutional module, the feature map that focuses more on image position information obtained through coordinate attention is denoted as
. This process involves computing the coordinate encoding for each position, such as row and column indices, to generate a unique weight matrix, where
represent the height, width, and number of channels of the input feature map, respectively. Then,
represents the feature map with the enhanced spatial information obtained. CSA also uses AVG to compute the attention matrix of the input feature map at the spatial level to enhance the model’s ability to capture spatial features, as follows:
We believe that the galvanized steel surfaces with complex textures have many overlapping features, so GhostConv [
47] is introduced to obtain more non-redundant features
as shown in Formula (
32):
Finally, we multiply the obtained attention matrix by the input feature map and stitch the multiplied matrix and the input feature map into channels to obtain the feature map
. The feature map, enhanced by the coordinate spatial attention mechanism, is mapped back to a single channel as the segmentation mask output:
The remaining network structure details of aggregated attention are shown in
Table 1.
Classification-aided module (CAM): We aim for GalvaNet to have better interpretability and to generate more accurate anomaly maps, which is designed to improve the segmentation performance by incorporating a classification task. The fundamental requirement for a high-quality anomaly map is to ensure that its users recognize anomalies clearly, which in turn improves segmentation performance. The classification-aided module is shown in
Table 2.
The feature processing and classification pipeline begins with the fusion of the single-channel mask output from the aggregated attention part and the 1024-channel feature map obtained from the deep convolutional modules. This fusion, conducted along the channel dimension, yields a comprehensive 1025-channel feature map. Subsequently, this fused feature map undergoes dimensionality reduction through convolutional processing, refining it into a more manageable 32-channel feature map. Next, both the 32-channel feature map and the single-channel mask go through global max pooling (GMP) operations and global average pooling (GAP) operations. These pooling operations are simultaneously utilized for both the features and the mask, yielding 64-dimensional feature representations for each. This parallel processing ensures that both global image features and segmentation-specific details are effectively captured and represented.
Simultaneously, the single-channel mask, representing segmentation details, utilizes the same pooling operations, generating a two-dimensional feature representation. These features encapsulate segmentation-specific information, contributing to a more comprehensive feature vector.
Following the pooling operations, the 64-dimensional feature representations from the feature map and the mask are concatenated with the 2-dimensional segmentation feature, resulting in a 66-dimensional feature vector. This concatenation effectively integrates information from different processing stages, combining both global image features and localized segmentation details. Finally, this 66-dimensional feature vector is input into a multi-layer perceptron (MLP) network for classification.