The StarCAN network structure is extremely simple, consisting of an initial Stem layer and four feature extraction stages. The image sequentially passes through the Stem layer and the four stages, progressively extracting features and outputting feature maps at each stage, thereby achieving efficient image classification.
Firstly, the 3-channel input image passes through the Stem layer, where an initial feature extraction is performed using a convolutional layer and a ReLU activation function, producing an initial feature map with 32 channels and halved spatial resolution. In each stage, downsampling is performed using a convolutional layer with a stride of 2, halving the spatial resolution of the feature map while doubling the number of channels. Each stage includes a StarCAA Block module. Each Block consists of depthwise separable convolution, a regular convolution layer, batch normalization, a ReLU activation function, the Context Anchor Attention (CAA) mechanism, and a Drop Path for random depth dropping. These Blocks progressively extract and process feature maps, converting low-level features into high-level features, thus providing rich feature representations for the final classification task. The complete structure of the StarCAN network is shown in
Figure 4.
3.1.1. StarCAA Blocks
StarCAA Blocks are the core modules of the StarCAN network, highlighting the roles of element-wise multiplication and the CAA attention mechanism and incorporating multiple convolution operations and attention mechanisms. The process begins with a depthwise separable convolution (DWConv) to efficiently extract initial features while reducing computational complexity. Next, two parallel convolution layers transform the features, and a ReLU activation function applies nonlinear mapping. At this stage, element-wise multiplication merges the outputs of the two convolution layers, producing more representative features. Following this, another layer of depthwise separable convolution and pointwise convolution (g) further refines the feature map. The channel attention mechanism (CAA) is then introduced, adaptively recalibrating feature responses by assigning different weights to each channel, thereby highlighting important features and significantly enhancing the discriminative power of the feature map. Additionally, the introduction of Drop Path improves the network’s robustness and generalization ability. Finally, the input feature map is added to the processed feature map to achieve residual connection, further enhancing network performance.
By integrating these components, the Block module aims to gradually extract and enhance relevant features, thereby improving the network’s overall feature representation and classification performance. This multi-layer stacking method implicitly transforms input features into exceptionally high nonlinear dimensions while operating in a low-dimensional space.
3.1.2. The Context Anchor Attention (CAA)
The Context Anchor Attention (CAA) mechanism is integrated within the StarCAA Blocks, as illustrated in
Figure 4, sharing the input parameters of the backbone network. The CAA mechanism uses global average pooling and one-dimensional depthwise separable convolution to capture relationships between distant pixels and enhance features in the central region. The specific structure of this attention module is depicted in
Figure 4. The attention mechanism starts with the input feature map, which is processed through an average pooling layer to generate local region features
:
The pooling layer is configured with a kernel size of 7, a stride of 1, and a padding of 3. Within the average pooling layer, the input feature maps are aggregated and dimensionality-reduced. Aggregation relies on computing the mean of local areas to summarize features, smoothing the feature maps, and reducing the impact of noise, thus extracting more robust features. This approach mitigates the model’s tendency to overfit local noise and details, enhancing its generalization capability, which is particularly beneficial in noisy environments such as autonomous driving. Dimensionality reduction decreases the feature map dimensions, reducing data volume and computational load, thereby improving the efficiency of subsequent convolution operations. The padding and stride settings ensure that the pooled feature map retains the same size as the input, preserving global information. This enables the model to consider global context information when processing local features. After average pooling, the feature maps contain less redundant information, reducing computational complexity and enhancing the model’s efficiency. Lowering the computational resource demands of the model is crucial for the subsequent calculation of attention factors.
Next, the pooled feature map
undergoes a 1 × 1 convolution operation, resulting in an intermediate feature map
.
Subsequently, the intermediate feature map
sequentially passes through depthwise separable convolutions in the horizontal and vertical directions to capture contextual information from different orientations, generating the feature map
.
Typically, depthwise separable convolutions (DWConv) decompose a standard convolution into two simpler operations: depthwise convolution and pointwise convolution. We define the parameter count of a standard convolution as . The computational complexity of a depthwise separable convolution is . Here, K represents the kernel size, and are the numbers of input and output channels, respectively, and H and W are the height and width of the feature map. This design significantly reduces the number of parameters, especially when and are large. The parameter count after decomposing into depthwise and pointwise convolutions is much lower than that of standard convolutions, thus reducing the model’s parameter count and lowering the risk of overfitting. Depthwise separable convolutions can significantly improve the model’s inference and training speed while maintaining similar or even higher accuracy.
The two depthwise separable convolutions in the CAA attention mechanism are designed to be lightweight while recognizing long-distance pixel correlations. Unlike traditional k × k 2D depthwise convolutions, these use a pair of 1D depthwise convolution kernels to achieve a similar effect as standard large-kernel depthwise convolutions, thus reducing both parameters and computational load. The reconfigured horizontal (kh) and vertical (kv) kernel sizes capture relevant information in the horizontal and vertical directions of the input feature map, respectively. This operation extracts edge and shape information in different directions of the feature map more effectively, enhancing the ability to establish relationships between distant pixels without significantly increasing the computational cost due to the dual depthwise separable convolution design.
Next, the feature map
undergoes a second 1 × 1 convolution operation to extract high-level contextual features further, resulting in an enhanced feature map
. This enhanced feature map is then passed through a Sigmoid activation function to generate the attention factor A. Finally, the input feature map is multiplied element-wise by the attention factor A, producing the enhanced output feature map Y. By weighting the original input feature map, important features are enhanced while unimportant features are suppressed, thereby improving the model’s ability to focus on crucial parts.
Through the aforementioned multi-stage processing and weighting mechanism, the CAA module effectively integrates multi-scale contextual information, significantly enhancing the overall performance of the neural network in various computer vision tasks.