3.2. C2f_KAN Module
In human pose estimation tasks, the use of standard convolution modules may lead to insufficient utilization of computational resources due to a large number of redundant features, as well as internal covariate shift issues during the training process, which can result in inadequate image feature extraction. To address these problems, we introduce the C2f_KAN module into the backbone part. This module incorporates the idea of instance normalization and has significant advantages in feature extraction and information fusion, making it particularly suitable for handling fine-grained features in human pose estimation. Its structure is shown in
Figure 3.
The introduction of the KAN caused a sensation. In contrast to traditional MLPs, which use fixed activation functions at the nodes (“neurons”), a KAN implements learnable activation functions along the edges (“weights”). Specifically, the weight parameters in a KAN are replaced by univariate functions parameterized as spline curves, thereby eliminating linear weights. This adjustment allows KANs to excel in small-scale AI and scientific tasks, obtaining advances in both accuracy and interpretability. The underlying mathematical principle can be written as follows:
where
and
.
To optimize computation, KAN convolution incorporates a basis function,
, in the design of the activation function. Therefore, the activation function
is the sum of the basis function
and the spline function
. Specifically, their mathematical formulas can be written as follows:
where
is a weight that can better control the overall size of the activation function,
denotes the coefficients that are optimized during training,
is a B-spline basis function defined on the grid, and the grid points define the interval where each basis function
is active, significantly affecting the shape and smoothness. During training, the parameters
of these splines (i.e., the coefficients of the basis function
) are optimized to minimize the loss function, thereby adjusting the shape of the spline to best fit the training data.
The C2f_KAN module, as shown in Algorithms 1–3, inspired by the KAN, retains the efficient feature extraction capabilities of the original C2f module while significantly enhancing its feature extraction and adaptability. Through integrating KAN convolution and non-linear activation functions, this module strengthens the convolutional kernels, enabling them to capture more intricate and subtle features. The C2f_KAN module is utilized in the P4 and P5 layers of the backbone network.
To balance parameter reduction with feature extraction performance, multiple Bottleneck_KAN modules are used to progressively extract and refine features, ensuring that the final output feature map is both rich and high in quality. Moreover, during the concatenation of features from different layers,
convolution is applied for dimensionality reduction and fusion, ensuring that the feature map’s dimensions and information content are optimal for further processing. The convolution function in the C2f_KAN module can be written as follows:
where
x represents the input feature map with dimensions
(
H is the height,
W is the width, and
C is the number of channels),
is the convolution kernel with dimensions
(
M is the kernel height,
N is the kernel width,
C is the number of input channels, and
K is the number of output channels),
(corresponding to the output channels
K) denotes the bias term, and
is the output feature map (where
i,
j, and
k are the indices for the height, width, and channel dimensions of the output feature map, respectively) with dimensions
(
and
are the height and width of the output feature map, and
is the number of output channels)
Algorithm 1 PyTorch-like Code for C2f_KAN Algorithm |
- 1:
Input: - 2:
Output: Output tensor after processing through all Bottleneck_KAN layers - 3:
Initialize: super() - 4:
Create a list of Bottleneck_KAN modules: self.m ← - 5:
for each layer in self.m do - 6:
- 7:
end for - 8:
Returnx
|
Algorithm 2 PyTorch-like Code for Bottleneck_KAN Module |
Input:
- 2:
Output: Output tensor after applying bottleneck operations Initialize: super() - 4:
Compute: Define: - 6:
Define: Compute: - 8:
if not self.add then Return - 10:
else Return - 12:
end if
|
Algorithm 3 PyTorch-like Code for choose_kan Function |
Input: Output: Selected KAN convolutional layer
- 3:
ifthen else ifthen - 6:
else ifthen - 9:
else ifthen else ifthen - 12:
end if Return kan
|
3.3. SOEP Module
Detecting small targets during human pose estimation on the P3, P4, and P5 layers can be challenging. Traditional methods often add an extra detection layer before the P3 layer to enhance the detection of small targets. However, this can lead to increased computational complexity and longer post-processing times. To address these challenges, we improved upon the CSP approach and OmniKernel [
24] through creating the CSPOmniKernel, which we refer to as the small object enhance pyramid (SOEP). We apply depthwise convolutions with kernel size of
to pursue a large receptive field and use
and
depth convolutions to obtain context information. To avoid introducing a large amount of computational overhead, we place the module in the neck part. Then, we explore the possibility of using maximal convolutions for image restoration by gradually increasing
k, and reasonably expand the convolution kernel (receptive field) to the feature size while avoiding a large number of parameters. Instead of adding a detection layer, we process the data through SPDConv, which extracts features enriched with small target information. These features are then integrated into the P3 layer. This approach effectively learns feature representations from global to local scales, ultimately improving the detection performance for small targets. The structure of this method is illustrated in
Figure 4.
3.4. LSCB Module
The layered shared convolution with batch normalization (LSCB) detection head is designed to receive feature maps from different scales (P3, P4, P5) using a multi-scale feature fusion method. By merging multi-scale features, it processes the feature maps from different scales to include information of varying granularity, thus capturing more contextual information. This helps the model to more accurately locate keypoints. Moreover, the LSCB module achieves the goal of reducing the number of parameters and improving computational efficiency through the use of shared convolutional layers, ensuring the real-time performance of the model. The resulting diagram is shown in
Figure 5.
High-precision keypoint detection is essential for human pose estimation tasks. In the LSCB module, the keypoint feature processing layers consist of a convolution layer, a convolution layer, and another convolution layer. These layers are designed to handle feature maps at different scales and to extract keypoint features effectively. The LSCB module processes multi-scale (P3’, P4’, P5’) features utilizing shared convolution layers, thereby capturing more contextual information from various scales of feature maps. Given the statistical differences between features at different levels, normalization is necessary. Directly incorporating batch normalization (BN) in the shared parameter detection head can cause errors due to moving average discrepancies. While group normalization (GN) can mitigate this issue, it increases the computational overhead during inference. Therefore, we independently calculate BN for each shared convolution layer, avoiding the moving average errors associated with shared BN. This approach significantly reduces the number of parameters. Moreover, to handle the inconsistent target scales detected by each detection head, we employ a Scale layer to adjust the features, ensuring that each detection head can adapt to different target scales. Specifically, after the three feature layers output from the neck enter the detection head, each branch first passes through a convolution layer to adjust the number of channels, unifying the number of channels of the three input feature layers to the number of channels in the middle layer. Then, all feature layers are gathered into a shared convolution module for feature extraction. The convolution kernel size is . The use of shared convolution can reduce the number of parameters and calculations of the model. Finally, the regression branch and the classification branch are separated. In the regression branch, a convolution layer is used to predict the coordinate offset of the bounding box. However, when there are objects with significant differences in size and shape within a scene, shared convolution may not fully adapt to targets of varying scales and shapes, potentially leading to insufficient representational capacity of the feature map. To compensate for its limitations in integrating multi-scale features, architectures like Feature Pyramid Networks (FPNs) or Path Aggregation Networks (PANs) can be introduced on top of shared convolution. These techniques help the model capture features of different scales without significantly increasing the number of parameters. Specifically, in LSCB, to address the issue of inconsistent target scales detected by each detection head, the output of the regression branch utilizes the Scale layer to scale the features and adjust the target scale, thereby enabling the localization of targets of different sizes. A similar mechanism can be observed in architectures like FPN, where multi-scale feature fusion improves detection performance for objects of various sizes. The introduction of the Scale layer is a further optimization of existing multi-scale fusion methods, adding flexibility and controllability. This enhances the applicability of shared convolution in multi-scale target detection and mitigates its negative impact on the quality of feature maps.
In the classification branch, a convolution layer is used to predict the probability of each category. The convolution layer weights of the two branches are independent, such that the model can learn the positioning and classification tasks separately. Finally, the predicted keypoint features are decoded into actual keypoint coordinates, ensuring the accuracy of keypoint locations and improving detection precision.
Overall, the shared weight parameter design can effectively reduce the number of parameters and calculations, thus improving the running speed of the model, and allows for the processing of features of different scales at the same time, capturing information of various sizes in the image, which helps the model to better understand the relationships between objects in the image, thus improving the recognition accuracy.