1. Introduction
Image-based object counting is a prominent research challenge, with diverse applications in fields such as video surveillance, traffic monitoring, and beyond. Counting objects swiftly, especially in scenarios with dense or unevenly distributed objects, poses a challenge, due to limitations in the human visual system. Consequently, the development of a dependable object counting system has emerged as a recent focal point in computer vision research.
Following a comprehensive review of prior research, it is evident that the majority of studies assume the presence of a single class in the training dataset, such as populations, cells, animals, or cars. Examples are illustrated in
Figure 1a. This approach is commonly referred to as class-specific counting (CSC). Consequently, when deploying the model, it can only handle task objectives that belong to the same class as the training data. Mainstream approaches typically perform object counting using a series of convolutional neural networks (CNNs) with fixed-size convolution kernels. However, they still require a substantial amount of training data corresponding to specific semantic classes for effective learning.
The few-shot counting (FSC) algorithm, initially proposed by Lu et al. [
1], addresses the class-agnostic counting (CAC) problem, as exemplified in
Figure 1b. Unlike class-specific counting models, FSC is trained on data encompassing various semantic categories. During the inference process, FSC can effectively capture the visual features of a query image containing new categories by leveraging examples from the corresponding classes, enabling it to count the instances of these categories in a new scene. As articulated by Min in BMNet [
2], CAC represents a promising direction in object counting—a shift from merely learning to count objects to learning to count various methods or categories.
Specifically, FSC is designed to determine the quantity of salient objects in an image, regardless of their semantic class, by utilizing a user-provided set of “exemplars” that refer to the specific objects to be counted. Typically, existing FSC methods consist of two main components: feature extraction and matching. Initially, they extract visual features from the exemplars and then compare these features to those extracted from the query image. The outcome of this similarity matching serves as an intermediate representation for inferring object counts.
In essence, FSC comprises two crucial elements: expressive features and a similarity metric. A recent study, BMNet [
2], focused on enhancing the similarity metric by applying predefined rules, to identify self-similarity within images. Another approach, SAFECount [
3], categorized CAC algorithms into two groups: feature-based and similarity-based approaches. It proposed a fusion of the strengths of both methods, to achieve superior target counting results. As depicted in
Figure 2c, this paper extended upon the principles of SAFECount [
3] for the CAC task, improving its similarity comparison techniques and feature enhancements. This resulted in a CAC network with enhanced counting performance and more robust generalization capabilities.
This paper presents three key contributions:
Introduction of ACECount: We introduce ACECount, a versatile visual object counting architecture that combines Transformer [
4,
5] and CNN models, featuring multiple attention mechanisms. ACECount utilizes local and global attention, to explicitly capture features within and between image patches. It incorporates a cross-attention mechanism for similarity to user-provided few-shot examples. Additionally, ACECount employs densely connected multi-scale regression heads, to capture feature similarity at different scales. This comprehensive approach enhances its counting performance.
Cross-Dataset Generality: We thoroughly investigated ACECount’s ability to generalize across different datasets. In the zero-shot scenario, we utilized two learnable tensors to replace exemplar features as inputs. This adaptation enabled the network to disregard irrelevant input portions and transform into a single-class counting network. ACECount demonstrated remarkable performance on both the population count dataset and the FSC-147 dataset under zero-shot conditions, highlighting the effectiveness of our multi-scale regression head for diverse counting tasks.
Ablation Study on CAC Dataset: This study highlights the significant performance improvement achieved by our core module, the feature interaction enhancement module. We also emphasize our careful network structure design and hyperparameter selection, which collectively contribute to ACECount’s outstanding performance in FSC scenarios, establishing it as one of the top-performing solutions.
The content of the following chapters is outlined as follows: in
Section 2, we discuss related work, covering the research background and commonly used algorithms in both class-specific object counting and class-agnostic counting tasks; in
Section 3, we introduce the ACECount framework and elaborate on the principles and individual roles of each module within the network; in
Section 4, we outline the specific experimental program, encompassing an introduction to the dataset, technical algorithm details, comparisons of various methods, and ablation experiments; in
Section 5, we provide visualization results for the FSC-147 dataset; in
Section 6, we offer a comprehensive summary of the text, highlighting the algorithm’s strengths and weaknesses, along with potential future solutions and directions.
3. Proposed Method
Overview. Few-shot counting (FSC) aims to determine the number of objects in a query image that belong to specific classes. Unlike class-specific counting (CSC), FSC’s primary goal is to learn the capability to count new classes based on existing classes, rather than solely counting objects within predefined classes. In CSC, training data only consist of pixel-level point annotations for object classes. In FSC, object classes are split into two parts: one for training and the other for testing.
We denote the training class as and the inference class as . and encompass different semantic classes of counted objects, with . During network training, for each query image from , represented as , we provide a certain number of box coordinates to represent the exemplars in the image. These exemplars are denoted as , where , to calculate the number of corresponding categories in the image. Additionally, a ground-truth density map, denoted as , is provided for supervised network training.
For query images from , represented as , only a certain number of exemplars—i.e., , where —are provided, to enable the network to estimate the number of corresponding categories. FSC’s objective is to use a minimal number of exemplars to calculate the number of objects in . When the user specifies the number of exemplars in a query image as K, the task is referred to as K-shot FSC.
Framework Architecture. The ACECount framework comprises six components: the query encoder, the exemplar encoder, the pyramid feature aggregation module (PFA), the similarity comparison module (SCM), the feature attention enhancement module (FAEM), and a multi-scale dense counting regression head (MDCR). The SCM and FAEM collectively form the core part of the framework, known as the feature interaction enhancement module (FIEM).
In broad strokes, the query encoder and exemplar encoder are responsible for extracting features from input queries and exemplars, respectively. The PFA aggregates low-level and high-level features. The SCM integrates image regions with arbitrary given regions through cross-attention, facilitating the comparison of image regions to any given shots and the calculation of their similarity. The FAEM employs spatial and channel attention to enhance the input feature map, augmenting the overall feature representation. The MDCR utilizes densely connected dilated convolution, to expand the receptive fields of the regression head, enhancing multi-scale object recognition and extending network generality for counting different categories.
The network’s versatility for counting different categories is extended, as illustrated in
Figure 3.
The pipeline of ACECount involves several key steps. Queries and exemplars are processed through two different encoders, CNN-base and Twins-base, for encoding. Twins reshapes the sequence of vectors back to a 2D feature map,
, at each downsampling. These features at different stages are sent to the pyramid feature aggregation module (PFA) for feature fusion, to obtain a feature map containing global information,
. Meanwhile, the exemplar is transformed into exemplars features,
, and support features,
, and it is used as key and value (K and V) by the CNN-base encoder. These are then fed into the PFA along with the query. In the feature attention enhancement module (FAEM), the weight of similar features in
R is adjusted using the support feature,
, to obtain the reshaped similarity map,
. This map is then used to enhance the original feature,
, resulting in the enhanced feature,
. Finally,
is input into the multi-scale dense counting regression head (MDCR), to obtain the density map and counting results. To simply sum it up:
where
k denotes the number of user-given shots, and
i denotes the number of query images. In the next section, we will introduce these six modules in detail.
3.1. Visual Encoder
Our model’s encoder comprises two components: one following the Twins structure [
25], and the other following the CNN structure. These two distinct encoders process two types of input images derived from the same source image: the former is responsible for extracting features from the query images, while the latter focuses on extracting features from the exemplars.
We represent the input image as , where 3, H, and W denote its channel size, height, and width, respectively. The input exemplar is denoted as , with N, 3, and K representing its shot number, channel size, height, and width. In the following sections, we will describe these two components separately.
3.1.1. Query Encoder
As transformer technology [
5] continues to advance in natural language processing, an increasing number of researchers are exploring its applications in computer vision. Vision in transformers (VIT) [
26] was introduced to address the computational complexity of self-attention mechanisms in image processing. Since its introduction, VIT has gradually extended its reach to various computer vision tasks, achieving outstanding results. VIT employs a self-attention layer that divides images into fixed-size patches, embedding them into 1D vector sequences, which serve as input for the self-attention mechanism. These sequences are then linearly transformed, to produce the
K,
Q, and
V matrices, which are used in the self-attention operation. The self-attention calculation process is as follows:
where
denotes softmax,
Q denotes query,
K denotes key,
V denotes value, and
d denotes the length of the vector.
We utilize Twins-SVT [
25], a pyramidal VIT variant, as the query encoder. Twins introduces a novel attention mechanism that combines locally grouped self-attention (LSA) and global sub-sampled attention (GSA) in two combinations, which are then stacked together in multiple modules. This configuration is referred to as spatially separable self-attention (SSSA).
To elaborate, in a given layer l of Twins, the feature maps are initially divided into patches, each measuring pixels, and projected into tokens, using a multi-layer perceptron (MLP). At this stage, information exchange only occurs within the sub-window. However, GSA facilitates information flow between different patches, overcoming the limitation of local information processing. Smaller receptive fields could otherwise impact the network’s performance. In the GSA stage, each sub-window produces a single representative through convolutional operations. This representative aggregates critical information from its corresponding sub-window. Subsequently, a self-attention calculation is applied to the representatives of all the sub-windows.
This approach enables sub-windows to communicate with one another through their representatives, capturing global features. Formally, we represent SSSA as follows:
where
represents the self-attentiveness of the local grouping within the sub-window,
signifies the global sub-sampling attention obtained by interacting with the representative keys (generated by the sub-sampling function) from each sub-window, and finally,
.
Twins capture fine-grained and short-distance information in the image through the attention operation within the specified sub-window, using LSA. They also capture the global and long-distance information of the image by fusing the attention from various sub-windows through GSA. This mode effectively captures image features while being less computationally and parametrically demanding than the standard self-attention operation, making it easy to deploy.
We incorporate Twins-SVT Large with pre-trained weights from ImageNet, consisting of four stages with 11 layers of GSA and LSA. It is worth noting that the input image undergoes downsampling to
of its original size, while the channel dimension expands to 1024. From this architecture, we extract the output feature maps
from stages 1, 2, and 3, to be used as inputs for the PFA component. The resulting output is represented as follows:
3.1.2. Exemplars Encoder
In the exemplar encoder, when performing K-shot counting (
K = 0, 1, 2, 3), we employ a standard convolutional structure for feature extraction from the exemplars. Specifically, the exemplar encoder comprises four
convolutional layers with ReLU activation and average pooling. The object scale remains nearly unchanged as the exemplar images are resized to
dimensions. Consequently, we refrain from applying additional operations to the exemplars but instead focus on extracting features and mapping them onto high-dimensional feature maps, using a few shallow CNN layers. The output feature map
from the penultimate convolutional layer serves as the enhanced support feature input in the FAEM. In scenarios where zero-shot counting is performed (i.e., when no exemplar is provided), to meet the input requirements of the SCM part (
K,
Q, and
V) we incorporate a conditional mechanism to skip this step and replace the feature vector obtained from the feature extractor module with a learnable token, which then serves as the key and value inputs to the SCM:
3.2. Pyramid Feature Aggregation Module
PFA accepts the feature maps , , and from the final three stages of the Twins network. These maps are down-sampled to , , and of the original image size, respectively. Additionally, PFA incorporates a top-down feature fusion operation.
As the Twins network becomes deeper, the downsampling factor increases, resulting in high-level features with richer semantic information but at the cost of losing fine-grained details present in low-level features. Simply upsampling with interpolation cannot effectively recover these lost details, including positional, textural, and color information. This limitation makes it challenging to accurately capture the target’s boundary and location solely based on the feature map output of the deep network. This issue is particularly problematic in tasks involving dense object counting.
Numerous approaches have been developed to address this challenge. For instance, the feature pyramid network (FPN) [
27] was introduced, to enhance localization accuracy by integrating high-level semantic and low-level features through top-down pathways. PAN [
27] further extended FPN, with a bottom-up pathway, to reduce the information path lengths for both low-level and top-level features, facilitating the propagation of accurate signals in low-level features. These architectures have found widespread use in object detection tasks (e.g., the YOLO series [
28,
29,
30,
31]), demonstrating the efficacy of multi-scale feature fusion in target detection. Consequently, many object counting approaches have incorporated pyramidal feature fusion modules. However, most of these methods merely perform basic feature concatenation between high-level and low-level features and do not fully exploit the multi-scale information derived from the encoder. This limitation hinders their effectiveness in counting small objects.
This concept draws inspiration from the remarkable success of YOLOv6 v3.0 [
28] in object detection. We have devised a multi-scale aggregation mechanism for integrating information across different scales. Our approach is grounded in the utilization of SIMSPPF [
30] and RepBlock [
30] as foundational components. An illustration of one of the SIMSPPF structures is provided in
Figure 4b, which represents an enhanced iteration of the SPP [
29] module. It incorporates the concept of a spatial pyramid, seamlessly merging local and global features, substantially enhancing the feature map’s information representation capacity. This approach effectively mitigates performance degradation, particularly when dealing with substantial size variations in target counting tasks.
Furthermore, RepBlock is employed within the network, featuring a multi-branch topology during training. However, during inference, this multi-branch structure is seamlessly fused into a single
convolutional layer. The architectural arrangement is illustrated in
Figure 4a. Specifically, this module takes as its input the feature maps,
, from the exemplar encoder, each at a different scale. To prevent excessive computational overhead, we reduce the channel dimensions of
, using a
convolutional layer.
initially enters the SIMSPPF module, where the feature maps pass through pooling layers of varying kernel sizes before being amalgamated. This approach aims to mitigate image distortion stemming from cropping and scaling operations during data preprocessing, a critical consideration for small object localization and recognition.
The next step involves upsampling the feature maps, using a
convolutional kernel and bilinear interpolation, to maintain their spatial dimensions consistent with those of
. Subsequently, these up-sampled feature maps are concatenated in the channel dimension. Following the concatenation, the PFA passes through a sequence of layers, including RepBlock, convolutional layers (conv), and upsampling layers, sequentially. Finally, it is combined with the low-level feature,
. This process culminates in the generation of the final output of the PFA component, denoted as
. To summarize briefly:
3.3. Feature Interaction Enhancement Module
As shown in
Figure 5a, FSC is typically approached through two mainstream methods: the feature-based approach and the similarity-based approach. In the feature-based approach, query features are compared pointwise, to obtain a feature similarity map, which is then used for regressing a density map via a trained regression head. Examples of such methods include GAM [
1], CAFOCNet [
21], FAMNet [
20], and SAFECount [
3].
However, it is important to note that the information contained in similarity maps is often much less comprehensive compared to that in raw features. This limitation makes it challenging to obtain detailed information about the target, hindering the network’s ability to distinguish precise object boundaries. Moreover, the feature-based representation method may lose spatial information from the exemplars, due to the presence of pooling layers, which can adversely impact localization.
In this paper, we propose a similar feature interaction enhancement module that seeks to combine the strengths of both approaches. Specifically, we enhance the original features, denoted as , from the query image by adjusting the feature weights based on similarity information, represented as . Subsequently, these enhanced features are used for density map regression. This section’s structure comprises the following two modules.
3.3.1. Similarity Comparison Module
Lu et al. proposed in GAM [
1] that the counting problem can be transformed into a matching problem by leveraging the self-similarity property inherent in images, which naturally occurs in object counting tasks. This approach allows for counting arbitrary objects in a class-agnostic manner.
For this paper, we employed a series of standard transformer decoder structures, to assess the self-similarity of images. Specifically, we treated query image features as queries, and we used an MLP to transform exemplar features into two distinct projections: key and value. The cross-attention mechanism in the transformer decoder computed the similarity between query and value, comparing arbitrary regions in the image to a user-defined number of exemplar shots, to measure their similarity. We used standard 2-layer transformer decoder structures. We projected each
-size feature map pixel from the input image into a sequence. This sequence entered an embedding layer with a hidden dimension of 512. After passing through the SCM, the vector sequences were converted back into 2D feature maps:
By comparing the query and key in the similarity space, the model was asked to select the appropriate information from the features automatically and, with the help of the features of the few shots given by the user, the score R of the user’s region of interest was calculated by comparison.
3.3.2. Feature Attention Enhancement Module
Channel attention involves compressing the feature map into a 1D representation along the spatial dimension and performing attention operations along its channel dimension. This module enables the model to prioritize regions of interest while suppressing attention in non-interesting regions. For instance, CBAM [
32] employs a large-scale convolutional kernel to compute an attention map across both channel and spatial dimensions. It then multiplies the feature map with this attention map, to adaptively optimize the features. This approach allows for capturing long-range dependencies along one spatial direction while preserving precise location information along the other. The resulting feature maps are transformed into direction-aware and position-sensitive attention maps, which can be complementarily applied to the input feature maps, to enhance the representation of the object of interest.
More specifically, when we consider the features , , and obtained from the query image and exemplars, along with the feature similarity graph R generated from SCM, our proposed FAEM aims to enhance the original features. FAEM initially processes the support features, , through the GAM module. Leveraging spatial and channel attention mechanisms, the network emphasizes the salient objects in the image (i.e., exemplars). The feature similarity map R is then dimensionally reduced, using a convolutional layer, to reduce computational complexity. Subsequently, the feature map attention region is reshaped, and the coordinate attention module derives the reshaped similarity map, denoted as . Following this, is combined with the query feature map, , to enhance the relevant feature dimensions of the k-shot image provided by the user. , , and are input into FAEM, and the attention distribution is adjusted, based on the weights of in spatial and channel dimensions, utilizing two attention modules, CA and GAM, respectively. This component effectively blends the CNN and the attention mechanism, to make the network more focused on our task objectives.
The realization of this part can be divided into two steps:
Similar attention enhancement: In this step, we aggregate the support features,
, by incorporating coordinate attention (CA) [
33] and the global attention mechanism (GAM) [
34]. The structures are illustrated in
Figure 5b,c. Essentially, in relation to the exemplar features
provided by the user, regions with higher similarity in
F should be assigned greater weights in both spatial and channel dimensions. The spatial and channel attention mechanism accomplishes this weighted aggregation.
Original image feature enhancement: In this step, we incorporate the enhanced similarity graph, , into the original features of the query image, , using a shallow convolutional layer. This process enhances the weights of the feature maps that are close to the target class in exemplars. The enhancement is accomplished through a straightforward matrix addition before entering the convolutional layer.
3.4. Multi-Scale Dense Counting Regression Head
We have obtained global and multi-scale information through PFA and enhanced the features corresponding to exemplars using FAEM, which employs feature similarity and attention mechanisms. Next, we require a robust regression head, to accurately estimate the density map. The structure of this regression head is depicted in
Figure 6a.
In general, a standard CNN followed by a pooling layer can be employed for counting tasks, and this architecture is often stacked multiple times, to generate a density map in the decoder part (e.g., SAFECount). However, this approach continuously reduces the spatial resolution of the feature map, leading to the loss of detailed information and a diminished ability to detect small targets.
To make better use of the information within the feature map, it is essential to expand the perceptual field of the regression head. Directly using larger fixed-size convolutional kernels would significantly increase the computational demands of the network. Dilated convolutional (DConv) layers [
35] provide an excellent solution to enhancing the perceptual field without increasing computational complexity. A common practice is to stack DConv layers together [
35]. However, this can lead to the gridding effect of DConv layers, where, after passing through multiple convolutional layers, some pixels in the feature map may be missed in subsequent convolutions, as illustrated in
Figure 7b. Therefore, the dilation rate of DConv layers must be carefully chosen, to ensure that the convolution captures a larger receptive field while avoiding the gridding effect and the loss of information at a distance.
In ACECount, ensuring the density map’s resolution requires using a final feature map size that is
of the input image’s size. When dealing with class-independent counting, where target scale variations can be substantial and a diverse range of targets is encountered, having a large receptive field becomes essential. To address this, we employ a dense ASPP structure with the smallest feasible expansion coefficient. Specifically, the multi-scale dense counting regression head (MDCR) module consists of three parallel Dconv layers denoted as
(
i = 1, 2, 3), accompanied by a
convolutional branch functioning as a shortcut. The output of each of these expanded convolutions is then passed through subsequent convolutional layers before being concatenated and processed through the DConvs with varying expansions (
R = 1, 2, 3). The overall structure of this module can be simplified, to resemble
Figure 6b, equivalent to a sequence of cascading null convolutions. Subsequently, we reduce the feature map’s channel count through two straightforward convolutional layers, resulting in a one-channel density heatmap. The number of targets is obtained by simply summing the heatmap values. The computational load of this spatial processing component is defined as follows:
where 3 denotes convolution kernel size and
d denotes dilation rates.
We denote this part as follows:
3.5. Zero-Shot Counting
In the context of zero-shot counting, where the training data either lacks exemplar annotations or does not utilize them, we replace the exemplar features, , and support features, , with two learnable tensors ( and ). Specifically, is used as both the key and value inputs within the SCM component, while serves as the input for the FAEM segment’s support features. This innovative approach enables the network to seamlessly adapt to different values of ‘k’ in k-shot counting scenarios, thereby enhancing its performance in zero-shot counting tasks.
3.6. Loss Function
We use a loss function that is common in population counting, which comes from DM-Count [
16] and is represented by a weighted sum of count loss, optimal transmission (OT) loss, and total variance (TV) loss. Count loss: let
be the
parameterization of the vector,
,
denote the predicted density map number and the actual value number, respectively, and the count loss requires the values of
,
to be as close as possible. The final count loss is defined by the absolute value of the difference between the two only:
Optimal transmission loss:
z,
are both non-normalized density functions, which can be converted into probability density functions by treating them as their respective sums. OT loss can measure the similarity between two density maps and, simultaneously, help the counting model obtain a firm fit, providing an effective gradient to train the network to minimize the distribution gap between the predicted density map and the ground truth. It is defined as follows:
Total variance: Wang et al. (2020) [
16] pointed out that OT loss approximates target dense regions well but produces inaccurate results for regions with low target density. In this regard, TV loss is used to stabilize the supervised signal and increase the stability of the training process. It is defined as follows:
Finally, the DM loss function is a weighted sum of the counting loss, OT loss, and TV loss, which is expressed as
where
z,
denote the target number of estimated and true density maps, respectively, and
and
are adjustable hyperparameters.
4. Experiments
4.1. Dataset
We conducted experiments using our framework on five benchmark datasets: FSC-147 [
20], CARPK [
34], ShanghaiTech Part A [
17], ShanghaiTech Part B [
17], and UCF-QNRF [
16]. These experiments included class-agnostic counting on FSC-147, class-specific crowd counting on ShanghaiTech, UCF-QNRF, and class-specific car counting on CARPK. These datasets vary in terms of counting classes, image resolution ratios, the number of objects, object density, and color spaces.
For the FSC-147 dataset during training, we used all the bounding box annotation information as exemplars in the 3-shots case. In the i-shots case (where i = 1, 2), i boxes were randomly selected from the three boxes, as exemplars input.
The ShanghaiTech, and UCF-QNRF datasets do not provide bounding box annotations. In these cases, we trained the datasets in a zero-shot setting, using a learnable tensor instead of exemplars input.
In the case of CARPK, which provides bounding box annotations, we obtained the point annotation information by computing the centroid coordinates of the bounding boxes, to generate the true value density map. Exemplars were obtained by randomly selecting the bounding boxes, and the remaining settings were kept consistent with the training of the FSC-147 dataset.
FSC-147 is one of the few available datasets for few-shot counting (FSC). It features a multi-category setup with 147 classes and 6135 images. Each image includes three bounding boxes representing instances of objects from count categories specified by exemplars. The dataset’s data distribution is as follows: the training set comprises 89 classes and 3659 images, while the validation and test sets consist of 1286 and 1190 images, respectively, including 29 additional disjoint classes. The object count per image ranges from 7 to 3701, with the number of images containing over 1000 objects being relatively low, averaging 56. The FSC-147 dataset presents two primary challenges. First, most prior algorithms were designed for counting a specific target class, whereas FSC-147 encompasses diverse categories and necessitates the ability to handle unseen categories during inference. Second, the dataset spans various scenarios, demanding algorithms to effectively filter out irrelevant context interference in task goals.
The ShanghaiTech dataset comprises two parts, A and B. Part A includes images collected from the internet, exhibiting a wide range of population density variations. The data distribution for Part A consists of 300 images in the training set and 128 images in the test set. By contrast, Part B contains images captured by surveillance cameras on Shanghai streets, featuring significant intrascene scale and density variations. Part B’s data distribution comprises 400 images in the training set and 316 images in the test set.
UCF-QNRF images were sourced from various websites and exhibit diversity in scenes and sizes. These images primarily contain objects at a small scale. The dataset consists of a total of 1535 images, with 1201 in the training set and 334 in the test set. No validation set is specified.
CARPK serves as a class-specific car counting benchmark commonly used to assess the generalizability of FSC network datasets. It includes 1448 images of cars captured by drones in four different parking lots, encompassing a total of 90,000 cars.
4.2. Implementation Detail
For data augmentation, we uniformly applied random cropping to each image. To be precise, the FSC-147 and ShanghaiTech images were randomly cropped to pixels, while images from UCF-QNRF were cropped to pixels. Randomized horizontal flipping operations were also performed on the datasets above.
In the network’s MDCR, the significant increase in feature map channels due to dense connectivity required strategies to manage model size and computational load. To tackle this, we used multiple convolutional layers to reduce the number of feature channels in the intermediate stage, thereby effectively reducing the overall model’s parameter count.
During training, we froze the weights of the query encoder part of the network and only updated the parameters of the remaining components. After 50 epochs of training, the weights of the encoder part were unfrozen. This strategy preserved the generic semantic information obtained from pre-training on ImageNet, which was loaded into the encoder part of the network. We employed the AdamW [
36] optimizer with a batch size 16, a well-suited choice for transformer-based models. The initial learning rate was set to
, and
regularization of 0.0001 was employed, to prevent overfitting. In the class-specific counting dataset, we initialized the network parameters with weights trained on the FSC-147 dataset and subsequently fine-tuned them on the corresponding CSC dataset. Simultaneously, we kept the weights of the backbone part frozen.
We conducted our experiments within the PyTorch framework, employing an NVIDIA GeForce RTX 3090 environment.
4.3. Evaluation Metrics
Reviewing the previous work in object counting, we adopted two commonly used counting methods as our evaluation metrics: mean absolute error (MAE) and root mean squared error (RMSE). They are defined as follows:
where
denotes the number of people predicted by ACECount,
denotes the actual number of people in each image, and
N indicates the total number of images.
4.4. Compared to Different Methods
4.4.1. Class-Agnostic Counting
Training Configuration on Class-Agnostic Counting. The dimensions of the query images and exemplar images were set to
and
for
and
, respectively. As presented in
Table 1, ACECount demonstrated great performance in terms of zero-shot and few-shot counting, surpassing previous methods, particularly in the validation set results. The visualisation results are displayed in
Figure 8.
Comparing Methods. We evaluated and compared the proposed ACECount model to existing CAC methods on the FSC-147 dataset: Under the few-shot training setting, we compared CFOCNet [
16], FSOD [
37], Counting-DETR [
38], BMNet [
2], and FamNet [
20]. For BMNet [
2], we used the variant BMNet+ [
2] that implements both the core components, i.e., the self-similarity module and the dynamic similarity metric. Additionally, for more comparisons, we also tested SAFECount [
3], which applies a similar design to our work, i.e., combining feature-based and similarity-based approaches. We also compared RepRPN-C [
39] and RCC [
40] under the zero-shot training setting, demonstrating that our method can perform well under zero-shot counting as well.
Performance on FSC-147.
Table 1 lists the quantitative results for FSC-147. Specifically, ACECount demonstrated a clear advantage over some methods (CFOCNet, FSOD, Counting-DETR, BMNet+, and FamNet) that are either based solely on features or rely on a single similarity metric. Compared to SAFECount, which is also based on similarity enhancement, ACECount was able to more accurately distinguish the boundary contours of the target object and adapt to scale variations in target objects, identifying smaller targets. ACECount achieved an MAE of 14.93 and an RMSE of 50.11 on the validation set, and an MAE of 14.06 and an RMSE of 83.51 on the test set. It is important to note that SAFECount and BMNet+ already served as a strong baseline for ACECount. This performance improvement was mainly attributed to our network’s enhanced FIEM, which effectively leveraged the attention mechanism and combined similarity and raw feature information, to optimize the use of the encoder’s initial features.
Table 1.
Comparison with the CAC algorithm on the FSC-147 dataset.
Table 1.
Comparison with the CAC algorithm on the FSC-147 dataset.
Method | Venue | Shots | Val | Test |
---|
MAE | MSE | MAE | MSE |
---|
RepRPN-C [39] | ACCV23 | 0 | 31.69 | 100.31 | 28.32 | 128.76 |
RCC [40] | Arxiv22 | 0 | 20.39 | 64.62 | 21.64 | 103.47 |
ACECount (ours) | 2023 | 0 | 17.92 | 68.31 | 16.16 | 105.23 |
CFOCNet [16] | WACV21 | 3 | 21.19 | 61.41 | 22.10 | 112.71 |
FSOD [37] | CVPR20 | 3 | 36.36 | 115.00 | 32.53 | 140.65 |
FamNet [20] | CVPR21 | 3 | 23.75 | 69.07 | 22.08 | 99.54 |
BMNet+ [2] | CVPR22 | 3 | 15.74 | 58.53 | 14.62 | 91.83 |
Counting-DETR [38] | ECCV22 | 3 | - | - | 16.79 | 123.56 |
SAFECount [3] | WAC23 | 3 | 15.28 | 47.20 | 14.32 | 83.54 |
ACECount (ours) | 2023 | 3 | 14.93 | 50.11 | 14.06 | 83.51 |
4.4.2. Class-Specific Object Counting
Training Configuration on Class-Specific Counting. In our experiments on the crowd counting dataset, we utilized a zero-shot counting training configuration. We compared our method to two categories of competitors: single-class crowd counting methods and FSC methods. Conversely, for the vehicle counting dataset, we adopted a three-shot training configuration. Our results demonstrated that our method outperforms all FSC methods in crowd counting and matches the performance of State-of-the-Art single-class crowd counting methods. Furthermore, our approach excels in vehicle counting. It is worth noting that our model was not originally designed for specific car counting or crowd counting tasks. the visualisation results are displayed in
Figure 9.
Performance on ShanghaiTech. The FSC approach lacks the ability to handle multi-scale targets and struggles to detect small-scale crowd targets when transitioning from the FSC-147 dataset to the crowd counting dataset. Thanks to our MDCR module, our network efficiently identifies and locates small pixel targets. ACECount outperforms all FSC methods, in terms of CSC task results, and it is on a par with State-of-the-Art single-class crowd counting methods.
Performance on UCF-QNRF. The results are presented in
Table 2. Our method consistently outperformed other CAC methods. This is attributable to the detailed information contained within our PFA, facilitating the detection of small objects. Furthermore, our MDCR excelled in capturing multi-scale features and global contextual information, enabling accurate crowd size regression. In comparison to class-specific crowd counting algorithms, our MAE and RMSE were surpassed only by the previous top-performing algorithms.
Performance on CARPK. The model, initially trained on the FSC-147 dataset, underwent fine-tuning on CARPK and was then compared to dedicated car counting models and other general counting models. The results are presented in
Table 3.
4.5. Ablation Study
For this section, we performed comprehensive ablation experiments on the proposed network structure, using the FSC-147 dataset. These experiments aimed to demonstrate the effectiveness of our proposed modules and to assess each module’s contribution to the network. Additionally, we conducted extensive ablation studies, to select hyperparameters, determine the optimal size and number of regression head convolutions, investigate the impact of weight freezing during training, and evaluate the effectiveness of the core modules, SCM and FAEM.
4.5.1. Pyramid Feature Aggregation Module
In
Table 4, we present the network’s performance when it received features from different encoder stages, aiming to assess the effectiveness of the progressive feature aggregation (PFA) module. In the deeper layers of the encoder, the network tended to lose intricate details, retaining primarily the image’s high-level semantic features, which could be detrimental to our ultimate goal.
,
, and
were derived from the output features of the three phases within the backbone network. Simultaneously leveraging features from all three phases enabled the aggregation of comprehensive information across various scales, resulting in improved counting performance. The experimental findings highlighted that not only the incorporation of multi-stage features impacts network performance but also that the organization of these features plays a pivotal role. Our designed PFA module consistently outperformed both the straightforward concatenation of the three output features (as indicated in line 4) and the feature fusion using the ASPP approach (line 5).
4.5.2. Multi-Scale Columns
As shown in
Table 5, we assessed the impact of varying the number of columns in the DConv layer structure within the MDCR on counting performance.
,
, and
denote the successive null convolutions, as depicted in
Figure 6a. By default, all the experiments included a shortcut path comprising a
convolution.
Intriguingly, this structural design primarily aimed to enhance generalization across diverse datasets. Notably, on the FSC dataset, employing a single column also yielded favorable results for the network. However, a more practical approach involved the comprehensive aggregation of all columns, leading to enhanced network performance. We achieved an MAE and an MSE of 14.9 and 50.1, respectively, on the validation set. These results compare favorably to the best outcomes of 16.7 and 62.1, obtained using only a single column of DConv layers, reflecting improvements of −1.8 and −12, respectively. This underscores the indispensability of MDCR.
Simultaneously, we explored different stacking strategies for DConv layers. While stacking three DConv layers identically and setting the dilation rates to 2 offered some performance improvement, it introduced meshing issues, as discussed in
Section 4.5.4, ultimately resulting in performance degradation, compared to the approach outlined in this paper. Stacking DConvs with varying dilation rates in depth mitigated this problem, albeit with negligible improvements in the context of object counting tasks.
4.5.3. Freezing Backbone
As we mentioned in the training details in
Section 4.5, we undertook the operation of freezing the encoder weights first and then unfroze them later, during training. This aimed to preserve as much as possible the high-level semantic information of various categories learned in ImageNet contained in the weights, so that the network was more general and could achieve effective counting in the face of different categories. We investigated the effect of freezing the weights on the network’s performance, and the results are shown in
Table 6. It can be seen intuitively that the operation of freezing weights did bring some performance improvement to our network. The MAE at the time of testing was reduced from 16.3 to 14.0.
4.5.4. Component Analysis of ACECount
We comprehensively assessed the network’s designed structures on the FSC-147 dataset, to gauge each component’s influence on the network’s overall performance. We derived the count structure measures by integrating the components: namely, PFA, SCM, FEAM, and MDCR. The findings are summarized in
Table 7, revealing notable performance enhancements attributed to incorporating PFA, SCM, FEAM, and MDCR structures. PFA, SCM, and FEAM structures were the most influential in augmenting the model’s performance. This demonstrates the core modules’ usefulness in our design for the target counting task.