2. Related Work
In recent years, some scholars have tried to improve the robustness of tracking objects through deep learning methods. They increased the number of training sets to improve the ability of feature extraction or obtain a more general tracker. Some scholars use filters to compare the partially extracted object features and then dynamically update the filter.
Many scholars have proposed methods trying to solve the object encountered motion blur in the tracking. In 2021, Qing Guo et al. proposed a new generic scheme based on the generative adversarial network (GAN). They used a fine-tune discriminator as an adaptive blur evaluator to enable selective frame de-blurring during tracking to improve the robustness of tracking blurred objects [
13]. In 2021, Zhongjie Mao et al. cited image quality assessment (IQA) and de-blurring components into the basic D3S (a discriminative single-shot segmentation tracker) framework to enhance context patches, thereby improving the accuracy of tracking blurred objects [
14]. In 2021, Zehan Tan et al. used the circle algorithm to calculate the neighborhood of offset estimation and then used a short-learning based on SiamRPN++ to achieve online tracking of high-speed targets to solve the impact of motion blur [
15]. In 2021, Zhongguan Zhai et al. proposed a space-time memory planar object tracking networks (STMPOT) network to classify the pixels belonging to the object or background of the current frame by remembering the information of the object and background of each frame to solve the motion blur effects [
16]. In 2022, Chao Liang et al. built a new one-shot MOT tracker. They used a global embedding search to propagate previous trajectories to the current frame and extended the role of ID embedding from data association to motion prediction which improves the tracker only relying on single-frame detections to predict candidate bounding boxes to address the impact of motion blur [
17]. In 2022, Jeongseok Hyun et al. proposed a novel JDT model that recovers missed detections by learning object-level spatiotemporal coherence from edge features in graph neural networks (GNNs), while correlating detection candidates for consecutive frames, solving the effect of motion blur [
18].
Many scholars have proposed methods trying to solve object encountered deformation in the tracking. In 2019, Wenxi Liu et al. proposed a deformable convolutional layer. It can enrich object appearance representations in a detection tracking framework to adaptively enhance its original features. They believe that the rich feature representation through deformable convolution can help the convolutional neural network (CNN) classifier to distinguish the target object from the background and reduce the influence of deformation on tracking objects [
19]. In 2019, Detian Huang et al. proposed an improved algorithm that improved the feature extraction ability by incorporating the multi-domain training, redesigned both the selection criteria for optimal action and the reward function, and used an effective online adaptive update strategy to adapt to the deformation of the object during tracking [
20]. In 2020, Jianming Zhang et al. proposed to use offline pre-trained Resnet-101 to obtain mid-level and high-level extracted features combined with correlation filters to improve the ability to track moving objects with deformation [
21]. In 2021, Shiyong Lan et al. proposed a new approach by embedding the occlusion perception block into the model update stage to adaptively adjust the model update according to the situation of occlusion and using the relatively stable color statistics to deal with the appearance shape changes in large targets and compute the histogram response scores as a complementary part of final correlation response to mitigate appearance deformation [
22]. In 2022, Xuesong Gao et al. proposed a novel deformed sample generator to obtain a more general classifier and avoid larger training datasets. The classifier and the deformed sample generator are learned jointly, thereby improving the robustness of tracking deformed objects [
23].
Many scholars have proposed methods trying to solve the object encountered illumination changes in the tracking. In 2017, Yijun Yan et al. proposed foreground detection in visible and thermal images to reduce the effects of red-green-blue (RGB) color on lighting noise and shadows to reduce the impact of illumination variations on tracking moving objects [
24]. In 2018, Shuai Liu et al. proposed an optimized discriminative correlation filter (DCF) tracker that improves the accuracy under illumination changes by performing multiple region detection and using alternate templates (MRAT) while saving alternate templates through a template update mechanism [
25]. In 2021, Jieming Yang et al. proposed a neural network that uses the historical location of the target combined with the historical location of the target to expand the training data and uses the metric loss of the historical appearance feature of the target to train the appearance feature extraction module to improve the extraction performance to address the effect of lighting changes on tracking moving objects [
26]. In 2022, Zhou, Yuxin, and Yi Zhang proposed SiamET, a Siamese-based network using Resnet-50 as its backbone with enhanced template modules. They address the effect of illumination variations on tracking moving objects by using templates that are obtained based on all historical frames [
27].
Many scholars have proposed methods trying to solve the object encountered occlusion changes in the tracking. In 2019, Wei Feng et al. proposed a new dynamic saliency-aware regularized CF tracking (DSAR-CF) scheme that defines a simple and efficient energy function to guide the online update of the regularized weight map to address the effect of occlusion when tracking [
28]. In 2020, Yue Yuan et al. proposed a scale-adaptive object-tracking method to reduce the impact of occlusion on tracking moving objects. They extracted features from different layers of ResNet to produce response maps fused based on the AdaBoost algorithm, prevented the filters from updating when occlusion occurs, and used a scale filter to estimate the target scale [
29]. In 2020, Di Yuan et al. designed a mask set to generate local filters to capture the local structures of the target and adopted an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which could enhance the robustness of the tracker effectively [
30]. In 2021, Yuan Tai et al. constructed the subspace with image patches of the search window in previous frames. When the appearance of an object is occluded, the original image patch used to learn the filter is replaced by the reconstructed patch so that the filter can learn from the object instead of the background, thus reducing the effect of occlusion [
31]. In 2022, Jinkun Cao et al. demonstrated the effect of a simple motion model, observation-centric SORT (OC-SORT), to reduce the errors accumulated by linear motion models during loss, so that they could reduce the effect of occlusion phenomena on tracking moving objects [
32].
However, most of these scholars only propose individual solutions to the problems of deformation, illumination variation, motion blur, and occlusion when tracking moving objects. One solution for a single phenomenon is not enough to track moving objects correctly and stably. When the surface features of the tracked object are drastically changed due to the simultaneous occurrence of various factors such as deformation, illumination variation, motion blur, and occlusion, the tracker that can only solve a single phenomenon will still result in low tracking accuracy or tracking loss. Unfortunately, when tracking moving objects, four phenomena, such as deformation, illumination variation, motion blur, and occlusion, are encountered almost simultaneously. Because of the above reasons, we propose an adaptive dynamic multi-template object tracker (ADMTCF) in this paper, which can simultaneously overcome the difficulty of tracking moving objects with deformation, motion blur, illumination variation, and occlusion. We believe a great tracker must have the following capabilities to simultaneously overcome the four problems of motion blur, deformation, illumination variation, and occlusion when tracking.
The template of the tracking object must have sufficient characteristics of the target object;
The template of the tracking object must not be sensitive to illumination variations;
The template of the tracking object must be more than one set;
The template of the tracking object must be dynamically updatable.
Adel Bibi and Bernard Ghanem proposed similar concepts. They believed that using multiple and multi-scale templates can improve tracking accuracy and can overcome the shortcomings of KCF’s single template size [
33]. We differ from them in two ways. The first difference is that they use multiple templates to calculate at the same time, but we chose the template with the highest similarity for tracking the next frame of the target after calculations and sorting of multiple templates. The second point is that they update the scale of the tracker by maximizing over the posterior distribution of a grid of scales, but we use the most similar template and then adjust the size of the template adaptively to further improve the similarity between the template and the target.
The adaptive dynamic multi-template object tracker we proposed in this paper has several characteristics. First, we convert the image of the selected object from RGB to HSV color space and then perform LBP conversion on the luminance to obtain a sample of the object’s image feature [
34,
35]. After HSV color space conversion and local binary pattern (LBP) conversion, the image features do not change drastically with the illumination variation. Therefore, our tracker does not reduce the stability and accuracy of image tracking due to changes in ambient light. Secondly, different from the general tracker, the tracker we proposed not only has multiple sets of samples sorted by time but also dynamically updates or adds the tracker templates when tracking objects. Our strategy for updating or adding new tracker templates must meet one of two conditions. The first condition is to choose a smaller change threshold when the characteristics of the target change greatly. The second condition is to choose a larger change threshold when the characteristics of the target change less. Since our tracker has multiple sets of templates sorted by time and can be updated dynamically, it can overcome the tracking loss caused by deformation or motion blur during tracking. Third, our tracker template can resize as the surface features of moving objects change. To compare with our adaptively adjusting template, the traditional use of a single threshold to judge the similarity between the moving object and the template is too monotonous and prone to tracking loss. Therefore, during the tracking process, our tracker can exhibit excellent tracking robustness even if the target is occluded.
In this paper, we have two main contributions. The first contribution is that our proposed ADMTCF algorithm can maintain the tracking accuracy and robustness even if the object encounters deformation, illumination change, motion blur, and occlusion. The second contribution is that we propose an evaluation method with a penalty factor, which can objectively reflect the accuracy of various algorithms for estimating the object’s size.
4. Results
In this section, we used the original KCF, four the state of the art tracking algorithms proposed recently, and our adaptive dynamic multi-template object tracker to perform the tracking experiments. In the experiments, all objects suffered the problems of motion blur, deformation, illumination variation, and occlusions. We used four scenes to represent the objects encountering motion blur, deformation, illumination variation, and occlusion, respectively. Firstly, we used the first scene to experiment with the effect of various trackers on the tracking of objects in a light-changing environment, as shown in
Figure 9. Secondly, we used the second scene to experiment with the tracking effect of various trackers for tracking the deformation object, as shown in
Figure 10. Thirdly, we used the third scene to experiment with the tracking effect of various trackers for tracking the moving blur object, as shown in
Figure 11. Finally, we used the fourth scene to experiment with the tracking effect of various trackers for tracking the occluded object, as shown in
Figure 12.
The algorithms we used as the experimental control group were KCF, MCCTH, MKCFup, LDES, and SiamRPN++ [
36,
37,
38,
39,
40]. According to João F. Henriques et al., the KCF using HOG features for object tracking, has the advantage of high tracking accuracy rather than using grayscale features or color features [
36]. According to Ning Wang et al., MCCTH uses an adaptively updated multi-cue analysis framework for object tracking and has good object tracking robustness [
37]. According to Ming Tang et al., MKCFup introduces a new type of multi-kernel learning (MKL), which can use the powerful discriminability of nonlinear kernels to track objects and can track high-speed moving objects [
38]. According to Yang Li et al., LDES can overcome the scale change, rotation and large displacement of the target when tracking the target [
39]. Bo Li et al., used a training dataset that locates objects biasing from the image center to deepen the network of SiamRPN++ and proposed a new model architecture to perform layer-wise and depth-wise aggregations to improve tracking accuracy [
40]. Tracking experiments were performed for scenarios 1, 2, 3, and 4. Some screenshots of the tracking results using the algorithms ADMTCF, KCF, MCCTH, MKCFup, LDES, and SIAMPRN++ are shown in
Figure 9,
Figure 10,
Figure 11 and
Figure 12, respectively. The experimental data are shown in
Table 1 and
Table 2.
In scenario 1, the target is a red car driving out of the tunnel from the tunnel. In the tunnel, due to the light, the target appears dark red nearly black. When the target exits the tunnel and encounters light outside the tunnel, it has a pristine red appearance. This video tests the tracker’s ability to track the color of part or all of an object’s appearance as the illumination varies. According to
Table 1, the experimental results show that the algorithms KCF, MKCFup, LDES and our tracker showed good tracking ability in the case of illumination variations.
In scenario 2, the target is a fish, which moves from the left to the right. During the movement of the fish, the fish gradually turned from sideways to front facing the camera. The image features, size, and outline of the fish’s side and front are different. This video tests the tracker’s ability to track changes in the appearance and shape of the target. According to
Table 1, the experimental results showed that the algorithms MCCTH, LEDS, and KCF showed good tracking ability in the case of deformation.
In scenario 3, the target is a moving pedestrian, moving from the left side of the screen to the right side. During the movement, the outline of the pedestrian appears blurred. This video tests the tracker’s ability to track objects with a blurry appearance. According to
Table 1, the experimental results showed that the algorithms ADMTCF, KCF, and MCCTH showed good tracking ability in the case of motion blur.
In scenario 4, the target is a moving pedestrian, moving from the right side to the left. During the pedestrian’s movement, the outline of the pedestrian is occluded by a white car. This video tests the tracker’s ability to track the occlusion of the target. According to
Table 1, the experimental results showed that the algorithms MKCFup, LDES, and MCCTH showed good tracking ability in the case of occlusion.
In scenario 1 and scenario 2, due to the position and angle of the camera fixing, the coordinates of the target object did not shift due to camera movement or angle swing. Therefore, we marked the bounding boxes of the various trackers and the tracked trajectories in the images in scenario 1 and scenario 2, as shown in
Figure 10 and
Figure 11. The trajectories are the lines connecting the center points of the bounding boxes in frames.
According to
Table 1 and
Figure 13, we can observe the tracking performance of various algorithms evaluated using metrics 1. By using HOG features, KCF performed best in scenario 1 (illumination variation). Due to using an adaptively updated multicue analysis framework, MCCTH performed best in scenario 2 (deformation). LDES was second in scenario 2. Our adaptive dynamic multi-template CF performed best in scenario 3. MKCFup was second in scenario 4.
According to
Table 2 and
Figure 14, we can observe the tracking performance of various algorithms evaluated using Metrics-2. SiamRPN++ performed best in scenario 1. MCCTH performed best in scenario 2. Our adaptive dynamic multi-template CF performed best in scenarios 3 and 4.
We took the average of the AVG accuracy of the 4 scenarios in
Table 3 and
Table 4 to indicate the tracking ability of the algorithm that encounters illumination variation, deformation, motion blur, and occlusion at the same time, and plots
Figure 15a,b. From
Figure 15a, it can be seen that our proposed adaptive dynamic multi-template object tracker achieved third place and maintained good tracking stability and robustness when the tracking encountered illumination variation, deformation, motion blur, and occlusion. From
Figure 15b, it can be seen that our proposed ADMTCF achieved the highest tracking accuracy of all trackers.
Finally, we carried out an evaluation with respect to the running time of the algorithms. For objectivity, we made the evaluation of the running time by using two different CPUs, the Intel Celeron N2940 1.83GHz and Intel I5 8250U 3.4GHz. We recorded the frame rates in
Table 5 and
Table 6, respectively. For the objectivity of comparison, we resized the resolution of all images of the scenarios to 720 × 480. In
Table 5 and
Table 6, we found that our algorithm was faster than SiamRPN++. It means that the ADMTCF not only runs faster but also reduces the influence of illumination variation, deformation, motion blur, and occlusion when tracking objects. In addition, using the I5 8250U CPU, the best frame rate of our ADMTCF was more than 30 fps and meets the requirements of real-time applications without a high power consumption GPU.