Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Spatial feature embedding for robust visual object tracking

Published: 20 December 2023 Publication History

Abstract

Recently, the offline‐trained Siamese pipeline has drawn wide attention due to its outstanding tracking performance. However, the existing Siamese trackers utilise offline training to extract ‘universal’ features, which is insufficient to effectively distinguish between the target and fluctuating interference in embedding the information of the two branches, leading to inaccurate classification and localisation. In addition, the Siamese trackers employ a pre‐defined scale for cropping the search candidate region based on the previous frame's result, which might easily introduce redundant background noise (clutter, similar objects etc.), affecting the tracker's robustness. To solve these problems, the authors propose two novel sub‐network spatial employed to spatial feature embedding for robust object tracking. Specifically, the proposed spatial remapping (SRM) network enhances the feature discrepancy between target and distractor categories by online remapping, and improves the discriminant ability of the tracker on the embedding space. The MAML is used to optimise the SRM network to ensure its adaptability to complex tracking scenarios. Moreover, a temporal information proposal‐guided (TPG) network that utilises a GRU model to dynamically predict the search scale based on temporal motion states to reduce potential background interference is introduced. The proposed two network is integrated into two popular trackers, namely SiamFC++ and TransT, which achieve superior performance on six challenging benchmarks, including OTB100, VOT2019, UAV123, GOT10K, TrackingNet and LaSOT, TrackingNet and LaSOT denoting them as SiamSRMC and SiamSRMT, respectively. Moreover, the proposed trackers obtain competitive tracking performance compared with the state‐of‐the‐art trackers in the attribute of background clutter and similar object, validating the effectiveness of our method.

Graphical Abstract

The anchor‐free Siamese tracking method is prone to ‘target‐like’ classification responses in areas with backgrounds clutter and similar distractors, affecting tracking accuracy. The authors propose a spatial remapping network that provide more discriminative metric features for accurate classification and localisation of similar objects, enhancing the tracker’s ability to handle distractor regions.

References

[1]
Shen, J., et al.: Visual object tracking by hierarchical attention siamese network. IEEE Trans. Cybern. 50(7), 3068–3080 (2019). https://doi.org/10.1109/tcyb.2019.2936503
[2]
Chen, Z., et al.: Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6668–6677 (2020)
[3]
Chen, Z., et al.: Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6668–6677 (2020)
[4]
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
[5]
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
[6]
Liu, Z., et al.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
[7]
Danelljan, M., et al.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
[8]
Nam, H., Han, B.: Learning multi‐domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4293–4302 (2016)
[9]
Danelljan, M., et al.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
[10]
Li, P., et al.: Gradnet: gradient‐guided network for visual object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6162–6171 (2019)
[11]
Guo, D., et al.: SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277 (2020)
[12]
Wang, Q., et al.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)
[13]
Han, W., et al.: Learning to fuse asymmetric feature maps in siamese trackers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16570–16580 (2021)
[14]
Bertinetto, L., et al.: Fully‐convolutional siamese networks for object tracking. In: Computer Vision–ECCV 2016 Workshops, vol. 14, pp. 850–865. Springer International Publishing, Amsterdam (2016). October 8‐10 and 15‐16, 2016, Proceedings, Part II
[15]
Chen, X., et al.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
[16]
Li, B., et al.: Evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 16–20. Long Beach (2019)
[17]
Xu, Y., et al.: Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. Proc. AAAI Conf. Artif. Intell. 34(07), 12549–12556 (2020). https://doi.org/10.1609/aaai.v34i07.6944
[18]
Liao, B., et al.: Pg‐net: pixel to global matching network for visual tracking. In: Computer Vision–ECCV 2020: 16th European Conference, pp. 429–444. Springer International Publishing, Glasgow (2020). August 23–28, 2020, Proceedings, Part XXII 16
[19]
Yan, B., et al.: Learning spatio‐temporal transformer for visual tracking[C]. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10448–10457 (2021)
[20]
Wang, N., et al.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
[21]
Lin, L., et al.: Swintrack: a simple and strong baseline for transformer tracking[J]. Adv. Neural Inf. Process. Syst. 35, 16743–16754 (2022)
[22]
Song, Z., et al.: Transformer Tracking with Cyclic Shifting Window attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8791–8800 (2022)
[23]
Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few‐shot learning. Adv. Neural Inf. Process. Syst. 30 (2017)
[24]
Dey, R., Salem, F.M.: Gate‐variants of Gated Recurrent Unit (GRU) Neural networks[C]//2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1597–1600. IEEE (2017)
[25]
Zhang, Z., Peng, H.: Deeper and wider siamese networks for real‐time visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4591–4600 (2019)
[26]
Guo, D., et al.: Graph attention tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552 (2021)
[27]
Zhu, Z., et al.: Distractor‐aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision, pp. 101–117. ECCV (2018)
[28]
Voigtlaender, P., et al.: Siam R‐CNN: visual tracking by re‐detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6578–6588 (2020)
[29]
Yu, Y., et al.: Deformable siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
[30]
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
[31]
Dosovitskiy, A., et al.: An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020). arXiv preprint arXiv:2010.11929
[32]
Chen, X., et al.: Efficient visual tracking via hierarchical cross‐attention transformer. In: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pp. 461–477. Springer Nature Switzerland, Cham (2023)
[33]
Cao, Z., et al.: Hift: hierarchical feature transformer for aerial tracking[C]. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 15457–15466 (2021)
[34]
Guo, Q., et al.: Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1763–1771 (2017)
[35]
Sun, X., et al.: Updatable Siamese Tracker with Two‐Stage One‐Shot Learning (2021). arXiv preprint arXiv:2104.15049
[36]
Zhou, J., Wang, P., Sun, H.: Discriminative and robust online learning for siamese visual tracking. Proc. AAAI Conf. Artif. Intell. 34(No. 07), 13017–13024 (2020). https://doi.org/10.1609/aaai.v34i07.7002
[37]
Bhat, G., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)
[38]
Mayer, C., et al.: Transforming Model Prediction for tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
[39]
Park, E., Berg, A.C.: Meta‐tracker: fast and robust online adaptation for visual object trackers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 569–585 (2018)
[40]
Choi, J., Kwon, J., Lee, K.M.: Deep meta learning for real‐time target‐aware visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 911–920 (2019)
[41]
Bhat, G., et al.: Know your surroundings: exploiting scene information for object tracking. In: Computer Vision–ECCV 2020: 16th European Conference, vol. XXIII 16, pp. 205–221. Springer International Publishing, Glasgow (2020). August 23–28, 2020, Proceedings, Part
[42]
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)
[43]
Dai, K., et al.: High‐performance long‐term tracking with meta‐updater. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6298–6307 (2020)
[44]
Wang, G., et al.: Tracking by instance detection: a meta‐learning approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6288–6297 (2020)
[45]
Finn, C., Abbeel, P., Levine, S.: Model‐agnostic meta‐learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
[46]
Yang, T., Chan, A.B.: Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 152–167 (2018)
[47]
Huang, B., et al.: Siamatl: online update of siamese tracking network via attentional transfer learning. IEEE Trans. Cybern. 52(8), 7527–7540 (2021). https://doi.org/10.1109/tcyb.2020.3043520
[48]
Fu, Z., et al.: Stmtrack: template‐free visual tracking with space‐time memory networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13774–13783 (2021)
[49]
Wang, L., Zhang, L., Yi, Z.: Trajectory predictor by using recurrent neural networks in visual tracking. IEEE Trans. Cybern. 47(10), 3172–3183 (2017). https://doi.org/10.1109/tcyb.2017.2705345
[50]
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one‐shot image recognition. In: ICML Deep Learning Workshop, vol. 2(1)(2015)
[51]
Huang, L., Zhao, X., Huang, K.: Got‐10k: a large high‐diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1562–1577 (2019). https://doi.org/10.1109/tpami.2019.2957464
[52]
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, pp. 740–755. Springer International Publishing, Zurich (2014). September 6‐12, 2014, Proceedings, Part V 13
[53]
Muller, M., et al.: Trackingnet: a large‐scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision, pp. 300–317. ECCV (2018)
[54]
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(09), 1834–1848 (2015). https://doi.org/10.1109/tpami.2014.2388226
[55]
Kristan, M., et al.: The seventh visual object tracking vot2019 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019). 0‐0
[56]
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: Computer Vision–ECCV 2016: 14th European Conference, pp. 445–461. Springer International Publishing, Amsterdam (2016). October 11–14, 2016, Proceedings, Part I 14
[57]
Fan, H., et al.: Lasot: a high‐quality benchmark for large‐scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
[58]
Zhang, Z., et al.: Ocean: object‐aware anchor‐free tracking. In: Computer Vision–ECCV 2020: 16th European Conference, pp. 771–787. Springer International Publishing, Glasgow (2020). August 23–28, 2020, Proceedings, Part XXI 16
[59]
Li, B., et al.: High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)
[60]
Wang, N., et al.: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1308–1317 (2019). Unsupervised deep tracking[C]//
[61]
Zheng, L., et al.: Learning feature embeddings for discriminant model based tracking[C]//Computer Vision–ECCV 2020: 16th European Conference, pp. 759–775. Springer International Publishing, Glasgow (2020). August 23–28, 2020, Proceedings, Part XV 16
[62]
Chen, B., et al.: Backbone Is All Your Need: A Simplified Architecture for Visual Object tracking[C] // European Conference on Computer Vision, pp. 375–392. Springer Nature Switzerland, Cham (2022)
[63]
Kim, M., et al.: Towards Sequence‐Level Training for Visual tracking[C]//European Conference on Computer Vision, pp. 534–551. Springer Nature Switzerland, Cham (2022)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IET Computer Vision
IET Computer Vision  Volume 18, Issue 4
June 2024
122 pages
EISSN:1751-9640
DOI:10.1049/cvi2.v18.4
Issue’s Table of Contents
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 20 December 2023

Author Tags

  1. computer vision
  2. distance learning
  3. image motion analysis
  4. object tracking

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media