MLGT: multi-local guided tracker for visual object tracking

Liang, Xingzhu; Chen, Miaomiao; Liu, Erhu

doi:10.1007/s11554-024-01418-8

MLGT: multi-local guided tracker for visual object tracking

Research
Published: 19 March 2024

Volume 21, article number 54, (2024)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Xingzhu Liang^1,2,
Miaomiao Chen¹ &
Erhu Liu¹

226 Accesses
Explore all metrics

Abstract

Existing single-stream tracking pipelines achieve good performance improvements by joint feature extraction and interaction. These tracking pipelines establish a bidirectional information flow between the template frame and the search frame, using the correlation and dynamic changes between them to improve the modeling and representation capabilities of the object, thereby improving the accuracy and robustness of tracking. However, these tracking pipelines just use the highest level semantic information of the encoder, and the low-level features are only used to compute new activations, which cannot meet the fine-grained requirements of the tracking task. To solve this issue, we propose a new approach named multi-local guided tracker (MLGT), which merges features obtained at various depths to strengthen the interaction between different semantic information. Specifically, we divide the single-stream pipeline into fixed output stages, and each stage is responsible for extracting and processing different levels of features. Then, we pass the output features into an enhanced fusion module (EFM), which incorporates a shared encoder and concatenation operation. The encoder is used to further extract the information in the joint features, and the catenation operation used to fuse features from different output stages. We conduct extensive evaluations on five datasets, among which we achieve 70.5% SUC on the LaSOT dataset, which is 1.4% higher than the existing single-stream tracker OSTrack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking

Article 22 July 2023

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

A robust attention-enhanced network with transformer for visual tracking

Article 31 March 2023

References

Javed, S., Danelljan, M., Khan, F.S., Khan, M.H., Felsberg, M., Matas, J.: Visual object tracking with discriminative filters and Aiamese networks: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 6552–6574 (2022)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: Evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
Choi, S., Lee, J., Lee, Y., Hauptmann, A.: Robust long-term object tracking via improved discriminative model prediction. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 602–617 (2020). Springer
Zheng, Y., Zhong, B., Liang, Q., Tang, Z., Ji, R., Li, X.: Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Trans. Circuits Syst. Video Technol. 33(4), 1671–1683 (2022)
Article Google Scholar
Zhao, M., Okada, K., Inaba, M.: Trtr: visual tracking with transformer. arXiv preprint arXiv:2105.03817 (2021)
Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.: Aiatrack: attention in attention for transformer visual tracking. In: European Conference on Computer Vision, pp. 146–164 (2022). Springer
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W., Ouyang, W.: Backbone is all your need: a simplified architecture for visual object tracking. In: European Conference on Computer Vision, pp. 375–392 (2022). Springer
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision, pp. 341–357 (2022). Springer
Gao, S., Zhou, C., Zhang, J.: Generalized relation modeling for transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18686–18695 (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Ahmed, I., Jeon, G.: A real-time person tracking system based on Siammask network for intelligent video surveillance. J. Real-Time Image Proc. 18, 1803–1814 (2021)
Article Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: Swintrack: a simple and strong baseline for transformer tracking. Adv. Neural. Inf. Process. Syst. 35, 16743–16754 (2022)
Google Scholar
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Van Gool, L.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., Zeng, W.: Correlation-aware deep tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8760 (2022)
Tang, C., Hu, Q., Zhou, G., Yao, J., Zhang, J., Huang, Y., Ye, Q.: Transformer sub-patch matching for high-performance visual object tracking. IEEE Trans. Intell. Transport. Syst. (2023)
Wang, W., Zhang, K., Su, Y., Wang, J., Wang, Q.: Learning cross-attention discriminators via alternating time–space transformers for visual tracking. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Wang, J., Chen, D., Wu, Z., Luo, C., Dai, X., Yuan, L., Jiang, Y.-G.: Omnitracker: Unifying object tracking by tracking-with-detection. arXiv preprint arXiv:2303.12079 (2023)
Paul, M., Danelljan, M., Mayer, C., Van Gool, L.: Robust visual tracking by segmentation. In: European Conference on Computer Vision, pp. 571–588 (2022). Springer
Yan, B., Jiang, Y., Sun, P., Wang, D., Yuan, Z., Luo, P., Lu, H.: Towards grand unification of object tracking. In: European Conference on Computer Vision, pp. 733–751 (2022). Springer
Song, Z., Yu, J., Chen, Y.-P.P., Yang, W.: Transformer tracking with cyclic shifting window attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8791–8800 (2022)
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454 (2021)
Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13339–13348 (2021)
Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5289–5298 (2021)
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X.: High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6298–6307 (2020)
Tang, C., Wang, X., Bai, Y., Wu, Z., Zhang, J., Huang, Y.: Learning spatial-frequency transformer for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. (2023)
Lin, Y.-E., Li, M., Liang, X., Xia, C.: Siamlight: lightweight networks for object tracking via attention mechanisms and pixel-level cross-correlation. J. Real-Time Image Proc. 20(2), 31 (2023)
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Research Foundation of the Institute of Environment-friendly Materials and Occupational Health (Wuhu), Anhui University of Science and Technology under Grant ALW2021YF04, and the Science and Technology Research Project of Wuhu City under Grant 2020yf48.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, 232001, Anhui, China
Xingzhu Liang, Miaomiao Chen & Erhu Liu
Institute of Environment-friendly Materials and Occupational Health, Anhui University of Science and Technology, Wuhu, 10587, Anhui, China
Xingzhu Liang

Authors

Xingzhu Liang
View author publications
You can also search for this author in PubMed Google Scholar
Miaomiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Erhu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miaomiao Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, X., Chen, M. & Liu, E. MLGT: multi-local guided tracker for visual object tracking. J Real-Time Image Proc 21, 54 (2024). https://doi.org/10.1007/s11554-024-01418-8

Download citation

Received: 10 July 2023
Accepted: 09 January 2024
Published: 19 March 2024
DOI: https://doi.org/10.1007/s11554-024-01418-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MLGT: multi-local guided tracker for visual object tracking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

A robust attention-enhanced network with transformer for visual tracking

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

MLGT: multi-local guided tracker for visual object tracking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Repformer: a robust shared-encoder dual-pipeline transformer for visual tracking

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

A robust attention-enhanced network with transformer for visual tracking

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation