Abstract
Existing single-stream tracking pipelines achieve good performance improvements by joint feature extraction and interaction. These tracking pipelines establish a bidirectional information flow between the template frame and the search frame, using the correlation and dynamic changes between them to improve the modeling and representation capabilities of the object, thereby improving the accuracy and robustness of tracking. However, these tracking pipelines just use the highest level semantic information of the encoder, and the low-level features are only used to compute new activations, which cannot meet the fine-grained requirements of the tracking task. To solve this issue, we propose a new approach named multi-local guided tracker (MLGT), which merges features obtained at various depths to strengthen the interaction between different semantic information. Specifically, we divide the single-stream pipeline into fixed output stages, and each stage is responsible for extracting and processing different levels of features. Then, we pass the output features into an enhanced fusion module (EFM), which incorporates a shared encoder and concatenation operation. The encoder is used to further extract the information in the joint features, and the catenation operation used to fuse features from different output stages. We conduct extensive evaluations on five datasets, among which we achieve 70.5% SUC on the LaSOT dataset, which is 1.4% higher than the existing single-stream tracker OSTrack.
Similar content being viewed by others
References
Javed, S., Danelljan, M., Khan, F.S., Khan, M.H., Felsberg, M., Matas, J.: Visual object tracking with discriminative filters and Aiamese networks: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 6552–6574 (2022)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: Evolution of Siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)
Choi, S., Lee, J., Lee, Y., Hauptmann, A.: Robust long-term object tracking via improved discriminative model prediction. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 602–617 (2020). Springer
Zheng, Y., Zhong, B., Liang, Q., Tang, Z., Ji, R., Li, X.: Leveraging local and global cues for visual tracking via parallel interaction network. IEEE Trans. Circuits Syst. Video Technol. 33(4), 1671–1683 (2022)
Zhao, M., Okada, K., Inaba, M.: Trtr: visual tracking with transformer. arXiv preprint arXiv:2105.03817 (2021)
Gao, S., Zhou, C., Ma, C., Wang, X., Yuan, J.: Aiatrack: attention in attention for transformer visual tracking. In: European Conference on Computer Vision, pp. 146–164 (2022). Springer
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W., Ouyang, W.: Backbone is all your need: a simplified architecture for visual object tracking. In: European Conference on Computer Vision, pp. 375–392 (2022). Springer
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: end-to-end tracking with iterative mixed attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13608–13618 (2022)
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: European Conference on Computer Vision, pp. 341–357 (2022). Springer
Gao, S., Zhou, C., Zhang, J.: Generalized relation modeling for transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18686–18695 (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Ahmed, I., Jeon, G.: A real-time person tracking system based on Siammask network for intelligent video surveillance. J. Real-Time Image Proc. 18, 1803–1814 (2021)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Lin, L., Fan, H., Zhang, Z., Xu, Y., Ling, H.: Swintrack: a simple and strong baseline for transformer tracking. Adv. Neural. Inf. Process. Syst. 35, 16743–16754 (2022)
Mayer, C., Danelljan, M., Bhat, G., Paul, M., Paudel, D.P., Yu, F., Van Gool, L.: Transforming model prediction for tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8731–8740 (2022)
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Xie, F., Wang, C., Wang, G., Cao, Y., Yang, W., Zeng, W.: Correlation-aware deep tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8760 (2022)
Tang, C., Hu, Q., Zhou, G., Yao, J., Zhang, J., Huang, Y., Ye, Q.: Transformer sub-patch matching for high-performance visual object tracking. IEEE Trans. Intell. Transport. Syst. (2023)
Wang, W., Zhang, K., Su, Y., Wang, J., Wang, Q.: Learning cross-attention discriminators via alternating time–space transformers for visual tracking. IEEE Trans. Neural Netw. Learn. Syst. (2023)
Wang, J., Chen, D., Wu, Z., Luo, C., Dai, X., Yuan, L., Jiang, Y.-G.: Omnitracker: Unifying object tracking by tracking-with-detection. arXiv preprint arXiv:2303.12079 (2023)
Paul, M., Danelljan, M., Mayer, C., Van Gool, L.: Robust visual tracking by segmentation. In: European Conference on Computer Vision, pp. 571–588 (2022). Springer
Yan, B., Jiang, Y., Sun, P., Wang, D., Yuan, Z., Luo, P., Lu, H.: Towards grand unification of object tracking. In: European Conference on Computer Vision, pp. 733–751 (2022). Springer
Song, Z., Yu, J., Chen, Y.-P.P., Yang, W.: Transformer tracking with cyclic shifting window attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8791–8800 (2022)
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454 (2021)
Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13339–13348 (2021)
Yan, B., Zhang, X., Wang, D., Lu, H., Yang, X.: Alpha-refine: boosting tracking performance by precise bounding box estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5289–5298 (2021)
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)
Dai, K., Zhang, Y., Wang, D., Li, J., Lu, H., Yang, X.: High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6298–6307 (2020)
Tang, C., Wang, X., Bai, Y., Wu, Z., Zhang, J., Huang, Y.: Learning spatial-frequency transformer for visual object tracking. IEEE Trans. Circuits Syst. Video Technol. (2023)
Lin, Y.-E., Li, M., Liang, X., Xia, C.: Siamlight: lightweight networks for object tracking via attention mechanisms and pixel-level cross-correlation. J. Real-Time Image Proc. 20(2), 31 (2023)
Acknowledgements
This research was supported by the Research Foundation of the Institute of Environment-friendly Materials and Occupational Health (Wuhu), Anhui University of Science and Technology under Grant ALW2021YF04, and the Science and Technology Research Project of Wuhu City under Grant 2020yf48.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liang, X., Chen, M. & Liu, E. MLGT: multi-local guided tracker for visual object tracking. J Real-Time Image Proc 21, 54 (2024). https://doi.org/10.1007/s11554-024-01418-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-024-01418-8