research-article

Visual Tracking via Dynamic Memory Networks

Authors:

Antoni B. ChanAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 43, Issue 1

Pages 360 - 374

https://doi.org/10.1109/TPAMI.2019.2929034

Published: 01 January 2021 Publication History

Abstract

Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed.

References

[1]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[2]

C. Szegedy, W. Liu, Y. Jia, and P. Sermanet, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.

[3]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.

[4]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.

Digital Library

[5]

R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.

[6]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Conf. Neural Inf. Process. Syst., 2015, pp. 1137–1149.

[7]

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 640–651.

[8]

H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2016, pp. 1520–1528.

[9]

Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4438–4446.

[10]

Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and M.-H. Yang, “CREST: Convolutional residual learning for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2574–2583.

[11]

H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–4302.

[12]

L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3119–3127.

[13]

L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 850–865.

[14]

Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamic siamese network for visual object tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1781–1789.

[15]

R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1420–1429.

[16]

D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 FPS with deep regression networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 749–765.

[17]

J. Valmadre, L. Bertinetto, F. Henriques, A. Vedaldi, and P. H. S. Torr, “End-to-end representation learning for correlation filter based tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5000–5008.

[18]

H. Nam, M. Baek, and B. Han, “Modeling and propagating CNNs in a tree structure for visual tracking,” arXiv:1608.07242 [cs.CV], 2016.

[19]

T. Yang and A. B. Chan, “Recurrent filter learning for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2017, pp. 2010–2019.

[20]

A. He, C. Luo, X. Tian, and W. Zeng, “A twofold siamese network for real-time object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4834–4843.

[21]

Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank, “Learning attentions : Residual attentional siamese network for high performance online visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4854–4863.

[22]

T. Yang and A. B. Chan, “Learning dynamic memory networks for object tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 153–169.

[23]

H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in Proc. Eur. Conf. Comput. Vis., 2008, pp. 234–247.

[24]

B. Babenko, S. Member, M.-H. Yang, and S. Member, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011.

Digital Library

[25]

Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012.

Digital Library

[26]

P. Li, D. Wang, L. Wang, and H. Lu, “Deep visual tracking: Review and experimental comparison,” Pattern Recognit., vol. 76, pp. 323–338, 2018.

Digital Library

[27]

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.

Digital Library

[28]

B. Han, J. Sim, and H. Adam, “BranchOut: Regularization for online ensemble tracking with convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 521–530.

[29]

C. Huang, S. Lucey, and D. Ramanan, “Learning policies for adaptive tracking with deep feature cascades,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 105–114.

[30]

Z. Chi, H. Li, H. Lu, and M.-H. Yang, “Dual deep network for visual tracking,” IEEE Trans. Image Process., vol. 26, no. 4, pp. 2005–2015, Apr. 2017.

Digital Library

[31]

A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv:1410.5401 [cs.NE], 2014.

[32]

J. Weston, S. Chopra, and A. Bordes, “Memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.

[33]

S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end memory networks,” in Proc. Conf. Neural Inf. Process. Syst., 2015, pp. 2440–2448.

[34]

A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. Gómez Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. Moritz Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, pp. 471–476, 2016.

[35]

A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “One-shot learning with memory-augmented neural networks,” in Proc. Int. Conf. Mach. Learn., arXiv:1605.06065 [cs.LG], 2016.

[36]

B. Liu, Y. Wang, Y.-W. Tai, and C.-K. Tang, “MAVOT: Memory-augmented video object tracking,” arXiv:1711.09414 [cs.CV], 2017.

[37]

R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 160–167.

[38]

L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Process., 2013, pp. 8599–8603.

[39]

R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.

[40]

R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75, 1997.

Digital Library

[41]

Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task relationships in multi-task learning,” in Proc. 26th Conf. Uncertainty Artif. Intell., 2010 pp. 733–742.

[42]

S. Li, Z.-Q. Liu, and A. B. Chan, “Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network,” Int. J. Comput. Vis., vol. 113, pp. 19–36, 2015.

Digital Library

[43]

J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 702–709.

[44]

K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.

[45]

D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658.

[46]

A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7482–7491.

[47]

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450 [stat.ML], 2016.

[48]

M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.

[49]

C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3074–3082.

[50]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.

Digital Library

[51]

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, 2015.

[52]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467 [cs.DC], 2016.

[53]

Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2411–2418.

[54]

Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015.

Digital Library

[55]

M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Čehovin, G. Nebehay, T. Vojí\vr, G. Fernandez, et al., “The visual object tracking VOT2015 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2015, pp. 564–586.

[56]

M. Kristan, A. Leonardis, J. Matas, and M. Felsberg, “The visual object tracking VOT2016 challenge results,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 607–612.

[57]

M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír, G. Häger, A. Lukežič, A. Eldesokey, G. Fernández, et al., “The visual object tracking VOT2017 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2017, pp. 1949–1972.

[58]

B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8971–8980.

[59]

H. Fan and H. Ling, “Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5487–5495.

[60]

M. Wang, Y. Liu, and Z. Huang, “Large margin object tracking with circulant feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4800–4808.

[61]

J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Young Choi, “Attentional correlation filter network for adaptive visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4828–4837.

[62]

L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple: Complementary learners for real-time tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1401–1409.

[63]

M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proc. British Mach. Vis. Conf., 2014.

[64]

J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.

Digital Library

[65]

A. Lukežič, T. Vojí, L. Čehovin, J. Matas, and M. Kristan, “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4847–4856.

[66]

T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter for robust object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4819–4827.

[67]

M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1430–1438.

[68]

M. Danelljan, H. Gustav, F. S. Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4310–4318.

[69]

Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, and J. L. M.-H. Yang, “Hedged deep tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4303–4311.

[70]

M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2016.

[71]

G. Zhu, F. Porikli, and H. Li, “Beyond local search: Tracking objects everywhere with instance-specific proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 943–951.

[72]

Y. Hua, K. Alahari, and C. Schmid, “Online object tracking with proposal selection,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3092–3100.

[73]

M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 472–488.

[74]

C. Sun, H. Lu, and M.-H. Yang, “Learning spatial-aware regressions for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8962–8970.

[75]

M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6931–6939.

[76]

E. Gundogdu and A. A. Alatan, “Good features to correlate for visual tracking,” IEEE Trans. Image Process., vol. 27, no. 5, pp. 2526–2540, May 2018.

Cited By

Gao JLu YQi XKou YLi BLi LYu SHu W(2024)Recursive Least-Squares Estimator-Aided Online Learning for Visual TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.315697746:3(1881-1897)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPAMI.2022.3156977
Zhou YWang YChau L(2023)Moving Towards Centers: Re-Ranking With Attention and Memory for Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2022.316118925(3456-3468)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3161189
Prudviraj JReddy MVishnu CMohan C(2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TIP.2022.3195643
Show More Cited By

Index Terms

Visual Tracking via Dynamic Memory Networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning Dynamic Memory Networks for Object Tracking
Computer Vision – ECCV 2018
Abstract
Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking ...
Robust Visual Tracking via Binocular Consistent Sparse Learning

In spite of the rapid development of visual tracking technologies, robust object tracking in the monocular images under complex environments still remains a challenging problem. In contrast to its monocular counterpart, stereo vision features more ...
Robust Visual Tracking via Basis Matching

Most existing tracking approaches are based on either the tracking by detection framework or the tracking by matching framework. The former needs to learn a discriminative classifier using positive and negative samples, which will cause tracking drift ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 43, Issue 1

Jan. 2021

374 pages

ISSN:0162-8828

Issue’s Table of Contents

0162-8828 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 January 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao JLu YQi XKou YLi BLi LYu SHu W(2024)Recursive Least-Squares Estimator-Aided Online Learning for Visual TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.315697746:3(1881-1897)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPAMI.2022.3156977
Zhou YWang YChau L(2023)Moving Towards Centers: Re-Ranking With Attention and Memory for Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2022.316118925(3456-3468)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3161189
Prudviraj JReddy MVishnu CMohan C(2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TIP.2022.3195643
Tan KXu TWei Z(2021)Learning complementary Siamese networks for real-time high-performance visual trackingJournal of Visual Communication and Image Representation10.1016/j.jvcir.2021.10329980:COnline publication date: 30-Dec-2021
https://dl.acm.org/doi/10.1016/j.jvcir.2021.103299
Li GPeng MNai KLi ZLi K(2020)Reliable correlation tracking via dual-memory selection modelInformation Sciences: an International Journal10.1016/j.ins.2020.01.015518:C(238-255)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1016/j.ins.2020.01.015

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents