Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Visual Tracking via Dynamic Memory Networks

Published: 01 January 2021 Publication History

Abstract

Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed.

References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[2]
C. Szegedy, W. Liu, Y. Jia, and P. Sermanet, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[3]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[4]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
[5]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[6]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Conf. Neural Inf. Process. Syst., 2015, pp. 1137–1149.
[7]
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 640–651.
[8]
H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2016, pp. 1520–1528.
[9]
Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4438–4446.
[10]
Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and M.-H. Yang, “CREST: Convolutional residual learning for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2574–2583.
[11]
H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4293–4302.
[12]
L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3119–3127.
[13]
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 850–865.
[14]
Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamic siamese network for visual object tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1781–1789.
[15]
R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1420–1429.
[16]
D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 FPS with deep regression networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 749–765.
[17]
J. Valmadre, L. Bertinetto, F. Henriques, A. Vedaldi, and P. H. S. Torr, “End-to-end representation learning for correlation filter based tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5000–5008.
[18]
H. Nam, M. Baek, and B. Han, “Modeling and propagating CNNs in a tree structure for visual tracking,” arXiv:1608.07242 [cs.CV], 2016.
[19]
T. Yang and A. B. Chan, “Recurrent filter learning for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2017, pp. 2010–2019.
[20]
A. He, C. Luo, X. Tian, and W. Zeng, “A twofold siamese network for real-time object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4834–4843.
[21]
Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank, “Learning attentions : Residual attentional siamese network for high performance online visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4854–4863.
[22]
T. Yang and A. B. Chan, “Learning dynamic memory networks for object tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 153–169.
[23]
H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in Proc. Eur. Conf. Comput. Vis., 2008, pp. 234–247.
[24]
B. Babenko, S. Member, M.-H. Yang, and S. Member, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011.
[25]
Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012.
[26]
P. Li, D. Wang, L. Wang, and H. Lu, “Deep visual tracking: Review and experimental comparison,” Pattern Recognit., vol. 76, pp. 323–338, 2018.
[27]
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[28]
B. Han, J. Sim, and H. Adam, “BranchOut: Regularization for online ensemble tracking with convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 521–530.
[29]
C. Huang, S. Lucey, and D. Ramanan, “Learning policies for adaptive tracking with deep feature cascades,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 105–114.
[30]
Z. Chi, H. Li, H. Lu, and M.-H. Yang, “Dual deep network for visual tracking,” IEEE Trans. Image Process., vol. 26, no. 4, pp. 2005–2015, Apr. 2017.
[31]
A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv:1410.5401 [cs.NE], 2014.
[32]
J. Weston, S. Chopra, and A. Bordes, “Memory networks,” in Proc. Int. Conf. Learn. Representations, 2015.
[33]
S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end memory networks,” in Proc. Conf. Neural Inf. Process. Syst., 2015, pp. 2440–2448.
[34]
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. Gómez Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. Moritz Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, pp. 471–476, 2016.
[35]
A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “One-shot learning with memory-augmented neural networks,” in Proc. Int. Conf. Mach. Learn., arXiv:1605.06065 [cs.LG], 2016.
[36]
B. Liu, Y. Wang, Y.-W. Tai, and C.-K. Tang, “MAVOT: Memory-augmented video object tracking,” arXiv:1711.09414 [cs.CV], 2017.
[37]
R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 160–167.
[38]
L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Process., 2013, pp. 8599–8603.
[39]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[40]
R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75, 1997.
[41]
Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task relationships in multi-task learning,” in Proc. 26th Conf. Uncertainty Artif. Intell., 2010 pp. 733–742.
[42]
S. Li, Z.-Q. Liu, and A. B. Chan, “Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network,” Int. J. Comput. Vis., vol. 113, pp. 19–36, 2015.
[43]
J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 702–709.
[44]
K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
[45]
D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658.
[46]
A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7482–7491.
[47]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450 [stat.ML], 2016.
[48]
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
[49]
C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3074–3082.
[50]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.
[51]
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, 2015.
[52]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467 [cs.DC], 2016.
[53]
Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2411–2418.
[54]
Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015.
[55]
M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Čehovin, G. Nebehay, T. Vojí\vr, G. Fernandez, et al., “The visual object tracking VOT2015 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2015, pp. 564–586.
[56]
M. Kristan, A. Leonardis, J. Matas, and M. Felsberg, “The visual object tracking VOT2016 challenge results,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 607–612.
[57]
M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, T. Vojír, G. Häger, A. Lukežič, A. Eldesokey, G. Fernández, et al., “The visual object tracking VOT2017 challenge results,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2017, pp. 1949–1972.
[58]
B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8971–8980.
[59]
H. Fan and H. Ling, “Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5487–5495.
[60]
M. Wang, Y. Liu, and Z. Huang, “Large margin object tracking with circulant feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4800–4808.
[61]
J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Young Choi, “Attentional correlation filter network for adaptive visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4828–4837.
[62]
L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple: Complementary learners for real-time tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1401–1409.
[63]
M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proc. British Mach. Vis. Conf., 2014.
[64]
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
[65]
A. Lukežič, T. Vojí, L. Čehovin, J. Matas, and M. Kristan, “Discriminative correlation filter with channel and spatial reliability,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4847–4856.
[66]
T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter for robust object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4819–4827.
[67]
M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1430–1438.
[68]
M. Danelljan, H. Gustav, F. S. Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4310–4318.
[69]
Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, and J. L. M.-H. Yang, “Hedged deep tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4303–4311.
[70]
M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2016.
[71]
G. Zhu, F. Porikli, and H. Li, “Beyond local search: Tracking objects everywhere with instance-specific proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 943–951.
[72]
Y. Hua, K. Alahari, and C. Schmid, “Online object tracking with proposal selection,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3092–3100.
[73]
M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond correlation filters: Learning continuous convolution operators for visual tracking,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 472–488.
[74]
C. Sun, H. Lu, and M.-H. Yang, “Learning spatial-aware regressions for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8962–8970.
[75]
M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6931–6939.
[76]
E. Gundogdu and A. A. Alatan, “Good features to correlate for visual tracking,” IEEE Trans. Image Process., vol. 27, no. 5, pp. 2526–2540, May 2018.

Cited By

View all
  • (2024)Recursive Least-Squares Estimator-Aided Online Learning for Visual TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.315697746:3(1881-1897)Online publication date: 1-Mar-2024
  • (2023)Moving Towards Centers: Re-Ranking With Attention and Memory for Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2022.316118925(3456-3468)Online publication date: 1-Jan-2023
  • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
  • Show More Cited By

Index Terms

  1. Visual Tracking via Dynamic Memory Networks
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
    IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 43, Issue 1
    Jan. 2021
    374 pages

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 01 January 2021

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Recursive Least-Squares Estimator-Aided Online Learning for Visual TrackingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.315697746:3(1881-1897)Online publication date: 1-Mar-2024
    • (2023)Moving Towards Centers: Re-Ranking With Attention and Memory for Re-IdentificationIEEE Transactions on Multimedia10.1109/TMM.2022.316118925(3456-3468)Online publication date: 1-Jan-2023
    • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 1-Jan-2022
    • (2021)Learning complementary Siamese networks for real-time high-performance visual trackingJournal of Visual Communication and Image Representation10.1016/j.jvcir.2021.10329980:COnline publication date: 30-Dec-2021
    • (2020)Reliable correlation tracking via dual-memory selection modelInformation Sciences: an International Journal10.1016/j.ins.2020.01.015518:C(238-255)Online publication date: 1-May-2020

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media