research-article

Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Authors:

Narendra Ahuja,

Ming-Hsuan YangAuthors Info & Claims

International Journal of Computer Vision, Volume 128, Issue 1

Pages 96 - 120

https://doi.org/10.1007/s11263-019-01212-1

Published: 01 January 2020 Publication History

Abstract

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

References

[1]

Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In CVPR.

[2]

Andriyenko, A., & Schindler, K. (2011). Multi-target tracking by continuous energy minimization. In CVPR.

[3]

Andriyenko, A., Schindler, K., & Roth, S. (2012). Discrete-continuous optimization for multi-target tracking. In CVPR.

[4]

Anguelov, D., Lee, K. C., Gokturk, S. B., & Sumengen, B. (2007). Contextual identity recognition in personal photo albums. In CVPR.

[5]

Ayazoglu, M., Sznaier, M., & Camps, O. I. (2012) Fast algorithms for structured robust principal component analysis. In CVPR.

[6]

Bauml, M., Tapaswi, M., & Stiefelhagen, R. (2013). Semi-supervised learning with constraints for person identification in multimedia data. In CVPR.

[7]

Ben Shitrit, H., Berclaz, J., Fleuret, F., & Fua, P. (2011). Tracking multiple people under global appearance constraints. In ICCV.

[8]

Berclaz J, Fleuret F, Turetken E, and Fua P Multiple object tracking using k-shortest paths optimization PAMI 2011 33 9 1806-1819

[9]

Bertinetto, L., Henriques, J. F., Valmadre, J., Torr, P., & Vedaldi, A. (2016) Learning feed-forward one-shot learners. In NIPS (pp. 523–531).

[10]

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. In European conference on computer vision (pp. 850–865). Springer.

[11]

Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV (pp. 1365–1372).

[12]

Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E., & Van Gool, L. (2009). Robust tracking-by-detection using a detector confidence particle filter. In ICCV.

[13]

Brendel, W., Amer, M., & Todorovic, S. (2011). Multiobject tracking as maximum weight independent set. In CVPR.

[14]

Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video object segmentation. In CVPR. IEEE.

[15]

Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In FG (pp. 67–74). IEEE.

[16]

Chopra, S., Hadsell, R., & LeCun, Y. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR.

[17]

Cinbis, R. G., Verbeek, J., & Schmid, C. (2011). Unsupervised metric learning for face identification in tv video. In ICCV.

[18]

Collins, R. T. (2012). Multitarget data association with higher-order motion models. In CVPR.

[19]

Collins RT, Liu Y, and Leordeanu M Online selection of discriminative tracking features PAMI 2005 27 10 1631-1643

[20]

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.

[21]

Dicle, C., Camps, O. I., Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In ICCV.

[22]

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.

[23]

Du M and Chellappa R Face association for videos using conditional random fields and max-margin markov networks PAMI 2016 38 9 1762-1773

[24]

El Khoury, E., Senac, C., & Joly, P. (2010). Face-and-clothing based people clustering in video content. In ICMR.

[25]

Everingham, M., Sivic, J., & Zisserman, A. (2006). “Hello! My name is... Buffy”—Automatic naming of characters in tv video. In BMVC.

[26]

Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV (pp. 2960–2967).

[27]

Fernando B, Tommasi T, and Tuytelaars T Joint cross-domain classification and subspace learning for unsupervised adaptation Pattern Recognition Letters 2015 65 60-66

[28]

Fulkerson, B., Vedaldi, A., & Soatto, S. (2008). Localizing objects with smart dictionaries. In ECCV.

[29]

Ganin, Y., & Lempitsky, V. (2014) Unsupervised domain adaptation by backpropagation. arXiv.

[30]

Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, et al.Domain-adversarial training of neural networksThe Journal of Machine Learning Research20161712096-203035046191360.68671

[31]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS (pp. 2672–2680).

[32]

Grabner, H., & Bischof, H. (2006). On-line boosting and vision. In CVPR.

[33]

Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In CVPR (pp. 2827–2836).

[34]

Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR.

[35]

Hu, J., Lu, J., & Tan, Y. P. (2014). Discriminative deep metric learning for face verification in the wild. In CVPR.

[36]

Huang, C., Li, Y., Ai, H., et al. (2006). Robust head tracking with particles based on multiple cues. In ECCVW.

[37]

Huang, C., Wu, B., & Nevatia, R. (2008). Robust object tracking by hierarchical association of detection responses. In ECCV.

[38]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM.

[39]

Jiang, H., Fels, S., & Little, J. J. (2007). A linear programming approach for multiple object tracking. In CVPR.

[40]

Joon Oh, S., Benenson, R., Fritz, M., & Schiele, B. (2015). Person recognition in personal photo collections. In ICCV (pp. 3862–3870).

[41]

Kalal Z, Mikolajczyk K, and Matas J Tracking-learning-detection. PAMI 2012 34 7 1409-1422

[42]

Kaucic, R., Perera, A. A., Brooksby, G., Kaufhold, J., & Hoogs, A. (2005). A unified framework for tracking through occlusions and across sensor gaps. In CVPR.

[43]

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.

[44]

Kuo, C. H., Huang, C., & Nevatia, R. (2010). Multi-target tracking by on-line learned discriminative appearance models. In CVPR.

[45]

Kuo, C. H., & Nevatia, R. (2011). How does person identity recognition help multi-person tracking? In CVPR.

[46]

Leibe, B., Schindler, K., & Van Gool, L. (2007). Coupled detection and trajectory estimation for multi-object tracking. In ICCV.

[47]

Li, Y., Ai, H., Yamashita, T., Lao, S., & Kawade, M. (2007). Tracking in low frame rate video: A cascade particle filter with discriminative observers of different lifespans. In CVPR.

[48]

Li, Y., Huang, C., & Nevatia, R. (2009). Learning to associate: Hybridboosted multi-target tracker for crowded scene. In CVPR.

[49]

Lin, D., Kapoor, A., Hua, G., & Baker, S. (2010). Joint people, event, and location recognition in personal photo collections using cross-domain context. In ECCV.

[50]

Lin, Z., Courbariaux, M., Memisevic, R., & Bengio, Y. (2015). Neural networks with few multiplications. arXiv.

[51]

Liu, M. Y., & Tuzel, O. (2016). Coupled generative adversarial networks. In NIPS (pp. 469–477).

[52]

Long, M., Wang, J., Ding, G., Sun, J., & Yu, P. S. (2013). Transfer feature learning with joint distribution adaptation. In ICCV (pp. 2200–2207).

[53]

Lowe DG Distinctive image features from scale-invariant keypoints IJCV 2004 60 2 91-110

[54]

Mathias, M., Benenson, R., Pedersoli, M., & Van Gool, L. (2014). Face detection without bells and whistles. In ECCV.

[55]

Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC.

[56]

Paul, G., Elie, K., Sylvain, M., Jean-Marc, O., & Paul, D. (2014). A conditional random field approach for audio-visual people diarization. In ICASSP.

[57]

Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV.

[58]

Perera, A. A., Srinivas, C., Hoogs, A., Brooksby, G., & Hu, W. (2006). Multi-object tracking through simultaneous long occlusions and split-merge conditions. In CVPR.

[59]

Pernici Federico FaceHugger: The ALIEN Tracker Applied to Faces Computer Vision – ECCV 2012. Workshops and Demonstrations 2012 Berlin, Heidelberg Springer Berlin Heidelberg 597-601

[60]

Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In ICCV.

[61]

Rao, Y., Lin, J., Lu, J., & Zhou, J. (2017). Learning discriminative aggregation network for video-based face recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3781–3790).

[62]

Rao Y, Lu J, and Zhou J Learning discriminative aggregation network for video-based face recognition and person re-identification International Journal of Computer Vision 2019 127 6–7 701-718

[63]

Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv.

[64]

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW (pp. 17–35). Springer.

[65]

Roth, M., Bauml, M., Nevatia, R., & Stiefelhagen, R. (2012). Robust multi-pose face tracking by multi-stage tracklet association. In ICPR.

[66]

Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV (pp. 213–226). Springer.

[67]

Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.

[68]

Shu, R., Bui, H. H., Narui, H., & Ermon, S. (2018). A dirt-t approach to unsupervised domain adaptation. In ICLR.

[69]

Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”—Learning person specific classifiers from video. In CVPR.

[70]

Stauffer, C. (2003). Estimating tracking sources and sinks. In CVPR.

[71]

Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation. In AAAI (Vol. 6, p. 8).

[72]

Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In ECCV (pp. 443–450). Springer.

[73]

Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014a). Deep learning face representation by joint identification-verification. In NIPS.

[74]

Sun, Y., Wang, X., & Tang, X. (2014b). Deep learning face representation from predicting 10,000 classes. In CVPR.

[75]

Taigman, Y., Polyak, A., & Wolf, L. (2016). Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200.

[76]

Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). DeepFace: Closing the gap to human-level performance in face verification. In CVPR.

[77]

Tang, Z., Zhang, Y., Li, Z., & Lu, H. (2015). Face clustering in videos with proportion prior. In IJCAI.

[78]

Tapaswi, M., Bauml, M., & Stiefelhagen, R. (2012). “Knock! Knock! Who is it?” probabilistic person identification in tv-series. In CVPR.

[79]

Tapaswi, M., Parkhi, O. M., Rahtu, E., Sommerlade, E., Stiefelhagen, R., & Zisserman, A. (2014). Total cluster: A person agnostic clustering method for broadcast videos. In ICVGIP.

[80]

Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7167–7176).

[81]

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv.

[82]

Van der Maaten L and Hinton GVisualizing data using t-sneJMLR200892579–2605851225.68219

[83]

Varghese, J., & Nair, K. (2016). Detecting video shot boundaries by modified tomography. In VisionNet (pp. 131–135). ACM.

[84]

Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.

[85]

Wang, B., Wang, G., Chan, K. L., & Wang, L. (2014). Tracklet association with online target-specific metric learning. In CVPR.

[86]

Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.

[87]

Wu, B., Lyu, S., Hu, B. G., & Ji, Q. (2013a). Simultaneous clustering and tracklet linking for multi-face tracking in videos. In ICCV.

[88]

Wu, B., Zhang, Y., Hu, B. G., & Ji, Q. (2013b). Constrained clustering and its application to face clustering in videos. In CVPR.

[89]

Xiao, S., Tan, M., & Xu, D. (2014). Weighted block-sparse low rank representation for face clustering in videos. In ECCV.

[90]

Xing, J., Ai, H., & Lao, S. (2009). Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In CVPR.

[91]

Yang, B., Nevatia, R. (2012a). Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In CVPR.

[92]

Yang, B., & Nevatia, R. (2012b). Online learned discriminative part-based appearance models for multi-human tracking. In ECCV.

[93]

Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. arXiv.

[94]

Yoon, J. S., Rameau, F., Kim, J., Lee, S., Shin, S., & Kweon, I. S. (2017). Pixel-level matching for video object segmentation using convolutional neural networks. In ICCV (pp. 2186–2195). IEEE.

[95]

Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for multi-object tracking using network flows. In CVPR.

[96]

Zhang, N., Paluri, M., Taigman, Y., Fergus, R., & Bourdev, L. (2015). Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4804–4813).

[97]

Zhang S, Wang J, Wang Z, Gong Y, and Liu Y Multi-target tracking by learning local-to-global trajectory models PR 2015 48 2 580-590

[98]

Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2016). Joint face representation adaptation and clustering in videos. In ECCV.

[99]

Zhao, X., Gong, D., & Medioni, G. (2012). Tracking using motion patterns for very crowded scenes. In ECCV.

[100]

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In ICCV.

[101]

Zhou, C., Zhang, C., Fu, H., Wang, R., & Cao, X. (2015). Multi-cue augmented face clustering. In ACM MM.

Cited By

Kim JJu CKim GLee D(2024)BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained VideosComputer Vision – ACCV 202410.1007/978-981-96-0901-7_17(278-294)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0901-7_17
Weng ZZhuang HLi HRamalingam BMohan RLin Z(2023)Online Multi-Face Tracking With Multi-Modality Cascaded MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322469933:6(2738-2752)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TCSVT.2022.3224699
Tang ZHuang J(2022)Harmonious Multi-branch Network for Person Re-identification with Harder Triplet LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/350140518:4(1-21)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3501405

Index Terms

Tracking Persons-of-Interest via Unsupervised Representation Adaptation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Face Tracking and Recognition via Incremental Local Sparse Representation
ICIG '13: Proceedings of the 2013 Seventh International Conference on Image and Graphics

This paper addresses the problem of tracking and recognizing faces via incremental local sparse representation. We first develop a robust face tracking algorithm based on the local sparse appearance. This sparse representation model exploits both ...
Pose-Robust Facial Expression Recognition Using View-Based 2D + 3D AAM

This paper proposes a pose-robust face tracking and facial expression recognition method using a view-based 2D 3D active appearance model (AAM) that extends the 2D 3D AAM to the view-based approach, where one independent face model is used for a ...
Adaptive appearance model tracking for still-to-video face recognition

Systems for still-to-video face recognition (FR) seek to detect the presence of target individuals based on reference facial still images or mug-shots. These systems encounter several challenges in video surveillance applications due to variations in ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Computer Vision

International Journal of Computer Vision Volume 128, Issue 1

Jan 2020

259 pages

ISSN:0920-5691

Issue’s Table of Contents

© Springer Science+Business Media, LLC, part of Springer Nature 2019.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2020

Accepted: 12 August 2019

Received: 13 May 2019

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim JJu CKim GLee D(2024)BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained VideosComputer Vision – ACCV 202410.1007/978-981-96-0901-7_17(278-294)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1007/978-981-96-0901-7_17
Weng ZZhuang HLi HRamalingam BMohan RLin Z(2023)Online Multi-Face Tracking With Multi-Modality Cascaded MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322469933:6(2738-2752)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TCSVT.2022.3224699
Tang ZHuang J(2022)Harmonious Multi-branch Network for Person Re-identification with Harder Triplet LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/350140518:4(1-21)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3501405

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents