Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Published: 01 January 2020 Publication History

Abstract

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often can appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we automatically generate a large number of training samples using the contextual constraints for a given video, and further adapt the pre-trained face CNN to the characters in the specific videos using discovered training samples. The embedding feature space is fine-tuned so that the Euclidean distance in the space corresponds to the semantic face similarity. To this end, we devise a symmetric triplet loss function which optimizes the network more effectively than the conventional triplet loss. With the learned discriminative features, we apply an EM clustering algorithm to link tracklets across multiple shots to generate the final trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

References

[1]
Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In CVPR.
[2]
Andriyenko, A., & Schindler, K. (2011). Multi-target tracking by continuous energy minimization. In CVPR.
[3]
Andriyenko, A., Schindler, K., & Roth, S. (2012). Discrete-continuous optimization for multi-target tracking. In CVPR.
[4]
Anguelov, D., Lee, K. C., Gokturk, S. B., & Sumengen, B. (2007). Contextual identity recognition in personal photo albums. In CVPR.
[5]
Ayazoglu, M., Sznaier, M., & Camps, O. I. (2012) Fast algorithms for structured robust principal component analysis. In CVPR.
[6]
Bauml, M., Tapaswi, M., & Stiefelhagen, R. (2013). Semi-supervised learning with constraints for person identification in multimedia data. In CVPR.
[7]
Ben Shitrit, H., Berclaz, J., Fleuret, F., & Fua, P. (2011). Tracking multiple people under global appearance constraints. In ICCV.
[8]
Berclaz J, Fleuret F, Turetken E, and Fua P Multiple object tracking using k-shortest paths optimization PAMI 2011 33 9 1806-1819
[9]
Bertinetto, L., Henriques, J. F., Valmadre, J., Torr, P., & Vedaldi, A. (2016) Learning feed-forward one-shot learners. In NIPS (pp. 523–531).
[10]
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. In European conference on computer vision (pp. 850–865). Springer.
[11]
Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV (pp. 1365–1372).
[12]
Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E., & Van Gool, L. (2009). Robust tracking-by-detection using a detector confidence particle filter. In ICCV.
[13]
Brendel, W., Amer, M., & Todorovic, S. (2011). Multiobject tracking as maximum weight independent set. In CVPR.
[14]
Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Van Gool, L. (2017). One-shot video object segmentation. In CVPR. IEEE.
[15]
Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In FG (pp. 67–74). IEEE.
[16]
Chopra, S., Hadsell, R., & LeCun, Y. (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR.
[17]
Cinbis, R. G., Verbeek, J., & Schmid, C. (2011). Unsupervised metric learning for face identification in tv video. In ICCV.
[18]
Collins, R. T. (2012). Multitarget data association with higher-order motion models. In CVPR.
[19]
Collins RT, Liu Y, and Leordeanu M Online selection of discriminative tracking features PAMI 2005 27 10 1631-1643
[20]
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR.
[21]
Dicle, C., Camps, O. I., Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In ICCV.
[22]
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). Decaf: A deep convolutional activation feature for generic visual recognition. In ICML.
[23]
Du M and Chellappa R Face association for videos using conditional random fields and max-margin markov networks PAMI 2016 38 9 1762-1773
[24]
El Khoury, E., Senac, C., & Joly, P. (2010). Face-and-clothing based people clustering in video content. In ICMR.
[25]
Everingham, M., Sivic, J., & Zisserman, A. (2006). “Hello! My name is... Buffy”—Automatic naming of characters in tv video. In BMVC.
[26]
Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV (pp. 2960–2967).
[27]
Fernando B, Tommasi T, and Tuytelaars T Joint cross-domain classification and subspace learning for unsupervised adaptation Pattern Recognition Letters 2015 65 60-66
[28]
Fulkerson, B., Vedaldi, A., & Soatto, S. (2008). Localizing objects with smart dictionaries. In ECCV.
[29]
Ganin, Y., & Lempitsky, V. (2014) Unsupervised domain adaptation by backpropagation. arXiv.
[30]
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, et al.Domain-adversarial training of neural networksThe Journal of Machine Learning Research20161712096-203035046191360.68671
[31]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS (pp. 2672–2680).
[32]
Grabner, H., & Bischof, H. (2006). On-line boosting and vision. In CVPR.
[33]
Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In CVPR (pp. 2827–2836).
[34]
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR.
[35]
Hu, J., Lu, J., & Tan, Y. P. (2014). Discriminative deep metric learning for face verification in the wild. In CVPR.
[36]
Huang, C., Li, Y., Ai, H., et al. (2006). Robust head tracking with particles based on multiple cues. In ECCVW.
[37]
Huang, C., Wu, B., & Nevatia, R. (2008). Robust object tracking by hierarchical association of detection responses. In ECCV.
[38]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM.
[39]
Jiang, H., Fels, S., & Little, J. J. (2007). A linear programming approach for multiple object tracking. In CVPR.
[40]
Joon Oh, S., Benenson, R., Fritz, M., & Schiele, B. (2015). Person recognition in personal photo collections. In ICCV (pp. 3862–3870).
[41]
Kalal Z, Mikolajczyk K, and Matas J Tracking-learning-detection. PAMI 2012 34 7 1409-1422
[42]
Kaucic, R., Perera, A. A., Brooksby, G., Kaufhold, J., & Hoogs, A. (2005). A unified framework for tracking through occlusions and across sensor gaps. In CVPR.
[43]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.
[44]
Kuo, C. H., Huang, C., & Nevatia, R. (2010). Multi-target tracking by on-line learned discriminative appearance models. In CVPR.
[45]
Kuo, C. H., & Nevatia, R. (2011). How does person identity recognition help multi-person tracking? In CVPR.
[46]
Leibe, B., Schindler, K., & Van Gool, L. (2007). Coupled detection and trajectory estimation for multi-object tracking. In ICCV.
[47]
Li, Y., Ai, H., Yamashita, T., Lao, S., & Kawade, M. (2007). Tracking in low frame rate video: A cascade particle filter with discriminative observers of different lifespans. In CVPR.
[48]
Li, Y., Huang, C., & Nevatia, R. (2009). Learning to associate: Hybridboosted multi-target tracker for crowded scene. In CVPR.
[49]
Lin, D., Kapoor, A., Hua, G., & Baker, S. (2010). Joint people, event, and location recognition in personal photo collections using cross-domain context. In ECCV.
[50]
Lin, Z., Courbariaux, M., Memisevic, R., & Bengio, Y. (2015). Neural networks with few multiplications. arXiv.
[51]
Liu, M. Y., & Tuzel, O. (2016). Coupled generative adversarial networks. In NIPS (pp. 469–477).
[52]
Long, M., Wang, J., Ding, G., Sun, J., & Yu, P. S. (2013). Transfer feature learning with joint distribution adaptation. In ICCV (pp. 2200–2207).
[53]
Lowe DG Distinctive image features from scale-invariant keypoints IJCV 2004 60 2 91-110
[54]
Mathias, M., Benenson, R., Pedersoli, M., & Van Gool, L. (2014). Face detection without bells and whistles. In ECCV.
[55]
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In BMVC.
[56]
Paul, G., Elie, K., Sylvain, M., Jean-Marc, O., & Paul, D. (2014). A conditional random field approach for audio-visual people diarization. In ICASSP.
[57]
Pellegrini, S., Ess, A., Schindler, K., & Van Gool, L. (2009). You’ll never walk alone: Modeling social behavior for multi-target tracking. In ICCV.
[58]
Perera, A. A., Srinivas, C., Hoogs, A., Brooksby, G., & Hu, W. (2006). Multi-object tracking through simultaneous long occlusions and split-merge conditions. In CVPR.
[59]
Pernici Federico FaceHugger: The ALIEN Tracker Applied to Faces Computer Vision – ECCV 2012. Workshops and Demonstrations 2012 Berlin, Heidelberg Springer Berlin Heidelberg 597-601
[60]
Ramanan, D., Baker, S., & Kakade, S. (2007). Leveraging archival video for building face datasets. In ICCV.
[61]
Rao, Y., Lin, J., Lu, J., & Zhou, J. (2017). Learning discriminative aggregation network for video-based face recognition. In Proceedings of the IEEE international conference on computer vision (pp. 3781–3790).
[62]
Rao Y, Lu J, and Zhou J Learning discriminative aggregation network for video-based face recognition and person re-identification International Journal of Computer Vision 2019 127 6–7 701-718
[63]
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv.
[64]
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW (pp. 17–35). Springer.
[65]
Roth, M., Bauml, M., Nevatia, R., & Stiefelhagen, R. (2012). Robust multi-pose face tracking by multi-stage tracklet association. In ICPR.
[66]
Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV (pp. 213–226). Springer.
[67]
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. In CVPR.
[68]
Shu, R., Bui, H. H., Narui, H., & Ermon, S. (2018). A dirt-t approach to unsupervised domain adaptation. In ICLR.
[69]
Sivic, J., Everingham, M., & Zisserman, A. (2009). “Who are you?”—Learning person specific classifiers from video. In CVPR.
[70]
Stauffer, C. (2003). Estimating tracking sources and sinks. In CVPR.
[71]
Sun, B., Feng, J., & Saenko, K. (2016). Return of frustratingly easy domain adaptation. In AAAI (Vol. 6, p. 8).
[72]
Sun, B., & Saenko, K. (2016). Deep coral: Correlation alignment for deep domain adaptation. In ECCV (pp. 443–450). Springer.
[73]
Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014a). Deep learning face representation by joint identification-verification. In NIPS.
[74]
Sun, Y., Wang, X., & Tang, X. (2014b). Deep learning face representation from predicting 10,000 classes. In CVPR.
[75]
Taigman, Y., Polyak, A., & Wolf, L. (2016). Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200.
[76]
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). DeepFace: Closing the gap to human-level performance in face verification. In CVPR.
[77]
Tang, Z., Zhang, Y., Li, Z., & Lu, H. (2015). Face clustering in videos with proportion prior. In IJCAI.
[78]
Tapaswi, M., Bauml, M., & Stiefelhagen, R. (2012). “Knock! Knock! Who is it?” probabilistic person identification in tv-series. In CVPR.
[79]
Tapaswi, M., Parkhi, O. M., Rahtu, E., Sommerlade, E., Stiefelhagen, R., & Zisserman, A. (2014). Total cluster: A person agnostic clustering method for broadcast videos. In ICVGIP.
[80]
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7167–7176).
[81]
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., & Darrell, T. (2014). Deep domain confusion: Maximizing for domain invariance. arXiv.
[82]
Van der Maaten L and Hinton GVisualizing data using t-sneJMLR200892579–2605851225.68219
[83]
Varghese, J., & Nair, K. (2016). Detecting video shot boundaries by modified tomography. In VisionNet (pp. 131–135). ACM.
[84]
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In CVPR.
[85]
Wang, B., Wang, G., Chan, K. L., & Wang, L. (2014). Tracklet association with online target-specific metric learning. In CVPR.
[86]
Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645–3649). IEEE.
[87]
Wu, B., Lyu, S., Hu, B. G., & Ji, Q. (2013a). Simultaneous clustering and tracklet linking for multi-face tracking in videos. In ICCV.
[88]
Wu, B., Zhang, Y., Hu, B. G., & Ji, Q. (2013b). Constrained clustering and its application to face clustering in videos. In CVPR.
[89]
Xiao, S., Tan, M., & Xu, D. (2014). Weighted block-sparse low rank representation for face clustering in videos. In ECCV.
[90]
Xing, J., Ai, H., & Lao, S. (2009). Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses. In CVPR.
[91]
Yang, B., Nevatia, R. (2012a). Multi-target tracking by online learning of non-linear motion patterns and robust appearance models. In CVPR.
[92]
Yang, B., & Nevatia, R. (2012b). Online learned discriminative part-based appearance models for multi-human tracking. In ECCV.
[93]
Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. arXiv.
[94]
Yoon, J. S., Rameau, F., Kim, J., Lee, S., Shin, S., & Kweon, I. S. (2017). Pixel-level matching for video object segmentation using convolutional neural networks. In ICCV (pp. 2186–2195). IEEE.
[95]
Zhang, L., Li, Y., & Nevatia, R. (2008). Global data association for multi-object tracking using network flows. In CVPR.
[96]
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., & Bourdev, L. (2015). Beyond frontal faces: Improving person recognition using multiple cues. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4804–4813).
[97]
Zhang S, Wang J, Wang Z, Gong Y, and Liu Y Multi-target tracking by learning local-to-global trajectory models PR 2015 48 2 580-590
[98]
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2016). Joint face representation adaptation and clustering in videos. In ECCV.
[99]
Zhao, X., Gong, D., & Medioni, G. (2012). Tracking using motion patterns for very crowded scenes. In ECCV.
[100]
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., & Tian, Q. (2015). Scalable person re-identification: A benchmark. In ICCV.
[101]
Zhou, C., Zhang, C., Fu, H., Wang, R., & Cao, X. (2015). Multi-cue augmented face clustering. In ACM MM.

Cited By

View all
  • (2024)BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained VideosComputer Vision – ACCV 202410.1007/978-981-96-0901-7_17(278-294)Online publication date: 8-Dec-2024
  • (2023)Online Multi-Face Tracking With Multi-Modality Cascaded MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322469933:6(2738-2752)Online publication date: 1-Jun-2023
  • (2022)Harmonious Multi-branch Network for Person Re-identification with Harder Triplet LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/350140518:4(1-21)Online publication date: 4-Mar-2022

Index Terms

  1. Tracking Persons-of-Interest via Unsupervised Representation Adaptation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image International Journal of Computer Vision
          International Journal of Computer Vision  Volume 128, Issue 1
          Jan 2020
          259 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 01 January 2020
          Accepted: 12 August 2019
          Received: 13 May 2019

          Author Tags

          1. Face tracking
          2. Transfer learning
          3. Convolutional neural networks
          4. Triplet loss

          Qualifiers

          • Research-article

          Funding Sources

          • National Science Foundation

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 21 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained VideosComputer Vision – ACCV 202410.1007/978-981-96-0901-7_17(278-294)Online publication date: 8-Dec-2024
          • (2023)Online Multi-Face Tracking With Multi-Modality Cascaded MatchingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.322469933:6(2738-2752)Online publication date: 1-Jun-2023
          • (2022)Harmonious Multi-branch Network for Person Re-identification with Harder Triplet LossACM Transactions on Multimedia Computing, Communications, and Applications10.1145/350140518:4(1-21)Online publication date: 4-Mar-2022

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media