Article

BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos

Authors:

Dong-Ho LeeAuthors Info & Claims

Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part II

Pages 278 - 294

https://doi.org/10.1007/978-981-96-0901-7_17

Published: 08 December 2024 Publication History

Abstract

Multi-face tracking (MFT) is a subtask of multi-object tracking (MOT) that focuses on detecting and tracking multiple faces across video frames. Modern MOT trackers adopt the Kalman filter (KF), a linear model that estimates current motions based on previous observations. However, these KF-based trackers struggle to predict motions in unconstrained videos with frequent shot changes, occlusions, and appearance variations. To address these limitations, we propose BoT-FaceSORT, a novel MFT framework that integrates shot change detection, shared feature memory, and an adaptive cascade matching strategy for robust tracking. It detects shot changes by comparing the color histograms of adjacent frames and resets KF states to handle discontinuities. Additionally, we introduce MovieShot, a new benchmark of challenging movie clips to evaluate MFT performance in unconstrained scenarios. We also demonstrate the superior performance of our method compared to existing methods on three benchmarks, while an ablation study validates the effectiveness of each component in handling unconstrained videos.

References

[1]

Aharon, N., Orfaig, R., Bobrovsky, B.Z.: Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)

[2]

Bernardin K and Stiefelhagen R Evaluating multiple object tracking performance: the clear mot metrics EURASIP Journal on Image and Video Processing 2008 2008 1-10

Digital Library

[3]

Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP). pp. 3464–3468 (2016)

[4]

Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: Rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9686–9696 (June 2023)

[5]

Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)

[6]

Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: Retinaface: Single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

[7]

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)

[8]

Du Y, Zhao Z, Song Y, Zhao Y, Su F, Gong T, and Meng H Strongsort: Make deepsort great again IEEE Trans. Multimedia 2023 25 8725-8737

Digital Library

[9]

Fang Y, Ko S, and Jo GS Robust visual tracking based on global-and-local search with confidence reliability estimation Neurocomputing 2019 367 273-286

Digital Library

[10]

Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)

[11]

Giancola, S., Amine, M., Dghaily, T., Ghanem, B.: Soccernet: A scalable dataset for action spotting in soccer videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (June 2018)

[12]

Guo, J., Deng, J., Lattas, A., Zafeiriou, S.: Sample and computation redistribution for efficient face detection. arXiv preprint arXiv:2105.04714 (2021)

[13]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

[14]

Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 82(1), 35–45 (03 1960)

[15]

Kim, M., Jain, A.K., Liu, X.: Adaface: Quality adaptive margin for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18750–18759 (June 2022)

[16]

Kuhn HW The hungarian method for the assignment problem Naval Research Logistics Quarterly 1955 2 1–2 83-97

[17]

Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942 (2015)

[18]

Lin, C.C., Hung, Y.: A prior-less method for multi-face tracking in unconstrained videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

[19]

Luiten J, Osep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, and Leibe B Hota: A higher order metric for evaluating multi-object tracking Int. J. Comput. Vision 2021 129 548-578

Digital Library

[20]

Luo W, Xing J, Milan A, Zhang X, Liu W, and Kim TK Multiple object tracking: A literature review Artif. Intell. 2021 293 103448

[21]

Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 3025–3029 (2023)

[22]

Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)

[23]

Pele O and Werman M Daniilidis K, Maragos P, and Paragios N The Quadratic-Chi Histogram Distance Family Computer Vision – ECCV 2010 2010 Heidelberg Springer 749-762 6312

[24]

Pernici, F., Bartoli, F., Bruni, M., Del Bimbo, A.: Memory based online learning of deep representations from video streams. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

[25]

Pernici F, Bruni M, and Del Bimbo A Self-supervised on-line cumulative learning from video streams Comput. Vis. Image Underst. 2020 197–198 102983

[26]

Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Computer Vision – ECCV 2016 Workshops. pp. 17–35. Springer International Publishing, Cham (2016)

[27]

Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., Sun, J.: Crowdhuman: A benchmark for detecting human in a crowd. arxiv 2018. arXiv preprint arXiv:1805.00123 (2018)

[28]

Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20993–21002 (June 2022)

[29]

Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7464–7475 (June 2023)

[30]

Wang Z, Zheng L, Liu Y, Li Y, and Wang S Vedaldi A, Bischof H, Brox T, and Frahm J-M Towards Real-Time Multi-Object Tracking Computer Vision – ECCV 2020 2020 Cham Springer 107-122 12356

Digital Library

[31]

Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP). pp. 3645–3649 (2017)

[32]

Wong, Y., Chen, S., Mau, S., Sanderson, C., Lovell, B.C.: Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition. In: CVPR 2011 WORKSHOPS. pp. 74–81. IEEE (2011)

[33]

Yang M, Han G, Yan B, Zhang W, Qi J, Lu H, and Wang D Hybrid-sort: Weak cues matter for online multi-object tracking Proceedings of the AAAI Conference on Artificial Intelligence 2024 38 7 6504-6512

[34]

Zhang S, Gong Y, Huang J-B, Lim J, Wang J, Ahuja N, and Yang M-H Leibe B, Matas J, Sebe N, and Welling M Tracking Persons-of-Interest via Adaptive Discriminative Features Computer Vision – ECCV 2016 2016 Cham Springer 415-433 9909

[35]

Zhang S, Huang JB, Lim J, Gong Y, Wang J, Ahuja N, and Yang MH Tracking persons-of-interest via unsupervised representation adaptation Int. J. Comput. Vision 2020 128 96-120

Digital Library

[36]

Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, Luo P, Liu W, and Wang X Avidan S, Brostow G, Cissé M, Farinella GM, and Hassner T Bytetrack: Multi-object tracking by associating every detection box Computer Vision - ECCV 2022 2022 Cham Springer Nature Switzerland 1-21

Digital Library

[37]

Zhang Y, Wang C, Wang X, Zeng W, and Liu W Fairmot: On the fairness of detection and re-identification in multiple object tracking Int. J. Comput. Vision 2021 129 3069-3087

Digital Library

[38]

Zhu, Z., Huang, G., Deng, J., Ye, Y., Huang, J., Chen, X., Zhu, J., Yang, T., Lu, J., Du, D., Zhou, J.: Webface260m: A benchmark unveiling the power of million-scale deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10492–10502 (June 2021)

Index Terms

BoT-FaceSORT: Bag-of-Tricks for Robust Multi-face Tracking in Unconstrained Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Animation
      1. Motion capture
      2. Motion processing

Index terms have been assigned to the content through auto-classification.

Recommendations

Robust Image Mosaicing of Soccer Videos using Self-Calibration and Line Tracking

In this paper we propose an accurate and robust image mosaicing method of soccer video taken from a rotating and zooming camera using line tracking and self-calibration. The mosaicing of soccer videos is not easy, because their playing fields are low ...
Multi-views tracking within and across uncalibrated camera streams
IWVS '03: First ACM SIGMM international workshop on Video surveillance

This paper presents novel approaches for continuous detection and tracking of moving objects observed by multiple, stationary or moving cameras. Stationary video streams are registered using a ground plane homography and the trajectories derived by ...
Fuzzy system-based real-time face tracking in a multi-subject environment with a pan-tilt-zoom camera

This paper proposes real-time face tracking in a multi-subject environment with a pan-tilt-zoom camera using the fuzzy system technique. Tracking is based on detected faces in the Hue-Saturation-Value (HSV) color space. To detect faces, a fuzzy ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision, Hanoi, Vietnam, December 8–12, 2024, Proceedings, Part II

Dec 2024

520 pages

ISBN:978-981-96-0900-0

DOI:10.1007/978-981-96-0901-7

Editors:
Minsu Cho
Pohang University of Science and Technology (POSTECH), Pohang, Korea (Republic of)
,
Ivan Laptev
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
,
Du Tran
Google, Mountain View, CA, USA
,
Angela Yao
National University of Singapore, Singapore, Singapore
,
Hongbin Zha
https://ror.org/02v51f717Peking University, Beijing, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 December 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten