Abstract
Video face recognition (VFR) has gained significant attention as a promising field combining computer vision and artificial intelligence, revolutionizing identity authentication and verification. Unlike traditional image-based methods, VFR leverages the temporal dimension of video footage to extract comprehensive and accurate facial information. However, VFR heavily relies on robust computing power and advanced noise processing capabilities to ensure optimal recognition performance. This paper introduces a novel length-adaptive VFR framework based on a recurrent-mechanism-driven Vision Transformer, termed TempoViT. TempoViT efficiently captures spatial and temporal information from face videos, enabling accurate and reliable face recognition while mitigating the high GPU memory requirements associated with video processing. By leveraging the reuse of hidden states from previous frames, the framework establishes recurring links between frames, allowing the modeling of long-term dependencies. Experimental results validate the effectiveness of TempoViT, demonstrating its state-of-the-art performance in video face recognition tasks on benchmark datasets including iQIYI-ViD, YTF, IJB-C, and Honda/UCSD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. In: 4th International Conference on Learning Representations, ICLR 2016 (2015)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML), vol. 2, p. 4 (2021)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Ding, C., Tao, D.: Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 1002–1014 (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Du, H., Shi, H., Zeng, D., Zhang, X.P., Mei, T.: The elements of end-to-end deep face recognition: a survey of recent advances. ACM Comput. Surv. (CSUR) 54(10s), 1–42 (2022)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Gong, S., Shi, Y., Kalka, N.D., Jain, A.K.: Video face recognition: component-wise feature aggregation network (C-FAN). In: 2019 International Conference on Biometrics (ICB), pp. 1–8. IEEE (2019)
Guo, G., Zhang, N.: A survey on deep learning based face recognition. Comput. Vis. Image Underst. 189, 102805 (2019)
Hajati, F., Tavakolian, M., Gheisari, S., Gao, Y., Mian, A.S.: Dynamic texture comparison using derivative sparse representation: application to video-based face recognition. IEEE Trans. Hum.-Mach. Syst. 47(6), 970–982 (2017)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Hörmann, S., Cao, Z., Knoche, M., Herzog, F., Rigoll, G.: Face aggregation network for video face recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2973–2977. IEEE (2021)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
Hu, W., Huang, Y., Zhang, F., Li, R., Li, W., Yuan, G.: Seqface: make full use of sequence information for face recognition. arXiv preprint arXiv:1803.06524 (2018)
Kim, S.T., Kim, D.H., Ro, Y.M.: Spatio-temporal representation for face authentication by using multi-task learning with human attributes. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2996–3000. IEEE (2016)
Kim, S.T., Ro, Y.M.: Facial dynamics interpreter network: what are the important relations between local dynamics for facial trait estimation? In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–480 (2018)
Lee, K., Ho, J., Yang, M., Kriegman, D.: Visual tracking and recognition using probabilistic appearance manifolds. Comput. Vis. Image Underst. 99(3), 303–331 (2005)
Li, Y., Zheng, W., Cui, Z., Zhang, T.: Face recognition based on recurrent regression neural network. Neurocomputing 297, 50–58 (2018)
Lin, J., Xiao, L., Wu, T., Bian, W.: Image set-based face recognition using pose estimation with facial landmarks. Multimedia Tools Appl. 79(27), 19493–19507 (2020)
Liu, Y., et al.: iQIYI-VID: a large dataset for multi-modal person identification. arXiv preprint arXiv:1811.07548 (2018)
Maze, B., et al.: IARPA Janus benchmark-C: face dataset and protocol. In: 2018 International Conference on Biometrics (ICB), pp. 158–165. IEEE (2018)
Mokhayeri, F., Granger, E.: A paired sparse representation model for robust face recognition from a single sample. Pattern Recogn. 100, 107129 (2020)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)
Rao, Y., Lu, J., Zhou, J.: Attention-aware deep reinforcement learning for video face recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3931–3940 (2017)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, H., et al.: CosFace: large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274 (2018)
Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: CVPR 2011, pp. 529–534. IEEE (2011)
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Yang, J., et al.: Neural aggregation network for video face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4362–4371 (2017)
Yang, J., Dong, X., Liu, L., Zhang, C., Shen, J., Yu, D.: Recurring the transformer for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14063–14073 (2022)
Zhang, M., Song, G., Zhou, H., Liu, Yu.: Discriminability distillation in group representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 1–19. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_1
Zhong, Y., Deng, W.: Face transformer for recognition. arXiv preprint arXiv:2103.14803 (2021)
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant Nos. 62376003, 62306003, 62372004, 62302005).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, H. et al. (2024). A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14429. Springer, Singapore. https://doi.org/10.1007/978-981-99-8469-5_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-8469-5_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8468-8
Online ISBN: 978-981-99-8469-5
eBook Packages: Computer ScienceComputer Science (R0)