STAN: spatiotemporal attention network for video-based facial expression recognition

Yi, Yufan; Xu, Yiping; Ye, Ziyi; Li, Linhui; Hu, Xinli; Tian, Yan

doi:10.1007/s00371-022-02721-w

STAN: spatiotemporal attention network for video-based facial expression recognition

Original article
Published: 19 November 2022

Volume 39, pages 6205–6220, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Yufan Yi¹,
Yiping Xu ORCID: orcid.org/0000-0002-7764-0941¹,
Ziyi Ye¹,
Linhui Li¹,
Xinli Hu² &
…
Yan Tian¹

531 Accesses
1 Altmetric
Explore all metrics

Abstract

Video-based facial expression recognition is a very challenging task. The expression features portrayed by traditional ResNet18 are not rich enough, while the classical method LSTM to process expression videos may not extract effective temporal features for cases with weaker emotion intensity. This paper proposes a spatiotemporal attention network to extract more diverse spatial features and more effective temporal relationships. Firstly, a spatial attention module is used to enhance the expression features extracted by the ResNet18 and remove redundancy. Then, multiple levels of information are combined to extract richer expression features. Meanwhile, the video stream is divided into a series of video clips using a sliding window, and the temporal features are extracted from each small clip by the LSTM, which is a simple but effective way to divide the video. Third, considering the different importance of expression features extracted from each window for the results of expression recognition of the whole video, an attention-based score fusion module is proposed to fuse expression information from multiple windows. We perform comprehensive experiments on in-the-wild FER benchmarks (AFEW8.0 and HUST-MM). Quantitative and qualitative analyses demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Deep learning-based facial emotion recognition for human–computer interaction applications

Article 22 April 2021

Detecting Affect States Using VGG16, ResNet50 and SE-ResNet50 Networks

Article 11 March 2020

References

Cao, Z., Chu, Z., Liu, D., et al.: A vector-based representation to enhance head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1188–1197 (2021)
Cao, Z., Liu, D., Wang, Q., et al.: Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical gaussian. arXiv preprint arXiv:2208.09122 (2022)
Chen, L., Zhou, M., Su, W., et al.: Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction. Inf. Sci. 428, 49–61 (2018)
Article MathSciNet Google Scholar
Choi, D.Y., Song, B.C.: Semi-supervised learning for facial expression-based emotion recognition in the continuous domain. Multimed. Tools Appl. 79(37), 28,169-28,187 (2020)
Article Google Scholar
Cui, Y., Yan, L., Cao, Z., et al.: Tf-blender: Temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)
Dhall, A., Goecke, R., Lucey, S., et al.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 19(03), 34–41 (2012)
Article Google Scholar
Ding, H., Zhou, S.K., Chellappa, R.: Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), IEEE, pp. 118–126 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4438–4446 (2017)
Gao, J., Zhao, Y.: Tfe: A transformer architecture for occlusion aware facial expression recognition. Frontiers in Neurorobotics 15 (2021)
Gogić, I., Manhart, M., Pandžić, I.S., et al.: Fast facial expression recognition using local binary features and shallow neural networks. Vis. Comput. 36(1), 97–112 (2020). https://doi.org/10.1007/s00371-018-1585-8
Article Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 3154–3160 (2017)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Hu, J., Liao, X., Wang, W., et al.: Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans. Circ. Syst. Video Technol. 32(3), 1089–1102 (2021)
Article Google Scholar
Hu, M., Wang, H., Wang, X., et al.: Video facial emotion recognition based on local enhanced motion history image and cnn-ctslstm networks. J. Vis. Commun. Image Represent. 59, 176–185 (2019)
Article Google Scholar
Hu, M., Ge, P., Wang, X., et al.: A spatio-temporal integrated model based on local and global features for video expression recognition. Vis. Comput. 38(8), 2617–2634 (2022). https://doi.org/10.1007/s00371-021-02136-z
Article Google Scholar
Huang, Q., Huang, C., Wang, X., et al.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 580, 35–54 (2021)
Article MathSciNet Google Scholar
Ji, S., Xu, W., Yang, M., et al.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Jiang, X., Zong, Y., Zheng, W., et al.: Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2881–2889 (2020)
Jung, H., Lee, S., Yim, J., et al.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 2983–2991 (2015)
Kim, D.H., Lee, M.K., Choi, D.Y., et al.: Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in-the-wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 529–535 (2017)
Kumar, V., Rao, S., Yu, L.: Noisy student training using body language dataset improves facial expression recognition. In: European Conference on Computer Vision, Springer, pp. 756–773 (2020)
Li, K., Jin, Y., Akram, M.W., et al.: Facial expression recognition with convolutional neural networks via a new face cropping and rotation strategy. Vis. Comput. 36(2), 391–404 (2020). https://doi.org/10.1007/s00371-019-01627-4
Article Google Scholar
Li, Y., Zeng, J., Shan, S., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Trans. Image Process. 28(5), 2439–2450 (2018)
Article MathSciNet Google Scholar
Liang, D., Liang, H., Yu, Z., et al.: Deep convolutional bilstm fusion network for facial expression recognition. Vis. Comput. 36(3), 499–508 (2020). https://doi.org/10.1007/s00371-019-01636-3
Article Google Scholar
Liang, X., Xu, L., Zhang, W., et al.: A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02413-5
Article Google Scholar
Liao, X., Li, K., Zhu, X., et al.: Robust detection of image operator chain with two-stream convolutional neural network. IEEE J. Sel. Top. Signal Process. 14(5), 955–968 (2020)
Article Google Scholar
Liu, D., Cui, Y., Tan, W., et al.: (2021a) Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9816–9825
Liu, D., Cui, Y., Yan, L., et al.: Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6101–6109 (2021b)
Mavani, V., Raman, S., Miyapuram, K.P.: Facial expression recognition using visual saliency and deep learning. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2783–2788 (2017)
Meng, D., Peng, X., Wang, K., et al.: Frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3866–3870 (2019)
Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? Adv. Neural Inform. process. syst. 32 (2019)
Ouyang, X., Kawaai, S., Goh, E.G.H., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 577–582 (2017)
Peng, Y., He, X., Zhao, J.: Object-part attention model for fine-grained image classification. IEEE Trans. Image Process. 27(3), 1487–1500 (2017)
Article MathSciNet MATH Google Scholar
Salgado, P., Banos, O., Villalonga, C.: Facial expression interpretation in asd using deep learning. In: International Work-Conference on Artificial Neural Networks, pp. 322–333. Springer (2021)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Sun, N., Li, Q., Huan, R., et al.: Deep spatial-temporal feature fusion for facial expression recognition in static images. Pattern Recogn. Lett. 119, 49–61 (2019)
Article Google Scholar
Tan, J., Liao, X., Liu, J., et al.: Channel attention image steganography with generative adversarial networks. IEEE Trans. Network Sci. Eng. 9(2), 888–903 (2021)
Article Google Scholar
Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in-the-wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576 (2017)
Wang, H., Zhou, G., Hu, M., et al.: Video emotion recognition using local enhanced motion history image and cnn-rnn networks. In: Chinese Conference on Biometric Recognition, pp. 109–119. Springer (2018)
Chapter Google Scholar
Wang, K., Peng, X., Yang, J., et al.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29, 4057–4069 (2020)
Article MATH Google Scholar
Wen, Z., Lin, W., Wang, T., et al.: Distract your attention: multi-head cross attention network for facial expression recognition. arXiv preprint arXiv:2109.07270 (2021)
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Yan, L., Ma, S., Wang, Q., et al.: Video captioning using global-local representation. IEEE Trans. Circ. Syst. Video Technol. 32(10), 6642–6656 (2022). https://doi.org/10.1109/TCSVT.2022.3177320
Article Google Scholar
Yan, L., Wang, Q., Cui, Y., et al.: Gl-rg: Global-local representation granularity for video captioning. arXiv preprint arXiv:2205.10706 (2022b)
Zhang, C.B., Jiang, P.T., Hou, Q., et al.: Delving deep into label smoothing. IEEE Trans. Image Process. 30, 5984–5996 (2021)
Article Google Scholar
Zhang, K., Huang, Y., Du, Y., et al.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process. 26(9), 4193–4203 (2017)
Article MathSciNet MATH Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., et al.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

Download references

Acknowledgements

This study was supported by the National Key R &D Program of China under Grant 2020YFC0833102.

Author information

Authors and Affiliations

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China
Yufan Yi, Yiping Xu, Ziyi Ye, Linhui Li & Yan Tian
Institute of Remote Sensing Applications, Chinese Academy of Sciences, Beijing, 100094, China
Xinli Hu

Authors

Yufan Yi
View author publications
You can also search for this author in PubMed Google Scholar
Yiping Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ziyi Ye
View author publications
You can also search for this author in PubMed Google Scholar
Linhui Li
View author publications
You can also search for this author in PubMed Google Scholar
Xinli Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiping Xu.

Ethics declarations

Conflict of interest

Conflict of interest The authors declare that they have no conflict of interest. Data cannot be made available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yi, Y., Xu, Y., Ye, Z. et al. STAN: spatiotemporal attention network for video-based facial expression recognition. Vis Comput 39, 6205–6220 (2023). https://doi.org/10.1007/s00371-022-02721-w

Download citation

Accepted: 28 October 2022
Published: 19 November 2022
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00371-022-02721-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STAN: spatiotemporal attention network for video-based facial expression recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Deep learning-based facial emotion recognition for human–computer interaction applications

Detecting Affect States Using VGG16, ResNet50 and SE-ResNet50 Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

STAN: spatiotemporal attention network for video-based facial expression recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Deep learning-based facial emotion recognition for human–computer interaction applications

Detecting Affect States Using VGG16, ResNet50 and SE-ResNet50 Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation