Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

STAN: spatiotemporal attention network for video-based facial expression recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Video-based facial expression recognition is a very challenging task. The expression features portrayed by traditional ResNet18 are not rich enough, while the classical method LSTM to process expression videos may not extract effective temporal features for cases with weaker emotion intensity. This paper proposes a spatiotemporal attention network to extract more diverse spatial features and more effective temporal relationships. Firstly, a spatial attention module is used to enhance the expression features extracted by the ResNet18 and remove redundancy. Then, multiple levels of information are combined to extract richer expression features. Meanwhile, the video stream is divided into a series of video clips using a sliding window, and the temporal features are extracted from each small clip by the LSTM, which is a simple but effective way to divide the video. Third, considering the different importance of expression features extracted from each window for the results of expression recognition of the whole video, an attention-based score fusion module is proposed to fuse expression information from multiple windows. We perform comprehensive experiments on in-the-wild FER benchmarks (AFEW8.0 and HUST-MM). Quantitative and qualitative analyses demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Cao, Z., Chu, Z., Liu, D., et al.: A vector-based representation to enhance head pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1188–1197 (2021)

  2. Cao, Z., Liu, D., Wang, Q., et al.: Towards unbiased label distribution learning for facial pose estimation using anisotropic spherical gaussian. arXiv preprint arXiv:2208.09122 (2022)

  3. Chen, L., Zhou, M., Su, W., et al.: Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction. Inf. Sci. 428, 49–61 (2018)

    Article  MathSciNet  Google Scholar 

  4. Choi, D.Y., Song, B.C.: Semi-supervised learning for facial expression-based emotion recognition in the continuous domain. Multimed. Tools Appl. 79(37), 28,169-28,187 (2020)

    Article  Google Scholar 

  5. Cui, Y., Yan, L., Cao, Z., et al.: Tf-blender: Temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)

  6. Dhall, A., Goecke, R., Lucey, S., et al.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 19(03), 34–41 (2012)

    Article  Google Scholar 

  7. Ding, H., Zhou, S.K., Chellappa, R.: Facenet2expnet: Regularizing a deep face recognition net for expression recognition. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), IEEE, pp. 118–126 (2017)

  8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4438–4446 (2017)

  10. Gao, J., Zhao, Y.: Tfe: A transformer architecture for occlusion aware facial expression recognition. Frontiers in Neurorobotics 15 (2021)

  11. Gogić, I., Manhart, M., Pandžić, I.S., et al.: Fast facial expression recognition using local binary features and shallow neural networks. Vis. Comput. 36(1), 97–112 (2020). https://doi.org/10.1007/s00371-018-1585-8

    Article  Google Scholar 

  12. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 3154–3160 (2017)

  13. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  14. Hu, J., Liao, X., Wang, W., et al.: Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans. Circ. Syst. Video Technol. 32(3), 1089–1102 (2021)

    Article  Google Scholar 

  15. Hu, M., Wang, H., Wang, X., et al.: Video facial emotion recognition based on local enhanced motion history image and cnn-ctslstm networks. J. Vis. Commun. Image Represent. 59, 176–185 (2019)

    Article  Google Scholar 

  16. Hu, M., Ge, P., Wang, X., et al.: A spatio-temporal integrated model based on local and global features for video expression recognition. Vis. Comput. 38(8), 2617–2634 (2022). https://doi.org/10.1007/s00371-021-02136-z

    Article  Google Scholar 

  17. Huang, Q., Huang, C., Wang, X., et al.: Facial expression recognition with grid-wise attention and visual transformer. Inf. Sci. 580, 35–54 (2021)

    Article  MathSciNet  Google Scholar 

  18. Ji, S., Xu, W., Yang, M., et al.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)

    Article  Google Scholar 

  19. Jiang, X., Zong, Y., Zheng, W., et al.: Dfew: A large-scale database for recognizing dynamic facial expressions in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2881–2889 (2020)

  20. Jung, H., Lee, S., Yim, J., et al.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 2983–2991 (2015)

  21. Kim, D.H., Lee, M.K., Choi, D.Y., et al.: Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in-the-wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 529–535 (2017)

  22. Kumar, V., Rao, S., Yu, L.: Noisy student training using body language dataset improves facial expression recognition. In: European Conference on Computer Vision, Springer, pp. 756–773 (2020)

  23. Li, K., Jin, Y., Akram, M.W., et al.: Facial expression recognition with convolutional neural networks via a new face cropping and rotation strategy. Vis. Comput. 36(2), 391–404 (2020). https://doi.org/10.1007/s00371-019-01627-4

    Article  Google Scholar 

  24. Li, Y., Zeng, J., Shan, S., et al.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Trans. Image Process. 28(5), 2439–2450 (2018)

    Article  MathSciNet  Google Scholar 

  25. Liang, D., Liang, H., Yu, Z., et al.: Deep convolutional bilstm fusion network for facial expression recognition. Vis. Comput. 36(3), 499–508 (2020). https://doi.org/10.1007/s00371-019-01636-3

    Article  Google Scholar 

  26. Liang, X., Xu, L., Zhang, W., et al.: A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02413-5

    Article  Google Scholar 

  27. Liao, X., Li, K., Zhu, X., et al.: Robust detection of image operator chain with two-stream convolutional neural network. IEEE J. Sel. Top. Signal Process. 14(5), 955–968 (2020)

    Article  Google Scholar 

  28. Liu, D., Cui, Y., Tan, W., et al.: (2021a) Sg-net: Spatial granularity network for one-stage video instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9816–9825

  29. Liu, D., Cui, Y., Yan, L., et al.: Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6101–6109 (2021b)

  30. Mavani, V., Raman, S., Miyapuram, K.P.: Facial expression recognition using visual saliency and deep learning. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2783–2788 (2017)

  31. Meng, D., Peng, X., Wang, K., et al.: Frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, pp. 3866–3870 (2019)

  32. Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? Adv. Neural Inform. process. syst. 32 (2019)

  33. Ouyang, X., Kawaai, S., Goh, E.G.H., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 577–582 (2017)

  34. Peng, Y., He, X., Zhao, J.: Object-part attention model for fine-grained image classification. IEEE Trans. Image Process. 27(3), 1487–1500 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  35. Salgado, P., Banos, O., Villalonga, C.: Facial expression interpretation in asd using deep learning. In: International Work-Conference on Artificial Neural Networks, pp. 322–333. Springer (2021)

    Google Scholar 

  36. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  37. Sun, N., Li, Q., Huan, R., et al.: Deep spatial-temporal feature fusion for facial expression recognition in static images. Pattern Recogn. Lett. 119, 49–61 (2019)

    Article  Google Scholar 

  38. Tan, J., Liao, X., Liu, J., et al.: Channel attention image steganography with generative adversarial networks. IEEE Trans. Network Sci. Eng. 9(2), 888–903 (2021)

    Article  Google Scholar 

  39. Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in-the-wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576 (2017)

  40. Wang, H., Zhou, G., Hu, M., et al.: Video emotion recognition using local enhanced motion history image and cnn-rnn networks. In: Chinese Conference on Biometric Recognition, pp. 109–119. Springer (2018)

    Chapter  Google Scholar 

  41. Wang, K., Peng, X., Yang, J., et al.: Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29, 4057–4069 (2020)

    Article  MATH  Google Scholar 

  42. Wen, Z., Lin, W., Wang, T., et al.: Distract your attention: multi-head cross attention network for facial expression recognition. arXiv preprint arXiv:2109.07270 (2021)

  43. Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  44. Yan, L., Ma, S., Wang, Q., et al.: Video captioning using global-local representation. IEEE Trans. Circ. Syst. Video Technol. 32(10), 6642–6656 (2022). https://doi.org/10.1109/TCSVT.2022.3177320

    Article  Google Scholar 

  45. Yan, L., Wang, Q., Cui, Y., et al.: Gl-rg: Global-local representation granularity for video captioning. arXiv preprint arXiv:2205.10706 (2022b)

  46. Zhang, C.B., Jiang, P.T., Hou, Q., et al.: Delving deep into label smoothing. IEEE Trans. Image Process. 30, 5984–5996 (2021)

    Article  Google Scholar 

  47. Zhang, K., Huang, Y., Du, Y., et al.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process. 26(9), 4193–4203 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  48. Zhou, B., Khosla, A., Lapedriza, A., et al.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

Download references

Acknowledgements

This study was supported by the National Key R &D Program of China under Grant 2020YFC0833102.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiping Xu.

Ethics declarations

Conflict of interest

Conflict of interest The authors declare that they have no conflict of interest. Data cannot be made available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, Y., Xu, Y., Ye, Z. et al. STAN: spatiotemporal attention network for video-based facial expression recognition. Vis Comput 39, 6205–6220 (2023). https://doi.org/10.1007/s00371-022-02721-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02721-w

Keywords