Abstract
Video classification is an important research field due to its applications ranging from human action recognition for video surveillance to emotion recognition for human-computer interaction. This paper proposes a new method called One-Shot Only (OSO) for real-time video classification with a case study in facial emotion recognition. Instead of using 3D convolutional neural networks (CNN) or multiple 2D CNNs with decision fusion as in the previous studies, the OSO method tackles video classification as a single image classification problem by spatially rearranging video frames using frame selection or clustering strategies to form a simple representative storyboard for spatio-temporal video information fusion. It uses a single 2D CNN for video classification and thus can be optimised end-to-end directly in terms of the classification accuracy. Experimental results show that the OSO method proposed in this paper outperformed multiple 2D CNNs with decision fusion by a large margin in terms of classification accuracy (by up to 13%) on the AFEW 7.0 dataset for video classification. It is also very fast, up to ten times faster than the commonly used 2D CNN architectures for video classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kim, B.-K., Roh, J., Dong, S.-Y., Lee, S.-Y.: Hierarchical committee of deep convolutional neural networks for robust facial expression recognition. J. Multimodal User Interfaces 10(2), 173–189 (2016). https://doi.org/10.1007/s12193-015-0209-0
Liu, C., Tang, T., Lv, K., Wang, M.: Multi-feature based emotion recognition for video clips. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 630–634. ACM, Boulder (2018)
Lu, C., Zheng, W., Li, C., Tang, C., Liu, S., Yan, S., Zong, Y.: Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 646–652. ACM, Boulder (2018)
Knyazev, B., Shvetsov, R., Efremova, N., Kuharenko, A.: Convolutional neural networks pretrained on large face recognition datasets for emotion classification from video. arXiv preprint arXiv:1711.04598 (2017)
Bargal, S.A., Barsoum, E., Ferrer, C.C., Zhang, C.: Emotion recognition in the wild from videos using images. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 433–436. ACM, Tokyo (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Jing, L., Yang, X., Tian, Y.: Video you only look once: overall temporal convolutions for action recognition. J. Vis. Commun. Image Representation 52, 58–65 (2018)
Samadiani, N., Huang, G., Cai, B., Luo, W., Chi, C.-H., Xiang, Y., He, J.: A review on automatic facial expression recognition systems assisted by multimodal sensor data. Sensors 19, 1863 (2019)
Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17, 124–129 (1971)
Kahou, S.E., et al.: Combining modality specific deep neural networks for emotion recognition in video. In: Proceedings of the 15th ACM International conference on multimodal interaction, pp. 543–550. ACM, Sydney (2013)
Dhall, A., Goecke, R., Joshi, J., Wagner, M., Gedeon, T.: Emotion recognition in the wild challenge 2013. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 509–516. ACM, Sydney (2013)
Sikka, K., Dykstra, K., Sathyanarayana, S., Littlewort, G., Bartlett, M.: Multiple kernel learning for emotion recognition in the wild. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 517–524. ACM, Sydney (2013)
Liu, M., Wang, R., Huang, Z., Shan, S., Chen, X.: Partial least squares regression on grassmannian manifold for emotion recognition. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 525–530. ACM, Sydney (2013)
Chen, J., Chen, Z., Chi, Z., Fu, H.: Facial expression recognition in video with multiple feature fusion. IEEE Trans. Affect. Comput. 9, 38–50 (2018)
Dhall, A., Murthy, O.V.R., Goecke, R., Joshi, J., Gedeon, T.: Video and image based emotion recognition challenges in the wild: EmotiW 2015. In: Proceedings of the ACM on International Conference on Multimodal Interaction, pp. 423–426. ACM, Seattle (2015)
Yang, B., Cao, J., Ni, R., Zhang, Y.: Facial expression recognition using weighted mixture deep neural network based on double-channel facial images. IEEE Access 6, 4630–4640 (2018)
Doherty, A.R., Byrne, D., Smeaton, A.F., Jones, G.J.F., Hughes, M.: Investigating keyframe selection methods in the novel domain of passively captured visual lifelogs. In: Proceedings of the International Conference on Content-based Image and Video Retrieval, pp. 259–268. ACM, Niagara Falls (2008)
Guo, S.M., Pan, Y.A., Liao, Y.C., Hsu, C.Y., Tsai, J.S.H., Chang, C.I.: A key frame selection-based facial expression recognition system. In: Proceedings of ICICIC 2006 Innovative Computing, Information and Control, pp. 341–344 (2006)
Zhang, Q., Yu, S.-P., Zhou, D.-S., Wei, X.-P.: An efficient method of key-frame extraction based on a cluster algorithm. J. Hum. Kinet. 39, 5–14 (2013)
Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10, 18–31 (2019)
Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Collecting large, richly annotated facial-expression databases from movies. IEEE Multimed. 19, 34–41 (2012)
Shi, J., Tomasi, C.: Good features to track. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 593–600 (1994)
Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical report, Carnegie Mellon University (1991)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 1499–1503 (2016)
Ouyang, X., et al.: Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 577–582. ACM, Glasgow (2017)
Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450. ACM, Tokyo (2016)
Vielzeuf, V., Pateux, S., Jurie, F.: Temporal multimodal fusion for video emotion classification in the wild. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 569–576. ACM, Glasgow (2017)
Fan, Y., Lam, Jacqueline C.K., Li, Victor O.K.: Multi-region ensemble convolutional neural network for facial expression recognition. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11139, pp. 84–94. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01418-6_9
Yan, J., et al.: Multi-clue fusion for emotion recognition in the wild. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 458–463. ACM, Tokyo (2016)
Ding, W., et al.: Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 506–513. ACM, Tokyo (2016)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Basbrain, A., Gan, J.Q. (2020). One-Shot Only Real-Time Video Classification: A Case Study in Facial Emotion Recognition. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2020. IDEAL 2020. Lecture Notes in Computer Science(), vol 12489. Springer, Cham. https://doi.org/10.1007/978-3-030-62362-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-62362-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62361-6
Online ISBN: 978-3-030-62362-3
eBook Packages: Computer ScienceComputer Science (R0)