Abstract
In this work, we propose a hybrid deep network/graph decoding using hidden Markov model system for the classification of kitchen activities for the Actions for Cooking Eggs data set. We use and compare two deep learning architectures, a deep convolutional neural network (CNN) alone and a long short-term memory network built on top of a CNN. We address the video classification problem both on the level of actions performed in certain frames and the full-length video level. Our proposed system detects a sequence of cooking actions and outputs a menu class for the entire video. Our approach achieves the highest reported accuracy on the data set for identifying cooking actions with an overall accuracy of 81% compared to the state of the art of 76% and succeeds in assigning a menu label to a sequence of cooking actions with an accuracy of 100% compared to an accuracy range of 10–30% reported in previous work. We also explore the effects of processing a subset of the available frames and imposing a state occupancy constraint during decoding. Our best reported results are achieved when using a common-sense dictionary grammar expansion when processing one frame out of every 35 frames and when restricting state transitions for at least five consecutive frames.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and materials
The data sets analyzed during the current study are available through the [ ICPR 2012 Contest-Kitchen Scene Context-based Gesture Recognition] available online at http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/
Abbreviations
- HMM:
-
Hidden Markov model
- CNN:
-
Convolutional neural network
- LSTM:
-
Long short-term memory
- SVM:
-
Support vector machines
- NN:
-
Neural networks
- SBR:
-
Symbolic behavior recognition
- HOGV:
-
Histogram of oriented gradient variation
References
Hoai, M., Lan, Z.-Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3272 (2011). IEEE
Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Comput. Vision Image Understand. 73(3), 428–440 (1999)
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks supplementary materials
Crowley, J.L., Reignier, P., Pesnel, S.: Context aware vision using image-based active recognition (2004)
Kim, T.-K., Cipolla, R.: Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1415–1428 (2008)
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surveys (CSUR) 43(3), 1–43 (2011)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1728–1743 (2011)
Chaquet, J.M., Carmona, E.J., Fernández-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput. Vision Image Understand. 117(6), 633–659 (2013)
Liu, Y., Lu, Z., Li, J., Yang, T.: Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuit Syst. Video Technol. 29(8), 2416–2430 (2018)
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Global temporal representation based cnns for infrared action recognition. IEEE Signal Process. Lett. 25(6), 848–852 (2018)
Liu, J., Li, Y., Song, S., Xing, J., Lan, C., Zeng, W.: Multi-modality multi-task recurrent neural network for online action detection. IEEE Trans. Circuit Syst. Video Technol. 29(9), 2667–2682 (2018)
Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity (2018). https://doi.org/10.1155/2018/5345241
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing 29, 3168–3182 (2019)
Tu, Z., Li, H., Zhang, D., Dauwels, J., Li, B., Yuan, J.: Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing 28(6), 2799–2812 (2019)
Liu, Y., Wang, K., Li, G., Lin, L.: Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing (2021)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), 221–231 (2013)
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Int. Workshop Human Behav. Understand., pp. 29–39. Springer, Berlin (2011)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer, Cham (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Action classification in soccer videos with long short-term memory recurrent neural networks. In: Int. Conf. Artif. Neural Netw., pp. 154–159. Springer, Berlin (2010)
Wu, Z., Wang, X., Jiang, Y.-G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470 (2015). ACM
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201 (2012). IEEE
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: ICCV ’09: Proceedings of the Twelfth IEEE International Conference on Computer Vision. IEEE Computer Society, Washington, DC, USA (2009). http://www.cs.rochester.edu/u/rmessing/uradl/
Hashimoto, A., Mori, S., Iiyama, M., Minoh, M.: Kusk object dataset: recording access to objects in food preparation. In: 2016 IEEE International Conference on Multimedia Expo Workshops (ICMEW) (2016)
la Torre Frade, F.D., Hodgins, J.K., Bargteil, A.W., Artal, X.M., Macey, J.C., Castells, A.C.I., Beltran, J.: Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. Technical Report CMU-RI-TR-08-22, Carnegie Mellon University, Pittsburgh, PA (April 2008). http://kitchen.cs.cmu.edu/
Shimada, A., Kondo, K., Deguchi, D., Morin, G., Stern, H.: Kitchen scene context based gesture recognition: a contest in ICPR2012. In: Advances in Depth Image Analysis and Applications, pp. 168–185. Springer, Berlin (2013)
Ni, B., Paramathayalan, V.R., Li, T., Moulin, P.: Multiple granularity modeling: a coarse-to-fine framework for fine-grained action analysis. Int. J. Comput. Vision 120(1), 28–43 (2016)
Ni, B., Moulin, P., Yan, S.: Pose adaptive motion feature pooling for human action analysis. Int. J. Comput. Vision 111(2), 229–248 (2015)
Hung, N.T., Kim, J.Y., et al.: Gesture recognition in cooking video based on image features and motion features using Bayesian network classifier. In: Emerging Trends in Image Processing, Computer Vision and Pattern Recognition, pp. 379–392. Elsevier (2015)
Ohyama, W., Hotta, S., Wakabayashi, T.: Spatiotemporal auto-correlation of grayscale gradient with importance map for cooking gesture recognition. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 166–170 (2015). IEEE
Bansal, S., Khandelwal, S., Gupta, S., Goyal, D.: Kitchen activity recognition based on scene context. In: 2013 20th IEEE International Conference on Image Processing (ICIP), pp. 3461–3465 (2013). IEEE
Monteiro, J., Granada, R., Barros, R.C., Meneguzzi, F.: Deep neural networks for kitchen activity recognition. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2048–2055 (2017). https://doi.org/10.1109/IJCNN.2017.7966102
Kojima, S., Ohyama, W., Wakabayashi, T.: Gesture recognition based on spatiotemporal histogram of oriented gradient variation. In: Informatics, Electronics and Vision & 2017 7th International Symposium in Computational Medical and Health Technology (ICIEV-ISCMHT), 2017 6th International Conference On, pp. 1–4 (2017). IEEE
Granada, R., Pereira, R.F., Monteiro, J., Barros, R., Ruiz, D., Meneguzzi, F.: Hybrid activity and plan recognition for video streams. In: The AAAI 2017 Workshop on Plan, Activity, and Intent Recognition (2017)
Hussein, F., Piccardi, M.: V-JAUNE: a framework for joint action recognition and video summarization. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 13(2), 1–19 (2017)
Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Berlin (2008)
Funding
This manuscript was prepared during MR’s work toward her self-funded PhD degree.
Author information
Authors and Affiliations
Contributions
MR processed and analyzed the data and results. MR was the major contributor in writing the manuscript. AE provided advice and guidance through the study. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ramadan, M., El-Jaroudi, A. Action detection and classification in kitchen activities videos using graph decoding. Vis Comput 39, 799–812 (2023). https://doi.org/10.1007/s00371-021-02346-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02346-5