In this work, we propose a hybrid deep network/graph decoding using hidden Markov model system for the classification of kitchen activities for the Actions for Cooking Eggs data set. We use and compare two deep learning architectures, a deep convolutional neural network (CNN) alone and a long short-term memory network built on top of a CNN. We address the video classification problem both on the level of actions performed in certain frames and the full-length video level. Our proposed system detects a sequence of cooking actions and outputs a menu class for the entire video. Our approach achieves the highest reported accuracy on the data set for identifying cooking actions with an overall accuracy of 81% compared to the state of the art of 76% and succeeds in assigning a menu label to a sequence of cooking actions with an accuracy of 100% compared to an accuracy range of 10–30% reported in previous work. We also explore the effects of processing a subset of the available frames and imposing a state occupancy constraint during decoding. Our best reported results are achieved when using a common-sense dictionary grammar expansion when processing one frame out of every 35 frames and when restricting state transitions for at least five consecutive frames.

The data sets analyzed during the current study are available through the [ ICPR 2012 Contest-Kitchen Scene Context-based Gesture Recognition] available online at http://www.murase.m.is.nagoya-u.ac.jp/KSCGR/
