Abstract
An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system’s feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
See Sect. 5 for more details on the data gathering process.
The dataset has been made publicly available at http://www.cs.ucf.edu/~smasood/datasets/UCFKinect.zip.
The optimal value of the threshold T was found for each value of γ using the training set.
References
Ali, S., & Shah, M. (2010). Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 288–303.
Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239.
Cao, L., Liu, Z., & Huang, T. (2010). Cross-dataset action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1998–2005).
Carlsson, S., & Sullivan, J. (2001). Action recognition by shape matching to key frames. In IEEE international workshop at CVPR on models versus exemplars in computer vision.
Cheema, S., Eweiwi, A., Thurau, C. & Bauckhage, C. (2011). Action recognition by learning discriminative key poses. In IEEE international workshop at ICCV on performance evaluation on recognition of human actions and pose estimation methods (pp. 1302–1309).
Cuntoor, N., & Chellappa, R. (2006). Key frame-based activity representation using antieigenvalues. In Proceedings of the Asian conference on computer vision (pp. 499–508).
Davis, J. W., & Tyagi, A. (2006). Minimal-latency human action recognition using reliable-inference. Image and Vision Computing, 24(5), 455–472.
Felzenszwalb, P. F., Girshick, R. B., & Mcallester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2241–2248).
Fothergill, S., Mentis, H. M., Kohli, P., & Nowozin, S. (2012). Instructing people for training gestural interactive systems. In Proceedings of the ACM conference on human factors in computing systems (pp. 1737–1746).
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2), 337–407.
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the IEEE international conference on computer vision (pp. 415–422).
Guan, P., Weiss, A., Bălan, A. O., & Black, M. J. (2009). Estimating human shape and pose from a single image. In Proceedings of the IEEE international conference on computer vision (pp. 1381–1388).
Hoai, M., & De la Torre, F. (2012). Max-margin early event detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (pp. 282–289).
Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3D points. In IEEE international workshop at CVPR on human communicative behavior analysis (pp. 9–14).
Liu, J., & Shah, M. (2008). Learning human actions via information maximization. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the IEEE European conference on computer vision (pp. 359–372).
Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Martens, J., & Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the international conference on machine learning (pp. 1033–1040).
Masood, S., Nagaraja, A., Khan, N., Zhu, J., & Tappen, M. (2011). Correcting cuboid corruption for action recognition in complex environment. In IEEE international workshop at ICCV on video event categorization, tagging and retrieval for real-world applications (pp. 1540–1547).
Metacritic (2011). Fighters uncaged critic reviews. http://www.metacritic.com/game/xbox-360/fighters-uncaged/critic-reviews.
Müller, M., & Röder, T. (2006). Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 137–146).
Narasimhan, M., Viola, P. A., & Shilman, M. (2006). Online decoding of Markov models under latency constraints. In Proceedings of the international conference on machine learning (pp. 657–664).
Norton, J., Wingrave, C., & LaViola, J. (2010). Exploring strategies and guidelines for developing full body video game interfaces. In Proceedings of the international conference on the foundations of digital games (pp. 155–162).
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 271–278).
Raptis, M., Kirovski, D., & Hoppe, H. (2011). Real-time classification of dance gestures from skeleton animation. In Proceedings of SIGGRAPH/Eurographics symposium on computer animation (pp. 147–156).
Schindler, K., & Van Gool, L. J. (2008). Action snippets: How many frames does human action recognition require? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Shao, L., & Ji, L. (2009). Motion histogram analysis based key frame extraction for human action/activity representation. In Proceedings of the conference on computer and robot vision (pp. 88–92).
Shen, Y., & Foroosh, H. (2009). View-invariant action recognition from point triplets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 1898–1905.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1297–1304).
Sigal, L., Balan, A., & Black, M. J. (2010). HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1–2).
Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In IEEE international workshop at ICCV on visual surveillance (pp. 1729–1736).
Viola, P., & Jones, M. (2001a). Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 511–518).
Viola, P., & Jones, M. (2001b). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154.
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1290–1297).
Yang, W., Wang, Y., & Mori, G. (2010). Recognizing human actions from still images with latent poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2030–2037).
Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2442–2449).
Zhao, Z., & Elgammal, A. (2008). Information theoretic key frame selection for action recognition. In Proceedings of the British machine vision conference (pp. 1–10).
Acknowledgements
Marshall F. Tappen, Syed Z. Masood and Chris Ellis were supported by NSF grants IIS-0905387 and IIS-0916868. Joseph J. LaViola Jr. was supported by NSF CAREER award IIS-0845921 and NSF awards IIS-0856045 and CCF-1012056.
Author information
Authors and Affiliations
Corresponding author
Additional information
S.Z. Masood and C. Ellis contributed equally towards this paper.
Rights and permissions
About this article
Cite this article
Ellis, C., Masood, S.Z., Tappen, M.F. et al. Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition. Int J Comput Vis 101, 420–436 (2013). https://doi.org/10.1007/s11263-012-0550-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0550-7