Abstract
Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multi-view pictorial structures for 3D human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Andriluka, M., Roth, S., & Schiele, B. (2011). Discriminative appearance models for pictorial structures. International Journal of Computer Vision (IJCV), 99, 259–280.
Aubert, O., & Prié, Y. (2007). Advene: An open-source framework for integrating and visualising audiovisual metadata. In MM. ACM.
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding (pp. 29–39). Springer.
Barr, A., & Feigenbaum, E. (1981). The handbook of artificial intelligence (Vol. 1). Los Altos: William Kaufman Inc.
Bloem, J., Regneri, M., & Thater, S. (2012). Robust processing of noisy web-collected data. In KONVENS.
Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision (ECCV).
Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Campbell, L., & Bobick, A. (1995). Recognition of human body motion using phase space constraints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Chakraborty, B., Holte, M., Moeslund, T., Gonzalez, J., & Roca, X. (2011). A selective spatio-temporal interest point detector for human action recognition in complex scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Chaquet, J., Carmona, E., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.
Chen, D., & Dolan, W. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
Cherian, A., Mairal, J., Alahari, K., & Schmid, C. (2014). Mixing body-part sequences for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision (ECCV).
Das, P., Xu, C., Doell, R., & Corso, J. (2013). Thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Divvala, S. K., Efros, A. A., & Hebert, M. (2012). How important are “Deformable Parts” in the Deformable Parts Model? In Computer Vision–ECCV 2012. Workshops and Demonstrations (pp. 31–40). Berlin, Heidelberg: Springer.
Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2011). The PASCAL action classification taster competition. International Journal of Computer Vision, 88, 303–338.
Farhadi, A., Endres, I., & Hoiem, D. (2010). Attribute-centric recognition for cross-category generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Fathi, A., Farhadi, A., & Rehg, J. (2011). Understanding egocentric activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE.
Fellbaum, C. (1998). WordNet: An electronical lexical database. Cambridge: The MIT Press.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision (IJCV), 61, 55–79.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. EEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32, 1627–1645.
Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ferryman, J. (Ed.). (2007). PETS.
Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Trans. Comput’73.
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS).
Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2013). Learning multi-modal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (p. 99).
Gkioxari, G., Arbelaez, P., Bourdev, L., & Malik, J. (2013). Articulated pose estimation using discriminative armlet classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., & Saenko, K. (2013). Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Gupta, A., Srinivasan, P., Shi, J., & Davis, L. (2009). Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. (2013). Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia. IEEE.
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(1), 221–231.
Kantorov, V., & Laptev, I. (2014). Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Karlinsky, L., Dinerstein, M., & Ullman, S. (2010). Using body-anchored priors for identifying actions in single images. In Advances in Neural Information Processing Systems (NIPS).
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Kliper-Gross, O., Hassner, T., & Wolf, L. (2012). The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(3), 615–621.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
De la Torre, F., Hodgins, J., Montano, J., Valcarcel, S., Forcada, R., & Macey, J. (2009). Guide to the cmu multimodal activity database. Technical Report CMU-RI-TR-08-22, Robotics Institute.
Lampert, C., Nickisch, H., & Harmeling, S. (2013). Attribute-based classification for zero-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), (p. 99).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64, 107–123.
Laptev, I., & Pérez, P. (2007). Retrieving actions in movies. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3361–3368). IEEE.
Li, L.-J., & Li, F.-F. (2007). What, where and who? classifying events by scene and object recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 1–8). IEEE.
Liu, J., McCloskey, S., & Liu, Y. (2012). Training data recycling for multi-level learning. In 21st International Conference on Pattern Recognition (ICPR), 2012, (pp. 2314–2318), Nov 2012.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos ’in the wild’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2009.
Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Mittal, A., Zisserman, A., & Torr, P. (2011). Hand detection using multiple proposals. In Proceedings of the British Machine Vision Conference (BMVC).
Motwani, T. S., & Mooney, R. J. (2012). Improving video activity recognition using object recognition and text mining. In ECAI (pp. 600–605) August 2012.
Natarajan, P., & Nevatia, R. (2008). View and scale invariant action recognition using multiview shape-flow models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Niebles, J., Chen, C.-W., & Li, F.-F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision (ECCV).
Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP (pp. 722–729). IEEE.
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J.T., et al. (2011). A large-scale benchmark dataset for event recognition in surveillance video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3153–3160). IEEE.
Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., et al. (2012). Trecvid 2012—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012. NIST, USA.
Packer, B., Saenko, K., & Koller, D. (2012). A combined pose, object, and feature model for action understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I.D. (2010). High five: Recognising human interactions in TV shows. In Proceedings of the British Machine Vision Conference (BMVC).
Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Ramanathan, V., Liang, P., & Li, F.-F. (2013). Video event understanding using natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Regneri, M., Koller, A., & Pinkal, M. (2010). Learning script knowledge with web experiments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. 1.
Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Forster, K., Troster, G., et al. (2010). Collecting complex activity data sets in highly rich networked sensor environments. In INSS.
Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., & Schiele, B. (2014). Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition (GCPR), September 2014.
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where–and why? Semantic relatedness for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012a). A database for fine grained activity detection of cooking activities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012b). Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision (ECCV).
Rohrbach, M., Ebert, S., & Schiele, B. (2013a). Transfer learning in a transductive setting. In Advances in Neural Information Processing Systems (NIPS).
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013b). Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Ryoo, M., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. In Information Processing And Management.
Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation.
Schank, R., & Abelson, R. (1977). Scripts, plans, goals and understanding.
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.
Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2014). Coherent multi-sentence video description with variable level of detail, 03/2014. arXiv:1403.6173.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1297–1304). IEEE.
Sill, J., Takács, G., Mackey, L., & Lin, D. (2009). Feature-weighted linear stacking. arXiv:0911.0460.
Singh, P., Lin, T., Mueller, E., Lim, G., Perkins, T., & Zhu, W. (2002). Open mind common sense: Knowledge acquisition from the general public. In DOA, CoopIS and ODBASE, 2002,
Singh, V., & Nevatia, R. (2011). Action recognition in cluttered dynamic scenes using pose-specific part models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Socher, R., & Li, F.-F. (2010). Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, June 2010.
Socher, R., Ganjoo, M., Manning, C.D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems (NIPS) (pp. 935–943).
Soomro, K., Zamir, A.R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. Technical report, arXiv:1212.0402.
Stein, S., & McKenna, S. (2013). Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UbiComp. ACM, September 2013.
Sung, J., Ponce, C., Selman, B., & Saxena, A. (2011). Human activity detection from RGBD images. CoRR, abs/1107.0169. informal publication.
Tang, K., Li, F.-F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, June 2012.
Tang, K., Yao, B., Li, F.-F., & Koller, D. (2013). Combining the right features for complex event recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Taylor, G.W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In Proceedings of the European Conference on Computer Vision (ECCV), (pp. 140–153). Springer.
Tenorth, M., Bandouch, J., & Beetz, M. (2009). The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In THEMIS.
Teo, C.L., Yang, Y., Daume, H., Fermuller, C., & Aloimonos, Y. (2012). Towards a watson that sees: Language-guided action recognition for robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (pp. 374–381). IEEE.
Ting, K.M., & Witten, I.H. (1997). Stacked generalization: When does it work? In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).
Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.
Wang, H., Ullah, M., Klaser, A., Laptev, I., & Schmid, C. (2009a). Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference (BMVC).
Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2011). Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103, 60–79.
Wang, J., Markert, K., & Everingham, M. (2009). Learning models for object recognition from natural language descriptions. In Andrea Cavallaro, Simon Prince, Daniel C. Alexander (Eds.), Proceedings of the British Machine Vision Conference (BMVC) (pp. 1–11). British Machine Vision Association.
Wang, L., Qiao, Y., & Tang, X. (2013b). Mining motion atoms and phrases for complex action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., et al. (2010). Caltech-ucsd birds 200. Technical Report, California Institute of Technology.
Yang, W., Wang, Y., & Mori, G. (2011). Recognizing human actions from still images with latent poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
Yang, Y., & Ramanan, D. (2013). Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35.
Yao, A., Gall, J., Fanelli, G., & Van Gool, L. (2011a). Does human action recognition benefit from pose estimation? In Proceedings of the British Machine Vision Conference (BMVC).
Yao, B., & Li, F.-F. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(9), 1691–1703.
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L.J., & Li, F.-F. (2011b). Action recognition by learning bases of action attributes and parts. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, November 2011b.
Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 29 2009.
Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang, L., Khan, M.U.G., & Gotoh, Y. (2011). Video scene classification based on natural language description. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 942–949). IEEE.
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. In Advances in Neural Information Processing Systems (NIPS).
Zinnen, A., Blanke, U., & Schiele, B. (2009). An analysis of sensor-oriented vs. model-based activity recognition. In ISWC.
Acknowledgments
This work was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD), by the Cluster of Excellence “Multimodal Computing and Interaction” of the German Excellence Initiative and the Max Planck Center for Visual Computing and Communication.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.
Rights and permissions
About this article
Cite this article
Rohrbach, M., Rohrbach, A., Regneri, M. et al. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data. Int J Comput Vis 119, 346–373 (2016). https://doi.org/10.1007/s11263-015-0851-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0851-8