Abstract
Due to the rapid growth of modern mobile devices, users can capture a variety of videos at anytime and anywhere. The explosive growth of mobile videos brings about the difficulty and challenge on categorization and management. In this paper, we propose a novel approach to annotate group activities for mobile videos, which helps tag each person with an activity label, thus helping users efficiently manage the uploaded videos. To extract rich context information, we jointly model three co-existing cues including the activity duration time, individual action feature and the context information shared between person interactions. Then these appearances and context cues are modeled with a structure learning framework, which can be solved by inference with a greedy forward search. Moreover, we can infer group activity labels of all the persons together with their activity durations, especially for the situation with multiple group activities co-existing. Experimental results on mobile video dataset show that the proposed approach achieves outstanding results for group activity classification and annotation.
Similar content being viewed by others
References
Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: ECCV (2012)
Antic, B., Ommer, B.: Learning latent constituents for recognition of group activities in video. In: European Conference on Computer Vision (ECCV) (2014)
Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time series. KDD workshop, vol. 10, pp. 359–370 (1994)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1395–1402 (2005)
Chang, X., Zheng, W.-S., Zhang, J.: Learning person-person interaction in collective activity recognition. IEEE Trans. Image Process. 24(6), 1906–1918 (2015)
Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: European Conference on Computer Vision (ECCV) (2012)
Choi, W., Shahid, K., Savarese, S.: What are they doing? Collective activity classification using spatio–temporal relationship among people. In: IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1282–1289 (2009)
Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3280 (2011)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886–893 (2005)
Desai, C., Ramanan, D., Fowlkes, C.C.: Discriminative models for multi-class object layout. Int. J. Comput. Vis. 95(1), 1–12 (2011)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2012–2019 (2009)
Han, D., Bo, L., Sminchisescu, C.: Selection and context for action recognition. In: IEEE International Conference on Computer Vision (ICCV), vol. 9, pp. 1933–1940 (2009)
Han, Y., Wu, F., Lu, X., Tian, Q., Zhuang, Y., Luo, J.: Correlated attribute transfer with multi-task graph-guided fusion. In: Proceedings of the 20th ACM international conference on Multimedia, ACM, pp. 529–538 (2012)
Han, Y., Wei, X., Cao, X., Yang, Y., Zhou, X.: Augmenting image descriptions using structured prediction output. IEEE Trans. Multimed. 16(6), 1665–1676 (2014)
Jain, A., Gupta, A., Davis. L.S.: Learning what and how of contextual models for scene labeling. In: Computer Vision—ECCV 2010. Springer, pp. 199–212 (2010)
Kjellström, H., Romero, J., Martínez, D., Kragić, D.: Simultaneous visual recognition of manipulation actions and manipulated objects. In: Computer Vision–ECCV 2008. Springer, pp. 336–349 (2008)
Lan, T., Yang, W., Wang, Y., Mori, G.: Beyond actions: Discriminative models for contextual group activities. In: Advances in Neural Information Processing Systems 23, pp. 1216–1224 (2010)
Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1354–1361 (2012)
Lan, T., Wang, Y., Yang, W., et al.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–1562 (2012)
Li, R., Porfilio, P., Zickler, T.: Finding group interactions in social clutter. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2722–2729 (2013)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, IEEE, pp. 2929–2936 (2009)
Murphy, K., Torralba, A., Freeman, W.: Using the forest to see the trees: a graphical model relating features, objects and scenes. Adv. Neural Inf. Process. Syst. 16, 1499–1506 (2003)
Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: IEEE 11th international conference on Computer Vision, 2007. ICCV 2007, IEEE, pp. 1–8 (2007)
Ryoo, M.S., Aggarwal, J.K.: Spatio–temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE 12th International Conference on Computer Vision, IEEE, pp. 1593–1600 (2009)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, IEEE, vol. 3, pp. 32–36 (2004)
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the twenty-first international conference on Machine learning, ACM, p. 104 (2004)
Choi, W., Savarese, S.: Understanding collective activities of people from videos. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1242–1257 (2013)
Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 9–16 (2010)
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 17–24 (2010)
Zhu, Y., Nayak, N., Roy-Chowdhury, A.: Context-aware modeling and recognition of activities in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2491–2498 (2013)
Acknowledgments
This work was supported by 863 Program 2014AA015104, and National Natural Science Foundation of China 61273034, and 61332016.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, C., Wang, J., Li, J. et al. Automatic group activity annotation for mobile videos. Multimedia Systems 23, 667–677 (2017). https://doi.org/10.1007/s00530-016-0514-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-016-0514-9