research-article

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor

Authors:

Yihong GongAuthors Info & Claims

MM '09: Proceedings of the 17th ACM international conference on Multimedia

Pages 165 - 174

https://doi.org/10.1145/1631272.1631297

Published: 19 October 2009 Publication History

Abstract

Event detection plays an essential role in video content analysis and remains a challenging open problem. In particular, the study on detecting human-related video events in complex scenes with both a crowd of people and dynamic motion is still limited. In this paper, we investigate detecting video events that involve elementary human actions, e.g. making cellphone call, putting an object down, and pointing to something, in complex scenes using a novel spatio-temporal descriptor based approach. A new spatio-temporal descriptor, which temporally integrates the statistics of a set of response maps of low-level features, e.g. image gradients and optical flows, in a space-time cube, is proposed to capture the characteristics of actions in terms of their appearance and motion patterns. Based on this kind of descriptors, the bag-of-words method is utilized to describe a human figure as a concise feature vector. Then, these features are employed to train SVM classifiers at multiple spatial pyramid levels to distinguish different actions. Finally, a Gaussian kernel based temporal filtering is conducted to segment the sequences of events from a video stream taking account of the temporal consistency of actions. The proposed approach is capable of tolerating spatial layout variations and local deformations of human actions due to diverse view angles and rough human figure alignment in complex scenes. Extensive experiments on the 50-hour video dataset of TRECVid 2008 event detection task demonstrate that our approach outperforms the well-known SIFT descriptor based methods and effectively detects video events in challenging real-world conditions.

References

[1]

F. Wang, Y.G. Jiang, and C.W. Ngo, "Video event detection using motion relativity and visual relatedness," in Proc. ACM Multimedia, 2008, pp. 239--248.

Digital Library

[2]

Y. Ke, R. Sukthankar, and M. Hebert, "Efficient visual event detection using volumetric features," in Proc. Int. Conf. Computer Vision, 2005, vol. 1, pp. 166--173.

Digital Library

[3]

D. Xu and S.F Chang, "Visual event recognition in news video using kernel methods with multi-level temporal alignment," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2007, pp. 1--8.

[4]

G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, "Event detection and analysis from video streams," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 8, pp. 873--889, 2001.

Digital Library

[5]

G. Zhu, C. Xu, Q. Huang, W. Gao, and L. Xing, "Player action recognition in broadcast tennis video with applications to semantic analysis of sports game," in Porc. ACM Multimedia, 2006, pp. 431--440.

Digital Library

[6]

Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.

[7]

I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning realistic human actions from movies," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2008, pp. 1--8.

[8]

C. Schuldt, I. Laptev, and B. Caputa, "Recognizing human actions: a local svm approach," in Proc. Int. Conf. Pattern Recognition, 2004, pp. 1--8.

Digital Library

[9]

TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid. http://www.itl.nist.gov/iad/mig//tests/trecvid/2008/doc/EventDet08-EvalPlan-v07.htm. http://www-nlpir.nist.gov/projects/tvpubs/tv8.slides/event-detection.pdf.

[10]

Z. Li, Y. Fu, T.S. Huang, and S. Yan, "Real-time human action recognition by luminance field trajectory analysis," in Proc. ACM Multimedia, 2008, pp. 671--675.

Digital Library

[11]

H. Buxton, "Learning and understanding dynamic scene activity: a review," Image and Vision Computing, vol. 21, pp. 125--136, 2003.

[12]

W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Trans. Systems, Man, and Cybernetics, vol. 34, no. 3, pp. 334--352, 2004.

Digital Library

[13]

J. Shen, D. Tao, and X. Li, "Modality mixture projections for semantic video event detection," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1587--1596, 2008.

Digital Library

[14]

M. Xu, L. Duan, C. Xu, and Q. Tian, "A fusion scheme of visual and auditory modalities for event detection in sports video," in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, 2003, pp. 189--192.

Digital Library

[15]

N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, "Personalized abstraction of broadcasted American football video by highlight selection," IEEE Trans. Multimedia, vol. 6, no. 4, pp. 575--586, 2004.

Digital Library

[16]

C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan, "Live sports event detection based on broadcast video and web-casting text," in Proc. ACM Multimedia, 2006, pp. 221--230.

Digital Library

[17]

L. Xie, P. Xu, S.F. Chang, A. Divakaran, and H. Sun, "Structure analysis of soccer video with domain knowledge and hidden markov models," Pattern Recognition Letter, vol. 25, no. 7, pp. 767--775, 2004.

Digital Library

[18]

M.L. Shyu, X. Xie, M. Chen, and S.C. Chen, "Video semantic event/concept detection using a subspace-based multimedia data mining framework," IEEE Trans. Multimedia, vol. 10, no. 5, pp. 252--259, 2008.

Digital Library

[19]

C.G.M. Snoek and M. Worring, "Multimedia event-based video indexing using time intervals," IEEE Trans. Multimedia, vol. 7, no. 4, pp. 638--647, 2005.

Digital Library

[20]

D.A. Sadlier and N.E. Oconnor, "Event detection in field sports video using audio-visual features and a support vector machine," IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1225--1233, 2008.

Digital Library

[21]

P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, "Machine recognition of human activities: a survey," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473--1488, 2008.

Digital Library

[22]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," in Porc. The IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.

[23]

M. Han, W. Xu, H. Tao, and Y. Gong, "An algorithm for multiple object trajectory tracking," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2004, pp. 864--871.

[24]

M. Yang, F. Lv, W. Xu, and Y. Gong, "Detection driven adaptive multi-cue integration for multiple human tracking," in Proc. Int. Conf. Computer Vision, 2009.

[25]

G.A. Korn and T.M. Korn, Math handbook for scientists and engineers, New York: McGraw-Hill, 1968.

[26]

M. Giese and T. Poggio, "Neural mechanisms for the recognition of biological movements and action," Nature Reviews Neuroscience, vol. 4, pp. 179--192, 2003.

[27]

H. Jhuang, T. Serre, L. Wolf, and T. Poggio, "A biologically inspired system for action recognition," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.

[28]

S. Lazebnik, c. Schmid, and J. Ponce, "Beyond bags of features: spatial pyramid matching for recognizing natural scene categories," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169--2178.

Digital Library

[29]

V. Vapnik, The nature of statistical learning theory, New York: Spinger-Verlag, 1995.

Digital Library

[30]

Y.G. Jiang, C.W. Ngo, and J. Yang, "Towards optimal bag-of-features for object categorization and semantic video retrieval," in Proc. ACM Int. Conf. Image and Video Retrieval, 2007, pp. 494--501.

Digital Library

[31]

D. Lowe, "Distinctive image features from scale-invariant keypoints," Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91--110, 2004.

Digital Library

[32]

A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," in Proc. Int. Conf. Computer Vision, vol. 2, 2003, pp. 726--733.

Digital Library

[33]

J.C. Platt, "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods", in Advances in Large Margin Classifiers, Cambridge: MIT Press, 1999.

[34]

R. Duda and P. Hart, Pattern classification and scene analysis, New York: John Wiley&Sons Inc, 1973.

[35]

B.K.P. Horn and B.G. Schunck, "Determining optical flow," Artificial Intelligence, vol. 17, pp. 185--203, 1981.

Digital Library

[36]

F. Lv, W. Xu, M, Yang, K. Yu, G. Zhu, and Y. Gong, "Surveillance event detection," TRECVid notebook paper in Proc. TRECVid workshop, 2008.

Cited By

LI YCHU ZXIN Y(2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
https://doi.org/10.1587/transinf.2019EDP7221
Wang DOu FZhou Y(2018)Action Recognition Based on Multi-feature Depth Motion MapsIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591591(2683-2688)Online publication date: Oct-2018
https://doi.org/10.1109/IECON.2018.8591591
Wang ZLiu SZhang JChen SGuan Q(2017)A Spatio-Temporal CRF for Human Interaction UnderstandingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.253969927:8(1647-1660)Online publication date: Aug-2017
https://doi.org/10.1109/TCSVT.2016.2539699
Show More Cited By

Index Terms

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics
    1. Animation
      1. Motion capture
      2. Motion processing

Recommendations

Local velocity-adapted motion events for spatio-temporal recognition

In this paper, we address the problem of motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we ...
Human action recognition based on dense of spatio-temporal interest points and HOG-3D descriptor
ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

The spatio-temporal interest points and HOG-3D descriptors have been used for human action recognition. But combination of both has not been studied in earlier research. In this paper, combination of the two methods is applied to human action ...
Action recognition using 3D DAISY descriptor

In this paper we propose a novel spatial-temporal descriptor for action recognition. We extend a recent image local descriptor, DAISY, to three dimensions to deal with the information in the additional temporal domain in videos. The new 3D DAISY ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '09: Proceedings of the 17th ACM international conference on Multimedia

October 2009

1202 pages

ISBN:9781605586083

DOI:10.1145/1631272

General Chairs:
Wen Gao
Peking University, China
,
Yong Rui
Microsoft, China
,
Alan Hanjalic
Delft University of Technology, The Netherlands
,
Program Chairs:
Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, China
,
Eckehard Steinbach
Technical University of Munich, Germany
,
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Michelle Zhou
IBM T. J. Watson Research Center, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM09

Sponsor:

SIGMM

MM09: ACM Multimedia Conference

October 19 - 24, 2009

Beijing, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
972
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)4

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

LI YCHU ZXIN Y(2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
https://doi.org/10.1587/transinf.2019EDP7221
Wang DOu FZhou Y(2018)Action Recognition Based on Multi-feature Depth Motion MapsIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591591(2683-2688)Online publication date: Oct-2018
https://doi.org/10.1109/IECON.2018.8591591
Wang ZLiu SZhang JChen SGuan Q(2017)A Spatio-Temporal CRF for Human Interaction UnderstandingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.253969927:8(1647-1660)Online publication date: Aug-2017
https://doi.org/10.1109/TCSVT.2016.2539699
Niu LLi WXu D(2016)Exploiting Privileged Information from Web Data for Action and Event RecognitionInternational Journal of Computer Vision10.1007/s11263-015-0862-5118:2(130-150)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11263-015-0862-5
Gao CMeng DTong WYang YCai YShen HLiu GXu SHauptmann AKankanhalli MRueger SManmatha RJose Jvan Rijsbergen K(2014)Interactive Surveillance Event Detection through Mid-level Discriminative RepresentationProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578765(305-312)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1145/2578726.2578765
Bhattacharya SYu FChang SKankanhalli MRueger SManmatha RJose Jvan Rijsbergen K(2014)Minimally Needed Evidence for Complex Event Recognition in Unconstrained VideosProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578740(105-112)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1145/2578726.2578740
Wang FSun ZJiang YNgo C(2014)Video Event Detection Using Motion Relativity and Feature SelectionIEEE Transactions on Multimedia10.1109/TMM.2014.231578016:5(1303-1315)Online publication date: Aug-2014
https://doi.org/10.1109/TMM.2014.2315780
Tavassolipour MKarimian MKasaei S(2014)Event Detection and Summarization in Soccer Videos Using Bayesian Network and CopulaIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2013.224364024:2(291-304)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1109/TCSVT.2013.2243640
Subramanyam MNallaperumal KSubban RPerumalsamy PDurairaj SGayathri Devi SSelva Kumar S(2014)Real-Time Hard and Soft Shadow Compensation with Adaptive Patch Gradient PairsAdvances in Signal Processing and Intelligent Recognition Systems10.1007/978-3-319-04960-1_22(245-252)Online publication date: 2014
https://doi.org/10.1007/978-3-319-04960-1_22
Tian YCao LLiu ZZhang Z(2013)Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point FeaturesHuman Behavior Recognition Technologies10.4018/978-1-4666-3682-8.ch012(249-267)Online publication date: 2013
https://doi.org/10.4018/978-1-4666-3682-8.ch012
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents