Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1631272.1631297acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor

Published: 19 October 2009 Publication History

Abstract

Event detection plays an essential role in video content analysis and remains a challenging open problem. In particular, the study on detecting human-related video events in complex scenes with both a crowd of people and dynamic motion is still limited. In this paper, we investigate detecting video events that involve elementary human actions, e.g. making cellphone call, putting an object down, and pointing to something, in complex scenes using a novel spatio-temporal descriptor based approach. A new spatio-temporal descriptor, which temporally integrates the statistics of a set of response maps of low-level features, e.g. image gradients and optical flows, in a space-time cube, is proposed to capture the characteristics of actions in terms of their appearance and motion patterns. Based on this kind of descriptors, the bag-of-words method is utilized to describe a human figure as a concise feature vector. Then, these features are employed to train SVM classifiers at multiple spatial pyramid levels to distinguish different actions. Finally, a Gaussian kernel based temporal filtering is conducted to segment the sequences of events from a video stream taking account of the temporal consistency of actions. The proposed approach is capable of tolerating spatial layout variations and local deformations of human actions due to diverse view angles and rough human figure alignment in complex scenes. Extensive experiments on the 50-hour video dataset of TRECVid 2008 event detection task demonstrate that our approach outperforms the well-known SIFT descriptor based methods and effectively detects video events in challenging real-world conditions.

References

[1]
F. Wang, Y.G. Jiang, and C.W. Ngo, "Video event detection using motion relativity and visual relatedness," in Proc. ACM Multimedia, 2008, pp. 239--248.
[2]
Y. Ke, R. Sukthankar, and M. Hebert, "Efficient visual event detection using volumetric features," in Proc. Int. Conf. Computer Vision, 2005, vol. 1, pp. 166--173.
[3]
D. Xu and S.F Chang, "Visual event recognition in news video using kernel methods with multi-level temporal alignment," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2007, pp. 1--8.
[4]
G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia, "Event detection and analysis from video streams," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 8, pp. 873--889, 2001.
[5]
G. Zhu, C. Xu, Q. Huang, W. Gao, and L. Xing, "Player action recognition in broadcast tennis video with applications to semantic analysis of sports game," in Porc. ACM Multimedia, 2006, pp. 431--440.
[6]
Y. Ke, R. Sukthankar, and M. Hebert, "Event detection in crowded videos," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.
[7]
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning realistic human actions from movies," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2008, pp. 1--8.
[8]
C. Schuldt, I. Laptev, and B. Caputa, "Recognizing human actions: a local svm approach," in Proc. Int. Conf. Pattern Recognition, 2004, pp. 1--8.
[9]
TREC Video Retrieval Evaluation, http://www-nlpir.nist.gov/projects/trecvid. http://www.itl.nist.gov/iad/mig//tests/trecvid/2008/doc/EventDet08-EvalPlan-v07.htm. http://www-nlpir.nist.gov/projects/tvpubs/tv8.slides/event-detection.pdf.
[10]
Z. Li, Y. Fu, T.S. Huang, and S. Yan, "Real-time human action recognition by luminance field trajectory analysis," in Proc. ACM Multimedia, 2008, pp. 671--675.
[11]
H. Buxton, "Learning and understanding dynamic scene activity: a review," Image and Vision Computing, vol. 21, pp. 125--136, 2003.
[12]
W. Hu, T. Tan, L. Wang, and S. Maybank, "A survey on visual surveillance of object motion and behaviors," IEEE Trans. Systems, Man, and Cybernetics, vol. 34, no. 3, pp. 334--352, 2004.
[13]
J. Shen, D. Tao, and X. Li, "Modality mixture projections for semantic video event detection," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1587--1596, 2008.
[14]
M. Xu, L. Duan, C. Xu, and Q. Tian, "A fusion scheme of visual and auditory modalities for event detection in sports video," in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, vol. 3, 2003, pp. 189--192.
[15]
N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, "Personalized abstraction of broadcasted American football video by highlight selection," IEEE Trans. Multimedia, vol. 6, no. 4, pp. 575--586, 2004.
[16]
C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan, "Live sports event detection based on broadcast video and web-casting text," in Proc. ACM Multimedia, 2006, pp. 221--230.
[17]
L. Xie, P. Xu, S.F. Chang, A. Divakaran, and H. Sun, "Structure analysis of soccer video with domain knowledge and hidden markov models," Pattern Recognition Letter, vol. 25, no. 7, pp. 767--775, 2004.
[18]
M.L. Shyu, X. Xie, M. Chen, and S.C. Chen, "Video semantic event/concept detection using a subspace-based multimedia data mining framework," IEEE Trans. Multimedia, vol. 10, no. 5, pp. 252--259, 2008.
[19]
C.G.M. Snoek and M. Worring, "Multimedia event-based video indexing using time intervals," IEEE Trans. Multimedia, vol. 7, no. 4, pp. 638--647, 2005.
[20]
D.A. Sadlier and N.E. Oconnor, "Event detection in field sports video using audio-visual features and a support vector machine," IEEE Trans. Circuits and Systems for Video Technology, vol. 15, no. 10, pp. 1225--1233, 2008.
[21]
P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, "Machine recognition of human activities: a survey," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473--1488, 2008.
[22]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," in Porc. The IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.
[23]
M. Han, W. Xu, H. Tao, and Y. Gong, "An algorithm for multiple object trajectory tracking," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2004, pp. 864--871.
[24]
M. Yang, F. Lv, W. Xu, and Y. Gong, "Detection driven adaptive multi-cue integration for multiple human tracking," in Proc. Int. Conf. Computer Vision, 2009.
[25]
G.A. Korn and T.M. Korn, Math handbook for scientists and engineers, New York: McGraw-Hill, 1968.
[26]
M. Giese and T. Poggio, "Neural mechanisms for the recognition of biological movements and action," Nature Reviews Neuroscience, vol. 4, pp. 179--192, 2003.
[27]
H. Jhuang, T. Serre, L. Wolf, and T. Poggio, "A biologically inspired system for action recognition," in Proc. Int. Conf. Computer Vision, 2007, pp. 1--8.
[28]
S. Lazebnik, c. Schmid, and J. Ponce, "Beyond bags of features: spatial pyramid matching for recognizing natural scene categories," in Proc. Int. Conf. Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169--2178.
[29]
V. Vapnik, The nature of statistical learning theory, New York: Spinger-Verlag, 1995.
[30]
Y.G. Jiang, C.W. Ngo, and J. Yang, "Towards optimal bag-of-features for object categorization and semantic video retrieval," in Proc. ACM Int. Conf. Image and Video Retrieval, 2007, pp. 494--501.
[31]
D. Lowe, "Distinctive image features from scale-invariant keypoints," Int. Journal of Computer Vision, vol. 60, no. 2, pp. 91--110, 2004.
[32]
A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing action at a distance," in Proc. Int. Conf. Computer Vision, vol. 2, 2003, pp. 726--733.
[33]
J.C. Platt, "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods", in Advances in Large Margin Classifiers, Cambridge: MIT Press, 1999.
[34]
R. Duda and P. Hart, Pattern classification and scene analysis, New York: John Wiley&Sons Inc, 1973.
[35]
B.K.P. Horn and B.G. Schunck, "Determining optical flow," Artificial Intelligence, vol. 17, pp. 185--203, 1981.
[36]
F. Lv, W. Xu, M, Yang, K. Yu, G. Zhu, and Y. Gong, "Surveillance event detection," TRECVid notebook paper in Proc. TRECVid workshop, 2008.

Cited By

View all
  • (2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
  • (2018)Action Recognition Based on Multi-feature Depth Motion MapsIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591591(2683-2688)Online publication date: Oct-2018
  • (2017)A Spatio-Temporal CRF for Human Interaction UnderstandingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.253969927:8(1647-1660)Online publication date: Aug-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '09: Proceedings of the 17th ACM international conference on Multimedia
October 2009
1202 pages
ISBN:9781605586083
DOI:10.1145/1631272
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action recognition
  2. event detection
  3. motion representation
  4. semantic analysis

Qualifiers

  • Research-article

Conference

MM09
Sponsor:
MM09: ACM Multimedia Conference
October 19 - 24, 2009
Beijing, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)4
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
  • (2018)Action Recognition Based on Multi-feature Depth Motion MapsIECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON.2018.8591591(2683-2688)Online publication date: Oct-2018
  • (2017)A Spatio-Temporal CRF for Human Interaction UnderstandingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.253969927:8(1647-1660)Online publication date: Aug-2017
  • (2016)Exploiting Privileged Information from Web Data for Action and Event RecognitionInternational Journal of Computer Vision10.1007/s11263-015-0862-5118:2(130-150)Online publication date: 1-Jun-2016
  • (2014)Interactive Surveillance Event Detection through Mid-level Discriminative RepresentationProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578765(305-312)Online publication date: 1-Apr-2014
  • (2014)Minimally Needed Evidence for Complex Event Recognition in Unconstrained VideosProceedings of International Conference on Multimedia Retrieval10.1145/2578726.2578740(105-112)Online publication date: 1-Apr-2014
  • (2014)Video Event Detection Using Motion Relativity and Feature SelectionIEEE Transactions on Multimedia10.1109/TMM.2014.231578016:5(1303-1315)Online publication date: Aug-2014
  • (2014)Event Detection and Summarization in Soccer Videos Using Bayesian Network and CopulaIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2013.224364024:2(291-304)Online publication date: 1-Feb-2014
  • (2014)Real-Time Hard and Soft Shadow Compensation with Adaptive Patch Gradient PairsAdvances in Signal Processing and Intelligent Recognition Systems10.1007/978-3-319-04960-1_22(245-252)Online publication date: 2014
  • (2013)Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point FeaturesHuman Behavior Recognition Technologies10.4018/978-1-4666-3682-8.ch012(249-267)Online publication date: 2013
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media