research-article

Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision

Authors:

Alexander G. HauptmannAuthors Info & Claims

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Pages 581 - 590

https://doi.org/10.1145/2733373.2806218

Published: 13 October 2015 Publication History

Abstract

Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a post-processing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.

References

[1]

Trecvid MED 2013. http://nist.gov/itl/iad/mig/med13.cfm.

[2]

Trecvid MED 2014. http://nist.gov/itl/iad/mig/med14.cfm.

[3]

S. Agarwal. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In SDM, 2011.

[4]

A. Barbu et al. Video in sentences out. In UAI, 2012.

Digital Library

[5]

S. Bhattacharya, F. X. Yu, and S.-F. Chang. Minimally needed evidence for complex event recognition in unconstrained videos. In ICMR, 2014.

Digital Library

[6]

X. Chang, Y. Yang, A. G. Hauptmann, E. P. Xing, and Y.-L. Yu. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI, 2015.

Digital Library

[7]

X. Chang, Y. Yang, E. P. Xing, and Y.-L. Yu. Complex event detection using semantic saliency and nearly-isotonic SVM. In ICML, 2015.

Digital Library

[8]

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129--159, 2001.

Digital Library

[9]

P. Das, C. Xu, R. Doell, and J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.

Digital Library

[10]

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.

[11]

H. P. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, 2004.

[12]

A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In CVPR, 2009.

[13]

A. Habibian, K. E. van de Sande, and C. G. Snoek. Recommendations for video event recognition using concept vocabularies. In ICMR, 2013.

Digital Library

[14]

A. Hauptmann, R. Yan, W.-H. Lin, M. Christel, and H. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5):958--966, 2007.

Digital Library

[15]

B. He and X. Yuan. On the O(1/n) convergence rate of Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700--709, 2012.

Digital Library

[16]

N. Ikizler-Cinbis and S. Sclaroff. Object, scene and actions: Combining multiple features for human action recognition. In ECCV, 2010.

Digital Library

[17]

L. Jiang, A. G. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection. In ACM MM, 2012.

Digital Library

[18]

L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan, and A. G. Hauptmann. Self-paced learning with diversity. In NIPS, 2014.

Digital Library

[19]

Y.-G. Jiang, S. Bhattacharya, S.-F. Chang, and M. Shah. High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2):73--101, 2013.

[20]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.

Digital Library

[21]

K.-T. Lai, D. Liu, M.-S. Chen, and S.-F. Chang. Recognizing complex events in videos by learning key static-dynamic evidences. In ECCV, 2014.

[22]

W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos. Dynamic pooling for complex event recognition. In ICCV, 2013.

Digital Library

[23]

J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.

Digital Library

[24]

J. Liu, Q. Yu, O. Javed, S. Ali, A. Tamrakar, A. Divakaran, H. Cheng, and H. Sawhney. Video event recognition using concept attributes. In WACV, 2013.

Digital Library

[25]

Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM MM, 2012.

Digital Library

[26]

M. Mazloom, A. Habibian, and C. G. Snoek. Querying for video events by semantic signatures from few examples. In ACM MM, 2013.

Digital Library

[27]

M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. Semantic model vectors for complex video event recognition. IEEE Transactions on Multimedia, 14(1):88--101, 2012.

Digital Library

[28]

D. Oneatã, J. Verbeek, and C. Schmid. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013.

Digital Library

[29]

F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, 2010.

Digital Library

[30]

A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l 1∞ regularization. In ICML, 2009.

Digital Library

[31]

A. Rakotomamonjy. Sparse support vector infinite push. In ICML, 2012.

[32]

C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233--2271, 2009.

Digital Library

[33]

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259--268, 1992.

Digital Library

[34]

K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.

[35]

C. Sun and R. Nevatia. DISCOVER: Discovering important segments for classification of video events and recounting. In CVPR, 2014.

Digital Library

[36]

A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012.

[37]

C. C. Tan, Y.-G. Jiang, and C.-W. Ngo. Towards textually describing complex video contents with audio-visual concept classifiers. In ACM MM, 2011.

Digital Library

[38]

K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012.

Digital Library

[39]

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li. The new data and new challenges in multimedia research. CoRR, abs/1503.01817, 2015.

[40]

C. Tsai, M. L. Alexander, N. Okwara, and J. R. Kender. Highly efficient multimedia event recounting from user semantic preferences. In ICMR, 2014.

Digital Library

[41]

H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.

Digital Library

[42]

B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV, pages 1331--1338, 2011.

Digital Library

[43]

S. Yu, L. Jiang, and A. G. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM MM, 2014.

Digital Library

[44]

Y. Yu. On decomposing the proximal map. In NIPS, 2013.

[45]

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49--67, 2006.

Cited By

Zhang WQi ZWang SSu CSu LHuang Q(2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3568312
Qi ZWang SSu CSu LHuang QTian QWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Towards More Explainability: Concept Knowledge Mining Network for Event RecognitionProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413954(3857-3865)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413954
Qi ZWang SSu CSu LZhang WHuang QWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video AnalysisProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413618(3798-3806)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413618
Show More Cited By

Index Terms

Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Knowledge adaptation for ad hoc multimedia event detection with few exemplars
MM '12: Proceedings of the 20th ACM international conference on Multimedia

Multimedia event detection (MED) has a significant impact on many applications. Though video concept annotation has received much research effort, video event detection remains largely unaddressed. Current research mainly focuses on sports and news ...
Event Bank based multimedia representation via latent group logistic regression minimization

In order to perform multimedia event detection (MED) tasks in uncontrolled videos, a very large number of labeled videos are required for training the event classifier, which would become quite challenging especially when there are lots of events. ...
Temporal Localization of Actions with Actoms

We address the problem of localizing actions, such as opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '15: Proceedings of the 23rd ACM international conference on Multimedia

October 2015

1402 pages

ISBN:9781450334594

DOI:10.1145/2733373

General Chairs:
Xiaofang Zhou
The University of Queensland, Australia
,
Alan F. Smeaton
Dublin City University, Ireland
,
Qi Tian
The University of Texas at San Antonio, USA
,
Program Chairs:
Dick C.A. Bulterman
FXPAL, USA
,
Heng Tao Shen
The University of Queensland, Australia
,
Ketan Mayer-Patel
The University of North Carolina, USA
,
Shuicheng Yan
National University of Singapore, Singapore

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DECRA project
US Department of Defense the U.S. Army Research Office
Intelligence Advanced Research Projects Activity (IARPA)

Conference

MM '15

Sponsor:

SIGMM

MM '15: ACM Multimedia Conference

October 26 - 30, 2015

Brisbane, Australia

Acceptance Rates

MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WQi ZWang SSu CSu LHuang Q(2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3568312
Qi ZWang SSu CSu LHuang QTian QWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Towards More Explainability: Concept Knowledge Mining Network for Event RecognitionProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413954(3857-3865)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413954
Qi ZWang SSu CSu LZhang WHuang QWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video AnalysisProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413618(3798-3806)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3413618
Li PXu X(2020)Recurrent Compressed Convolutional Networks for Short Video Event DetectionIEEE Access10.1109/ACCESS.2020.30039398(114162-114171)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3003939
Gao ZWang LJojic NNiu ZZheng NHua G(2019)Video ImprintIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.286611441:12(3086-3099)Online publication date: 1-Dec-2019
https://doi.org/10.1109/TPAMI.2018.2866114
Han FZhang LYou XWang GLi X(2019)SHAD: Privacy-Friendly Shared Activity Detection and Data Sharing2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)10.1109/MASS.2019.00022(109-117)Online publication date: Nov-2019
https://doi.org/10.1109/MASS.2019.00022
Song HWu XYu WJia Y(2018)Extracting Key Segments of Videos for Event Detection by Learning From Web SourcesIEEE Transactions on Multimedia10.1109/TMM.2017.276332220:5(1088-1100)Online publication date: May-2018
https://doi.org/10.1109/TMM.2017.2763322
Min WBao BMei SZhu YRui YJiang S(2018)You Are What You EatIEEE Transactions on Multimedia10.1109/TMM.2017.275949920:4(950-964)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1109/TMM.2017.2759499
Ren PChen ZMa JWang SZhang ZRen ZMa T(2018)User session level diverse reranking of search resultsNeurocomputing10.1016/j.neucom.2016.05.087274:C(66-79)Online publication date: 24-Jan-2018
https://dl.acm.org/doi/10.1016/j.neucom.2016.05.087
Sun BChen HWang JXie H(2018)Evolutionary under-sampling based bagging ensemble method for imbalanced data classificationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5306-z12:2(331-350)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s11704-016-5306-z
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents