Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2733373.2806218acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision

Published: 13 October 2015 Publication History

Abstract

Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a post-processing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The consequent computational challenge is carefully addressed through a significant improvement of the current ADMM algorithm, which, after eliminating all inner loops and equipping novel closed-form solutions for all intermediate steps, enables us to efficiently process extremely large video corpora. We test the proposed method on the large scale TRECVID MEDTest 2014 and MEDTest 2013 datasets, and obtain very promising results for both MED and MER.

References

[1]
Trecvid MED 2013. http://nist.gov/itl/iad/mig/med13.cfm.
[2]
Trecvid MED 2014. http://nist.gov/itl/iad/mig/med14.cfm.
[3]
S. Agarwal. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In SDM, 2011.
[4]
A. Barbu et al. Video in sentences out. In UAI, 2012.
[5]
S. Bhattacharya, F. X. Yu, and S.-F. Chang. Minimally needed evidence for complex event recognition in unconstrained videos. In ICMR, 2014.
[6]
X. Chang, Y. Yang, A. G. Hauptmann, E. P. Xing, and Y.-L. Yu. Semantic concept discovery for large-scale zero-shot event detection. In IJCAI, 2015.
[7]
X. Chang, Y. Yang, E. P. Xing, and Y.-L. Yu. Complex event detection using semantic saliency and nearly-isotonic SVM. In ICML, 2015.
[8]
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129--159, 2001.
[9]
P. Das, C. Xu, R. Doell, and J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.
[10]
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.
[11]
H. P. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and V. Vapnik. Parallel support vector machines: The cascade SVM. In NIPS, 2004.
[12]
A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In CVPR, 2009.
[13]
A. Habibian, K. E. van de Sande, and C. G. Snoek. Recommendations for video event recognition using concept vocabularies. In ICMR, 2013.
[14]
A. Hauptmann, R. Yan, W.-H. Lin, M. Christel, and H. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5):958--966, 2007.
[15]
B. He and X. Yuan. On the O(1/n) convergence rate of Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700--709, 2012.
[16]
N. Ikizler-Cinbis and S. Sclaroff. Object, scene and actions: Combining multiple features for human action recognition. In ECCV, 2010.
[17]
L. Jiang, A. G. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection. In ACM MM, 2012.
[18]
L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan, and A. G. Hauptmann. Self-paced learning with diversity. In NIPS, 2014.
[19]
Y.-G. Jiang, S. Bhattacharya, S.-F. Chang, and M. Shah. High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2):73--101, 2013.
[20]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[21]
K.-T. Lai, D. Liu, M.-S. Chen, and S.-F. Chang. Recognizing complex events in videos by learning key static-dynamic evidences. In ECCV, 2014.
[22]
W. Li, Q. Yu, A. Divakaran, and N. Vasconcelos. Dynamic pooling for complex event recognition. In ICCV, 2013.
[23]
J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.
[24]
J. Liu, Q. Yu, O. Javed, S. Ali, A. Tamrakar, A. Divakaran, H. Cheng, and H. Sawhney. Video event recognition using concept attributes. In WACV, 2013.
[25]
Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM MM, 2012.
[26]
M. Mazloom, A. Habibian, and C. G. Snoek. Querying for video events by semantic signatures from few examples. In ACM MM, 2013.
[27]
M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. Semantic model vectors for complex video event recognition. IEEE Transactions on Multimedia, 14(1):88--101, 2012.
[28]
D. Oneatã, J. Verbeek, and C. Schmid. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013.
[29]
F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, 2010.
[30]
A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l 1∞ regularization. In ICML, 2009.
[31]
A. Rakotomamonjy. Sparse support vector infinite push. In ICML, 2012.
[32]
C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233--2271, 2009.
[33]
L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D, 60:259--268, 1992.
[34]
K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.
[35]
C. Sun and R. Nevatia. DISCOVER: Discovering important segments for classification of video events and recounting. In CVPR, 2014.
[36]
A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012.
[37]
C. C. Tan, Y.-G. Jiang, and C.-W. Ngo. Towards textually describing complex video contents with audio-visual concept classifiers. In ACM MM, 2011.
[38]
K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012.
[39]
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li. The new data and new challenges in multimedia research. CoRR, abs/1503.01817, 2015.
[40]
C. Tsai, M. L. Alexander, N. Okwara, and J. R. Kender. Highly efficient multimedia event recounting from user semantic preferences. In ICMR, 2014.
[41]
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
[42]
B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV, pages 1331--1338, 2011.
[43]
S. Yu, L. Jiang, and A. G. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In ACM MM, 2014.
[44]
Y. Yu. On decomposing the proximal map. In NIPS, 2013.
[45]
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49--67, 2006.

Cited By

View all
  • (2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
  • (2020)Towards More Explainability: Concept Knowledge Mining Network for Event RecognitionProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413954(3857-3865)Online publication date: 12-Oct-2020
  • (2020)Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video AnalysisProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413618(3798-3806)Online publication date: 12-Oct-2020
  • Show More Cited By

Index Terms

  1. Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '15: Proceedings of the 23rd ACM international conference on Multimedia
    October 2015
    1402 pages
    ISBN:9781450334594
    DOI:10.1145/2733373
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 October 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multimedia event detection (med)
    2. multimedia event recounting (mer)
    3. video analysis

    Qualifiers

    • Research-article

    Funding Sources

    • DECRA project
    • US Department of Defense the U.S. Army Research Office
    • Intelligence Advanced Research Projects Activity (IARPA)

    Conference

    MM '15
    Sponsor:
    MM '15: ACM Multimedia Conference
    October 26 - 30, 2015
    Brisbane, Australia

    Acceptance Rates

    MM '15 Paper Acceptance Rate 56 of 252 submissions, 22%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
    • (2020)Towards More Explainability: Concept Knowledge Mining Network for Event RecognitionProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413954(3857-3865)Online publication date: 12-Oct-2020
    • (2020)Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video AnalysisProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413618(3798-3806)Online publication date: 12-Oct-2020
    • (2020)Recurrent Compressed Convolutional Networks for Short Video Event DetectionIEEE Access10.1109/ACCESS.2020.30039398(114162-114171)Online publication date: 2020
    • (2019)Video ImprintIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.286611441:12(3086-3099)Online publication date: 1-Dec-2019
    • (2019)SHAD: Privacy-Friendly Shared Activity Detection and Data Sharing2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)10.1109/MASS.2019.00022(109-117)Online publication date: Nov-2019
    • (2018)Extracting Key Segments of Videos for Event Detection by Learning From Web SourcesIEEE Transactions on Multimedia10.1109/TMM.2017.276332220:5(1088-1100)Online publication date: May-2018
    • (2018)You Are What You EatIEEE Transactions on Multimedia10.1109/TMM.2017.275949920:4(950-964)Online publication date: 1-Apr-2018
    • (2018)User session level diverse reranking of search resultsNeurocomputing10.1016/j.neucom.2016.05.087274:C(66-79)Online publication date: 24-Jan-2018
    • (2018)Evolutionary under-sampling based bagging ensemble method for imbalanced data classificationFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5306-z12:2(331-350)Online publication date: 1-Apr-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media