research-article

Learning From Web Videos for Event Classification

Authors:

Nicolas Chesneau,

Karteek Alahari,

Cordelia SchmidAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 28, Issue 10

Pages 3019 - 3029

https://doi.org/10.1109/TCSVT.2017.2764624

Published: 01 October 2018 Publication History

Abstract

Traditional approaches for classifying event videos rely on a manually curated training data set. While this paradigm has achieved excellent results on benchmarks, such as TrecVid multimedia event detection (MED) challenge data sets, it is restricted by the effort involved in careful annotation. Recent approaches have attempted to address the need for annotation by automatically extracting images from the web, or generating queries to retrieve videos. In the former case, they fail to exploit additional cues provided by video data, while in the latter, they still require some manual annotation to generate relevant queries. We take an alternate approach in this paper, leveraging the synergy between visual video data and the associated textual metadata, to learn event classifiers without manually annotating any videos. Specifically, we first collect a video data set with queries constructed automatically from textual description of events, prune irrelevant videos with text and video data, and then learn the corresponding event classifiers. We evaluate this approach in the challenging setting where no manually annotated training set is available, i.e., EK0 in the TrecVid challenge, and show state-of-the-art results on MED 2011 and 2013 data sets.

References

[1]

A. Binder, W. Samek, K.-R. Müller, and M. Kawanabe, “Machine learning for visual concept recognition and ranking for images,” in Towards the Internet of Services: The THESEUS Research Program. Cham, Switzerland: Springer, 2014.

[2]

C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, Apr. 2011.

Digital Library

[3]

J. Chen, Y. Cui, G. Ye, D. Liu, and S.-F. Chang, “Event-driven semantic concept discovery by exploiting weakly tagged Internet images,” in Proc. ICMR, 2014, p. 1.

[4]

L. Chen, L. Duan, and D. Xu, “Event recognition in videos by learning from heterogeneous Web sources,” in Proc. CVPR, 2013, pp. 2666–2673.

[5]

S. K. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Webly-supervised visual concept learning,” in Proc. CVPR, 2014, pp. 3270–3277.

[6]

L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo, “Visual event recognition in videos by learning from Web data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1667–1680, Sep. 2012.

Digital Library

[7]

C. Gan, C. Sun, L. Duan, and B. Gong, “Webly-supervised video recognition by mutually voting for relevant Web images and Web video frames,” in Proc. ECCV, 2016, pp. 849–866.

[8]

C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann, “DevNet: A deep event network for multimedia event detection and evidence recounting,” in Proc. CVPR, 2015, pp. 2568–2577.

[9]

C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei, “You lead, we exceed: Labor-free video concept learning by jointly exploiting Web videos and images,” in Proc. CVPR, 2016, pp. 923–932.

[10]

A. Habibian, T. Mensink, and C. G. M. Snoek, “Composite concept discovery for zero-shot video event detection,” in Proc. ICMR, Apr. 2014, p. 17.

[11]

A. Hauptmann and M. Smith, “Text, speech, and vision for video segmentation: The informedia project,” in Proc. AAAI Fall Symp., 1995, pp. 1–6.

[12]

D. Hiemstra, “A probabilistic justification for using tf $\times $ idf term weighting in information retrieval,” Int. J. Digit. Libraries, vol. 3, no. 2, pp. 131–139, Aug. 2000.

[13]

N. Ikizler-Cinbis, R. G. Cinbis, and S. Sclaroff, “Learning actions from the Web,” in Proc. ICCV, 2009, pp. 995–1002.

[14]

L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann, “Zero-example event search using multimodal pseudo relevance feedback,” in Proc. ICMR, 2014, p. 297.

[15]

L. Jiang, S.-I. Yu, D. Meng, T. Mitamura, and A. G. Hauptmann, “Bridging the ultimate semantic gap: A semantic search engine for Internet videos,” in Proc. ICMR, 2015, pp. 27–34.

[16]

L. Jiang, S.-I. Yu, D. Meng, Y. Yang, T. Mitamura, and A. G. Hauptmann, “Fast and accurate content-based semantic search in 100M Internet videos,” in Proc. Int. Conf. ACM Multimedia, 2015, pp. 49–58.

[17]

T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. ECML, 1998, pp. 137–142.

[18]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proc. CVPR, 2014, pp. 1725–1732.

[19]

Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection using volumetric features,” in Proc. ICCV, Oct. 2005, pp. 166–173.

[20]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012, pp. 1097–1105.

[21]

I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. CVPR, 2008, pp. 1–8.

[22]

I. Laptev and P. Pérez, “Retrieving actions in movies,” in Proc. ICCV, Oct. 2007, pp. 1–8.

[23]

T. Leung, Y. Song, and J. Zhang, “Handling label noise in video classification via multiple instance learning,” in Proc. ICCV, Nov. 2011, pp. 2056–2063.

[24]

J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos ‘in the wild,”’ in Proc. CVPR, Jun. 2009, pp. 1996–2003.

[25]

J. Liuet al., “Video event recognition using concept attributes,” in Proc. WACV, Jan. 2013, pp. 339–346.

[26]

E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL Workshop, 2002, pp. 1–4.

[27]

T. Mensink, E. Gavves, and C. G. M. Snoek, “COSTA: Co-occurrence statistics for zero-shot classification,” in Proc. CVPR, Jun. 2014, pp. 2441–2448.

[28]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013, pp. 3111–3119.

[29]

G. A. Miller, “WordNet: A lexical database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.

Digital Library

[30]

P. X. Nguyen, G. Rogez, C. Fowlkes, and D. Ramanan. (2016). “The open world of micro-videos.” [Online]. Available: https://arxiv.org/abs/1603.09439

[31]

J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” Int. J. Comput. Vis., vol. 79, no. 3, pp. 299–318, Sep. 2008.

Digital Library

[32]

L. Niu, W. Li, and D. Xu, “Visual recognition by learning from Web data: A weakly supervised domain generalization approach,” in Proc. CVPR, Jun. 2015, pp. 2774–2783.

[33]

P. D. Overet al., “TRECVID 2010—An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” in Proc. TRECVID, 2010, p. 52.

[34]

P. Overet al., “TRECVID 2013—An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” in Proc. TRECVID, 2013, pp. 1–45.

[35]

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. NIPS, 2014, pp. 568–576.

[36]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2014.

[37]

B. Singh, X. Han, Z. Wu, V. I. Morariu, and L. S. Davis, “Selecting relevant Web trained concepts for automated event retrieval,” in Proc. ICCV, Dec. 2015, pp. 4561–4569.

[38]

M. Smith and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” in Proc. CVPR, Jun. 1997, pp. 775–781.

[39]

Y. Song, M. Zhao, J. Yagnik, and X. Wu, “Taxonomic classification for Web-based videos,” in Proc. CVPR, Jun. 2010, pp. 871–878.

[40]

G. Töpper, M. Knuth, and H. Sack, “DBpedia ontology enrichment for inconsistency detection,” in Proc. Int. Conf. Semantic Syst., 2012, pp. 33–40.

[41]

K. Toutanova and C. D. Manning, “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger,” in Proc. EMNLP, 2000, pp. 63–70.

[42]

H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in Proc. CVPR, Jun. 2011, pp. 3169–3176.

[43]

Z. Wang, M. Zhao, Y. Song, S. Kumar, and B. Li, “YouTubeCat: Learning to categorize wild Web videos,” in Proc. CVPR, Jun. 2010, pp. 879–886.

[44]

S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, “Zero-shot event detection using multi-modal fusion of weakly supervised concepts,” in Proc. CVPR, Jun. 2014, pp. 2665–2672.

[45]

Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video representation for event detection,” in Proc. CVPR, 2015, pp. 1798–1807.

[46]

G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang, “EventNet: A large scale structured concept library for complex event detection in video,” in Proc. ACM Multimedia, Oct. 2015, pp. 471–480.

Cited By

Liu YLiu FJiao LBao QSun LLi SLi LLiu X(2024)Multi-Grained Gradual Inference Model for Multimedia Event ExtractionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340224234:10_Part_2(10507-10520)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3402242

Index Terms

Learning From Web Videos for Event Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Multiple agent event detection and representation in videos
AAAI'05: Proceedings of the 20th national conference on Artificial intelligence - Volume 1

We propose a novel method to detect events involving multiple agents in a video and to learn their structure in terms of temporally related chain of sub-events. The proposed method has three significant contributions over existing frameworks. First, in ...
Text classification and named entities for new event detection
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

New Event Detection is a challenging task that still offers scope for great improvement after years of effort. In this paper we show how performance on New Event Detection (NED) can be improved by the use of text classification techniques as well as by ...
Triple-view Event Hierarchy Model for Biomedical Event Representation
Chinese Computational Linguistics
Abstract
Biomedical event representation can be applied to various language tasks. A biomedical event often involves multiple biomedical entities and trigger words, and the event structure is complex. However, existing research on event representation ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 28, Issue 10

Oct. 2018

658 pages

ISSN:1051-8215

Issue’s Table of Contents

1051-8215 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 October 2018

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YLiu FJiao LBao QSun LLi SLi LLiu X(2024)Multi-Grained Gradual Inference Model for Multimedia Event ExtractionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340224234:10_Part_2(10507-10520)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3402242

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents