Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning From Web Videos for Event Classification

Published: 01 October 2018 Publication History

Abstract

Traditional approaches for classifying event videos rely on a manually curated training data set. While this paradigm has achieved excellent results on benchmarks, such as TrecVid multimedia event detection (MED) challenge data sets, it is restricted by the effort involved in careful annotation. Recent approaches have attempted to address the need for annotation by automatically extracting images from the web, or generating queries to retrieve videos. In the former case, they fail to exploit additional cues provided by video data, while in the latter, they still require some manual annotation to generate relevant queries. We take an alternate approach in this paper, leveraging the synergy between visual video data and the associated textual metadata, to learn event classifiers without manually annotating any videos. Specifically, we first collect a video data set with queries constructed automatically from textual description of events, prune irrelevant videos with text and video data, and then learn the corresponding event classifiers. We evaluate this approach in the challenging setting where no manually annotated training set is available, i.e., EK0 in the TrecVid challenge, and show state-of-the-art results on MED 2011 and 2013 data sets.

References

[1]
A. Binder, W. Samek, K.-R. Müller, and M. Kawanabe, “Machine learning for visual concept recognition and ranking for images,” in Towards the Internet of Services: The THESEUS Research Program. Cham, Switzerland: Springer, 2014.
[2]
C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27, Apr. 2011.
[3]
J. Chen, Y. Cui, G. Ye, D. Liu, and S.-F. Chang, “Event-driven semantic concept discovery by exploiting weakly tagged Internet images,” in Proc. ICMR, 2014, p. 1.
[4]
L. Chen, L. Duan, and D. Xu, “Event recognition in videos by learning from heterogeneous Web sources,” in Proc. CVPR, 2013, pp. 2666–2673.
[5]
S. K. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Webly-supervised visual concept learning,” in Proc. CVPR, 2014, pp. 3270–3277.
[6]
L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo, “Visual event recognition in videos by learning from Web data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 9, pp. 1667–1680, Sep. 2012.
[7]
C. Gan, C. Sun, L. Duan, and B. Gong, “Webly-supervised video recognition by mutually voting for relevant Web images and Web video frames,” in Proc. ECCV, 2016, pp. 849–866.
[8]
C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann, “DevNet: A deep event network for multimedia event detection and evidence recounting,” in Proc. CVPR, 2015, pp. 2568–2577.
[9]
C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei, “You lead, we exceed: Labor-free video concept learning by jointly exploiting Web videos and images,” in Proc. CVPR, 2016, pp. 923–932.
[10]
A. Habibian, T. Mensink, and C. G. M. Snoek, “Composite concept discovery for zero-shot video event detection,” in Proc. ICMR, Apr. 2014, p. 17.
[11]
A. Hauptmann and M. Smith, “Text, speech, and vision for video segmentation: The informedia project,” in Proc. AAAI Fall Symp., 1995, pp. 1–6.
[12]
D. Hiemstra, “A probabilistic justification for using tf $\times $ idf term weighting in information retrieval,” Int. J. Digit. Libraries, vol. 3, no. 2, pp. 131–139, Aug. 2000.
[13]
N. Ikizler-Cinbis, R. G. Cinbis, and S. Sclaroff, “Learning actions from the Web,” in Proc. ICCV, 2009, pp. 995–1002.
[14]
L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann, “Zero-example event search using multimodal pseudo relevance feedback,” in Proc. ICMR, 2014, p. 297.
[15]
L. Jiang, S.-I. Yu, D. Meng, T. Mitamura, and A. G. Hauptmann, “Bridging the ultimate semantic gap: A semantic search engine for Internet videos,” in Proc. ICMR, 2015, pp. 27–34.
[16]
L. Jiang, S.-I. Yu, D. Meng, Y. Yang, T. Mitamura, and A. G. Hauptmann, “Fast and accurate content-based semantic search in 100M Internet videos,” in Proc. Int. Conf. ACM Multimedia, 2015, pp. 49–58.
[17]
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. ECML, 1998, pp. 137–142.
[18]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proc. CVPR, 2014, pp. 1725–1732.
[19]
Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection using volumetric features,” in Proc. ICCV, Oct. 2005, pp. 166–173.
[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS, 2012, pp. 1097–1105.
[21]
I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. CVPR, 2008, pp. 1–8.
[22]
I. Laptev and P. Pérez, “Retrieving actions in movies,” in Proc. ICCV, Oct. 2007, pp. 1–8.
[23]
T. Leung, Y. Song, and J. Zhang, “Handling label noise in video classification via multiple instance learning,” in Proc. ICCV, Nov. 2011, pp. 2056–2063.
[24]
J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos ‘in the wild,”’ in Proc. CVPR, Jun. 2009, pp. 1996–2003.
[25]
J. Liuet al., “Video event recognition using concept attributes,” in Proc. WACV, Jan. 2013, pp. 339–346.
[26]
E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc. ACL Workshop, 2002, pp. 1–4.
[27]
T. Mensink, E. Gavves, and C. G. M. Snoek, “COSTA: Co-occurrence statistics for zero-shot classification,” in Proc. CVPR, Jun. 2014, pp. 2441–2448.
[28]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013, pp. 3111–3119.
[29]
G. A. Miller, “WordNet: A lexical database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
[30]
P. X. Nguyen, G. Rogez, C. Fowlkes, and D. Ramanan. (2016). “The open world of micro-videos.” [Online]. Available: https://arxiv.org/abs/1603.09439
[31]
J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” Int. J. Comput. Vis., vol. 79, no. 3, pp. 299–318, Sep. 2008.
[32]
L. Niu, W. Li, and D. Xu, “Visual recognition by learning from Web data: A weakly supervised domain generalization approach,” in Proc. CVPR, Jun. 2015, pp. 2774–2783.
[33]
P. D. Overet al., “TRECVID 2010—An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” in Proc. TRECVID, 2010, p. 52.
[34]
P. Overet al., “TRECVID 2013—An overview of the goals, tasks, data, evaluation mechanisms, and metrics,” in Proc. TRECVID, 2013, pp. 1–45.
[35]
K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. NIPS, 2014, pp. 568–576.
[36]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2014.
[37]
B. Singh, X. Han, Z. Wu, V. I. Morariu, and L. S. Davis, “Selecting relevant Web trained concepts for automated event retrieval,” in Proc. ICCV, Dec. 2015, pp. 4561–4569.
[38]
M. Smith and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” in Proc. CVPR, Jun. 1997, pp. 775–781.
[39]
Y. Song, M. Zhao, J. Yagnik, and X. Wu, “Taxonomic classification for Web-based videos,” in Proc. CVPR, Jun. 2010, pp. 871–878.
[40]
G. Töpper, M. Knuth, and H. Sack, “DBpedia ontology enrichment for inconsistency detection,” in Proc. Int. Conf. Semantic Syst., 2012, pp. 33–40.
[41]
K. Toutanova and C. D. Manning, “Enriching the knowledge sources used in a maximum entropy part-of-speech tagger,” in Proc. EMNLP, 2000, pp. 63–70.
[42]
H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in Proc. CVPR, Jun. 2011, pp. 3169–3176.
[43]
Z. Wang, M. Zhao, Y. Song, S. Kumar, and B. Li, “YouTubeCat: Learning to categorize wild Web videos,” in Proc. CVPR, Jun. 2010, pp. 879–886.
[44]
S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, “Zero-shot event detection using multi-modal fusion of weakly supervised concepts,” in Proc. CVPR, Jun. 2014, pp. 2665–2672.
[45]
Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video representation for event detection,” in Proc. CVPR, 2015, pp. 1798–1807.
[46]
G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang, “EventNet: A large scale structured concept library for complex event detection in video,” in Proc. ACM Multimedia, Oct. 2015, pp. 471–480.

Cited By

View all
  • (2024)Multi-Grained Gradual Inference Model for Multimedia Event ExtractionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340224234:10_Part_2(10507-10520)Online publication date: 1-Oct-2024

Index Terms

  1. Learning From Web Videos for Event Classification
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Circuits and Systems for Video Technology
        IEEE Transactions on Circuits and Systems for Video Technology  Volume 28, Issue 10
        Oct. 2018
        658 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 October 2018

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Multi-Grained Gradual Inference Model for Multimedia Event ExtractionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340224234:10_Part_2(10507-10520)Online publication date: 1-Oct-2024

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media