research-article

Reliable shot identification for complex event detection via visual-semantic embedding

Authors:

Chen GongAuthors Info & Claims

Volume 213, Issue C

https://doi.org/10.1016/j.cviu.2021.103300

Published: 01 December 2021 Publication History

Abstract

Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.

Highlights

•

A visual-semantic guided loss is proposed to measure reliability of instance for event detection.

•

Training begins with high-reliability instances and gradually added instances of low reliability.

•

Promising experimental results show the effectiveness and superiority of the proposed method.

References

[1]

Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum learning. In: Proceedings of the 26th International Conference on Machine Learning. pp. 41–48.

[2]

Bunescu, R.C., Mooney, R.J., 2007. Multiple instance learning for sparse positive bags. In: Proceedings of the 24th International Conference on Machine Learning, pp. 105–112.

[3]

Chakraborty B., Gonzalez J., Roca F.X., Large scale continuous visual event recognition using max-margin hough transformation framework, Comput. Vis. Image Underst. 117 (10) (2013) 1356–1368.

[4]

Chang X., Ma Z., Lin M., Yang Y., Hauptmann A.G., Feature interaction augmented sparse learning for fast kinect motion detection, IEEE Trans. Image Process. 26 (8) (2017) 3911–3920.

Digital Library

[5]

Chang X., Ma Z., Yang Y., Zeng Z., Hauptmann A.G., Bi-level semantic representation analysis for multimedia event detection, IEEE Trans. Cybern. 47 (5) (2016) 1180–1197.

[6]

Chang X., Yang Y., Semisupervised feature analysis by mining correlations among multiple tasks, IEEE Trans. Neural Netw. Learn. Syst. 28 (10) (2017) 2294–2305.

[7]

Chang, X., Yu, Y., Yang, Y., Xing, E.P., 2016b. They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 1884–1893.

[8]

Chang X., Yu Y., Yang Y., Xing E.P., Semantic pooling for complex event analysis in untrimmed videos, IEEE Trans. Pattern Anal. Mach. Intell. 39 (8) (2017) 1617–1632.

Digital Library

[9]

Chen Y., Bi J., Wang J.Z., MIles: Multiple-instance learning via embedded instance selection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 1931–1947.

Digital Library

[10]

Chen Y., Wang J.Z., Image categorization by learning and reasoning with regions, J. Mach. Learn. Res. 5 (2004) 913–939.

[11]

Chen K., Yao L., Zhang D., Wang X., Chang X., Nie F., A semisupervised recurrent convolutional attention model for human activity recognition, IEEE Trans. Neural Netw. Learn. Syst. 31 (5) (2020) 1747–1756.

[12]

Cheng Z., Chang X., Zhu L., Kanjirathinkal R.C., Kankanhalli M.S., MMALFM: Explainable recommendation by leveraging reviews and images, ACM Trans. Inf. Syst. 37 (2) (2019) 16:1–16:28.

[13]

Cheny Z., Fuy Y., Zhang Y., Jiang Y.-G., Xue X., Sigal L., Multi-level semantic feature augmentation for one-shot learning, IEEE Trans. Image Process. 28 (9) (2019) 4594–4605.

[14]

Dietterich T.G., Lathrop R.H., Lozano-Pérez T., Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1) (1997) 31–71.

Digital Library

[15]

Fan, H., Chang, X., Cheng, D., Yang, Y., Xu, D., Hauptmann, A.G., 2017. Complex event detection by identifying reliable shots from untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 736–744.

[16]

Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J., 2002. Multi-instance kernels. In: Proceedings of the International Conference on Machine Learning. pp. 179–186.

[17]

Ghasedi, K., Wang, X., Deng, C., Huang, H., 2019. Balanced self-paced learning for generative adversarial clustering network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4391–4400.

[18]

Habibian A., Snoek C.G., Recommendations for recognizing video events by concept vocabularies, Comput. Vis. Image Underst. 124 (2014) 110–122.

[19]

Huang, C., Shi, B., Zhang, X., Wu, X., Chawla, N.V., 2019. Similarity-aware network embedding with self-paced learning. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. pp. 2113–2116.

[20]

Jiang, L., Hauptmann, A.G., Xiang, G., 2012. Leveraging high-level and low-level features for multimedia event detection. In: Proceedings of the 20th ACM International Conference on Multimedia. pp. 449–458.

[21]

Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G., 2014a. Easy samples first: Self-paced reranking for zero-example multimedia search. In: Proceedings of the 22nd ACM International Conference on Multimedia. pp. 547–556.

[22]

Jiang L., Meng D., Yu S.I., Lan Z., Shan S., Hauptmann A., Self-paced learning with diversity, Proc. Adv. Neural Inform. Process. Syst. (2014) 2078–2086.

[23]

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L., 2014a. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732.

[24]

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F., 2014b. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732.

[25]

Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1097–1105.

[26]

Kumar, M.P., Packer, B., Koller, D., 2010. Self-paced learning for latent variable models. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1189–1197.

[27]

Lai, K.-T., Liu, D., Chen, M.-S., Chang, S.-F., 2014a. Recognizing complex events in videos by learning key static-dynamic evidences. In: Proceedings of the European Conference on Computer Vision. pp. 675–688.

[28]

Lai, K.-T., Yu, F.X., Chen, M.-S., Chang, S.-F., 2014b. Video event detection by inferring temporal instance labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2243–2250.

[29]

Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B., 2015. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 204–212.

[30]

Laptev I., On space-time interest points, Int. J. Comput. Vis. 64 (2–3) (2005) 107–123.

Digital Library

[31]

Li, W., Duan, L., Xu, D., Tsang, I.W.-H., 2011. Text-based image retrieval using progressive multi-instance learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2049–2055.

[32]

Li, H., Gong, M., Meng, D., Miao, Q., 2016. Multi-objective self-paced learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. pp. 1802–1808.

[33]

Li Z., Nie F., Chang X., Nie L., Zhang H., Yang Y., Rank-constrained spectral clustering with flexible embedding, IEEE Trans. Neural Networks Learn. Syst. 29 (12) (2018) 6073–6082.

[34]

Li Z., Nie F., Chang X., Yang Y., Zhang C., Sebe N., Dynamic affinity graph construction for spectral clustering using multiple features, IEEE Trans. Neural Networks Learn. Syst. 29 (12) (2018) 6323–6332.

[35]

Li, W., Vasconcelos, N., 2015. Multiple instance learning for soft bags via top instances. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4277–4285.

[36]

Li Z., Yao L., Chang X., Zhan K., Sun J., Zhang H., Zero-shot event detection via event-adaptive concept relevance mining, Pattern Recognit. 88 (2019) 595–603.

Digital Library

[37]

Li, W., Yu, Q., Divakaran, A., Vasconcelos, N., 2013. Dynamic pooling for complex event recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2728–2735.

[38]

Liu, G., Wu, J., Zhou, Z.-H., 2012. Key instance detection in multi-instance learning. In: Proceedings of the Asian Conference on Machine Learning. pp. 253–268.

[39]

Lowe D.G., Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110.

Digital Library

[40]

Luo M., Chang X., Li Z., Nie L., Hauptmann A.G., Zheng Q., Simple to complex cross-modal learning to rank, Comput. Vis. Image Understand. 163 (2017) 67–77.

[41]

Luo M., Nie F., Chang X., Yang Y., Hauptmann A.G., Zheng Q., Adaptive unsupervised feature selection with structure regularization, IEEE Trans. Neural Networks Learn. Syst. 29 (4) (2018) 944–956.

[42]

Ma Z., Chang X., Yang Y., Sebe N., Hauptmann A.G., The many shades of negativity, IEEE Trans. Multim. 19 (7) (2017) 1558–1568.

[43]

Ma, Z., Yang, Y., Xu, Z., Yan, S., Sebe, N., Hauptmann, A.G., 2013. Complex event detection via multi-source video attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2627–2633.

[44]

Mazloom, M., Gavves, E., van de Sande, K., Snoek, C., 2013. Searching informative concept banks for video event detection. In: Proceedings of ACM Conference on International Conference on Multimedia Retrieval. pp. 255–262.

[45]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 3111–3119.

[46]

Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., Natarajan, P., 2012. Multimodal feature fusion for robust event detection in web videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1298–1305.

[47]

NIST Z., The trecvid MED 2013 dataset, 2013, http://nist.gov/itl/iad/mig/med13.cfm.

[48]

NIST Z., The trecvid MED 2014 dataset, 2014, http://nist.gov/itl/iad/mig/med14.cfm.

[49]

Oneata, D., Verbeek, J., Schmid, C., 2013. Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1817–1824.

[50]

Phan, S., Le, D.-D., Satoh, S., 2015. Multimedia event detection using event-driven multiple instance learning. In: Proceedings of the 23rd ACM International Conference on Multimedia. pp. 1255–1258.

[51]

Rabiner L., Schafer R., Introduction to digital speech processing, Found. Trends Signal Process. 1 (1–2) (2007) 1–194.

[52]

Ren P., Xiao Y., Chang X., Huang P., Li Z., Chen X., Wang X., A comprehensive survey of neural architecture search: Challenges and solutions, ACM Comput. Surv. 54 (4) (2021) 76:1–76:34.

Digital Library

[53]

Sánchez J., Perronnin F., Mensink T., Verbeek J., Image classification with the fisher vector: Theory and practice, Int. J. Comput. Vis. 105 (3) (2013) 222–245.

Digital Library

[54]

SanMiguel J.C., Martínez J.M., A semantic-based probabilistic approach for real-time video event recognition, Comput. Vis. Image Underst. 116 (9) (2012) 937–952.

[55]

Shen J., Tao D., Li X., Modality mixture projections for semantic video event detection, IEEE Trans. Circuits Syst. Video Technol. 18 (11) (2008) 1587–1596.

[56]

Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3th International Conference on Learning Representations.

[57]

Snoek C.G., Smeulders A.W., Visual-concept search solved?, Computer 43 (6) (2010) 76–78.

[58]

Song H., Wu X., Yu W., Jia Y., Extracting key segments of videos for event detection by learning from web sources, IEEE Trans. Multimed. 20 (5) (2017) 1088–1100.

[59]

Stein S., McKenna S.J., Recognising complex activities with histograms of relative tracklets, Comput. Vis. Image Underst. 154 (2017) 82–93.

[60]

Sun, C., Nevatia, R., 2013. Large-scale web video event classification by use of fisher vectors. In: Proceedings of IEEE Workshop on Applications of Computer Vision. pp. 15–22.

[61]

Supancic, J.S., Ramanan, D., 2013. Self-paced learning for long-term tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2379–2386.

[62]

Tang, K., Fei-Fei, L., Koller, D., 2012. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1250–1257.

[63]

Tang, Y., Yang, Y.-B., Gao, Y., 2012. Self-paced dictionary learning for image classification. In: Proceedings of the 20th ACM International Conference on Multimedia. pp. 833–836.

[64]

Tibo A., Jaeger M., Frasconi P., Learning and interpreting multi-multi-instance learning networks, J. Mach. Learn. Res. 21 (193) (2020) 1–60.

[65]

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497.

[66]

Vahdat, A., Cannons, K., Mori, G., Oh, S., Kim, I., 2013. Compositional models for video event detection: A multiple kernel learning latent variable approach. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1185–1192.

[67]

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015. Sequence to sequence – video to text. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4534–4542.

[68]

Wang H., Kläser A., Schmid C., Liu C.-L., Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis. 103 (1) (2013) 60–79.

[69]

Wang, H., Schmid, C., 2014. Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3551–3558.

[70]

Wang X., Yan Y., Tang P., Liu W., Guo X., Bag similarity network for deep multi-instance learning, Inform. Sci. 504 (2019) 578–588.

[71]

Wu J.K., Kankanhalli M.S., Lim J.-H., Hong D., Perspectives on Content-Based Multimedia Systems, Kluwer Academic, Hingham, MA, 2000.

[72]

Xian Y., Rong X., Yang X., Tian Y., Evaluation of low-level features for real-world surveillance event detection, IEEE Trans. Circuits Syst. Video Technol. 27 (3) (2016) 624–634.

Digital Library

[73]

Xu, Z., Yang, Y., Hauptmann, A.G., 2015. A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1798–1807.

[74]

Yan C., Chang X., Luo M., Zheng Q., Zhang X., Li Z., Nie F., Self-weighted robust LDA for multiclass classification with edge classes, ACM Trans. Intell. Syst. Technol. 12 (1) (2021) 4:1–4:19.

Digital Library

[75]

Yan Y., Yang Y., Meng D., Liu G., Tong W., Hauptmann A.G., Sebe N., Event oriented dictionary learning for complex event detection, IEEE Trans. Image Process. 24 (6) (2015) 1867–1878.

[76]

Yan C., Zheng Q., Chang X., Luo M., Yeh C., Hauptmann A.G., Semantics-preserving graph propagation for zero-shot object detection, IEEE Trans. Image Process. 29 (2020) 8163–8176.

Digital Library

[77]

Yuan D., Chang X., Huang P., Liu Q., He Z., Self-supervised deep correlation tracking, IEEE Trans. Image Process. 30 (2021) 976–985.

Digital Library

[78]

Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R., 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In: Proceedings of the 26-Th British Machine Vision Conference. pp. 1097–1105.

[79]

Zhan K., Chang X., Guan J., Chen L., Ma Z., Yang Y., Adaptive structure discovery for multimedia analysis using multiple features, IEEE Trans. Cybern. 49 (5) (2019) 1826–1834.

[80]

Zhang L., Chang X., Liu J., Luo M., Prakash M., Hauptmann A.G., Few-shot activity recognition with cross-modal memory network, Pattern Recognit. 108 (2020).

[81]

Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E., 2002. Content-based image retrieval using multiple-instance learning. In: Proceedings of the Nineteenth International Conference on Machine Learning. pp. 682–689.

[82]

Zhang D., Han J., Jiang L., Ye S., Chang X., Revealing event saliency in unconstrained video collection, IEEE Trans. Image Process. 26 (4) (2017) 1746–1758.

[83]

Zhang L., Liu J., Luo M., Chang X., Zheng Q., Deep semisupervised zero-shot learning with maximum mean discrepancy, Neural Comput. 30 (5) (2018).

[84]

Zhang L., Luo M., Liu J., Chang X., Yang Y., Hauptmann A.G., Deep top-k ranking for image-sentence matching, IEEE Trans. Multimed. 22 (3) (2020) 775–785.

[85]

Zhang D., Meng D., Han J., Co-saliency detection via a self-paced multiple-instance learning framework, IEEE Trans. Pattern Anal. Mach. Intell. 39 (5) (2017) 865–878.

Digital Library

[86]

Zhang, D., Meng, D., Li, C., Jiang, L., Zhao, Q., Han, J., 2015. A self-paced multiple-instance learning framework for co-saliency detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 594–602.

[87]

Zhang, C., Platt, J.C., Viola, P.A., 2006. Multiple instance boosting for object detection. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1417–1424.

[88]

Zhao, Q., Meng, D., Jiang, L., Xie, Q., Xu, Z., Hauptmann, A.G., 2015. Self-paced learning for matrix factorization. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. pp. 3196–3202.

[89]

Zhao Z., Xiang R., Su F., Complex event detection via attention-based video representation and classification, Multimedia Tools Appl. 77 (3) (2018) 3209–3227.

[90]

Zhou, Z.-H., Sun, Y.-Y., Li, Y.-F., 2009. Multi-instance learning by treating instances as non-iid samples. In: Proceedings of the 26th International Conference on Machine Learning. pp. 1249–1256.

[91]

Zhou S., Wang J., Meng D., Xin X., Li Y., Gong Y., Zheng N., Deep self-paced learning for person re-identification, Pattern Recognit. 76 (2018) 739–751.

Index Terms

Reliable shot identification for complex event detection via visual-semantic embedding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
  2. Machine learning
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Semantic pooling for complex event detection
MM '13: Proceedings of the 21st ACM international conference on Multimedia

Complex event detection is very challenging in open source such as You-Tube videos, which usually comprise very diverse visual contents involving various object, scene and action concepts. Not all of them, however, are relevant to the event. In other ...
Event Oriented Dictionary Learning for Complex Event Detection
Complex event detection is a retrieval task with the goal of finding videos of a particular event in a large-scale unconstrained Internet video archive, given example videos and text descriptions. Nowadays, different multimodal fusion schemes of low-level ...
Transductive Visual-Semantic Embedding for Zero-shot Learning
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches ...

Comments

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 213, Issue C

Dec 2021

94 pages

ISSN:1077-3142

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 December 2021

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents