Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Reliable shot identification for complex event detection via visual-semantic embedding

Published: 01 December 2021 Publication History

Abstract

Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.

Highlights

A visual-semantic guided loss is proposed to measure reliability of instance for event detection.
Training begins with high-reliability instances and gradually added instances of low reliability.
Promising experimental results show the effectiveness and superiority of the proposed method.

References

[1]
Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum learning. In: Proceedings of the 26th International Conference on Machine Learning. pp. 41–48.
[2]
Bunescu, R.C., Mooney, R.J., 2007. Multiple instance learning for sparse positive bags. In: Proceedings of the 24th International Conference on Machine Learning, pp. 105–112.
[3]
Chakraborty B., Gonzalez J., Roca F.X., Large scale continuous visual event recognition using max-margin hough transformation framework, Comput. Vis. Image Underst. 117 (10) (2013) 1356–1368.
[4]
Chang X., Ma Z., Lin M., Yang Y., Hauptmann A.G., Feature interaction augmented sparse learning for fast kinect motion detection, IEEE Trans. Image Process. 26 (8) (2017) 3911–3920.
[5]
Chang X., Ma Z., Yang Y., Zeng Z., Hauptmann A.G., Bi-level semantic representation analysis for multimedia event detection, IEEE Trans. Cybern. 47 (5) (2016) 1180–1197.
[6]
Chang X., Yang Y., Semisupervised feature analysis by mining correlations among multiple tasks, IEEE Trans. Neural Netw. Learn. Syst. 28 (10) (2017) 2294–2305.
[7]
Chang, X., Yu, Y., Yang, Y., Xing, E.P., 2016b. They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 1884–1893.
[8]
Chang X., Yu Y., Yang Y., Xing E.P., Semantic pooling for complex event analysis in untrimmed videos, IEEE Trans. Pattern Anal. Mach. Intell. 39 (8) (2017) 1617–1632.
[9]
Chen Y., Bi J., Wang J.Z., MIles: Multiple-instance learning via embedded instance selection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 1931–1947.
[10]
Chen Y., Wang J.Z., Image categorization by learning and reasoning with regions, J. Mach. Learn. Res. 5 (2004) 913–939.
[11]
Chen K., Yao L., Zhang D., Wang X., Chang X., Nie F., A semisupervised recurrent convolutional attention model for human activity recognition, IEEE Trans. Neural Netw. Learn. Syst. 31 (5) (2020) 1747–1756.
[12]
Cheng Z., Chang X., Zhu L., Kanjirathinkal R.C., Kankanhalli M.S., MMALFM: Explainable recommendation by leveraging reviews and images, ACM Trans. Inf. Syst. 37 (2) (2019) 16:1–16:28.
[13]
Cheny Z., Fuy Y., Zhang Y., Jiang Y.-G., Xue X., Sigal L., Multi-level semantic feature augmentation for one-shot learning, IEEE Trans. Image Process. 28 (9) (2019) 4594–4605.
[14]
Dietterich T.G., Lathrop R.H., Lozano-Pérez T., Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1) (1997) 31–71.
[15]
Fan, H., Chang, X., Cheng, D., Yang, Y., Xu, D., Hauptmann, A.G., 2017. Complex event detection by identifying reliable shots from untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 736–744.
[16]
Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J., 2002. Multi-instance kernels. In: Proceedings of the International Conference on Machine Learning. pp. 179–186.
[17]
Ghasedi, K., Wang, X., Deng, C., Huang, H., 2019. Balanced self-paced learning for generative adversarial clustering network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4391–4400.
[18]
Habibian A., Snoek C.G., Recommendations for recognizing video events by concept vocabularies, Comput. Vis. Image Underst. 124 (2014) 110–122.
[19]
Huang, C., Shi, B., Zhang, X., Wu, X., Chawla, N.V., 2019. Similarity-aware network embedding with self-paced learning. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. pp. 2113–2116.
[20]
Jiang, L., Hauptmann, A.G., Xiang, G., 2012. Leveraging high-level and low-level features for multimedia event detection. In: Proceedings of the 20th ACM International Conference on Multimedia. pp. 449–458.
[21]
Jiang, L., Meng, D., Mitamura, T., Hauptmann, A.G., 2014a. Easy samples first: Self-paced reranking for zero-example multimedia search. In: Proceedings of the 22nd ACM International Conference on Multimedia. pp. 547–556.
[22]
Jiang L., Meng D., Yu S.I., Lan Z., Shan S., Hauptmann A., Self-paced learning with diversity, Proc. Adv. Neural Inform. Process. Syst. (2014) 2078–2086.
[23]
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L., 2014a. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732.
[24]
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F., 2014b. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732.
[25]
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1097–1105.
[26]
Kumar, M.P., Packer, B., Koller, D., 2010. Self-paced learning for latent variable models. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1189–1197.
[27]
Lai, K.-T., Liu, D., Chen, M.-S., Chang, S.-F., 2014a. Recognizing complex events in videos by learning key static-dynamic evidences. In: Proceedings of the European Conference on Computer Vision. pp. 675–688.
[28]
Lai, K.-T., Yu, F.X., Chen, M.-S., Chang, S.-F., 2014b. Video event detection by inferring temporal instance labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2243–2250.
[29]
Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B., 2015. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 204–212.
[30]
Laptev I., On space-time interest points, Int. J. Comput. Vis. 64 (2–3) (2005) 107–123.
[31]
Li, W., Duan, L., Xu, D., Tsang, I.W.-H., 2011. Text-based image retrieval using progressive multi-instance learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2049–2055.
[32]
Li, H., Gong, M., Meng, D., Miao, Q., 2016. Multi-objective self-paced learning. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. pp. 1802–1808.
[33]
Li Z., Nie F., Chang X., Nie L., Zhang H., Yang Y., Rank-constrained spectral clustering with flexible embedding, IEEE Trans. Neural Networks Learn. Syst. 29 (12) (2018) 6073–6082.
[34]
Li Z., Nie F., Chang X., Yang Y., Zhang C., Sebe N., Dynamic affinity graph construction for spectral clustering using multiple features, IEEE Trans. Neural Networks Learn. Syst. 29 (12) (2018) 6323–6332.
[35]
Li, W., Vasconcelos, N., 2015. Multiple instance learning for soft bags via top instances. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4277–4285.
[36]
Li Z., Yao L., Chang X., Zhan K., Sun J., Zhang H., Zero-shot event detection via event-adaptive concept relevance mining, Pattern Recognit. 88 (2019) 595–603.
[37]
Li, W., Yu, Q., Divakaran, A., Vasconcelos, N., 2013. Dynamic pooling for complex event recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2728–2735.
[38]
Liu, G., Wu, J., Zhou, Z.-H., 2012. Key instance detection in multi-instance learning. In: Proceedings of the Asian Conference on Machine Learning. pp. 253–268.
[39]
Lowe D.G., Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110.
[40]
Luo M., Chang X., Li Z., Nie L., Hauptmann A.G., Zheng Q., Simple to complex cross-modal learning to rank, Comput. Vis. Image Understand. 163 (2017) 67–77.
[41]
Luo M., Nie F., Chang X., Yang Y., Hauptmann A.G., Zheng Q., Adaptive unsupervised feature selection with structure regularization, IEEE Trans. Neural Networks Learn. Syst. 29 (4) (2018) 944–956.
[42]
Ma Z., Chang X., Yang Y., Sebe N., Hauptmann A.G., The many shades of negativity, IEEE Trans. Multim. 19 (7) (2017) 1558–1568.
[43]
Ma, Z., Yang, Y., Xu, Z., Yan, S., Sebe, N., Hauptmann, A.G., 2013. Complex event detection via multi-source video attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2627–2633.
[44]
Mazloom, M., Gavves, E., van de Sande, K., Snoek, C., 2013. Searching informative concept banks for video event detection. In: Proceedings of ACM Conference on International Conference on Multimedia Retrieval. pp. 255–262.
[45]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 3111–3119.
[46]
Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., Natarajan, P., 2012. Multimodal feature fusion for robust event detection in web videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1298–1305.
[47]
NIST Z., The trecvid MED 2013 dataset, 2013, http://nist.gov/itl/iad/mig/med13.cfm.
[48]
NIST Z., The trecvid MED 2014 dataset, 2014, http://nist.gov/itl/iad/mig/med14.cfm.
[49]
Oneata, D., Verbeek, J., Schmid, C., 2013. Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1817–1824.
[50]
Phan, S., Le, D.-D., Satoh, S., 2015. Multimedia event detection using event-driven multiple instance learning. In: Proceedings of the 23rd ACM International Conference on Multimedia. pp. 1255–1258.
[51]
Rabiner L., Schafer R., Introduction to digital speech processing, Found. Trends Signal Process. 1 (1–2) (2007) 1–194.
[52]
Ren P., Xiao Y., Chang X., Huang P., Li Z., Chen X., Wang X., A comprehensive survey of neural architecture search: Challenges and solutions, ACM Comput. Surv. 54 (4) (2021) 76:1–76:34.
[53]
Sánchez J., Perronnin F., Mensink T., Verbeek J., Image classification with the fisher vector: Theory and practice, Int. J. Comput. Vis. 105 (3) (2013) 222–245.
[54]
SanMiguel J.C., Martínez J.M., A semantic-based probabilistic approach for real-time video event recognition, Comput. Vis. Image Underst. 116 (9) (2012) 937–952.
[55]
Shen J., Tao D., Li X., Modality mixture projections for semantic video event detection, IEEE Trans. Circuits Syst. Video Technol. 18 (11) (2008) 1587–1596.
[56]
Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3th International Conference on Learning Representations.
[57]
Snoek C.G., Smeulders A.W., Visual-concept search solved?, Computer 43 (6) (2010) 76–78.
[58]
Song H., Wu X., Yu W., Jia Y., Extracting key segments of videos for event detection by learning from web sources, IEEE Trans. Multimed. 20 (5) (2017) 1088–1100.
[59]
Stein S., McKenna S.J., Recognising complex activities with histograms of relative tracklets, Comput. Vis. Image Underst. 154 (2017) 82–93.
[60]
Sun, C., Nevatia, R., 2013. Large-scale web video event classification by use of fisher vectors. In: Proceedings of IEEE Workshop on Applications of Computer Vision. pp. 15–22.
[61]
Supancic, J.S., Ramanan, D., 2013. Self-paced learning for long-term tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2379–2386.
[62]
Tang, K., Fei-Fei, L., Koller, D., 2012. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1250–1257.
[63]
Tang, Y., Yang, Y.-B., Gao, Y., 2012. Self-paced dictionary learning for image classification. In: Proceedings of the 20th ACM International Conference on Multimedia. pp. 833–836.
[64]
Tibo A., Jaeger M., Frasconi P., Learning and interpreting multi-multi-instance learning networks, J. Mach. Learn. Res. 21 (193) (2020) 1–60.
[65]
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497.
[66]
Vahdat, A., Cannons, K., Mori, G., Oh, S., Kim, I., 2013. Compositional models for video event detection: A multiple kernel learning latent variable approach. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1185–1192.
[67]
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015. Sequence to sequence – video to text. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4534–4542.
[68]
Wang H., Kläser A., Schmid C., Liu C.-L., Dense trajectories and motion boundary descriptors for action recognition, Int. J. Comput. Vis. 103 (1) (2013) 60–79.
[69]
Wang, H., Schmid, C., 2014. Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3551–3558.
[70]
Wang X., Yan Y., Tang P., Liu W., Guo X., Bag similarity network for deep multi-instance learning, Inform. Sci. 504 (2019) 578–588.
[71]
Wu J.K., Kankanhalli M.S., Lim J.-H., Hong D., Perspectives on Content-Based Multimedia Systems, Kluwer Academic, Hingham, MA, 2000.
[72]
Xian Y., Rong X., Yang X., Tian Y., Evaluation of low-level features for real-world surveillance event detection, IEEE Trans. Circuits Syst. Video Technol. 27 (3) (2016) 624–634.
[73]
Xu, Z., Yang, Y., Hauptmann, A.G., 2015. A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1798–1807.
[74]
Yan C., Chang X., Luo M., Zheng Q., Zhang X., Li Z., Nie F., Self-weighted robust LDA for multiclass classification with edge classes, ACM Trans. Intell. Syst. Technol. 12 (1) (2021) 4:1–4:19.
[75]
Yan Y., Yang Y., Meng D., Liu G., Tong W., Hauptmann A.G., Sebe N., Event oriented dictionary learning for complex event detection, IEEE Trans. Image Process. 24 (6) (2015) 1867–1878.
[76]
Yan C., Zheng Q., Chang X., Luo M., Yeh C., Hauptmann A.G., Semantics-preserving graph propagation for zero-shot object detection, IEEE Trans. Image Process. 29 (2020) 8163–8176.
[77]
Yuan D., Chang X., Huang P., Liu Q., He Z., Self-supervised deep correlation tracking, IEEE Trans. Image Process. 30 (2021) 976–985.
[78]
Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R., 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In: Proceedings of the 26-Th British Machine Vision Conference. pp. 1097–1105.
[79]
Zhan K., Chang X., Guan J., Chen L., Ma Z., Yang Y., Adaptive structure discovery for multimedia analysis using multiple features, IEEE Trans. Cybern. 49 (5) (2019) 1826–1834.
[80]
Zhang L., Chang X., Liu J., Luo M., Prakash M., Hauptmann A.G., Few-shot activity recognition with cross-modal memory network, Pattern Recognit. 108 (2020).
[81]
Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E., 2002. Content-based image retrieval using multiple-instance learning. In: Proceedings of the Nineteenth International Conference on Machine Learning. pp. 682–689.
[82]
Zhang D., Han J., Jiang L., Ye S., Chang X., Revealing event saliency in unconstrained video collection, IEEE Trans. Image Process. 26 (4) (2017) 1746–1758.
[83]
Zhang L., Liu J., Luo M., Chang X., Zheng Q., Deep semisupervised zero-shot learning with maximum mean discrepancy, Neural Comput. 30 (5) (2018).
[84]
Zhang L., Luo M., Liu J., Chang X., Yang Y., Hauptmann A.G., Deep top-k ranking for image-sentence matching, IEEE Trans. Multimed. 22 (3) (2020) 775–785.
[85]
Zhang D., Meng D., Han J., Co-saliency detection via a self-paced multiple-instance learning framework, IEEE Trans. Pattern Anal. Mach. Intell. 39 (5) (2017) 865–878.
[86]
Zhang, D., Meng, D., Li, C., Jiang, L., Zhao, Q., Han, J., 2015. A self-paced multiple-instance learning framework for co-saliency detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 594–602.
[87]
Zhang, C., Platt, J.C., Viola, P.A., 2006. Multiple instance boosting for object detection. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 1417–1424.
[88]
Zhao, Q., Meng, D., Jiang, L., Xie, Q., Xu, Z., Hauptmann, A.G., 2015. Self-paced learning for matrix factorization. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. pp. 3196–3202.
[89]
Zhao Z., Xiang R., Su F., Complex event detection via attention-based video representation and classification, Multimedia Tools Appl. 77 (3) (2018) 3209–3227.
[90]
Zhou, Z.-H., Sun, Y.-Y., Li, Y.-F., 2009. Multi-instance learning by treating instances as non-iid samples. In: Proceedings of the 26th International Conference on Machine Learning. pp. 1249–1256.
[91]
Zhou S., Wang J., Meng D., Xin X., Li Y., Gong Y., Zheng N., Deep self-paced learning for person re-identification, Pattern Recognit. 76 (2018) 739–751.

Index Terms

  1. Reliable shot identification for complex event detection via visual-semantic embedding
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Computer Vision and Image Understanding
          Computer Vision and Image Understanding  Volume 213, Issue C
          Dec 2021
          94 pages

          Publisher

          Elsevier Science Inc.

          United States

          Publication History

          Published: 01 December 2021

          Author Tags

          1. Machine learning
          2. Complex event detection
          3. Visual-semantic guidance
          4. Reliable shot identification

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 04 Feb 2025

          Other Metrics

          Citations

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media