Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Published: 08 March 2024 Publication History

Abstract

As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.

References

[1]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247–1255.
[2]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.
[3]
Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757–1771.
[4]
Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of European Conference on Computer Vision. Springer, 25–36.
[5]
Desheng Cai, Shengsheng Qian, Quan Fang, and Changsheng Xu. 2021. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia (2021).
[6]
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014).
[7]
Guibin Chen, Deheng Ye, Zhenchang Xing, Jieshan Chen, and Erik Cambria. 2017. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proceedings of International Joint Conference on Neural Networks. 2377–2383.
[8]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 12655–12663.
[9]
Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of ACM International Conference on Multimedia. 898–907.
[10]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10638–10647.
[11]
Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of IEEE International Conference on Computer Vision. 522–531.
[12]
Xusong Chen, Dong Liu, Zhiwei Xiong, and Zheng-Jun Zha. 2020. Learning and fusing multiple user interest representations for micro-video and movie recommendations. IEEE Transactions on Multimedia 23 (2020), 484–496.
[13]
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5177–5186.
[14]
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. GSRFormer: Grounded situation recognition transformer with alternate semantic attention refinement. In Proceedings of ACM International Conference on Multimedia. 3272–3281.
[15]
Zhi-Qi Cheng, Yang Liu, Xiao Wu, and Xian-Sheng Hua. 2016. Video eCommerce: Towards online video advertising. In Proceedings of the 24th ACM International Conference on Multimedia. 1365–1374.
[16]
Zhi-Qi Cheng, Xiao Wu, Siyu Huang, Jun-Xiu Li, Alexander G. Hauptmann, and Qiang Peng. 2018. Learning to transfer: Generalizable attribute learning with multitask neural model search. In Proceedings of ACM International Conference on Multimedia. 90–98.
[17]
Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017. Video2Shop: Exact matching clothes in videos to online shopping images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4048–4056.
[18]
Zhi-Qi Cheng, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017. On the selection of anchors and targets for video hyperlinking. In Proceedings of ACM on International Conference on Multimedia Retrieval. 287–293.
[19]
Zhengming Ding and Yun Fu. 2016. Robust multi-view subspace learning through dual low-rank decompositions. In Proceedings of AAAI Conference on Artificial Intelligence. 1181–1187.
[20]
Thibaut Durand, Nazanin Mehrasa, and Greg Mori. 2019. Learning a deep ConvNet for multi-label classification with partial labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 647–657.
[21]
Wenqi Fan, Yao Ma, Han Xu, Xiaorui Liu, Jianping Wang, Qing Li, and Jiliang Tang. 2020. Deep adversarial canonical correlation analysis. In Proceedings of SIAM International Conference on Data Mining. 352–360.
[22]
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of Advances in Neural Information Processing Systems, Vol. 26.
[23]
Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. 2008. Multilabel classification via calibrated label ranking. Machine Learning 73, 2 (2008), 133–153.
[24]
Ying Gao, Xiaohan Feng, Tiange Zhang, Eric Rigall, Huiyu Zhou, Lin Qi, and Junyu Dong. 2021. Wallpaper texture generation and style transfer based on multi-label semantics. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1552–1563.
[25]
Jun-Yan He, Xiao Wu, Zhi-Qi Cheng, Zhaoquan Yuan, and Yu-Gang Jiang. 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 444 (2021), 319–331.
[26]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, and Bryan Seybold. 2017. CNN architectures for large-scale audio classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 131–135.
[27]
Wei Huang, Tong Xiao, Qi Liu, Zhenya Huang, Jianhui Ma, and Enhong Chen. 2023. HMNet: A hierarchical multi-modal network for educational video concept prediction. International Journal of Machine Learning and Cybernetics (2023), 1–12.
[28]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.
[29]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2017), 352–364.
[30]
Zheheng Jiang, Feixiang Zhou, Aite Zhao, Xin Li, Ling Li, Dacheng Tao, Xuelong Li, and Huiyu Zhou. 2021. Muti-view mouse social behaviour recognition with deep graphic model. IEEE Transactions on Image Processing 30 (2021), 5490–5504.
[31]
Peiguang Jing, Yuting Su, Liqiang Nie, Xu Bai, Jing Liu, and Meng Wang. 2017. Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering 30, 8 (2017), 1519–1532.
[32]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.
[33]
Cheng Li, Bingyu Wang, Virgil Pavlu, and Javed Aslam. 2016. Conditional Bernoulli mixtures for multi-label classification. In Proceedings of International Conference on Machine Learning. 2482–2491.
[34]
Xiang Li and Songcan Chen. 2021. A concise yet effective model for non-aligned incomplete multi-view and missing multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 5918–5932.
[35]
Yun Li, Shuyi Liu, Xuejun Wang, and Peiguang Jing. 2023. Self-supervised deep partial adversarial network for micro-video multimodal classification. Information Sciences 630 (2023), 356–369. DOI:
[36]
Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep learning for extreme multi-label text classification. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 115–124.
[37]
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of ACM International Conference on Multimedia. 970–978.
[38]
Wei Lu, Desheng Li, Liqiang Nie, Peiguang Jing, and Yuting Su. 2021. Learning dual low-rank representation for multi-label micro-video classification. IEEE Transactions on Multimedia 25 (2021), 77–89. DOI:
[39]
Wei Lu, Jiaxin Lin, Peiguang Jing, and Yuting Su. 2023. A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Processing Letters 30 (2023), 60–64. DOI:
[40]
Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019. Flexible online multi-modal hashing for large-scale multimedia retrieval. In Proceedings of the 27th ACM International Conference on Multimedia (MM). 1129–1137.
[41]
Gengyu Lyu, Xiang Deng, Yanan Wu, and Songhe Feng. 2022. Beyond shared subspace: A view-specific fusion for multi-view multi-label learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 7647–7654.
[42]
Foteini Markatopoulou, Vasileios Mezaris, and Ioannis Patras. 2018. Implicit and explicit concept relations in deep neural networks for multi-label video/image annotation. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2018), 1631–1644.
[43]
Zhang Minling and Zhou Zhihua. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.
[44]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of International Conference on Machine Learning. 689–696.
[45]
Phuong Anh Nguyen, Qing Li, Zhi-Qi Cheng, Yi-Jie Lu, Hao Zhang, Xiao Wu, and Chong-Wah Ngo. 2017. VIREO@ TRECVID 2017: Video-to-text, Ad-hoc video search and video hyperlinking. In 2017 TREC Video Retrieval Evaluation (TRECVID 2017). National Institute of Standards and Technology (NIST).
[46]
Liqiang Nie, Xiang Wang, Jianglong Zhang, Xiangnan He, Hanwang Zhang, Richang Hong, and Qi Tian. 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of ACM International Conference on Multimedia. 1192–1200.
[47]
Stephanie Pancoast and Murat Akbacak. 2014. Softening quantization in bag-of-audio-words. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374.
[48]
Minlong Peng, Qi Zhang, Yu-gang Jiang, and Xuan-Jing Huang. 2018. Cross-domain sentiment classification with target domain specific information. In Proceedings of Annual Meeting of the Association for Computational Linguistics. 2505–2513.
[49]
Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrusaitis, and Roland Goecke. 2016. Extending long short-term memory for multi-view structured learning. In Proceedings of European Conference on Computer Vision. 338–353.
[50]
Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333.
[51]
Sreemanananth Sadanand and Jason J. Corso. 2012. Action bank: A high-level representation of activity in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1234–1241.
[52]
Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of ACM International Conference on Multimedia. 357–360.
[53]
Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15, 1 (2014), 2949–2980.
[54]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
[55]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE international Conference on Computer Vision. 4489–4497.
[56]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552–5561.
[57]
Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, and Yu-Gang Jiang. 2023. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465 (2023). DOI:
[58]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.
[59]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 6000–6010.
[60]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
[61]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5005–5013.
[62]
Lichen Wang, Yunyu Liu, Can Qin, Gan Sun, and Yun Fu. 2020. Dual relation semi-supervised multi-label learning. In Proceedings of AAAI Conference on Artificial Intelligence. 6227–6234.
[63]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.
[64]
Liyuan Wang, Jing Zhang, Qi Tian, Chenhao Li, and Li Zhuo. 2019. Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2019), 4876–4886.
[65]
Zhe Wang, Zhongli Fang, Dongdong Li, Hai Yang, and Wenli Du. 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1848–1859.
[66]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019), 1–14.
[67]
Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 1437–1445.
[68]
Nan Wu, Stanislaw Jastrzebski, Kyunghyun Cho, and Krzysztof J. Geras. 2022. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of International Conference on Machine Learning. PMLR, 24043–24055.
[69]
Jiayi Xie, Yaochen Zhu, Zhibin Zhang, Jian Peng, Jing Yi, Yaosi Hu, Hongyi Liu, and Zhenzhong Chen. 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of The Web Conference. 2542–2548.
[70]
Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 1642–1651.
[71]
Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, and Jiebo Luo. 2017. Deep multimodal representation learning from temporal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5447–5455.
[72]
Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, and Yu Qiao. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of European Conference on Computer Vision. 649–665.
[73]
Chih-Kuan Yeh, Wei-Chieh Wu, Wei-Jen Ko, and Yu-Chiang Frank Wang. 2017. Learning deep latent space for multi-label classification. In Proceedings of AAAI Conference on Artificial Intelligence. 2838–2844.
[74]
Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).
[75]
Jia Zhang, Zhiming Luo, Candong Li, Changen Zhou, and Shaozi Li. 2019. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition 95 (2019), 136–150.
[76]
Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2013), 1819–1837.
[77]
Yu Zhang and Dit-Yan Yeung. 2012. A convex formulation for learning task relationships in multi-task learning. arXiv preprint arXiv:1203.3536 (2012).
[78]
Dawei Zhao, Qingwei Gao, Yixiang Lu, and Dong Sun. 2022. Non-aligned multi-view multi-label classification via learning view-specific labels. IEEE Transactions on Multimedia (2022).
[79]
Fengtao Zhou, Sheng Huang, Bo Liu, and Dan Yang. 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[80]
Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5513–5522.
[81]
Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep collaborative multi-view hashing for large-scale image search. IEEE Trans. Image Process. 29 (2020), 4643–4655.
[82]
Lei Zhu, Chaoqun Zheng, Weili Guan, Jingjing Li, Yang Yang, and Heng Tao Shen. 2023. Multi-modal hashing for efficient multimedia retrieval: A survey. IEEE Transactions on Knowledge and Data Engineering (2023), 1–20. DOI:
[83]
Xiaoyan Zhu, Jiaxuan Li, Jingtao Ren, Jiayin Wang, and Guangtao Wang. 2023. Dynamic ensemble learning for multi-label classification. Information Sciences 623 (2023), 94–111.
[84]
Yue Zhu, James T. Kwok, and Zhi-Hua Zhou. 2017. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering 30, 6 (2017), 1081–1094.

Index Terms

  1. Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
    June 2024
    715 pages
    EISSN:1551-6865
    DOI:10.1145/3613638
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2024
    Online AM: 06 February 2024
    Accepted: 27 January 2024
    Revised: 04 January 2024
    Received: 02 June 2023
    Published in TOMM Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Micro-video
    2. multimodal representations
    3. multi-label
    4. graph network

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Guangxi Key Laboratory of Big Data in Finance and Economics
    • Doctor Start-up Funds

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 284
      Total Downloads
    • Downloads (Last 12 months)284
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media