research-article

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

Authors:

Shancheng Fang,

Zhendong MaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 4

Article No.: 104, Pages 1 - 24

https://doi.org/10.1145/3633334

Published: 11 January 2024 Publication History

Abstract

Automatic live video commenting is getting increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from current methods. Sentimental factors are critical in interactive commenting, and there has been lack of research so far. Thus, in this article, we propose a Sentiment-oriented Transformer-based Variational Autoencoder (So-TVAE) network, which consists of a sentiment-oriented diversity encoder module and a batch attention module, to achieve diverse video commenting with multiple sentiments and multiple semantics. Specifically, our sentiment-oriented diversity encoder elegantly combines a VAE and random mask mechanism to achieve semantic diversity under sentiment guidance, which is then fused with cross-modal features to generate live video comments. A batch attention module is also proposed in this article to alleviate the problem of missing sentimental samples, caused by the data imbalance that is common in live videos as the popularity of videos varies. Extensive experiments on Livebot and VideoIC datasets demonstrate that the proposed So-TVAE outperforms the state-of-the-art methods in terms of the quality and diversity of generated comments. Related code is available at https://github.com/fufy1024/So-TVAE.

References

[1]

Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12487–12496.

[2]

Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2019. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–37.

Digital Library

[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.

[4]

Qingchun Bai, Qinmin Vivian Hu, Linhui Ge, and Liang He. 2019. Stories that big danmaku data can tell as a new media. IEEE Access 7 (2019), 53509–53519.

[5]

Satanjeev Banerjee and Alon Lavie. 2007. METEOR: An automatic metric for MT evaluation with Long short-term memory improved correlation with human judgments. In Association for Computational Linguistics, ACL, 2007. 65–72.

[6]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in Neural Information Processing Systems 28 (2015).

[7]

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, and S. Bengio. 2015. Generating sentences from a continuous space. Computer Science (2015).

[8]

Shan Cao, Gaoyun An, Zhenxing Zheng, and Zhiyong Wang. 2022. Vision-enhanced and consensus-aware transformer for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 10 (2022), 7005–7018.

[9]

Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2023. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–24.

Digital Library

[10]

Kezhen Chen, Qiuyuan Huang, Daniel McDuff, Xiang Gao, Hamid Palangi, Jianfeng Wang, Kenneth Forbus, and Jianfeng Gao. 2021. NICE: Neural image commenting with empathy. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4456–4472.

[11]

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, and Jiebo Luo. 2018. “Factual”or“Emotional”: Stylized image captioning with adaptive learning and attention. In Proceedings of the European Conference on Computer Vision (ECCV). 519–535.

[12]

Weidong Chen, Guorong Li, Xinfeng Zhang, Shuhui Wang, Liang Li, and Qingming Huang. 2023. Weakly supervised text-based actor-action video segmentation by clip-level multi-instance learning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1 (2023), 1–22.

Digital Library

[13]

Xu Chen, Yongfeng Zhang, Qingyao Ai, Hongteng Xu, Junchi Yan, and Zheng Qin. 2017. Personalized key frame recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–324.

Digital Library

[14]

Qishang Cheng, Hongliang Li, Qingbo Wu, and King Ngi Ngan. 2021. BA 2M: A batch aware attention module for image classification. arXiv preprint arXiv:2103.15099 (2021).

[15]

Ronan Collobert, Awni Hannun, and Gabriel Synnaeve. 2019. A fully differentiable beam search decoder. In International Conference on Machine Learning. PMLR, 1341–1350.

[16]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 1–21.

Digital Library

[17]

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9268–9277.

[18]

Shanshan Dong, Tianzi Niu, Xin Luo, Wu Liu, and Xinshun Xu. 2023. Semantic embedding guided attention with explicit visual feature fusion for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–18.

Digital Library

[19]

Chaoqun Duan, Lei Cui, Shuming Ma, Furu Wei, Conghui Zhu, and Tiejun Zhao. 2020. Multimodal matching transformer for live commenting. ECAI (2020).

[20]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, September 5-11, 2010, Proceedings, Part IV 11. Springer, 15–29.

[21]

Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137–3146.

[22]

Junlong Gao, Shiqi Wang, Shanshe Wang, Siwei Ma, and Wen Gao. 2019. Self-critical n-step training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6300–6308.

[23]

Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In Computer Vision–ECCV 2014. 529–545.

[24]

Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019. MSCap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision & Pattern Recognition. 4204–4213.

[25]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).

[26]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.

[27]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, 1 (2015), 853–899.

[28]

Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. Cross attention network for few-shot classification. Advances in Neural Information Processing Systems 32 (2019).

[29]

Zhi Hou, Baosheng Yu, and Dacheng Tao. 2022. BatchFormer: Learning to explore sample relationships for robust representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In International Conference on Machine Learning. PMLR, 1587–1596.

Digital Library

[31]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.

[32]

Weitao Jiang, Weixuan Wang, and Haifeng Hu. 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 4 (2021), 1–20.

Digital Library

[33]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of NAACL-HLT, Vol. 1. 2.

[34]

W. Kim, Y. Ahn, D. Kim, and K. H. Lee. 2022. Emp-RFT: Empathetic response generation via recognizing feature transitions between utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022.

[35]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Computer Science (2014).

[36]

Durk P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. Advances in Neural Information Processing Systems 27 (2014).

[37]

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. InInternational Conference on Learning Representations (ICLR) (2014).

[38]

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In International Conference on Machine Learning. PMLR, 595–603.

Digital Library

[39]

Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 2 (2002), 171–184.

Digital Library

[40]

N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. AAAI Press (2013).

[41]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903.

Digital Library

[42]

Shankar Kumar and Bill Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. 169–176.

[43]

Marie-Anne Lachaux, Armand Joulin, and Guillaume Lample. 2020. Target conditioning for one-to-many generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP).

[44]

Chenchen Li, Jialin Wang, Hongwei Wang, Miao Zhao, Wenjie Li, and Xiaotie Deng. 2019. Visual-textual emotion analysis with deep coupled video and danmu neural networks. IEEE Transactions on Multimedia 22, 6 (2019), 1634–1646.

[45]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. Association for Computational Linguistics (2011).

[46]

Shimin Li, Hang Yan, and Xipeng Qiu. 2022. Contrast and generation make BART a good dialogue emotion recognizer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11002–11010.

[47]

Wei Li, Jingjing Xu, Yancheng He, Shengli Yan, Yunfang Wu, et al. 2019. Coherent comment generation for Chinese articles with a graph-to-sequence model. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019.

[48]

Yehao Li, Ting Yao, Rui Hu, Tao Mei, and Yong Rui. 2016. Video ChatBot: Triggering live social interactions by automatic video commenting. In Proceedings of the 24th ACM International Conference on Multimedia. 757–758.

Digital Library

[49]

Shengcai Liao and Ling Shao. 2022. Graph sampling based deep metric learning for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7359–7368.

[50]

Jongin Lim, Sangdoo Yun, Seulki Park, and Jin Young Choi. 2022. Hypergraph-induced semantic tuplet loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 212–222.

[51]

Zhaojiang Lin, Genta Indra Winata, and Pascale Fung. 2019. Learning comment generation by leveraging user-generated data. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7225–7229.

[52]

S. Liu, Z. Zhu, Y. Ning, S. Guadarrama, and K. Murphy. 2017. Improved image captioning via policy gradient optimization ofSPIDEr. In IEEE International Conference on Computer Vision (ICCV). IEEE. (2017).

[53]

Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV). 338–354.

Digital Library

[54]

Huimin Lu, Rui Yang, Zhenrong Deng, Yonglin Zhang, Guangwei Gao, and Rushi Lan. 2021. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1s (2021), 1–18.

Digital Library

[55]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[56]

Guangyi Lv, Tong Xu, Enhong Chen, Qi Liu, and Yi Zheng. 2016. Reading the Videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. AAAI Press (2016).

[57]

Guangyi Lv, Kun Zhang, Le Wu, Enhong Chen, Tong Xu, Qi Liu, and Weidong He. 2019. Understanding the users and videos by mining a novel danmu dataset. IEEE Transactions on Big Data 8, 2 (2019), 535–551.

[58]

Shuming Ma, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. 2019. LiveBot: Generating live video comments based on visual and textual contexts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6810–6817.

Digital Library

[59]

Laurens van Der Maaten and Hinton Geoffrey.2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2605 (2008), 2579–2605.

[60]

Xin Man, Deqiang Ouyang, Xiangpeng Li, Jingkuan Song, and Jie Shao. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 4 (2022), 1–17.

Digital Library

[61]

Alexander Mathews, Lexing Xie, and Xuming He. 2016. SentiCap: Generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.

[62]

Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. 2021. Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 953–962.

[63]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870–10879.

[64]

Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2020. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2020), 2617–2633. DOI:

Digital Library

[65]

Lianhui Qin, Lemao Liu, Victoria Bi, Yan Wang, Xiaojiang Liu, Zhiting Hu, Hai Zhao, and Shuming Shi. 2018. Automatic article commenting: the task and dataset. In ACL.

[66]

S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. 2017. Self-critical sequence training for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017).

[67]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning. PMLR, 1278–1286.

Digital Library

[68]

Sahand Sabour, Chujie Zheng, and Minlie Huang. 2022. CEM: Commonsense-aware empathetic response generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11229–11237.

[69]

Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, and Tarek Abdelzaher. 2020. ControlVAE: Controllable variational autoencoder. In International Conference on Machine Learning. PMLR.

[70]

Tianxiao Shen, Myle Ott, Michael Auli, and Marc’Aurelio Ranzato. 2019. Mixture models for diverse machine translation: Tricks of the trade. In International Conference on Machine Learning. PMLR, 5719–5728.

[71]

Wenxian Shi, Hao Zhou, Ning Miao, and Lei Li. 2020. Dispersed exponential family mixture VAEs for interpretable text generation. In International Conference on Machine Learning. PMLR, 8840–8851.

[72]

Yaya Shi, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. 2023. Learning video-text aligned representations for video captioning. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2 (2023), 1–21.

Digital Library

[73]

Chen Sun, Chuang Gan, and Ram Nevatia. 2015. Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision. 2596–2604.

Digital Library

[74]

Ganchao Tan, Daqing Liu, Meng Wang, and Zheng-Jun Zha. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI.

[75]

Hao Tian, Can Gao, Xinyan Xiao, Hao Liu, Bolei He, Hua Wu, Haifeng Wang, and Feng Wu. 2020. SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis. Association for Computational Linguistics, ACL (2020).

[76]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[77]

R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence—video to text. In Proceedings of the IEEE International Conference on Computer Vision.

Digital Library

[79]

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Computer Science (2014).

[80]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[81]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.

[82]

Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2s (2018), 1–20.

Digital Library

[83]

Liwei Wang, Alexander G. Schwing, and Svetlana Lazebnik. 2017. Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space. Advances in Neural Information Processing Systems (NIPS) (2017).

[84]

Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, and Haifeng Hu. 2021. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).

Digital Library

[85]

Weiying Wang, Jieting Chen, and Qin Jin. 2020. VideoIC: A video interactive comments dataset and multimodal multitask learning for comments generation. In The 28th ACM International Conference on Multimedia.

Digital Library

[86]

Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Topic-guided variational autoencoders for text generation. CoRR abs/1903.07137 (2019).

[87]

Xiaodong Wang, Ye Tian, Rongheng Lan, Wen Yang, and Xinming Zhang. 2019. Beyond the watching: Understanding viewer interactions in crowdsourced live video broadcasting services. IEEE Transactions on Circuits and Systems for Video Technology 29, 11 (2019), 3454–3468.

Digital Library

[88]

Xiangyu Wang and Chengqing Zong. 2021. Distributed representations of emotion categories in emotion space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2364–2375.

[89]

Haiyang Wei, Zhixin Li, Feicheng Huang, Canlong Zhang, Huifang Ma, and Zhongzhi Shi. 2021. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 2 (2021), 1–22.

Digital Library

[90]

Tsung Hsien Wen, Yishu Miao, Phil Blunsom, and Steve Young. 2017. Latent intention dialogue models. In International Conference on Machine Learning (ICML).

[91]

Hao Wu, Gareth James F. Jones, and Franois Pitié. 2021. Knowing where and what to write in automated live video comments: A unified multi-task approach. In International Conference on Multimodal Interaction (ICMI).

Digital Library

[92]

Le Wu, Lei Chen, Yonghui Yang, Richang Hong, Yong Ge, Xing Xie, and Meng Wang. 2019. Personalized multimedia item and key frame recommendation. In IJCAI.

[93]

Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM International Conference on Multimedia. 537–545.

Digital Library

[94]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.

Digital Library

[95]

Hanqi Yan, Lin Gui, Gabriele Pergola, and Yulan He. 2021. Position bias mitigation: A knowledge-aware graph model for emotion cause extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).

[96]

Liang Yang, Haifeng Hu, Songlong Xing, and Xinlong Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1–18.

Digital Library

[97]

Pengcheng Yang, Zhihan Zhang, Fuli Luo, Lei Li, Chengyang Huang, and Xu Sun. 2019. Cross-modal commentator: Automatic machine commenting based on cross-modal information. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2680–2686.

[98]

Wenmian Yang, Wenyuan Gao, Xiaojie Zhou, Weijia Jia, Shaohua Zhang, and Yutao Luo. 2019. Herding effect based attention for personalized time-sync video recommendation. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 454–459.

[99]

Wenmain Yang, Kun Wang, Na Ruan, Wenyuan Gao, and Yunyong Zhang. 2019. Time-sync video tag extraction using semantic association graph. ACM Transactions on Knowledge Discovery from Data 13, 4 (2019), 1–24.

Digital Library

[100]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.

Digital Library

[101]

Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, and Xiaokang Yang. 2018. Fine-grained video captioning for sports narrative. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[102]

Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2020. Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4467–4480. DOI:

Digital Library

[103]

Jin Yuan, Lei Zhang, Songrui Guo, Yi Xiao, and Zhiyong Li. 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1–22.

Digital Library

[104]

Mengqi Yuan, Bing-Kun Bao, Zhiyi Tan, and Changsheng Xu. 2023. Adaptive text denoising network for image caption editing. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1–18.

Digital Library

[105]

W. Zeng, A. Abuduweili, L. Li, and P. Yang. 2019. Automatic generation of personalized comment based on user profile. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019.

[106]

Zehua Zeng, Neng Gao, Cong Xue, and Chenyang Tu. 2021. PLVCG: A pretraining based model for live video comment generation. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part II. Springer, 690–702.

Digital Library

[107]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13278–13288.

[108]

Zhiwang Zhang, Dong Xu, Wanli Ouyang, and Chuanqi Tan. 2020. Show, tell and summarize: Dense video captioning using visual cue aided sentence summarization. IEEE Transactions on Circuits and Systems for Video Technology 30, 9 (2020), 3130–3139.

Digital Library

[109]

Zhihan Zhang, Zhiyi Yin, Shuhuai Ren, Xinhang Li, and Shicheng Li. 2020. DCA: Diversified co-attention towards informative live video commenting. In Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part II 9. Springer, 3–15.

[110]

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. 2018. Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Association for Computational Linguistics, ACL(1).

[111]

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Association for Computational Linguistics, ACL(1).

[112]

Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020. MemCap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12984–12992.

[113]

Z. S. Zhao, H. X. Gao, and Q. Sun. 2018. Latest development of the theory framework, derivative model and application of generative adversarial nets. J Chin Comput Syst 39, 12 (2018), 2602–2606.

[114]

Zengshun Zhao, Qian Sun, Haoran Yang, Heng Qiao, Zhigang Wang, and Dapeng Oliver Wu. 2019. Compression artifacts reduction by improved generative adversarial networks. EURASIP Journal on Image and Video Processing 2019, 1 (2019), 1–7.

[115]

Chunting Zhou and Graham Neubig. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017).

[116]

Qile Zhu, Jianlin Su, Wei Bi, Xiaojiang Liu, Xiyao Ma, Xiaolin Li, and Dapeng Wu. 2020. A batch normalized inference network keeps the KL vanishing away. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020.

[117]

Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004–1013.

[118]

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1097–1100.

Digital Library

Cited By

Jin YChen WTian YSong YYan CMao Z(2024) Improving Radiology Report Generation with D 2 -Net: When Diffusion Meets Discriminator ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448326(2215-2219)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448326
Jin YChen WTian YSong YYan C(2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
https://doi.org/10.1016/j.neucom.2024.128122
Xu XHu MWang YLuo WLiu SLuo ZTan Y(2024)PLIClass: Weakly Supervised Text Classification with Iterative Training and Denoisy InferenceArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72350-6_20(292-305)Online publication date: 17-Sep-2024
https://doi.org/10.1007/978-3-031-72350-6_20

Index Terms

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Natural language processing
      1. Natural language generation

Recommendations

Semi-supervised Aspect-level Sentiment Classification Model based on Variational Autoencoder
Abstract
Aspect-level sentiment classification aims to predict the sentiment of a text in different aspects and it is a fine-grained sentiment analysis task. Recent work exploits an Attention-based Long Short-Term Memory Network to perform ...
Topic-word-constrained sentence generation with variational autoencoder
Highlights
- A model that can generate sentences under topic and word constraints was proposed.
Abstract
We propose a topic-word-constrained sentence-generation model with a variational autoencoder and convolutional neural network. It can generate sentences conditioned on a given topic distribution and a certain word. Unlike the vanilla ...
Hierarchical Variational Network for User-Diversified & Query-Focused Video Summarization
ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

This paper focuses on the query-focused video summarization, which is an extended task of video summarization and aims to automatically generate user-oriented summary by highlighting frames/shots relevant to the query. This task is different from ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 4

April 2024

676 pages

EISSN:1551-6865

DOI:10.1145/3613617

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2024

Online AM: 18 November 2023

Accepted: 08 November 2023

Revised: 10 September 2023

Received: 18 March 2023

Published in TOMM Volume 20, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Fund for Excellent Young Scholars
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
296
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)22

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin YChen WTian YSong YYan CMao Z(2024) Improving Radiology Report Generation with D 2 -Net: When Diffusion Meets Discriminator ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448326(2215-2219)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448326
Jin YChen WTian YSong YYan C(2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
https://doi.org/10.1016/j.neucom.2024.128122
Xu XHu MWang YLuo WLiu SLuo ZTan Y(2024)PLIClass: Weakly Supervised Text Classification with Iterative Training and Denoisy InferenceArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72350-6_20(292-305)Online publication date: 17-Sep-2024
https://doi.org/10.1007/978-3-031-72350-6_20

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents