Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548007acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Temporal Sentiment Localization: Listen and Look in Untrimmed Videos

Published: 10 October 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Video sentiment analysis aims to uncover the underlying attitudes of viewers, which has a wide range of applications in real world. Existing works simply classify a video into a single sentimental category, ignoring the fact that sentiment in untrimmed videos may appear in multiple segments with varying lengths and unknown locations. To address this, we propose a challenging task, i.e., Temporal Sentiment Localization (TSL), to find which parts of the video convey sentiment. To systematically investigate fully- and weakly-supervised settings for TSL, we first build a benchmark dataset named TSL-300, which is consisting of 300 videos with a total length of 1,291 minutes. Each video is labeled in two ways, one of which is frame-by-frame annotation for the fully-supervised setting, and the other is single-frame annotation, i.e., only a single frame with strong sentiment is labeled per segment for the weakly-supervised setting. Due to the high cost of labeling a densely annotated dataset, we propose TSL-Net in this work, employing single-frame supervision to localize sentiment in videos. In detail, we generate the pseudo labels for unlabeled frames using a greedy search strategy, and fuse the affective features of both visual and audio modalities to predict the temporal sentiment distribution. Here, a reverse mapping strategy is designed for feature fusion, and a contrastive loss is utilized to maintain the consistency between the original feature and the reverse prediction. Extensive experiments show the superiority of our method against the state-of-the-art approaches.

    Supplementary Material

    MP4 File (mm22-fp1067.mp4)
    Presentation video In this work, we aim to understand better the sentiment conveyed in untrimmed videos, in granularity of frame, for applications in real-world scenarios. We propose a novel task, i.e., Temporal Sentiment Analysis (TSL), to locate and classify sentiment simultaneously. Then, we present a video sentiment analysis dataset for fully- and weakly-supervised settings. To compact the challenges, we propose a weakly-supervised temporal sentiment localization framework. The dataset and code are available at https://github.com/nku-zhichengzhang/TSL300.

    References

    [1]
    R Baragash and H Aldowah. 2021. Sentiment analysis in higher education: a systematic mapping review. In Journal of Physics: Conference Series.
    [2]
    Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM.
    [3]
    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.
    [4]
    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR.
    [5]
    Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR.
    [6]
    Julien Deonna and Fabrice Teroni. 2012. The emotions: A philosophical introduction. [Book].
    [7]
    Florian Eyben, Felix Weninger, Nicolas Lehment, Björn Schuller, and Gerhard Rigoll. 2014. Affective video retrieval: Violence detection in Hollywood movies by large-scale segmental feature extraction. PLOS ONE 8, 12 (2014), 1--9.
    [8]
    Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV.
    [9]
    Sanjay Goswami, Satrajit Nandi, and Sucheta Chatterjee. 2019. Sentiment analysis based potential customer base identification in social media. In Contemporary Advances in Innovative and Applicable Information Technology.
    [10]
    Lili Guo, Longbiao Wang, Chenglin Xu, Jianwu Dang, Eng Siong Chng, and Haizhou Li. 2021. Representation Learning with Spectro-Temporal-Channel Attention for Speech Emotion Recognition. In ICASSP.
    [11]
    Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. 2021. Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization. In ACM MM.
    [12]
    He-Yen Hsieh, Ding-Jie Chen, and Tyng-Luh Liu. 2022. Contextual Proposal Network for Action Localization. In WACV.
    [13]
    Linjiang Huang, Liang Wang, and Hongsheng Li. 2021. Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization. In ICCV.
    [14]
    Linjiang Huang, Liang Wang, and Hongsheng Li. 2022. Multi-Modality Self- Distillation for Weakly Supervised Temporal Action Localization. TIP 31 (2022), 1504--1519.
    [15]
    Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos ?in the wild". CVIU 155 (2017), 1--23.
    [16]
    Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2010. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. TMM 12, 6 (2010), 523--535.
    [17]
    Haeng-Jin Jang, Jaemoon Sim, Yonnim Lee, and Ohbyung Kwon. 2013. Deep sentiment analysis: Mining the causality between personality-value-attitude for analyzing business ads in social media. Expert Systems with applications 40, 18 (2013), 7492--7503.
    [18]
    Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning. In ACM MM.
    [19]
    Yu-Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In AAAI.
    [20]
    Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang. 2014. Predicting viewer perceived emotions in animated GIFs. In ACM MM.
    [21]
    Chen Ju, Peisen Zhao, Siheng Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. Divide and Conquer for Single-Frame Temporal Action Localization. In ICCV.
    [22]
    Hang-Bong Kang. 2003. Affective content detection using HMMs. In ACM MM.
    [23]
    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
    [24]
    Qiuqiang Kong, Yong Xu, Wenwu Wang, and Mark D Plumbley. 2017. A joint detection-classification model for audio tagging of weakly labelled data. In ICASSP.
    [25]
    Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. 2019. Context-aware emotion recognition networks. In ICCV.
    [26]
    Jun-Tae Lee, Sungrack Yun, and Mihir Jain. 2022. Leaky Gated Cross-Attention for Weakly Supervised Multi-Modal Temporal Action Localization. In WACV.
    [27]
    Pilhyeon Lee and Hyeran Byun. 2021. Learning Action Completeness from Points for Weakly-supervised Temporal Action Localization. In CVPR.
    [28]
    Pilhyeon Lee, Jinglu Wang, Yan Lu, and Hyeran Byun. 2021. Weakly-supervised temporal action localization by uncertainty modeling. In AAAI.
    [29]
    Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In CVPR.
    [30]
    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. In ICCV.
    [31]
    Jiaxing Liu, Sen Chen, Longbiao Wang, Zhilei Liu, Yahui Fu, Lili Guo, and Jianwu Dang. 2021. Multimodal Emotion Recognition with Capsule Graph Convolutional Based Representation Fusion. In ICASSP.
    [32]
    Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In CVPR.
    [33]
    Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. 2020. Weakly-supervised action localization with expectation maximization multi-instance learning. In ECCV.
    [34]
    Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In CVPR.
    [35]
    Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, and Zheng Shou. 2020. Sf-net: Single-frame supervision for temporal action localization. In ECCV.
    [36]
    Kyle Min and Jason J Corso. 2020. Adversarial background-aware loss for weakly supervised temporal activity localization. In ECCV.
    [37]
    Trisha Mittal, Puneet Mathur, Aniket Bera, and Dinesh Manocha. 2021. Affect2mm: Affective analysis of multimedia content using emotion causality. In CVPR.
    [38]
    Davide Moltisanti, Sanja Fidler, and Dima Damen. 2019. Action recognition from single timestamp supervision in untrimmed videos. In CVPR.
    [39]
    Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. 2019. When does label smoothing help?. In NeurIPS.
    [40]
    Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, and Ling Shao. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In ICCV.
    [41]
    Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR.
    [42]
    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
    [43]
    Phuong Pham, Juncheng Li, Joseph Szurley, and Samarjit Das. 2018. Eventness: Object detection on spectrograms for temporal localization of audio events. In ICASSP.
    [44]
    Fan Qi, Xiaoshan Yang, and Changsheng Xu. 2021. Emotion Knowledge Driven Video Highlight Detection. TMM 23 (2021), 3999--4013.
    [45]
    Sujata Rani and Parteek Kumar. 2017. A sentiment analysis system to improve teaching and learning. Computer 50, 5 (2017), 36--43.
    [46]
    Julien Schroeter, Kirill Sidorov, and David Marshall. 2019. Weakly-supervised temporal localization via occurrence count learning. In ICML.
    [47]
    Christian Schulze, Dominik Henter, Damian Borth, and Andreas Dengel. 2014. Automatic detection of CSA media by multi-modal feature fusion for law enforcement support. In ICMR.
    [48]
    Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3--14.
    [49]
    Statista. 2020. Hours of video uploaded to YouTube every minute as of February 2020. https://www.statista.com/statistics/259477/hours-of-video-uploaded-toyoutube-every-minute/. [Online].
    [50]
    Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. 2015. Temporal localization of fine-grained actions in videos by domain transfer from web images. In ACM MM.
    [51]
    Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020. Asynchronous interaction aggregation for action detection. In ECCV.
    [52]
    Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards Understanding Human-Centric Situations from Videos. In CVPR.
    [53]
    Paulo Vitorino, Sandra Avila, Mauricio Perez, and Anderson Rocha. 2018. Leveraging deep neural networks to fight child pornography in the age of social media. Journal of Visual Communication and Image Representation 50 (2018), 303--313.
    [54]
    Hee Lin Wang and Loong-Fah Cheong. 2006. Affective understanding in film. TCSVT 16, 6 (2006), 689--704.
    [55]
    LiminWang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In CVPR.
    [56]
    Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: a survey of state-of-the-art methods. TAC 6, 4 (2015), 410--430.
    [57]
    Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, and Leonid Sigal. 2018. Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization. TAC 9, 2 (2018), 255--270.
    [58]
    Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In ICCV.
    [59]
    Jufeng Yang, Dongyu She, and Ming Sun. 2017. Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network. In IJCAI.
    [60]
    Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, JieleWu, Jiyun Zou, and Kaicheng Yang. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In ACL.
    [61]
    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP.
    [62]
    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).
    [63]
    AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.
    [64]
    Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2021. Graph Convolutional Module for Temporal Action Localization in Videos. TPAMI (2021).
    [65]
    Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In ECCV.
    [66]
    Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. 2021. CoLA: Weakly-Supervised TemporalAction Localization with Snippet Contrastive Learning. In CVPR.
    [67]
    Haimin Zhang and Min Xu. 2018. Recognition of emotions in user-generated videos with kernelized features. TMM 20, 10 (2018), 2824--2835.
    [68]
    Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. 2017. Learning affective features with a hybrid deep model for audio--visual emotion recognition. TCSVT 28, 10 (2017), 3030--3043.
    [69]
    Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. 2019. Hacs: Human action clips and segments dataset for recognition and temporal localization. In ICCV.
    [70]
    Sicheng Zhao, Guoli Jia, Jufeng Yang, Guiguang Ding, and Kurt Keutzer. 2021. Emotion Recognition From Multiple Modalities: Fundamentals and methodologies. SPM 38, 6 (2021), 59--73.
    [71]
    Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. 2020. An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos. In AAAI.
    [72]
    Tao Zhao, Junwei Han, Le Yang, Binglu Wang, and Dingwen Zhang. 2021. Soda: Weakly supervised temporal action localization based on astute background response and self-distillation learning. IJCV 129, 8 (2021), 2474--2498.
    [73]
    Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H Li, and Ge Li. 2018. Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In ACM MM.
    [74]
    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. TPAMI 40, 6 (2017), 1452--1464.

    Cited By

    View all
    • (2023)Ordinal Label Distribution Learning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.02146(23424-23434)Online publication date: 1-Oct-2023
    • (2023)Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01811(18888-18897)Online publication date: Jul-2023
    • (2023)DIP: Dual Incongruity Perceiving Network for Sarcasm Detection2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00250(2540-2550)Online publication date: Jul-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dataset
    2. video sentiment analysis
    3. weakly-supervised learning

    Qualifiers

    • Research-article

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)124
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Ordinal Label Distribution Learning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.02146(23424-23434)Online publication date: 1-Oct-2023
    • (2023)Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01811(18888-18897)Online publication date: Jul-2023
    • (2023)DIP: Dual Incongruity Perceiving Network for Sarcasm Detection2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00250(2540-2550)Online publication date: Jul-2023
    • (2023)An End-to-End Transformer with Progressive Tri-Modal Attention for Multi-modal Emotion RecognitionPattern Recognition and Computer Vision10.1007/978-981-99-8540-1_32(396-408)Online publication date: 25-Dec-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media