Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3487075.3487164acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaeConference Proceedingsconference-collections
research-article

Multimodal Emotion Recognition with Factorized Bilinear Pooling and Adversarial Learning

Published: 07 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    With the fast development of social networks, the massive growth of the number of multimodal data such as images and texts allows people have higher demands for information processing from an emotional perspective. Emotion recognition requires a higher ability for the computer to simulate high-level visual perception understanding. However, existing methods often focus on the single-modality investigation. In this work, we propose a multimodal model based on factorized bilinear pooling (FBP) and adversarial learning for emotion recognition. In our model, a multimodal feature fusion network is proposed to encode the inter-modality features under the guidance of the FBP to help the visual and textual feature representation learn from each other interactively. Beyond that, we propose an adversarial network by introducing two discriminative classification tasks, emotion recognition and multimodal fusion prediction. Our entire method can be implemented end-to-end by using a deep neural network framework. Experimental results indicate that our proposed model achieves competitive performance on the extended FI dataset. Progressive results prove the ability of our model for emotion recognition against other single- and multi-modality works respectively.

    References

    [1]
    Zhang D, Wu L, Sun C, (2019). Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations [C]. International Joint Conference on Artificial Intelligence (IJCAI), 415-5421.
    [2]
    Zhao S, Yao H, Gao Y, (2016). Continuous probability distribution prediction of image emotions via multi-task shared sparse regression [J]. IEEE Transactions on Multimedia, 19(3), 532-645.
    [3]
    Datta R, Li J and Wang J Z (2008). Algorithmic inferencing of aesthetics and emotion in natural images: an exposition [C]. IEEE International Conference on Image Processing (ICIP), 105-108.
    [4]
    Lee J, Kim Seungryong, Kim Sunok, Park J and Sohn K (2019). Context-Aware Emotion Recognition Networks [C]. IEEE/CVF International Conference on Computer Vision (ICCV), 10142-10151.
    [5]
    Wilson T, Wiebe J and Zubair M (2005). Recognizing contextual polarity in phrase-level sentiment analysis [C]. HLTEMNLP, 347-354.
    [6]
    Shan C, Gong S and McOwan P W (2009). Facial expression recognition based on Local Binary Patterns: A comprehensive study [J]. Image and Vision Computing, 27(6), 803-816.
    [7]
    Zhao G and Pietikainen M (2007). Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 915-928.
    [8]
    Liu F, Zhou X, Yan X, (2021). Image Steganalysis via Diverse Filters and Squeeze-and-Excitation Convolutional Neural Network [J]. Mathematics, 9(2), 189.
    [9]
    Darabant A S, Borza D and Danescu R (2021). Recognizing Human Races through Machine Learning-A Multi-Network, Multi-Features Study [J]. Mathematics, 9(2), 195.
    [10]
    Fersini E, Pozzi FA and Messina E (2017). Approval network: a novel approach for sentiment analysis in social networks [J]. World Wide Web, 20(4), 831-854.
    [11]
    Morency L, Mihalcea R and Doshi P (2011). Towards multimodal sentiment analysis: Harvesting opinions from the web [C]. ICMI 169-176.
    [12]
    Wöllmer M, Weninger F, Knaup T, (2013). YouTube movie reviews: Sentiment analysis in an audio-visual context [J]. IEEE Intell. Syst. 28(3) 46-53.
    [13]
    Zhong L, Liu Q, Yang P, (2015). Learning Multiscale Active Facial Patches for Expression Analysis [J]. IEEE Transactions on Cybernetics, 45(8), 1499-1510.
    [14]
    Zhao G and Pietikainen M (2007). Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 915-928.
    [15]
    Yu Z and Zhang C (2015). Image based Static Facial Expression Recognition with Multiple Deep Network Learning [C]. International Conference on Multimodal Interaction (ICMI), 435-442.
    [16]
    Borth D, Ji R, Chen T (2013). Large-scale visual sentiment ontology and detectors using adjective noun pairs [C]. ACM MM 223-232.
    [17]
    You Q, Jin H and Luo J (2017). Visual sentiment analysis by attending on local image regions [C]. AAAI 231-237.
    [18]
    Sahni T, Chandak C, Chedeti N R and Singh M (2017). Efficient Twitter sentiment classification using subjective distant supervision [C]. International Conference on Communication Systems and Networks (COMSNETS), 548-553.
    [19]
    Sainath T N, Vinyals O, Senior A W and Sak H (2015). Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks [C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4580-4584.
    [20]
    Nakov P, Ritter A, Rosenthal S, (2016). SemEval-2016 Task 4: Sentiment Analysis in Twitter. International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 1-18.
    [21]
    Alsaeedi A and Zubair M (2019). A study on sentiment analysis techniques of Twitter data [J]. Int. J. Adv. Comput. Sci. Appl. 10(2), 361-374.
    [22]
    Niu T, Zhu S, Pang L and El-Saddik A (2016). Sentiment analysis on multi-view social data [C]. International conference on MultiMedia Modeling, 15–27.
    [23]
    You Q, Luo J, Jin H and Yang J (2016). Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia [C]. ACM international conference on web search and data mining, 13–22.
    [24]
    Huang F, Zhang X, Zhao Z, (2019). Image–text sentiment analysis via deep multimodal attentive fusion [J]. Knowl. Based Syst. 167,26-37.
    [25]
    Ji R, Chen F, Cao L and Gao Y (2019). Cross-Modality Microblog Sentiment Prediction via Bi-Layer Multimodal Hypergraph Learning [J]. IEEE Trans. Multim. 21(4), 1062-1075.
    [26]
    Corchs S, Fersini E and Gasparini F (2019). Ensemble learning on visual and textual data for social image emotion classification [J]. International Journal of Machine Learning and Cybernetics, 10(8), 2057-2070.
    [27]
    Xu J, Huang F, Zhang X, (2019). Sentiment analysis of social images via hierarchical deep fusion of content and links [J]. Appl. Soft Comput. 80, 387-399.
    [28]
    Dai P, Ji R, Wang H, (2018). Cross-Modality Person Re-Identification with Generative Adversarial Training [C]. International Joint Conference on Artificial Intelligence, 677-683.
    [29]
    He K, Zhang X, Ren S and Sun J (2016). Deep Residual Learning for Image Recognition [C]. IEEE Conference on Computer Vision and Pattern Recognition, 770-778.
    [30]
    Pennington J, Socher R and Manning C D (2014). Glove: Global vectors for word representation [C]. Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.
    [31]
    Yu Zhou, Yu Jun, Fan J and Tao D (2017). Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]. IEEE International Conference on Computer Vision, 1839-1848.
    [32]
    Rendle S (2010). Factorization machines [C]. IEEE International Conference on Data Mining (ICDM), 995–1000.
    [33]
    You Q, Luo J, Jin H and Yang J (2016). Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and the Benchmark [C]. AAAI Conference on Artificial Intelligence, 308-314.

    Cited By

    View all
    • (2024)Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning ArchitecturesMachine Learning and Knowledge Extraction10.3390/make60300746:3(1545-1563)Online publication date: 9-Jul-2024
    • (2023)Scoping Review on Image-Text Multimodal Machine Learning Models2023 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI62032.2023.00035(186-192)Online publication date: 13-Dec-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering
    October 2021
    660 pages
    ISBN:9781450389853
    DOI:10.1145/3487075
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Adversarial learning
    2. Factorized bilinear pooling
    3. Multimodal emotion recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CSAE 2021

    Acceptance Rates

    Overall Acceptance Rate 368 of 770 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning ArchitecturesMachine Learning and Knowledge Extraction10.3390/make60300746:3(1545-1563)Online publication date: 9-Jul-2024
    • (2023)Scoping Review on Image-Text Multimodal Machine Learning Models2023 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI62032.2023.00035(186-192)Online publication date: 13-Dec-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media