research-article

Multimodal Emotion Recognition with Factorized Bilinear Pooling and Adversarial Learning

Authors:

Shi FengAuthors Info & Claims

CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering

Article No.: 89, Pages 1 - 6

https://doi.org/10.1145/3487075.3487164

Published: 07 December 2021 Publication History

Abstract

With the fast development of social networks, the massive growth of the number of multimodal data such as images and texts allows people have higher demands for information processing from an emotional perspective. Emotion recognition requires a higher ability for the computer to simulate high-level visual perception understanding. However, existing methods often focus on the single-modality investigation. In this work, we propose a multimodal model based on factorized bilinear pooling (FBP) and adversarial learning for emotion recognition. In our model, a multimodal feature fusion network is proposed to encode the inter-modality features under the guidance of the FBP to help the visual and textual feature representation learn from each other interactively. Beyond that, we propose an adversarial network by introducing two discriminative classification tasks, emotion recognition and multimodal fusion prediction. Our entire method can be implemented end-to-end by using a deep neural network framework. Experimental results indicate that our proposed model achieves competitive performance on the extended FI dataset. Progressive results prove the ability of our model for emotion recognition against other single- and multi-modality works respectively.

References

[1]

Zhang D, Wu L, Sun C, (2019). Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations [C]. International Joint Conference on Artificial Intelligence (IJCAI), 415-5421.

[2]

Zhao S, Yao H, Gao Y, (2016). Continuous probability distribution prediction of image emotions via multi-task shared sparse regression [J]. IEEE Transactions on Multimedia, 19(3), 532-645.

[3]

Datta R, Li J and Wang J Z (2008). Algorithmic inferencing of aesthetics and emotion in natural images: an exposition [C]. IEEE International Conference on Image Processing (ICIP), 105-108.

[4]

Lee J, Kim Seungryong, Kim Sunok, Park J and Sohn K (2019). Context-Aware Emotion Recognition Networks [C]. IEEE/CVF International Conference on Computer Vision (ICCV), 10142-10151.

[5]

Wilson T, Wiebe J and Zubair M (2005). Recognizing contextual polarity in phrase-level sentiment analysis [C]. HLTEMNLP, 347-354.

[6]

Shan C, Gong S and McOwan P W (2009). Facial expression recognition based on Local Binary Patterns: A comprehensive study [J]. Image and Vision Computing, 27(6), 803-816.

Digital Library

[7]

Zhao G and Pietikainen M (2007). Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 915-928.

Digital Library

[8]

Liu F, Zhou X, Yan X, (2021). Image Steganalysis via Diverse Filters and Squeeze-and-Excitation Convolutional Neural Network [J]. Mathematics, 9(2), 189.

[9]

Darabant A S, Borza D and Danescu R (2021). Recognizing Human Races through Machine Learning-A Multi-Network, Multi-Features Study [J]. Mathematics, 9(2), 195.

[10]

Fersini E, Pozzi FA and Messina E (2017). Approval network: a novel approach for sentiment analysis in social networks [J]. World Wide Web, 20(4), 831-854.

Digital Library

[11]

Morency L, Mihalcea R and Doshi P (2011). Towards multimodal sentiment analysis: Harvesting opinions from the web [C]. ICMI 169-176.

[12]

Wöllmer M, Weninger F, Knaup T, (2013). YouTube movie reviews: Sentiment analysis in an audio-visual context [J]. IEEE Intell. Syst. 28(3) 46-53.

Digital Library

[13]

Zhong L, Liu Q, Yang P, (2015). Learning Multiscale Active Facial Patches for Expression Analysis [J]. IEEE Transactions on Cybernetics, 45(8), 1499-1510.

[14]

Zhao G and Pietikainen M (2007). Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 915-928.

Digital Library

[15]

Yu Z and Zhang C (2015). Image based Static Facial Expression Recognition with Multiple Deep Network Learning [C]. International Conference on Multimodal Interaction (ICMI), 435-442.

Digital Library

[16]

Borth D, Ji R, Chen T (2013). Large-scale visual sentiment ontology and detectors using adjective noun pairs [C]. ACM MM 223-232.

[17]

You Q, Jin H and Luo J (2017). Visual sentiment analysis by attending on local image regions [C]. AAAI 231-237.

[18]

Sahni T, Chandak C, Chedeti N R and Singh M (2017). Efficient Twitter sentiment classification using subjective distant supervision [C]. International Conference on Communication Systems and Networks (COMSNETS), 548-553.

[19]

Sainath T N, Vinyals O, Senior A W and Sak H (2015). Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks [C]. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4580-4584.

[20]

Nakov P, Ritter A, Rosenthal S, (2016). SemEval-2016 Task 4: Sentiment Analysis in Twitter. International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 1-18.

[21]

Alsaeedi A and Zubair M (2019). A study on sentiment analysis techniques of Twitter data [J]. Int. J. Adv. Comput. Sci. Appl. 10(2), 361-374.

[22]

Niu T, Zhu S, Pang L and El-Saddik A (2016). Sentiment analysis on multi-view social data [C]. International conference on MultiMedia Modeling, 15–27.

[23]

You Q, Luo J, Jin H and Yang J (2016). Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia [C]. ACM international conference on web search and data mining, 13–22.

Digital Library

[24]

Huang F, Zhang X, Zhao Z, (2019). Image–text sentiment analysis via deep multimodal attentive fusion [J]. Knowl. Based Syst. 167,26-37.

Digital Library

[25]

Ji R, Chen F, Cao L and Gao Y (2019). Cross-Modality Microblog Sentiment Prediction via Bi-Layer Multimodal Hypergraph Learning [J]. IEEE Trans. Multim. 21(4), 1062-1075.

[26]

Corchs S, Fersini E and Gasparini F (2019). Ensemble learning on visual and textual data for social image emotion classification [J]. International Journal of Machine Learning and Cybernetics, 10(8), 2057-2070.

[27]

Xu J, Huang F, Zhang X, (2019). Sentiment analysis of social images via hierarchical deep fusion of content and links [J]. Appl. Soft Comput. 80, 387-399.

Digital Library

[28]

Dai P, Ji R, Wang H, (2018). Cross-Modality Person Re-Identification with Generative Adversarial Training [C]. International Joint Conference on Artificial Intelligence, 677-683.

[29]

He K, Zhang X, Ren S and Sun J (2016). Deep Residual Learning for Image Recognition [C]. IEEE Conference on Computer Vision and Pattern Recognition, 770-778.

[30]

Pennington J, Socher R and Manning C D (2014). Glove: Global vectors for word representation [C]. Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.

[31]

Yu Zhou, Yu Jun, Fan J and Tao D (2017). Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]. IEEE International Conference on Computer Vision, 1839-1848.

[32]

Rendle S (2010). Factorization machines [C]. IEEE International Conference on Data Mining (ICDM), 995–1000.

Digital Library

[33]

You Q, Luo J, Jin H and Yang J (2016). Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and the Benchmark [C]. AAAI Conference on Artificial Intelligence, 308-314.

Cited By

Binte Rashid MRahaman MRivas P(2024)Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning ArchitecturesMachine Learning and Knowledge Extraction10.3390/make60300746:3(1545-1563)Online publication date: 9-Jul-2024
https://doi.org/10.3390/make6030074
Rashid MRivas P(2023)Scoping Review on Image-Text Multimodal Machine Learning Models2023 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI62032.2023.00035(186-192)Online publication date: 13-Dec-2023
https://doi.org/10.1109/CSCI62032.2023.00035

Index Terms

Multimodal Emotion Recognition with Factorized Bilinear Pooling and Adversarial Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Disentangled Representation Learning for Multimodal Emotion Recognition
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is ...
A Three-stage multimodal emotion recognition network based on text low-rank fusion
Abstract
Multimodal emotion recognition has achieved good results in emotion recognition tasks by fusing multimodal information such as audio, text, and visual. How to use multimodal interaction and fusion to transform sparse unimodal into compact ...
MT-TCCT: Multi-task Learning for Multimodal Emotion Recognition
Artificial Neural Networks and Machine Learning – ICANN 2022
Abstract
Multimodal emotion recognition is an emerging research field, which aims to capture affective information from multimodal data, such as natural language, facial expression, and voice intonation. However, most existing methods focus more on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering

October 2021

660 pages

ISBN:9781450389853

DOI:10.1145/3487075

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CSAE 2021

CSAE 2021: The 5th International Conference on Computer Science and Application Engineering

October 19 - 21, 2021

Sanya, China

Acceptance Rates

Overall Acceptance Rate 368 of 770 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
111
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Binte Rashid MRahaman MRivas P(2024)Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning ArchitecturesMachine Learning and Knowledge Extraction10.3390/make60300746:3(1545-1563)Online publication date: 9-Jul-2024
https://doi.org/10.3390/make6030074
Rashid MRivas P(2023)Scoping Review on Image-Text Multimodal Machine Learning Models2023 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI62032.2023.00035(186-192)Online publication date: 13-Dec-2023
https://doi.org/10.1109/CSCI62032.2023.00035

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents