Search | arXiv e-print repository

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Authors: Tao Meng, Yuntao Shou, Wei Ai, Nan Yin, Keqin Li

Abstract: The main task of Multimodal Emotion Recognition in Conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion… ▽ More The main task of Multimodal Emotion Recognition in Conversations (MERC) is to identify the emotions in modalities, e.g., text, audio, image and video, which is a significant development direction for realizing machine intelligence. However, many data in MERC naturally exhibit an imbalanced distribution of emotion categories, and researchers ignore the negative impact of imbalanced data on emotion recognition. To tackle this problem, we systematically analyze it from three aspects: data augmentation, loss sensitivity, and sampling strategy, and propose the Class Boundary Enhanced Representation Learning (CBERL) model. Concretely, we first design a multimodal generative adversarial network to address the imbalanced distribution of {emotion} categories in raw data. Secondly, a deep joint variational autoencoder is proposed to fuse complementary semantic information across modalities and obtain discriminative feature representations. Finally, we implement a multi-task graph neural network with mask reconstruction and classification optimization to solve the problem of overfitting and underfitting in class boundary learning, and achieve cross-modal emotion recognition. We have conducted extensive experiments on the IEMOCAP and MELD benchmark datasets, and the results show that CBERL has achieved a certain performance improvement in the effectiveness of emotion recognition. Especially on the minority class fear and disgust emotion labels, our model improves the accuracy and F1 value by 10% to 20%. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 16 pages, 9 figures

arXiv:2309.07927 [pdf, ps, other]

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Authors: Ahmed Adel Attia, Jing Liu, Wei Ai, Dorottya Demszky, Carol Espy-Wilson

Abstract: Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent st… ▽ More Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition. △ Less

Submitted 15 May, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

arXiv:2105.14678 [pdf, other]

Image-to-Video Generation via 3D Facial Dynamics

Authors: Xiaoguang Tu, Yingtian Zou, Jian Zhao, Wenjie Ai, Jian Dong, Yuan Yao, Zhikang Wang, Guodong Guo, Zhifeng Li, Wei Liu, Jiashi Feng

Abstract: We present a versatile model, FaceAnime, for various video generation tasks from still images. Video generation from a single face image is an interesting problem and usually tackled by utilizing Generative Adversarial Networks (GANs) to integrate information from the input face image and a sequence of sparse facial landmarks. However, the generated face images usually suffer from quality loss, im… ▽ More We present a versatile model, FaceAnime, for various video generation tasks from still images. Video generation from a single face image is an interesting problem and usually tackled by utilizing Generative Adversarial Networks (GANs) to integrate information from the input face image and a sequence of sparse facial landmarks. However, the generated face images usually suffer from quality loss, image distortion, identity change, and expression mismatching due to the weak representation capacity of the facial landmarks. In this paper, we propose to "imagine" a face video from a single face image according to the reconstructed 3D face dynamics, aiming to generate a realistic and identity-preserving face video, with precisely predicted pose and facial expression. The 3D dynamics reveal changes of the facial expression and motion, and can serve as a strong prior knowledge for guiding highly realistic face video generation. In particular, we explore face video prediction and exploit a well-designed 3D dynamic prediction network to predict a 3D dynamic sequence for a single face image. The 3D dynamics are then further rendered by the sparse texture mapping algorithm to recover structural details and sparse textures for generating face frames. Our model is versatile for various AR/VR and entertainment applications, such as face video retargeting and face video prediction. Superior experimental results have well demonstrated its effectiveness in generating high-fidelity, identity-preserving, and visually pleasant face video clips from a single source face image. △ Less

Submitted 30 May, 2021; originally announced May 2021.

arXiv:2011.09078 [pdf, other]

Vertical-Horizontal Structured Attention for Generating Music with Chords

Authors: Yizhou Zhao, Liang Qiu, Wensi Ai, Feng Shi, Song-Chun Zhu

Abstract: In this paper, we propose a lightweight music-generating model based on variational autoencoder (VAE) with structured attention. Generating music is different from generating text because the melodies with chords give listeners distinguished polyphonic feelings. In a piece of music, a chord consisting of multiple notes comes from either the mixture of multiple instruments or the combination of mul… ▽ More In this paper, we propose a lightweight music-generating model based on variational autoencoder (VAE) with structured attention. Generating music is different from generating text because the melodies with chords give listeners distinguished polyphonic feelings. In a piece of music, a chord consisting of multiple notes comes from either the mixture of multiple instruments or the combination of multiple keys of a single instrument. We focus our study on the latter. Our model captures not only the temporal relations along time but the structure relations between keys. Experimental results show that our model has a better performance than baseline MusicVAE in capturing notes in a chord. Besides, our method accords with music theory since it maintains the configuration of the circle of fifths, distinguishes major and minor keys from interval vectors, and manifests meaningful structures between music phrases. △ Less

Submitted 17 November, 2020; originally announced November 2020.

arXiv:2005.10455 [pdf, other]

Single Image Super-Resolution via Residual Neuron Attention Networks

Authors: Wenjie Ai, Xiaoguang Tu, Shilei Cheng, Mei Xie

Abstract: Deep Convolutional Neural Networks (DCNNs) have achieved impressive performance in Single Image Super-Resolution (SISR). To further improve the performance, existing CNN-based methods generally focus on designing deeper architecture of the network. However, we argue blindly increasing network's depth is not the most sensible way. In this paper, we propose a novel end-to-end Residual Neuron Attenti… ▽ More Deep Convolutional Neural Networks (DCNNs) have achieved impressive performance in Single Image Super-Resolution (SISR). To further improve the performance, existing CNN-based methods generally focus on designing deeper architecture of the network. However, we argue blindly increasing network's depth is not the most sensible way. In this paper, we propose a novel end-to-end Residual Neuron Attention Networks (RNAN) for more efficient and effective SISR. Structurally, our RNAN is a sequential integration of the well-designed Global Context-enhanced Residual Groups (GCRGs), which extracts super-resolved features from coarse to fine. Our GCRG is designed with two novelties. Firstly, the Residual Neuron Attention (RNA) mechanism is proposed in each block of GCRG to reveal the relevance of neurons for better feature representation. Furthermore, the Global Context (GC) block is embedded into RNAN at the end of each GCRG for effectively modeling the global contextual information. Experiments results demonstrate that our RNAN achieves the comparable results with state-of-the-art methods in terms of both quantitative metrics and visual quality, however, with simplified network architecture. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: 6 pages, 4 figures, Accepted by IEEE ICIP 2020

arXiv:1703.01107 [pdf, other]

An intracardiac electrogram model to bridge virtual hearts and implantable cardiac devices

Authors: Weiwei Ai, Nitish Patel, Partha Roop, Avinash Malik, Nathan Allen, Mark L. Trew

Abstract: Virtual heart models have been proposed to enhance the safety of implantable cardiac devices through closed loop validation. To communicate with a virtual heart, devices have been driven by cardiac signals at specific sites. As a result, only the action potentials of these sites are sensed. However, the real device implanted in the heart will sense a complex combination of near and far-field extra… ▽ More Virtual heart models have been proposed to enhance the safety of implantable cardiac devices through closed loop validation. To communicate with a virtual heart, devices have been driven by cardiac signals at specific sites. As a result, only the action potentials of these sites are sensed. However, the real device implanted in the heart will sense a complex combination of near and far-field extracellular potential signals. Therefore many device functions, such as blanking periods and refractory periods, are designed to handle these unexpected signals. To represent these signals, we develop an intracardiac electrogram (IEGM) model as an interface between the virtual heart and the device. The model can capture not only the local excitation but also far-field signals and pacing afterpotentials. Moreover, the sensing controller can specify unipolar or bipolar electrogram (EGM) sensing configurations and introduce various oversensing and undersensing modes. The simulation results show that the model is able to reproduce clinically observed sensing problems, which significantly extends the capabilities of the virtual heart model in the context of device validation. △ Less

Submitted 3 March, 2017; originally announced March 2017.

arXiv:1603.05315 [pdf, ps, other]

Towards the Emulation of the Cardiac Conduction System for Pacemaker Testing

Authors: Eugene Yip, Sidharta Andalam, Partha S. Roop, Avinash Malik, Mark Trew, Weiwei Ai, Nitish Patel

Abstract: The heart is a vital organ that relies on the orchestrated propagation of electrical stimuli to coordinate each heart beat. Abnormalities in the heart's electrical behaviour can be managed with a cardiac pacemaker. Recently, the closed-loop testing of pacemakers with an emulation (real-time simulation) of the heart has been proposed. An emulated heart would provide realistic reactions to the pacem… ▽ More The heart is a vital organ that relies on the orchestrated propagation of electrical stimuli to coordinate each heart beat. Abnormalities in the heart's electrical behaviour can be managed with a cardiac pacemaker. Recently, the closed-loop testing of pacemakers with an emulation (real-time simulation) of the heart has been proposed. An emulated heart would provide realistic reactions to the pacemaker as if it were a real heart. This enables developers to interrogate their pacemaker design without having to engage in costly or lengthy clinical trials. Many high-fidelity heart models have been developed, but are too computationally intensive to be simulated in real-time. Heart models, designed specifically for the closed-loop testing of pacemakers, are too abstract to be useful in the testing of physical pacemakers. In the context of pacemaker testing, this paper presents a more computationally efficient heart model that generates realistic continuous-time electrical signals. The heart model is composed of cardiac cells that are connected by paths. Significant improvements were made to an existing cardiac cell model to stabilise its activation behaviour and to an existing path model to capture the behaviour of continuous electrical propagation. We provide simulation results that show our ability to faithfully model complex re-entrant circuits (that cause arrhythmia) that existing heart models can not. △ Less

Submitted 17 March, 2016; v1 submitted 16 March, 2016; originally announced March 2016.

Showing 1–7 of 7 results for author: Ai, W