Group cohesiveness reflects the level of intimacy that people feel with each other, and the development of a dialogue robot that can understand group cohesiveness will lead to the promotion of human communication. However, group cohesiveness is a complex concept that is difficult to predict based only on image pixels. Inspired by the fact that humans intuitively associate linguistic knowledge accumulated in the brain with the visual images they see, we propose a linguistic knowledge injectable deep neural network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness that can automatically associate the linguistic knowledge hidden behind images. LDNN consists of a visual encoder and a language encoder, and applies domain adaptation and linguistic knowledge transition mechanisms to transform linguistic knowledge from a language model to the visual LDNN. We train LDNN by adding descriptions to the training and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models and three state-of-the-art models in the test set, the results demonstrate that the visual LDNN not only improves the performance of the fine-tuned DNN model leading to an MSE very similar to the state-of-the-art model, but is also a practical and efficient method that requires relatively little preprocessing. Furthermore, ablation studies confirm that LDNN is an effective method to inject linguistic knowledge into visual models.

Supplementary Material

MP4 File (3382507.3418830.mp4)

We introduce paper "LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding" in this presentation video. The contributions of this paper can be summarized as follows: 1)We propose LDNN which can transform the linguistic knowledge distilled from a language model and transfer it into a single visual model; 2)We train a linguistic knowledge injected visual model using language and visual modal information only in the training phase and single visual modal information in the inference phase; 3)We expand an existing dataset of GAF3.0 by adding a description to each video data and show performance comparable to state-of-the-art models using multimodal information as single modal information.

Download
13.73 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

Edge-preserving image denoising using a deep convolutional neural network

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations