Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3382507.3418830acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding

Published: 22 October 2020 Publication History

Abstract

Group cohesiveness reflects the level of intimacy that people feel with each other, and the development of a dialogue robot that can understand group cohesiveness will lead to the promotion of human communication. However, group cohesiveness is a complex concept that is difficult to predict based only on image pixels. Inspired by the fact that humans intuitively associate linguistic knowledge accumulated in the brain with the visual images they see, we propose a linguistic knowledge injectable deep neural network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness that can automatically associate the linguistic knowledge hidden behind images. LDNN consists of a visual encoder and a language encoder, and applies domain adaptation and linguistic knowledge transition mechanisms to transform linguistic knowledge from a language model to the visual LDNN. We train LDNN by adding descriptions to the training and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models and three state-of-the-art models in the test set, the results demonstrate that the visual LDNN not only improves the performance of the fine-tuned DNN model leading to an MSE very similar to the state-of-the-art model, but is also a practical and efficient method that requires relatively little preprocessing. Furthermore, ablation studies confirm that LDNN is an effective method to inject linguistic knowledge into visual models.

Supplementary Material

MP4 File (3382507.3418830.mp4)
We introduce paper "LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding" in this presentation video. The contributions of this paper can be summarized as follows: 1)We propose LDNN which can transform the linguistic knowledge distilled from a language model and transfer it into a single visual model; 2)We train a linguistic knowledge injected visual model using language and visual modal information only in the training phase and single visual modal information in the inference phase; 3)We expand an existing dataset of GAF3.0 by adding a description to each video data and show performance comparable to state-of-the-art models using multimodal information as single modal information.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[2]
Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
[5]
Abhinav Dhall. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. In 2019 International Conference on Multimodal Interaction (ICMI '19).
[6]
Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. 2015. The more the merrier: Analysing the affect of a group of people in images. In FG. IEEE.
[7]
Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. 2007. What do we perceive in a glance of a real-world scene? Journal of vision (2007).
[8]
Terrence Fong, Charles Thorpe, and Charles Baur. 2003. Collaboration, dialogue, human-robot interaction. In Robotics Research. Springer.
[9]
Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. 2019. Predicting group cohesiveness in images. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE.
[10]
Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In 2019 International Conference on Multimodal Interaction (ICMI '19).
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[12]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS.
[13]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR.
[14]
Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2017. Cross-modal common representation learning by hybrid transfer network. In IJCAI.
[15]
H. Hung and D. Gatica-Perez. 2010. Estimating Cohesion in Small Groups Using Audio-Visual Nonverbal Behavior. IEEE Transactions on Multimedia 12, 6 (2010).
[16]
Xiuyi Jia, Xiang Zheng, Weiwei Li, Changqing Zhang, and Zechao Li. 2019. Facial Emotion Distribution Learning by Exploiting Low-Rank Label Correlations Locally. In CVPR.
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
[19]
Kristen A Lindquist, Jennifer K MacCormack, and Holly Shablack. 2015. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology (2015).
[20]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.
[21]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don?t Know: Unanswerable Questions for SQuAD. In ACL.
[22]
Sebastian Raschka. 2018. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808 (2018).
[23]
Garima Sharma, Shreya Ghosh, and Abhinav Dhall. 2019. Automatic Group Level Affect and cohesion prediction in videos. In International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW2019).
[24]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
[25]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR.
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
[27]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
[28]
Yanan Wang, Jianming Wu, and Keiichiro Hoashi. 2019. Multi-Attention Fusion Network for Video-Based Emotion Recognition. In ICMI.
[29]
Tien Xuan Dang, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, and Thanh-Hung Vo. 2019. Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction (ICMI '19).
[30]
An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. 2019. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In ACL.
[31]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In AAAI.
[32]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.
[33]
Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention Oriented Image Captions With Guiding Objects. In CVPR.
[34]
Bin Zhu, Xin Guo, Kenneth Barner, and Charles Boncelet. 2019. Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction (ICMI '19).

Cited By

View all
  • (2023)VideoAdviser: Video Knowledge Distillation for Multimodal Transfer LearningIEEE Access10.1109/ACCESS.2023.328018711(51229-51240)Online publication date: 2023
  • (2021)Using Valence Emotion to Predict Group Cohesion’s Dynamics: Top-down and Bottom-up Approaches2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII)10.1109/ACII52823.2021.9597429(1-8)Online publication date: 28-Sep-2021

Index Terms

  1. LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
      October 2020
      920 pages
      ISBN:9781450375818
      DOI:10.1145/3382507
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 October 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. affective computing
      2. human interaction
      3. machine learning for multimodal interaction
      4. multimodal fusion and representation

      Qualifiers

      • Research-article

      Conference

      ICMI '20
      Sponsor:
      ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
      October 25 - 29, 2020
      Virtual Event, Netherlands

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)13
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 03 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)VideoAdviser: Video Knowledge Distillation for Multimodal Transfer LearningIEEE Access10.1109/ACCESS.2023.328018711(51229-51240)Online publication date: 2023
      • (2021)Using Valence Emotion to Predict Group Cohesion’s Dynamics: Top-down and Bottom-up Approaches2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII)10.1109/ACII52823.2021.9597429(1-8)Online publication date: 28-Sep-2021

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media