Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3678957.3685752acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction

Published: 04 November 2024 Publication History

Abstract

We present a database for automatic understanding of Social Engagement in MultiParty Interaction (SEMPI). Social engagement is an important social signal characterizing the level of participation of an interlocutor in a conversation. Social engagement involves maintaining attention and establishing connection and rapport. Machine understanding of social engagement can enable an autonomous agent to better understand the state of human participation and involvement to select optimal actions in human-machine social interaction. Recently, video-mediated interaction platforms, e.g., Zoom, have become very popular. The ease of use and increased accessibility of video calls have made them a preferred medium for multiparty conversations, including support groups and group therapy sessions. To create this dataset, we first collected a set of publicly available video calls posted on YouTube. We then segmented the videos by speech turn and cropped the videos to generate single-participant videos. We developed a questionnaire for assessing the level of social engagement by listeners in a conversation probing the relevant nonverbal behaviors for social engagement, including back-channeling, gaze, and expressions. We used Prolific, a crowd-sourcing platform, to annotate 3,505 videos of 76 listeners by three people, reaching a moderate to high inter-rater agreement of 0.693. This resulted in a database with aggregated engagement scores from the annotators. We developed a baseline multimodal pipeline using the state-of-the-art pre-trained models to track the level of engagement achieving the CCC score of 0.454. The results demonstrate the utility of the database for future applications in video-mediated human-machine interaction and human-human social skill assessment. Our dataset and code are available at https://github.com/ihp-lab/SEMPI.

References

[1]
Tadas Baltrušaitis, Marwa Mahmoud, and Peter Robinson. 2015. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 6. IEEE, 1–6.
[2]
Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2013. Constrained local neural fields for robust facial landmark detection in the wild. In Proceedings of the IEEE international conference on computer vision workshops. 354–361.
[3]
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. 2017. UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM international conference on multimodal interaction. 464–472.
[4]
Dan Bohus and Eric Horvitz. 2009. Models for multiparty engagement in open-world dialog. In Proceedings of the SIGDIAL 2009 Conference, The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 10.
[5]
Dan Bohus and Eric Horvitz. 2014. Managing human-robot engagement with forecasts and... um... hesitations. In Proceedings of the 16th international conference on multimodal interaction. 2–9.
[6]
Hervé Bredin. 2023. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023.
[7]
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (Glasgow, UK) (ICMI ’17). Association for Computing Machinery, New York, NY, USA, 350–359. https://doi.org/10.1145/3136755.3136780
[8]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[9]
Ginevra Castellano, Iolanda Leite, Andre Pereira, Carlos Martinho, Ana Paiva, and Peter W McOwan. 2012. Detecting engagement in HRI: An exploration of social and task-based context. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing. IEEE, 421–428.
[10]
Oya Celiktutan, Efstratios Skordos, and Hatice Gunes. 2019. Multimodal Human-Human-Robot Interactions (MHHRI) Dataset for Studying Personality and Engagement. IEEE Transactions on Affective Computing 10, 4 (2019), 484–497. https://doi.org/10.1109/TAFFC.2017.2737019
[11]
Mathieu Chollet and Stefan Scherer. 2017. Assessing Public Speaking Ability from Thin Slices of Behavior. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). 310–316. https://doi.org/10.1109/FG.2017.45
[12]
Soumia Dermouche and Catherine Pelachaud. 2019. Engagement modeling in dyadic interaction. In 2019 international conference on multimodal interaction. 440–445.
[13]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[14]
M. Ali Akber Dewan, Mahbub Murshed, and Fuhua Lin. 2019. Engagement detection in online learning: a review. Smart Learning Environments 6, 1 (Jan. 2019), 1. https://doi.org/10.1186/s40561-018-0080-z
[15]
Svati Dhamija and Terranee E Boult. 2017. Automated mood-aware engagement prediction. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 1–8.
[16]
Kevin Doherty and Gavin Doherty. 2018. Engagement in HCI: Conception, Theory and Measurement. ACM Comput. Surv. 51, 5, Article 99 (nov 2018), 39 pages. https://doi.org/10.1145/3234149
[17]
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).
[18]
Andrew Emerson, Nathan Henderson, Jonathan Rowe, Wookhee Min, Seung Lee, James Minogue, and James Lester. 2020. Early Prediction of Visitor Engagement in Science Museums with Multimodal Learning Analytics. In Proceedings of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 107–116. https://doi.org/10.1145/3382507.3418890
[19]
Mary Ellen Foster, Andre Gaschler, and Manuel Giuliani. 2013. How can i help you’ comparing engagement classification strategies for a robot bartender. In Proceedings of the 15th ACM on International conference on multimodal interaction. 255–262.
[20]
Alesia Gainer, Allison Aptaker, Ron Artstein, David Cobbins, Mark Core, Carla Gordon, Anton Leuski, Zongjian Li, Chirag Merchant, David Nelson, Mohammad Soleymani, and David Traum. 2023. DIVIS: Digital Interactive Victim Intake Simulator. In Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents (, Würzburg, Germany, ) (IVA ’23). Association for Computing Machinery, New York, NY, USA, Article 63, 2 pages. https://doi.org/10.1145/3570945.3607328
[21]
Nadine Glas and Catherine Pelachaud. 2015. Definitions of engagement in human-agent interaction. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). 944–949. https://doi.org/10.1109/ACII.2015.7344688
[22]
Joseph F Grafsgaard, Joseph B Wiggins, Alexandria Katarina Vail, Kristy Elizabeth Boyer, Eric N Wiebe, and James C Lester. 2014. The additive value of multimodal features for predicting engagement, frustration, and learning during tutoring. In Proceedings of the 16th International Conference on Multimodal Interaction. 42–49.
[23]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
[24]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. pmlr, 448–456.
[25]
Shomik Jain, Balasubramanian Thiagarajan, Zhonghao Shi, Caitlyn Clabaugh, and Maja J Matarić. 2020. Modeling engagement in long-term, in-home socially assistive robot interventions for children with autism spectrum disorders. Science Robotics 5, 39 (2020), eaaz3791.
[26]
Ashish Kapoor, Rosalind W Picard, and Yuri Ivanov. 2004. Probabilistic combination of multiple modalities to detect interest. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., Vol. 3. IEEE, 969–972.
[27]
Amanjot Kaur, Aamir Mustafa, Love Mehta, and Abhinav Dhall. 2018. Prediction and Localization of Student Engagement in the Wild. In 2018 Digital Image Computing: Techniques and Applications (DICTA). 1–8. https://doi.org/10.1109/DICTA.2018.8615851
[28]
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758.
[29]
Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. Sage publications.
[30]
I Lawrence and Kuei Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics (1989), 255–268.
[31]
Séverin Lemaignan, Charlotte E. R. Edmunds, Emmanuel Senft, and Tony Belpaeme. 2018. The PInSoRo dataset: Supporting the data-driven study of child-child and child-robot social dynamics. PLOS ONE 13, 10 (10 2018), 1–19. https://doi.org/10.1371/journal.pone.0205999
[32]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[33]
Matthew Lombard, Lisa Weinstein, and Theresa Ditton. 2011. Measuring telepresence: The validity of the Temple Presence Inventory (TPI) in a gaming context. In ISPR 2011: The International Society for Presence Research Annual Conference. Edinburgh UK.
[34]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR). OpenReview, New Orleans, LA, USA, 18 pages.
[35]
Malia F Mason, Elizabeth P Tatkow, and C Neil Macrae. 2005. The look of love: Gaze shifts and person perception. Psychological science 16, 3 (2005), 236–239.
[36]
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2023. MultiMediate ’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions. In Proceedings of the 31st ACM International Conference on Multimedia (, Ottawa ON, Canada, ) (MM ’23). Association for Computing Machinery, New York, NY, USA, 9640–9645. https://doi.org/10.1145/3581783.3613851
[37]
Catharine Oertel, Céline De Looze, Stefan Scherer, Andreas Windmann, Petra Wagner, and Nick Campbell. 2011. Towards the automatic detection of involvement in conversation. In Analysis of Verbal and Nonverbal Communication and Enactment. The Processing Issues: COST 2102 International Conference, Budapest, Hungary, September 7-10, 2010, Revised Selected Papers. Springer, 163–170.
[38]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop.
[39]
Karl Pearson. 1896. VII. Mathematical contributions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character187 (1896), 253–318.
[40]
Christopher Peters, Ginevra Castellano, and Sara de Freitas. 2009. An exploration of user engagement in HCI. In Proceedings of the International Workshop on Affective-Aware Virtual Agents and Social Robots (Boston, Massachusetts) (AFFINE ’09). Association for Computing Machinery, New York, NY, USA, Article 9, 3 pages. https://doi.org/10.1145/1655260.1655269
[41]
Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023.
[42]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
[43]
Andrew Reece, Gus Cooney, Peter Bull, Christine Chung, Bryn Dawson, Casey Fitzpatrick, Tamara Glazer, Dean Knox, Alex Liebscher, and Sebastian Marin. 2023. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. Science Advances 9, 13 (2023), eadf3197.
[44]
Charles Rich, Brett Ponsler, Aaron Holroyd, and Candace L. Sidner. 2010. Recognizing engagement in human-robot interaction. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 375–382. https://doi.org/10.1109/HRI.2010.5453163
[45]
Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 1–8. https://doi.org/10.1109/FG.2013.6553805
[46]
Hanan Salam, Oya Celiktutan, Hatice Gunes, and Mohamed Chetouani. 2023. Automatic Context-Aware Inference of Engagement in HMI: A Survey. IEEE Transactions on Affective Computing (2023), 1–20. https://doi.org/10.1109/TAFFC.2023.3278707
[47]
Hanan Salam, Oya Celiktutan, Isabelle Hupont, Hatice Gunes, and Mohamed Chetouani. 2016. Fully automatic analysis of engagement and its relationship to personality in human-robot interactions. IEEE Access 5 (2016), 705–721.
[48]
Samiha Samrose, Daniel McDuff, Robert Sim, Jina Suh, Kael Rowan, Javier Hernandez, Sean Rintel, Kevin Moynihan, and Mary Czerwinski. 2021. MeetingCoach: An Intelligent Dashboard for Supporting Effective & Inclusive Meetings. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (, Yokohama, Japan, ) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 252, 13 pages. https://doi.org/10.1145/3411764.3445615
[49]
Jyotirmay Sanghvi, Ginevra Castellano, Iolanda Leite, André Pereira, Peter W. McOwan, and Ana Paiva. 2011. Automatic analysis of affective postures and body motion to detect engagement with a game companion. In Proceedings of the 6th International Conference on Human-Robot Interaction (Lausanne, Switzerland) (HRI ’11). Association for Computing Machinery, New York, NY, USA, 305–312. https://doi.org/10.1145/1957656.1957781
[50]
Jyotirmay Sanghvi, Ginevra Castellano, Iolanda Leite, André Pereira, Peter W McOwan, and Ana Paiva. 2011. Automatic analysis of affective postures and body motion to detect engagement with a game companion. In Proceedings of the 6th international conference on Human-robot interaction. 305–312.
[51]
Monisha Singh, Ximi Hoque, Donghuo Zeng, Yanan Wang, Kazushi Ikeda, and Abhinav Dhall. 2023. Do I Have Your Attention: A Large Scale Engagement Prediction Dataset and Baselines. In Proceedings of the 25th International Conference on Multimodal Interaction. 174–182.
[52]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
[53]
Baltrusaitis Tadas, Zadeh Amir, Lim Yao Chong, and Louis-Morency Philippe. 2018. Openface 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition.
[54]
Minh Tran, Yufeng Yin, and Mohammad Soleymani. 2023. Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition. arXiv preprint arXiv:2309.02418 (2023).
[55]
Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. 2014. The faces of engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Transactions on Affective Computing 5, 1 (2014), 86–98.
[56]
Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, and Andreas Bulling. 2015. Rendering of eyes for eye-shape registration and gaze estimation. In Proceedings of the IEEE international conference on computer vision. 3756–3764.
[57]
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051 (2021).
[58]
Yufeng Yin, Jiashu Xu, Tianxin Zu, and Mohammad Soleymani. 2022. X-Norm: Exchanging Normalization Parameters for Bimodal Fusion. In Proceedings of the 2022 International Conference on Multimodal Interaction. 605–614.
[59]
Chen Yu, Paul M Aoki, and Allison Woodruff. 2004. Detecting user engagement in everyday conversations. arXiv preprint cs/0410027 (2004).
[60]
Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe Morency. 2017. Convolutional experts constrained local model for 3d facial landmark detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 2519–2528.

Index Terms

  1. SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Information & Contributors

                Information

                Published In

                cover image ACM Other conferences
                ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction
                November 2024
                725 pages
                ISBN:9798400704628
                DOI:10.1145/3678957
                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                Published: 04 November 2024

                Permissions

                Request permissions for this article.

                Check for updates

                Author Tags

                1. Dataset
                2. Engagement
                3. Machine Learning
                4. Multiparty Interaction

                Qualifiers

                • Research-article
                • Research
                • Refereed limited

                Funding Sources

                Conference

                ICMI '24
                ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
                November 4 - 8, 2024
                San Jose, Costa Rica

                Acceptance Rates

                Overall Acceptance Rate 453 of 1,080 submissions, 42%

                Contributors

                Other Metrics

                Bibliometrics & Citations

                Bibliometrics

                Article Metrics

                • 0
                  Total Citations
                • 119
                  Total Downloads
                • Downloads (Last 12 months)119
                • Downloads (Last 6 weeks)18
                Reflects downloads up to 13 Jan 2025

                Other Metrics

                Citations

                View Options

                Login options

                View options

                PDF

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format.

                HTML Format

                Media

                Figures

                Other

                Tables

                Share

                Share

                Share this Publication link

                Share on social media