Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3652988.3696198acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
demonstration

Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

Published: 26 December 2024 Publication History

Abstract

The rise in capability and ubiquity of generative artificial intelligence (AI) technologies has enabled its application to the field of Socially Interactive Agents (SIAs). Despite significant interest in leveraging modern AI-powered components in real-time SIA research, substantial friction remains due to the absence of a standardized and universal SIA framework. As such, we developed Estuary: a multimodal (text, audio, and soon video) framework which facilitates the development of low-latency, real-time SIAs. Estuary seeks to reduce repeat work between studies and to provide a flexible platform that can be run entirely off-cloud to maximize configurability, controllability, reproducibility of studies, and speed of agent response times. We achieve this by constructing a robust multimodal framework which incorporates current and future components seamlessly into a modular and interoperable architecture.

Supplemental Material

MP4 File
5 Minute Technical Demo Video

References

[1]
[n. d.]. Faster Whisper. https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file.
[2]
[n. d.]. NVIDIA Ace. https://developer.nvidia.com/ace.
[3]
[n. d.]. Socket.IO. https://socket.io/
[4]
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
[5]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 59–66. https://doi.org/10.1109/FG.2018.00019
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165
[7]
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arxiv:1812.08008
[8]
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, and Julian Weber. 2024. XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model. arxiv:2406.04904 [eess.AS]
[9]
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015), aac4716.
[10]
Pierre Nicolas Durette. 2024. gTTS. https://github.com/pndurette/gTTS.
[11]
Aleix Conchillo Flaqué, Moishe Lettvin, Kwindla Hultman Kramer, chadbailey59, Jon Taylor, Thomas B., Liza, James Hush, and Rahul Nair. 2024. pipecat-ai/pipecat. https://github.com/pipecat-ai/pipecat
[12]
Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2023. S3: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv preprint arXiv:2307.14984 (2023).
[13]
Arno Hartholt, David Traum, Stacy C Marsella, Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All together now: Introducing the virtual human toolkit. In Intelligent Virtual Agents: 13th International Conference, IVA 2013, Edinburgh, UK, August 29-31, 2013. Proceedings 13. Springer, 368–381.
[14]
Masum Hasan, Cengiz Ozel, Sammy Potter, and Ehsan Hoque. 2023. SAPIEN: affective virtual agents powered by large language models. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 1–3.
[15]
Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. [n. d.]. EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. ([n. d.]).
[16]
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023).
[17]
Birgit Lugrin, Catherine Pelachaud, and David Traum. 2022. The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application. ACM.
[18]
OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
[19]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–22.
[20]
René Peinl, Basem Rizk, and Robert Szabad. 2020. Open source speech recognition on edge devices. In 2020 10th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 441–445.
[21]
Isabella Poggi, Catherine Pelachaud, Fiorella de Rosis, Valeria Carofiglio, and Berardina De Carolis. 2005. Greta. a believable embodied conversational agent. In Multimodal intelligent information presentation. Springer, 3–25.
[22]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
[23]
Basem Rizk. 2019. Evaluation of state of art open-source ASR engines with local inferencing. In Evaluation of State Of Art Open-source ASR Engines with Local Inferencing. Vol. 8.
[24]
Omar Shaikh, Valentino Emil Chai, Michele Gelfand, Diyi Yang, and Michael S Bernstein. 2024. Rehearsal: Simulating conflict to teach conflict resolution. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20.
[25]
Unity. 2024. UnityARKitDocumentation. https://docs.unity3d.com/Packages/[email protected]/manual/
[26]
Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. 2023. Humanoid agents: Platform for simulating human-like generative agents. arXiv preprint arXiv:2310.05418 (2023).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents
September 2024
337 pages
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 December 2024

Check for updates

Qualifiers

  • Demonstration
  • Research
  • Refereed limited

Conference

IVA '24
Sponsor:
IVA '24: ACM International Conference on Intelligent Virtual Agents
September 16 - 19, 2024
GLASGOW, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 42
    Total Downloads
  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)15
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media