demonstration

Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

Authors:

Caitlín Sullivan,

Scott FisherAuthors Info & Claims

IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents

Article No.: 50, Pages 1 - 3

https://doi.org/10.1145/3652988.3696198

Published: 26 December 2024 Publication History

Abstract

The rise in capability and ubiquity of generative artificial intelligence (AI) technologies has enabled its application to the field of Socially Interactive Agents (SIAs). Despite significant interest in leveraging modern AI-powered components in real-time SIA research, substantial friction remains due to the absence of a standardized and universal SIA framework. As such, we developed Estuary: a multimodal (text, audio, and soon video) framework which facilitates the development of low-latency, real-time SIAs. Estuary seeks to reduce repeat work between studies and to provide a flexible platform that can be run entirely off-cloud to maximize configurability, controllability, reproducibility of studies, and speed of agent response times. We achieve this by constructing a robust multimodal framework which incorporates current and future components seamlessly into a modular and interoperable architecture.

Supplemental Material

MP4 File

5 Minute Technical Demo Video

Download
256.50 MB

References

[1]

[n. d.]. Faster Whisper. https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file.

[2]

[n. d.]. NVIDIA Ace. https://developer.nvidia.com/ace.

[3]

[n. d.]. Socket.IO. https://socket.io/

[4]

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).

[5]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 59–66. https://doi.org/10.1109/FG.2018.00019

Digital Library

[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165

[7]

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arxiv:1812.08008

[8]

Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, and Julian Weber. 2024. XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model. arxiv:2406.04904 [eess.AS]

[9]

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015), aac4716.

[10]

Pierre Nicolas Durette. 2024. gTTS. https://github.com/pndurette/gTTS.

[11]

Aleix Conchillo Flaqué, Moishe Lettvin, Kwindla Hultman Kramer, chadbailey59, Jon Taylor, Thomas B., Liza, James Hush, and Rahul Nair. 2024. pipecat-ai/pipecat. https://github.com/pipecat-ai/pipecat

[12]

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2023. S3: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv preprint arXiv:2307.14984 (2023).

[13]

Arno Hartholt, David Traum, Stacy C Marsella, Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All together now: Introducing the virtual human toolkit. In Intelligent Virtual Agents: 13th International Conference, IVA 2013, Edinburgh, UK, August 29-31, 2013. Proceedings 13. Springer, 368–381.

[14]

Masum Hasan, Cengiz Ozel, Sammy Potter, and Ehsan Hoque. 2023. SAPIEN: affective virtual agents powered by large language models. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 1–3.

[15]

Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. [n. d.]. EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. ([n. d.]).

[16]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023).

[17]

Birgit Lugrin, Catherine Pelachaud, and David Traum. 2022. The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application. ACM.

[18]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

[19]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–22.

Digital Library

[20]

René Peinl, Basem Rizk, and Robert Szabad. 2020. Open source speech recognition on edge devices. In 2020 10th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 441–445.

[21]

Isabella Poggi, Catherine Pelachaud, Fiorella de Rosis, Valeria Carofiglio, and Berardina De Carolis. 2005. Greta. a believable embodied conversational agent. In Multimodal intelligent information presentation. Springer, 3–25.

[22]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.

[23]

Basem Rizk. 2019. Evaluation of state of art open-source ASR engines with local inferencing. In Evaluation of State Of Art Open-source ASR Engines with Local Inferencing. Vol. 8.

[24]

Omar Shaikh, Valentino Emil Chai, Michele Gelfand, Diyi Yang, and Michael S Bernstein. 2024. Rehearsal: Simulating conflict to teach conflict resolution. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20.

Digital Library

[25]

Unity. 2024. UnityARKitDocumentation. https://docs.unity3d.com/Packages/[email protected]/manual/

[26]

Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. 2023. Humanoid agents: Platform for simulating human-like generative agents. arXiv preprint arXiv:2310.05418 (2023).

Index Terms

Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

Index terms have been assigned to the content through auto-classification.

Recommendations

The Handbook on Socially Interactive Agents: 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application
The Handbook on Socially Interactive Agents: 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 1: Methods, Behavior, Cognition
Time to Go ONLINE! A Modular Framework for Building Internet-based Socially Interactive Agents
IVA '19: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents

Although socially interactive agents have emerged as a new metaphor for human-computer interaction, they are, to date, absent from the Internet. We describe the design choices, implementation, and challenges in building EEVA, the first fully integrated ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents

September 2024

337 pages

ISBN:9798400706257

DOI:10.1145/3652988

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 December 2024

Check for updates

Qualifiers

Demonstration
Research
Refereed limited

Conference

IVA '24

Sponsor:

SIGAI

IVA '24: ACM International Conference on Intelligent Virtual Agents

September 16 - 19, 2024

GLASGOW, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
42
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)15

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten