^preprint^preprintfootnotetext: Denotes equal contribution^preprint^preprintfootnotetext: This manuscript is a pre-print version of a paper accepted for publication at the ACM Intelligent Virtual Agents (IVA) 2024 Conference [DOI: 10.1145/3652988.3696198] [ACM ISBN: 979-8-4007-0625-7/24/09].

Estuary: A Framework For Building Multimodal Low-Latency Real-Time Socially Interactive Agents

Spencer Lin^* Basem Rizk^* Miru Jun^* Andy Artze Caitlín Sullivan Sharon Mozgai Scott Fisher University of Southern California

Abstract

The rise in capability and ubiquity of generative artificial intelligence (AI) technologies has enabled its application to the field of Socially Interactive Agents (SIAs). Despite rising interest in modern AI-powered components used for real-time SIA research, substantial friction remains due to the absence of a standardized and universal SIA framework. To target this absence, we developed Estuary: a multimodal (text, audio, and soon video) framework which facilitates the development of low-latency, real-time SIAs. Estuary seeks to reduce repeat work between studies and to provide a flexible platform that can be run entirely off-cloud to maximize configurability, controllability, reproducibility of studies, and speed of agent response times. We are able to do this by constructing a robust multimodal framework which incorporates current and future components seamlessly into a modular and interoperable architecture.

Refer to caption — Figure 1: Estuary’s various capabilities in Augmented Reality and Virtual Reality environments.

1 Background

Rapid advancements in AI have catalyzed the development of SIAs, which integrate complex technologies to facilitate nuanced human-computer interactions across various domains, producing affective virtual agents [15], multi-agent simulations of human behavior [20, 27, 13], and more [16, 17, 25]. To build effective SIAs, multiple microservices and components need to be integrated and managed, including Automatic Speech Recognition (ASR) and Text-To-Speech (TTS). Implementing these features require significant efforts. A modern, comprehensive SIA framework is useful to streamline efforts and reduce redundant work [18] and to provide a scalable and interoperable standard that can support new models and microservices.

1.1 SIA & Conversational AI Frameworks.

Several existing toolkits help streamline the process of building SIAs. The Virtual Human Toolkit (VHToolkit) [14] and Greta [22] are such examples, however, they do not support generative AI models such as Large Language Models (LLMs). Moreover, a recent wave of lightweight, high performing models (Speech-To-Text (STT) [24, 21], LLM [5], and TTS [9]) can be run on-edge to overcome privacy concerns evident in cloud-based services. These gaps and improvements in current tools add additional motivation for the creation of a cohesive framework tailored for SIAs.

Recent projects such as Pipecat [12] and NVIDIA ACE [3] provide frameworks for building conversational agents by connecting AI microservices. However, Pipecat is not tailored for building SIAs and lacks integration with game engines, which SIA research heavily relies upon. NVIDIA ACE, at the time of writing, is not open-source and restricts users to only running NVIDIA-approved AI microservices. This may be a critical limitation for research involving custom AI microservices and pipelines. Furthermore, NVIDIA ACE requires a cost-prohibitive enterprise plan if developers would like to host their microservices off-cloud or on-prem.

1.2 Limitations of Current Approaches.

Currently, several factors hinder SIA research: 1) computational limitations of devices (e.g., head-mounted displays (HMDs)) which prohibit running advanced AI models on a standalone device, 2) hardware architecture incompatibilities with AI models which necessitate an inferencing server, and 3) high latency of cloud-based microservices. To address these shortcomings, we developed Estuary, a distributed framework for low-latency real-time SIAs. Estuary simplifies the development process by seamlessly integrating user-defined modules for ASR, TTS, dialogue management, and LLMs with the purpose of constructing pipelines for SIAs. This allows it to support low-latency real-time voice interactions, understand the physical world, and to facilitate reproducible and highly configurable experiments.

2 System

Estuary brings five core value-propositions: 1) an interoperable microservice architecture, 2) multi-platform support, 3) off-cloud capabilities, 4) support for multimodal input and analysis, and 5) an open-source nature that opens it to community contributions.

2.1 Microservice Architecture.

We use a modular design to universally wrap local model or online API service within a Stage, which are asynchronous and parallelizable as denoted in Figure 3. A Stage component wraps the ML inference logic and hosts it on a child process, aggregating inputs and dispatching outputs. A selection of Stages (e.g., Whisper [23], GPT-3.5 [7], gTTS [11]) are connected into a $Pipeline$ according to the flow of choice. The $Pipeline$ internally orchestrates the flow of a standardized data type $DataWindow$ , which consists of one or more $DataPacket$ (s) (e.g., $AudioPacket$ , $VisionPacket$ , $TextPacket$ , etc.). A $DataPacket$ , with its source and creation timestamp, acts as identifiable placeholders for the Stage outcomes. This micro-service architecture allows us to plug and play new models rapidly by implementing them as Stage(s). A $Pipeline$ out of the box runs as a background process on a server that can communicate with any client through SocketIO protocol [4].

2.2 Multi-platform.

Estuary is a distributed framework that uses the SocketIO protocol to establish a connection between a client device and a host device running our framework. It can support any platform, including those supported by Unity game engine given that the hardware has a microphone, speaker, and can communicate using SocketIO. This hardware-agnostic support makes available HMDs and other devices by overcoming compute limitations or hardware architecture incompatibilities.

2.3 Off-Cloud.

In addition to leveraging cloud APIs, Estuary can be hosted locally and/or entirely with off-cloud microservices. This reduces latency by eliminating multiple cloud endpoints, improves reproducibility, and adds privacy and security as no data ever enters the cloud. As shown in our video demo, it takes approximately 1.2 $\sim$ 2.5 seconds (compared to an average 2.8 second latency from ChatGPT-4o [19]) from the end of a user’s dialogue to the first utterance from the SIA’s TTS module through a FasterWhisperBase.EN [2] $\rightarrow$ GPT-3.5 API [7] $\rightarrow$ XTTS [9] pipeline on a desktop with a RTX 4090 graphics card. This is made possible through several optimizations such as simultaneously streaming the LLM and TTS response. Furthermore, off-cloud microservices ensure reproducibility, which is of utmost importance especially in fields relating to psychology [10]. In Estuary, the exact same versions of LLM, ASR, TTS, and other microservices can be maintained and loaded for future use, whereas cloud-based services do not have guarantees to remain unchanged overtime.

2.4 Multimodal Input And Analysis.

Multimodal input [19] is critical for agents to understand the physical world and multiple modalities to produce a better cognitive model [6, 8]. Estuary integrates with Unity and packages like ARKit [26] to empower embodied agents with basic semantic understanding of the physical world and pathfinding capabilities in Augmented Reality (AR). Currently, Estuary supports text and audio datastreams. We plan to expand to video as well which will shift reliance away from hardware-specific packages.

2.5 Open-Source.

Estuary is open-source and promotes growth from community contribution. By nature, Estuary is flexible and hugely extensible to support integration of microservices now and into the future. Researchers and developers have the freedom to choose what models they use and what data is collected from anywhere in the pipeline, all without limitations imposed by paid services. Estuary is built as a robust, modern framework that can empower SIA research for years to come.

3 Demonstration

Our first scenario consists of a computer hosting Estuary interfaced to an HMD over a local network to demonstrate an advanced embodied conversational agent that can classify objects in its physical surroundings and interact accordingly. The second scenario consists of a computer hosting both Estuary and a desktop frontend to demonstrate the versatility of our framework. Estuary’s source code can be found in our GitHub ¹¹1https://github.com/Al-Estuary and a demo video here on our website ²²2https://estuary-ai.github.io/.

4 Acknowledgments

We would like to thank our teammates and advisors from the 2022 and 2023 NASA SUITS Challenge whose efforts contributed to the success of this project. We would also like to thank all open-source developers and creatives whose work we respectfully used.

References

[1]
fas [[n. d.]] [n. d.]. Faster Whisper. https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file.
nvi [[n. d.]] [n. d.]. NVIDIA Ace. https://developer.nvidia.com/ace.
soc [[n. d.]] [n. d.]. Socket.IO. https://socket.io/
Abdin et al. [2024] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
Baltrusaitis et al. [2018] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). 59–66. https://doi.org/10.1109/FG.2018.00019
Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165 https://arxiv.org/abs/2005.14165
Cao et al. [2019] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv:1812.08008
Casanova et al. [2024] Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, and Julian Weber. 2024. XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model. arXiv:2406.04904 [eess.AS]
Collaboration [2015] Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015), aac4716.
Durette [2024] Pierre Nicolas Durette. 2024. gTTS. https://github.com/pndurette/gTTS.
Flaqué et al. [2024] Aleix Conchillo Flaqué, Moishe Lettvin, Kwindla Hultman Kramer, chadbailey59, Jon Taylor, Thomas B., Liza, James Hush, and Rahul Nair. 2024. pipecat-ai/pipecat. https://github.com/pipecat-ai/pipecat
Gao et al. [2023] Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. 2023. S3: Social-network Simulation System with Large Language Model-Empowered Agents. arXiv preprint arXiv:2307.14984 (2023).
Hartholt et al. [2013] Arno Hartholt, David Traum, Stacy C Marsella, Ari Shapiro, Giota Stratou, Anton Leuski, Louis-Philippe Morency, and Jonathan Gratch. 2013. All together now: Introducing the virtual human toolkit. In Intelligent Virtual Agents: 13th International Conference, IVA 2013, Edinburgh, UK, August 29-31, 2013. Proceedings 13. Springer, 368–381.
Hasan et al. [2023] Masum Hasan, Cengiz Ozel, Sammy Potter, and Ehsan Hoque. 2023. SAPIEN: affective virtual agents powered by large language models. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 1–3.
Li et al. [[n. d.]] Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. [n. d.]. EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. ([n. d.]).
Liang et al. [2023] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023).
Lugrin et al. [2022] Birgit Lugrin, Catherine Pelachaud, and David Traum. 2022. The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application. ACM.
OpenAI [2024] OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
Park et al. [2023] Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–22.
Peinl et al. [2020] René Peinl, Basem Rizk, and Robert Szabad. 2020. Open source speech recognition on edge devices. In 2020 10th International Conference on Advanced Computer Information Technologies (ACIT). IEEE, 441–445.
Poggi et al. [2005] Isabella Poggi, Catherine Pelachaud, Fiorella de Rosis, Valeria Carofiglio, and Berardina De Carolis. 2005. Greta. a believable embodied conversational agent. In Multimodal intelligent information presentation. Springer, 3–25.
Radford et al. [2023] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
Rizk [2019] Basem Rizk. 2019. Evaluation of state of art open-source ASR engines with local inferencing. In Evaluation of State Of Art Open-source ASR Engines with Local Inferencing. Vol. 8.
Shaikh et al. [2024] Omar Shaikh, Valentino Emil Chai, Michele Gelfand, Diyi Yang, and Michael S Bernstein. 2024. Rehearsal: Simulating conflict to teach conflict resolution. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20.
Unity [2024] Unity. 2024. UnityARKitDocumentation. https://docs.unity3d.com/Packages/com.unity.xr.arkit@5.1/manual/
Wang et al. [2023] Zhilin Wang, Yu Ying Chiu, and Yu Cheung Chiu. 2023. Humanoid agents: Platform for simulating human-like generative agents. arXiv preprint arXiv:2310.05418 (2023).