opinion

Free access

Putting the Smarts into Robot Bodies

Fan Wang and Shaoshan Liu provide guidance for the development of embodied AI systems.

Authors:

Wang Fan,

Shaoshan LiuAuthors Info & Claims

Communications of the ACM, Volume 68, Issue 3

Pages 6 - 8

https://doi.org/10.1145/3703761

Published: 20 February 2025 Publication History

All formats PDF

Fan Wang and Shaoshan Liu

Building Foundation Models for Embodied Artificial Intelligence

https://bit.ly/3Wn2FY5

July 15, 2024

Embodied Artificial Intelligence (EAI) involves embedding artificial intelligence into tangible entities, such as robots, equipping them with the capacity to perceive, learn from, and engage dynamically with their surroundings. In this article we delve into the key tradeoffs of building foundation models for EAI systems.

Foundation Models for Embodied AI

Previously, we have outlined three guiding principles for developing embodied artificial intelligence (EAI) systems.¹ EAI systems should not depend on predefined, complex logic to handle specific scenarios. Instead, they must incorporate evolutionary learning mechanisms, enabling continuous adaptation to their operational environments. Additionally, the environment significantly influences not only physical behaviors but also cognitive structures. While the third principle focuses on simulation, the first two principles emphasize building EAI foundation models capable of learning from the EAI systems’ operating environments.

A common approach for EAI foundation models is to directly utilize pretrained large models. For example, pretrained GPT models can serve as a baseline, followed by fine-tuning and in-context learning (ICL) to enhance performance.⁹ These large models typically possess a substantial number of parameters to encode extensive world knowledge and feature a small context window for fast response times. This extensive pre-encoding allows these models to deliver excellent zero-shot performance. However, their limited context windows pose challenges for continuous learning from the EAI systems’ operating environments and connecting various usage scenarios.

Alternatively, another approach leverages models with significantly fewer parameters but a larger context window. These models, rather than encoding comprehensive world knowledge, focus on learning how to learn, or meta-learning.² With large context windows, these models can perform general-purpose in-context learning (GPICL), enabling continuous learning from their operating environments and establishing connections across a broad context.

The Figure below illustrates these two different approaches. The meta-training + GPICL approach, while exhibiting poorer zero-shot performance and having a smaller model size, excels in continuously learning from its environment, eventually specializing EAI systems for specific tasks. In contrast, the pretraining + fine-tuning + ICL approach, characterized by a larger model size and smaller context windows, offers superior zero-shot performance but inferior learning capabilities.

Figure. Foundation Model Options for EAI.

Empirical evidence supporting this is found in the GPT-3 paper, where a 7B Few-Shot model outperforms a 175B Zero-Shot model.³ If few-shot learning is replaced by a long context window enabling EAI systems to learn from their operating environments, performance may further improve.

We envision an ideal foundation model for EAI that should meet several critical criteria. Firstly, it should be capable of universally learning from complex instructions, demonstrations, and feedback without relying on crafted optimization techniques. Secondly, it should demonstrate high sample efficiency in its learning and adaptation processes. Thirdly, it must possess the ability to continuously learn through contextual information, effectively avoiding the issue of catastrophic forgetting. Therefore, we conclude that the meta-learning + GPICL approach is suitable for EAI systems. However, before we decide on taking this approach, let us first examine the tradeoffs between these two approaches.

Key Tradeoffs

In this section, we review the tradeoffs between pretrained large models vs. meta-training + GPICL as foundation models for EAI.⁴ The results are summarized in the Table below.

Table. Tradeoffs of Pretrained large model vs. meta-training + GPICL

Comparison	Pretraining + Fine-Tuning + ICL	Meta-Training + GPICL
Zero-Shot Capability	High	Low
Generalizability	In-Distribution Tasks	Diverse and Complex
Generalizability	Rudimentary Out-of-Distribution Tasks	Out-Of-Distribution Tasks
Knowledge carrier	Parameters	Memory / Hidden States
Scalability Enhancement Approach	Scaling up parameters and pre-training datasets	Scaling up meta-training tasks, context length, memories, and hidden states
Methodology of Task Adaptation	Data Collection (Fine-Tuning, In-efficient)	Very Complex Instruction
Methodology of Task Adaptation	Rudimentary Instruction & Prompt (ICL)	Explore & Exploit automatically
Emphasis of pre-training / meta-training stage	World knowledge, knowledge regarding the hardware	The capability of learning, memorization, and abstraction
Emphasis of post-training stage	Human-alignment, task-specific knowledge	World knowledge, human-alignment, task-specific knowledge
Inference Latency	Low	High
Memory Size	Small	Large

Table. Tradeoffs of Pretrained large model vs. meta-training + GPICL

Credit: Fan Wang

For zero-shot capability, the Pretraining + Fine-Tuning + ICL approach⁹ offers high performance, allowing models to generalize well to new tasks without any task-specific fine-tuning. In contrast, the Meta-Training + GPICL approach exhibits low zero-shot capability, as it focuses on learning to adapt to a wide variety of tasks using in-context learning rather than zero-shot generalization.

In terms of generalizability, the Pretraining + Fine-Tuning + ICL approach performs well on in-distribution tasks but has rudimentary capabilities for out-of-distribution tasks. Meta-Training + GPICL, on the other hand, exhibits diverse and complex generalization capabilities for out-of-distribution tasks due to its emphasis on meta-training over varied contexts.

The scalability enhancement approach for Pretraining + Fine-Tuning + ICL involves scaling up parameters and pre-training datasets to improve performance. Meta-Training + GPICL enhances scalability by scaling up meta-training tasks, context length, memories, and hidden states to improve the model’s adaptability.

Regarding task adaptation, Pretraining + Fine-Tuning + ICL relies on data collection and fine-tuning, which can be inefficient. In contrast, Meta-Training + GPICL utilizes very complex instructions and learns from diverse contexts automatically.

During the pre-training or meta-training stage, Pretraining + Fine-Tuning + ICL focuses on world knowledge and understanding the hardware. Meta-Training + GPICL emphasizes the capability of learning, memorization, and abstraction over a wide variety of tasks.

In the post-training stage, Pretraining + Fine-Tuning + ICL involves aligning the model to specific human-centric tasks, emphasizing human-alignment and task-specific knowledge. Meta-Training + GPICL continues to emphasize world knowledge, human-alignment, and task-specific knowledge.

Inference latency is generally low for Pretraining + Fine-Tuning + ICL as the model parameters are fixed after training. However, for Meta-Training + GPICL, inference can be slower due to the need to utilize and update memory and hidden states dynamically.

Memory size requirements for Pretraining + Fine-Tuning + ICL are small, as most knowledge is embedded in fixed model parameters. Conversely, Meta-Training + GPICL requires significant memory to handle complex instructions, extended context, and hidden states.

Meta-Training + GPICL offers the advantage of enabling the system to continuously learn various tasks through contexts, i.e., learning to continuously learn.⁷ This essentially requires the system to be able to learn new tasks without forgetting the old ones, which typically poses great challenge for gradient-based fine-tuning (catastrophic forgetting⁸) but can be less of a challenge with in-context learning.

Overcoming the Computing and Memory Bottlenecks

From the above comparison, it is evident that meta-training combined with GPICL offers superior adaptability and generalization across diverse and complex tasks. However, this approach demands higher resources, posing a challenge for most EAI systems, which are often real-time edge devices with limited computational capabilities and memory. The large context windows required for this approach can significantly increase inference time and memory footprint, potentially hindering its feasibility for EAI foundation models.

Fortunately, recent advancements have introduced innovative solutions to scale Transformer-based Large Language Models (LLMs) for processing infinitely long inputs while maintaining bounded memory and computational efficiency. A notable innovation is the Infini-attention mechanism, which integrates masked local attention and long-term linear attention within a single Transformer block. This enables the efficient processing of both short and long-range contextual dependencies. Additionally, the compressive memory system allows the model to maintain and retrieve information with bounded storage and computation costs, reusing old Key-Value (KV) states to enhance memory efficiency and enable fast streaming inference. Experimental results demonstrate that the Infini-attention model outperforms baseline models in long-context language modeling benchmarks, showing superior performance in tasks involving extremely long input sequences (up to 1 million tokens) and significant improvements in memory efficiency and perplexity scores.

Similarly, the StreamingLLM framework enables large models trained with a finite attention window to generalize to infinite sequence lengths without the need for fine-tuning. This is achieved by preserving the Key and Value (KV) states of initial tokens as attention sinks, along with the most recent tokens, stabilizing attention computation and maintaining performance over extended texts. StreamingLLM excels at modeling texts up to 4 million tokens, providing a remarkable speedup of up to 22.2 times.

Conclusion

In conclusion, we believe that learning from the environment is the essential feature for EAI systems and thus the meta-training + GPICL approach is promising for building EAI foundation models due to its capabilities of providing better long-term adaptability and generalization. Although currently this approach is facing significant challenges in computing and memory usage, we believe that innovations such as Infini-attention and StreamingLLM will soon making this approach viable for real-time, resource-constrained environments.

Fan Wang is a Distinguished Architect at Baidu working on AI systems. He holds a Master of Science degree from the engineering school at the University of Colorado at Boulder, and a Bachelor of Science degree from the University of Science and Technology of China. Fan specializes in Reinforcement Learning, Natural Language Processing, AI for Sciences, and Robotics.

Shaoshan Liu is a member of the ACM U.S. Technology Policy Committee, and a member of the U.S. National Academy of Public Administration’s Technology Leadership Panel Advisory Group. His educational background includes a Ph.D. in Computer Engineering from the University of California Irvine, and a master’s degree in public administration from Harvard Kennedy School.

References

[1]

A Brief History of Embodied Artificial Intelligence, and its Outlook. Communications of the ACM; https://cacm.acm.org/blogcacm/a-brief-history-of-embodied-artificial-intelligence-and-its-future-outlook/

Google Scholar

[2]

Kirsch, L., Harrison, J., Sohl-Dickstein, J., and Metz, L. General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, (2022).

Google Scholar

[3]

Brown, T. et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 33, (2020), 1877–1901.

Google Scholar

[4]

Wang, F., Lin, C., Cao, Y., and Kang, Y. Benchmarking General Purpose In-Context Learning. arXiv preprint arXiv:2405.17234, (2024).

Google Scholar

[5]

Munkhdalai, T., Faruqui, M., and Gopal, S. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, (2024).

Google Scholar

[6]

Xiao, G. et al. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, (2023).

Google Scholar

[7]

Beaulieu, S. et al. Learning to continually learn. ECAI 2020. IOS Press, 2020, 992–1001.

Crossref

Google Scholar

[8]

French, R.M. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3, 4 (1999), 128–135.

Crossref

Google Scholar

[9]

Ouyang, L. et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, (2022), 27730–27744.

Google Scholar

Index Terms

Putting the Smarts into Robot Bodies

Recommendations

Sections of complex convex bodies
Packing Convex Bodies by Cylinders

In Bezdek and Litvak (J Geom Anal 19:233---243, 2009) in relation to the unsolved Bang's plank problem (Proc Am Math Soc 2:990---993, 1951) we obtained a lower bound for the sum of relevant measures of cylinders covering a given d-dimensional convex ...
Double-lattice packings of convex bodies in the plane

Mahler [7] and Fejes T th [2] proved that every centrally symmetric convex plane bodyK admits a packing in the plane by congruent copies ofK with density at least 3/2. In this paper we extend this result to all, not necessarily symmetric, convex plane ...

Comments

Information & Contributors

Information

Published In

Communications of the ACM Volume 68, Issue 3

March 2025

96 pages

EISSN:1557-7317

DOI:10.1145/3719036

Editor:
James Larus
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2025

Online First: 20 February 2025

Published in CACM Volume 68, Issue 3

Check for updates

Qualifiers

Opinion

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,504
Total Downloads

Downloads (Last 12 months)1,504
Downloads (Last 6 weeks)1,504

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Foundation Models for Embodied AI

Key Tradeoffs

Overcoming the Computing and Memory Bottlenecks

Conclusion

References

Index Terms

Recommendations

Sections of complex convex bodies

Packing Convex Bodies by Cylinders

Double-lattice packings of convex bodies in the plane

Comments

Information

Published In

Publisher

Publication History

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Digital Edition

Magazine Site

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations