MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting

Liu, Fangchen; Fang, Kuan; Abbeel, Pieter; Levine, Sergey

Computer Science > Robotics

arXiv:2403.03174v2 (cs)

[Submitted on 5 Mar 2024 (v1), revised 19 Aug 2024 (this version, v2), latest version 4 Sep 2024 (v3)]

Title:MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting

Authors:Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine

View PDF HTML (experimental)

Abstract:Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-language models (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.

Subjects:	Robotics (cs.RO); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.03174 [cs.RO]
	(or arXiv:2403.03174v2 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2403.03174

Submission history

From: Fangchen Liu [view email]
[v1] Tue, 5 Mar 2024 18:08:45 UTC (11,673 KB)
[v2] Mon, 19 Aug 2024 21:47:42 UTC (11,794 KB)
[v3] Wed, 4 Sep 2024 01:18:13 UTC (11,794 KB)

Computer Science > Robotics

Title:MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Robotics

Title:MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators