LangNav: Language as a Perceptual Representation for Navigation

Pan, Bowen; Panda, Rameswar; Jin, SouYoung; Feris, Rogerio; Oliva, Aude; Isola, Phillip; Kim, Yoon

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.07889 (cs)

[Submitted on 11 Oct 2023 (v1), last revised 30 Mar 2024 (this version, v2)]

Title:LangNav: Language as a Perceptual Representation for Navigation

Authors:Bowen Pan, Rameswar Panda, SouYoung Jin, Rogerio Feris, Aude Oliva, Phillip Isola, Yoon Kim

View PDF

Abstract:We explore the use of language as a perceptual representation for vision-and-language navigation (VLN), with a focus on low-data settings. Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions. We then finetune a pretrained language model to select an action, based on the current view and the trajectory history, that would best fulfill the navigation instructions. In contrast to the standard setup which adapts a pretrained language model to work directly with continuous visual features from pretrained vision models, our approach instead uses (discrete) language as the perceptual representation. We explore several use cases of our language-based navigation (LangNav) approach on the R2R VLN benchmark: generating synthetic trajectories from a prompted language model (GPT-4) with which to finetune a smaller language model; domain transfer where we transfer a policy learned on one simulated environment (ALFRED) to another (more realistic) environment (R2R); and combining both vision- and language-based representations for VLN. Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories (10-100) are available, demonstrating the potential of language as a perceptual representation for navigation.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
Cite as:	arXiv:2310.07889 [cs.CV]
	(or arXiv:2310.07889v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.07889

Submission history

From: Bowen Pan [view email]
[v1] Wed, 11 Oct 2023 20:52:30 UTC (1,830 KB)
[v2] Sat, 30 Mar 2024 22:00:22 UTC (1,591 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LangNav: Language as a Perceptual Representation for Navigation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LangNav: Language as a Perceptual Representation for Navigation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators