Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Bai, Yifan; Wu, Dongming; Liu, Yingfei; Jia, Fan; Mao, Weixin; Zhang, Ziheng; Zhao, Yucheng; Shen, Jianbing; Wei, Xing; Wang, Tiancai; Zhang, Xiangyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.18361v1 (cs)

[Submitted on 28 May 2024]

Title:Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Authors:Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

View PDF HTML (experimental)

Abstract:Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.18361 [cs.CV]
	(or arXiv:2405.18361v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.18361

Submission history

From: Yifan Bai [view email]
[v1] Tue, 28 May 2024 16:57:44 UTC (19,872 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators