Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Yifan Bai¹^$\dagger$ Dongming Wu^{2 $*$ $\dagger$} Yingfei Liu³ Fan Jia³ Weixin Mao³ Ziheng Zhang³
Yucheng Zhao³ Jianbing Shen^{4 $\ddagger$} Xing Wei^{1 $\ddagger$} Tiancai Wang^{3 $\ddagger$} Xiangyu Zhang³
¹ Xi’an Jiaotong University, ² Beijing Institute of Technology,
³ MEGVII Technology, ⁴ SKL-IOTSC, University of Macau
yfbai@stu.xjtu.edu.cn, {wudongming97, shenjianbingcg}@gmail.com,
weixing@mail.xjtu.edu.cn, {liuyingfei,wangtiancai}@megvii.com Equal contributions. ^†This work was done during internship at MEGVII Technology. ^‡Corresponding authors. The work was supported by National Science and Technology Major Project of China (2023ZD0121300) and CAAI-MindSpore Open Found, the FDCT grants 0102/2023/RIA2, 0154/2022/A3.

Abstract

Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.

1 Introduction

Refer to caption — Figure 1: Comparision among end-to-end methods. (a) Modular BEV-based methods have three sequential modules for perception, prediction, and planning, but they cannot provide multiple potential trajectories and environment reasoning. (b) 2D-tokenized VLM projects 2D distorted images into tokens, which lack 3D prior for reliable autonomous driving. (c) Our 3D-tokenized LLM-based methods utilize 3D perceptions as 3D tokenizers, which provide potential trajectories and rich 3D priors for reliable driving.

Autonomous Driving (AD) is a sophisticated system that integrates perception, reasoning, and planning pomerleau1988alvinn ; janai2020computer ; chen2023end . Perception serves as the initial stage, capturing details of the surrounding environment. This information then feeds into the reasoning component, facilitating a deeper understanding, and ultimately guiding informed decision-making through the planning process. Recently, the incorporation of perception, reasoning, and planning to construct end-to-end models has become prevalent. It can be broadly categorized into two distinct methodologies: modular bird’s-eye view (BEV) based approaches and large vision-language model (VLM) based methods.

The modular BEV-based approaches are meticulously engineered, comprising custom-tailored modules, including 3D perception, trajectory prediction, and ego-car planning liang2020pnpnet ; casas2021mp3 ; chen2022learning ; zhang2022beverse ; hu2022st ; gu2023vip3d ; hu2023planning , as shown in Figure 1(a). While BEV representation enhances environmental perception, these methods may encounter difficulty stemming from their limited reasoning abilities. Specifically, these models tend to mimic established expert trajectories and struggle to predict multiple potential motion trajectories when confronted with novel scenarios. To tackle this challenge, VLM-based methods mark a significant turning point. They usually employ a 2D vision tokenizer (e.g., ViT-CLIP radford2021learning ) with a Large Language Model (LLM) to interpret distorted images and produce navigational commands xu2023drivegpt4 ; tian2024drivevlm ; wang2023drivemlm ; shao2023lmdrive ; jia2023adriver ; sima2023drivelm . Benefiting from the robust logical reasoning and cognitive abilities of the VLM agent, the model can generate rational decisions and dialogues.

Despite the success of VLM-based algorithms, the perceptual capabilities within this paradigm is barely studied. While we argue that the perception sub-task may not be essential for end-to-end driving, the capacity to perceive the environment remains a cornerstone of reliable planning. Since VLM-based methods rely on 2D vision tokenizers for environmental perception without incorporating 3D geometric priors, an intuitive question arises: Can a 2D-tokenized LLM accurately perceive the 3D environment? To answer this question, we specially design experiments to evaluate the perception performance of prevalent VLM-based systems in three tasks: 3D object detection, 3D lane detection, and environmental captioning. Our findings reveal that despite extensive pre-training and expansive parameters, mainstream VLM solutions typically lag in precision when compared to specialized models designed for these tasks. This glaring gap highlights the limitations of 2D tokenizers in perceiving 3D environments.

To address this issue, we wonder if 3D vision tokenizers hold the key to Pandora. We discover that the existing DETR-style BEV framework can naturally serve as a 3D visual compression tokenizer. Therefore, we opt for the advanced StreamPETR wang2023exploring and TopoMLP wu2023topomlp as our 3D visual tokenizers, forgoing the traditional use of ViT-CLIP radford2021learning . This strategy brings three advantages: 1) The innate priors of the 3D physical world are naturally encoded within visual tokens by introducing the position encodings. 2) It is capable of handling high-resolution images with any aspect ratio without the risk of distorting the images. 3) Video frames can be processed in a streaming manner, benefiting from DETR-style query propagation. Through evaluation of the nuScenes dataset, we demonstrate that our 3D-tokenized LLM approach achieves performance on par with specialized algorithms in tasks such as 3D object detection and lane detection.

Beyond that, we need to answer another question: Is a 3D-tokenized LLM the key to reliable autonomous driving? Following BEV-Planner li2023ego , we extend our exploration to the open-loop planning on the nuScenes dataset. By leveraging the 3D tokenizers for enhanced perception capabilities, our model not only comprehends the environment around the vehicle but also utilizes the LLM to formulate driving recommendations and plan the ego-car trajectory in an end-to-end manner. Remarkably, this approach eschews hand-crafted designs and achieves state-of-the-art performance on the nuScenes planning task.

In summary, our work highlights the importance of proper vision tokenizers in VLM-based AD and introduces the 3D-tokenized LLM as a solution. We showcase its superiority in adeptly addressing challenges across multiple tasks such as 3D perception, vectorized map construction, environmental caption, and planning within autonomous driving systems. Our model demonstrates superior performance in both benchmark evaluations and practical downstream applications, proving its reliability and versatility. Furthermore, our framework paves the way for pioneering end-to-end LLM-driven solutions in autonomous driving, potentially transforming how these systems are developed.

2 Can a 2D-Tokenized LLM Accurately Perceive 3D Environment?

Current VLM-based methods xu2023drivegpt4 ; tian2024drivevlm ; wang2023drivemlm ; shao2023lmdrive ; jia2023adriver in AD tend to employ 2D vision tokenizers. They operate without incorporating geometric 3D priors, raising concerns about their capability to accurately perceive and describe 3D environments, which is crucial for reliable planning. In this section, we provide insightful analysis and reveal the limitations of relying solely on 2D tokenizers for understanding 3D driving scenes, including 3D perception and visual captioning.

2.1 2D-Tokenized LLM for Perception

To investigate the 3D understanding capability of current VLM-based approaches, we first conduct experiments on traditional perception tasks: 3D object detection and 3D lane detection. In this part, we introduce datasets, models, and metrics.

Datasets. We design datasets tailored for VLM methods built upon popular multi-view benchmark nuScenes caesar2020nuscenes , as shown in Figure 2. For the 3D detection task, we construct question-and-answer (QA) pairs that focus on pinpointing the locations of objects surrounding the ego vehicle. Each question prompts the model to extract spatial information about the target objects from six views. The corresponding answers require the model to identify both the category and the 3D coordinates of objects. Similarly, the dataset for 3D lane detection also comprises QA pairs, whose answers are lane points borrowed from OpenLane-V2 subset-B wang2024openlane . Here, each road is depicted using four consecutive points describing the road centerline. More details can be found in the supplementary.

Models. All 2D-tokenized LLMs in our study adhere to a uniform architecture, which consists of three main components: 2D tokenizer, projector, and large language model. The 2D tokenizer follows ViT-CLIP radford2021learning to extract visual features from multiple perspectives of images. For the projection module, we incorporate a single convolutional layer to bridge the 2D tokenizer and LLM. Besides, we utilize diverse pre-trained LLMs, such as LLaMA touvron2023llama , LLaVA liu2024visual , Vicuna chiang2023vicuna , which are comprehensive processing of complex visual information to generate the perception of the environment, to prove consistency and fairness in our exploration. Additionally, another available VLM-based model pre-trained on 2D object detection Merlin yu2023merlin is also evaluated.

Metrics. In this study, we employ the F1 score as the main evaluation metric. The choice of the F1 score is motivated by two primary considerations: First, VLMs cannot deliver the necessary predictive confidence for metrics such as mean Average Precision (mAP). Second, traditional perceptual metrics commonly encourage numerous redundant predictions, which can clutter the model output. In contrast, VLMs are designed to generate more targeted and focused predictions, making the F1 score a better fit for assessing these models. In this work, for 3D detection, we choose threshold distances of 0.5, 1.0, 2.0, and 4.0 meters to define positive predictions, similar to the discrimination levels used in detection mAP calculations. As for 3D lane detection, we follow OpenLane-V2 evaluation protocol wang2024openlane to compute the F1 score.

Table 1: Comparisons with task-specific and VLM-based methods for 3D object detection tasks using our proposed dataset. The bold numbers represent the highest accuracy achieved in each category. The P_k, R_k, and F1_k represent the Precision, Recall, and respective F1 score ultimate

k

as threshold distances to define positive prediction. The Spe. represents task-specialist model.

	Method	Tokenizers	P_0.5	R_0.5	F1_0.5	P_1.0	R_1.0	F1_1.0	P_2.0	R_2.0	F1_2.0	P_4.0	R_4.0	F1_4.0
Spe.	PETR liu2022petr	-	12.4	21.5	15.8	20.0	30.5	24.1	27.5	37.7	31.8	33.8	42.6	37.7
Spe.	StreamPETR wang2023exploring	-	22.7	41.3	29.3	31.6	49.5	38.6	38.1	54.2	44.7	42.5	56.9	48.7
VLM	LLaMA touvron2023llama	2D	0.3	1.1	0.4	0.6	2.6	1.0	1.5	5.8	2.4	3.5	12.8	5.5
	LLaVA liu2024visual	2D	2.0	20.3	3.0	3.6	35.7	6.5	6.5	50.3	11.6	10.9	62.8	18.9
	Vicuna chiang2023vicuna	2D	2.0	20.1	2.5	2.9	35.6	5.4	5.9	51.1	10.1	9.4	63.8	16.4
	Merlin yu2023merlin	2D	3.0	22.5	5.3	4.1	36.1	7.4	6.6	52.6	11.7	12.1	64.3	20.4
	Atlas(Ours)	3D	15.0	61.2	24.1	27.2	74.0	39.8	36.2	79.2	49.7	41.2	81.2	54.6

3D Object Detection. In this study, we conduct extensive experiments to evaluate the performance of VLMs on 3D detection, as listed in Table 1. As a comparison, Table 1 also includes task-specific models such as PETR liu2022petr and StreamPETR wang2023exploring . Among these, the state-of-the-art detector StreamPETR achieves an F1_4.0 score of 48.7. Despite the rich contextual knowledge and extensive parameters, 2D-tokenized LLM methods exhibit a considerable performance drop in both precision and recall, leading to surprisingly low F1 scores. These methods struggle with detecting objects in the vicinity of the ego vehicle, highlighting a considerable disparity in 3D object detection capabilities between VLM-based methods and dedicated task-specific approaches.

Table 2: 3D lane detection.

Method	Tokenizers	P	R	F1
TopoMLP wu2023topomlp	-	50.6	55.7	53.0
LLaVA liu2024visual	2D	10.4	9.8	10.0
Vicuna chiang2023vicuna	2D	11.7	10.3	10.9
Merlin yu2023merlin	2D	22.1	22.4	22.2
Atlas(ours)	3D	45.7	39.1	42.2

3D Lane Detection. Vectorized maps provide a driving route for ego car, serving as a crucial perception task for autonomous driving. We present experiments of a state-of-the-art task-specific model TopoMLP wu2023topomlp and several aforementioned 2D-tokenized LLM methods on lane detection. The main results are shown in Table 2. Similarly, the performance of 2D-tokenized LLM methods is far away from the task-specific model, struggling to deal with 3D lane detection.

2.2 2D-Tokenized LLM for Captioning

In addition to basic environmental perception tasks, LLMs can be adapted to perform more complex tasks like extracting and interpreting key features from visual for captioning environments. This capability extends the utility of LLMs in practical applications, and leverages world knowledge and reasoning ability, particularly in scenarios requiring detailed environmental understanding.

To explore whether a 2D-tokenized LLM could serve as an effective perceptron, we develop a specialized version of the model for environmental captioning. This variant utilizes Vicuna chiang2023vicuna as its underlying LLM, tasked with capturing and describing the operational environment of a vehicle. This description includes various elements such as the location and quantity of nearby vehicles and pedestrians, traffic dynamics, concerning surrounding lanes of pedestrian crossing and road.

Despite the advanced capabilities of VLMs in generating natural language descriptions, as illustrated in Figure 3, our findings indicate that the 2D-tokenized LLM struggles with accurate environmental perception. The model frequently produces erroneous or "hallucinated" descriptions, which suggests that it still falls short of reliable perception in practical applications. This underscores the challenges and limitations inherent in deploying LLMs for complex perceptual tasks in dynamic environments.

Remark. To sum up, the experiments above reveal a significant limitation in the perception capabilities of LLMs that rely on 2D visual tokenizers. This limitation poses serious challenges for reliable ego vehicle planning. We claim that the primary reason for this limitation lies in the inability of 2D visual tokenizers to effectively integrate 3D spatial priors. To address the limitation, we introduce advanced pre-trained 3D perception models as 3D tokenizers in the following section.

3 3D-Tokenized LLM for Reliable Autonomous Driving

3.1 3D-Tokenized LLM

Distinct from 2D-tokenized LLM, we introduce 3D tokenizers founded upon a DETR-inspired architecture into LLM, formulating a 3D-tokenized LLM framework, named Atlas. In specific, Altas consists of three primary components. Initially, the model employs 3D tokenizers, StreamPETR wang2023exploring and TopoMLP wu2023topomlp , to process multi-view images into DETR-style query representations. Following this, these queries are streamlined through a single linear layer, functioning as a projector, to align with the LLM. The final component of Atlas is an LLM, designed as Vicuna chiang2023vicuna . This approach brings significant benefits in incorporating 3D innate prior, achieving high resolution, and facilitating temporal propagation, as previously elaborated.

3D Environment Perception. The performance of Atlas is evaluated on standard datasets tailored to the tasks of 3D object detection and 3D lane detection, as reported in Table 1 and Table 2. The results demonstrate that 3D-tokenized LLM achieves remarkable performance across both tasks. Besides, 3D-tokenized LLM performs better than 2D-tokenized LLM on driving environment captioning, as shown in Figure 3, thereby affirming the significant advantages of utilizing 3D tokenizers. In addition to representing 3D environment, our ultimate goal is to achieve reliable autonomous driving. In the following, we will evaluate the performance of 3D-tokenized LLM on ego-car planning.

3.2 Implementation

The whole model trains with 8 Tesla A100 GPUs, with training times of approximately 100 hours.

Dataset. We employ the nuScenes planning dataset caesar2020nuscenes in our experiments of reliable autonomous driving. As illustrated in Figure 2, we have reformatted the planning data into a question-answer format to facilitate our analysis. Previous research bevplanner has established that the "ego states"—sensor-provided data on the autonomous vehicle such as velocity, acceleration, yaw angle, and historical trajectory—play a crucial role in open-loop planning. Additionally, to aid in navigation, especially at intersections, it is essential to incorporate a high-level command (e.g., go straight, turn left, turn right) which provides directional guidance. Building on these insights, we propose the question-and-answer pairs demand the models to predict future velocity and acceleration based on the current state and to subsequently generate planning waypoints for the ego-car prompting by a high-level command. This processing called chain-of-thought wei2022chain , not only enhances the interpretability of the model’s reasoning process but also its reliability. A typical example is shown in Figure 2, and additional details about the dataset are available in the supplementary materials.

Metrics. We adhere to standard practices by utilizing the implementation provided by ST-P3 stp3 to assess planning over time horizons of 1s, 2s, and 3s. We assess the performance with two widely accepted metrics: the L2 error calculated by comparing the predicted trajectories of the ego vehicle with the ground-truth trajectories at corresponding waypoints, and the collision rate calculated by checking for any intersections between the ego vehicle and other entities within the scene.

3.3 Main Results

Table 3: Comparisons on the planning. For a fair comparison, we refer to the reproduced results in BEV-Planner bevplanner . The bold numbers represent the highest accuracy.

Method	High-level	Ego States		L2 (m)				Collision (%)
Method	Command	Bev	Planner	1s	2s	3s	Avg.	1s	2s	3s	Avg.
FF FF	✘	✔	✔	0.55	1.20	2.54	1.43	0.06	0.17	1.07	0.43
ST-P3 stp3	✔	✘	✘	1.59	2.64	3.73	2.65	0.69	3.62	8.39	4.23
ST-P3 stp3	✔	✔	✔	1.33	2.11	2.90	2.11	0.23	0.62	1.27	0.71
UniAD hu2023planning	✔	✘	✘	0.59	1.01	1.48	1.03	0.16	0.51	1.64	0.77
UniAD hu2023planning	✔	✔	✔	0.20	0.42	0.75	0.46	0.02	0.25	0.84	0.37
VAD-Base VAD	✔	✘	✘	0.69	1.22	1.83	1.25	0.06	0.68	2.52	1.09
VAD-Base VAD	✔	✔	✔	0.17	0.34	0.60	0.37	0.04	0.27	0.67	0.33
Ego-MLP admlp	✔	✘	✔	0.15	0.32	0.59	0.35	0.00	0.27	0.85	0.37
BEV-Planner bevplanner	✔	✘	✘	0.30	0.52	0.83	0.55	0.10	0.37	1.30	0.59
BEV-Planner bevplanner	✔	✔	✔	0.16	0.32	0.57	0.35	0.00	0.29	0.73	0.34
LLaVA liu2024visual	✔	✘	✘	1.04	1.74	2.57	1.79	0.58	1.17	1.74	1.16
Vicuna chiang2023vicuna	✔	✘	✘	1.06	1.80	2.54	1.80	0.60	1.21	1.78	1.20
Merlin yu2023merlin	✔	✘	✘	1.03	1.71	2.40	1.71	0.48	1.05	1.77	1.10
Atlas	✘	✘	✘	1.69	1.89	2.25	1.94	0.51	0.85	1.44	0.93
	✔	✘	✘	0.52	0.97	1.53	1.00	0.15	0.31	0.70	0.38
	✔	✔	✔	0.18	0.21	0.26	0.21	0.12	0.13	0.16	0.13

In this section, we evaluate the performance of our proposed method, Atlas, by comparing it against existing state-of-the-art (SoTA) BEV-based planners, as detailed in Table 3. Our experimental results reveal that Atlas achieves substantial improvements over the SoTA methods, reducing the average L2 metric by 40.0% and the average Collision metric by 60.6%. These significant enhancements corroborate the effectiveness of the 3D-tokenized LLMs, which we consider as the key to reliable autonomous driving.

Further, to ascertain whether the performance improvements are solely attributable to the inclusion of ego state information—a frequent topic of discussion within the community—we conduct additional experiments by removing the ego state data during both training and testing. In this experimental setting, compared to the prevailing VLM-based methods, our Atlas demonstrates superior performance and robustly validates the effectiveness of 3D tokenizers. Despite this, Atlas continues to outperform other BEV-based methods in terms of collision rates. However, the performance on the L2 metric is comparable to other methods. We hypothesize that this outcome may stem from the inherent capabilities of the LLM to predict multiple potential motion trajectories and make rational decisions, which, while confronted with novel scenarios, deviate from the ground truth.

3.4 Ablation Study

To avoid unnecessary misunderstandings, our ablation does not introduce any ego states.

Table 4: A set of ablative studies on 3D object detection and ego-car planning. The adopted algorithm designs and hyper-parameter settings are marked in bold. See §3.4 for details.

	3D Detection		Planning
	F1_1.0	F1_2.0	Avg. L2	Avg. Col.
Vicuna	5.4	10.1	2.19	2.75
$+$ QR	30.7	41.2	1.22	0.62
$+$ RP	34.6	46.5	1.10	0.44
$+$ MQ	39.8	49.7	1.00	0.38

(a) Component Effect. QR, RP, and MQ mean Query Representation, Reference Point embedding, and Memory Queue.

PT	SP	TM	Avg. L2	Avg. Col.
✔	-	-	1.51	1.05
-	✔	-	1.06	0.41
-	✔	✔	1.00	0.38

(b) Effect of different 3D tokenizers. PT, SP and TM represent PETR, StreamPETR and TopoMLP.

Resolution	Avg. L2	Avg. Col.
336 $\times$ 336	1.66	0.94
320 $\times$ 800	1.41	0.58
800 $\times$ 1600	1.00	0.38

RP. emb.	Avg. L2	Avg. Col.
none	1.18	0.57
sin-cos	1.21	0.57
learned	1.19	0.56
RP	1.00	0.38

(d) Reference point embeddings.

LLMs	Avg. L2	Avg. Col.
LLaMA touvron2023llama	1.14	0.47
LLaVA liu2024visual	1.03	0.39
Vicuna chiang2023vicuna	1.00	0.38
Merlin yu2023merlin	0.99	0.42

(e) Different pretrained LLMs.

Component Effect. We conduct ablation studies to analyze our proposed 3D-tokenized LLM, considering several key aspects: query representation, reference point embedding, and memory queue, all decoupling from StreamPETR on 3D detection and planning. The results are summarized in Table 4(a). Our experiments demonstrate that each component progressively enhances performance in both tasks. Furthermore, we observe a synergistic effect where improvements in one task appear to amplify accuracy in the other, strongly proving that the capacity to perceive the environment remains a cornerstone of reliable planning.

3D Tokenizers. We investigate the effectiveness of various 3D tokenizers for ego-car planning, which are the central enhancements introduced in our study. The results are shown in Table 4(b). The tokenizers we evaluate include PETR (PT) liu2022petr , StreamPETR (SP) wang2023exploring , and TopoMLP (TM) wu2023topomlp . Our incorporation of progressively advanced 3D perceptrons into LLM demonstrates a notable improvement in planning performance, underscoring the significance of 3D perception in achieving robust autonomous driving. Furthermore, we integrate TopoMLP to provide supplementary lane line information. This addition results in a modest enhancement in performance, suggesting the potential benefits of incorporating contextual roadway features into the motion planning process.

Resolution. Our approach integrates 3D tokenizers with adjustable image resolution capabilities, which aligns well with real-world applications in autonomous driving. As Table 4(c) presents, we observe that increasing the image resolution leads to a noticeable improvement in performance. This evidence indicates that our method holds significant advantages over traditional VLM techniques, particularly in terms of flexibility and efficacy in handling diverse image resolutions.

Reference Point Embeddings. Our Atlas introduces an important concept: 3D tokenizers equipped with reference point embeddings, following the setting of StreamPETR wang2023exploring and TopoMLP wu2023topomlp . Here, we evaluate the model performance of decoupling reference point embedding and query embedding. Our initial approaches relied solely on query representations (i.e., "none" in Table 4(d)), which overlooks the crucial 3D spatial context—termed as the reference point. However, as shown in Table 4(d), simply applying conventional embedding techniques carion2020end , like sin-cos position embedding and learned position embedding, to 3D queries do not markedly influence performance. This outcome underscores the unique advantages of reference points. To effectively utilize this, we incorporate offset mappings from the reference points via a single layer projector aka reference point embeddings to the 3D query representation (i.e., "RP" in Table 4(d)). Notably, this method achieves remarkable improvements in accuracy, underscoring its effectiveness.

Pretrained LLMs. In our experiments, we evaluate different LLMs that varied in their pre-training methodologies, as detailed in Table 4(e). Our results show that LLMs pre-trained with methods that align text and images significantly outperform others in planning tasks. We attribute this enhanced performance to the multimodal nature of their training. Additionally, our analysis reveals that models pre-trained with various visual-language data exhibited no significant differences in planning performance. We believe this is due to the absence of 3D data in their pre-training processes, suggesting that the inclusion of 3D data in pre-training, as 3D tokenizers do, is necessary.

Table 5: Effective of the chain of thought.

chain	Avg. L2	Avg. Col.
P	1.33	0.79
V-P	1.21	0.61
V-A-P	1.00	0.38
V-A-Y-P	1.15	0.55
V-A-T-P	1.40	0.81
P-V-A	1.01	0.40

Chain of Thought. In the realm of autonomous driving, recent works admlp ; bevplanner converge on a key insight: the state of the ego vehicle is a pivotal factor in shaping open-loop planning strategies. To this end, we delineate the ego states into four distinct yet interrelated dimensions: velocity (V), acceleration (A), yaw angle (Y), and the historical trajectory (T). To evaluate the influence of each dimension, we conduct ego planning based on the predicated of these parameters. The experimental findings, as outlined in Table 5, where "P" denotes "Planning". Notably, our results diverge from prevailing research, indicating that the yaw angle and historical trajectory do not enhance the efficacy of the planning process. This counterintuitive outcome is likely a consequence of the inherent difficulties in the precise forecasting of these variables wei2023autoregressive ; Bai_2024_CVPR . Moreover, we discover an interesting aspect of our model’s robustness: the sequence in which these parameters are predicted does not impact the performance. This suggests that altering the order of prediction (e.g., reversed) does not increase computational time.

3.5 Qualitative Results

We also conduct a qualitative analysis by visualizing the trajectory predictions made by Atlas, as shown in Figure 4. We execute the 3D-tokenized LLM five times to produce five depicted planning trajectories. The results demonstrate that Atlas is capable of generating multiple feasible plans for autonomous driving that are not only practical but also adhere to safety standards. Specifically, Atlas successfully devises various potential routes tailored to distinct driving scenarios, including following other vehicles, lane changing, and overtaking. Importantly, Atlas effectively identifies and avoids pedestrians and cars, showcasing its robust capability in ensuring road safety.

4 Related Work

DETR-style BEV Perception. DETR carion2020end is initially proposed to address the challenge of end-to-end detection, and further extensively applied in BEV perception liu2022petr ; li2022bevformer ; liu2023sparsebev ; lin2022sparse4d , thereby significantly advancing its development. DETR3D wang2022detr3d is a pioneering work that introduces the concept of 3D object queries, which interact with multi-view image features to produce sparse yet informative object representations. Further, PETR liu2022petr ; liu2022petrv2 introduces the concept of 3D position encoding, and BEVFormer li2022bevformer brings BEV temporal modeling. StreamPETR wang2023exploring and Sparsev2 lin2023sparse4d use object queries as a vessel for temporal modeling, effectively propagating temporal information while achieving SoTA performance with commendable efficiency. In a notable finding within StreamPETR, the inclusion of additional multi-frame image feature interactions does not enhance performance, suggesting that the highly compressed object queries are sufficiently expressive to encapsulate all necessary information for BEV perception. Moreover, the application of DETR framework has been expanded to map queries by works such as MapTR liao2022maptr , TopoNet li2023topology and TopoMLP wu2023topomlp , which are instrumental in the construction of vectorized map representations.

BEV-based End-to-end Driving. Traditional autonomous driving systems have often relied on manual rules for planning, which can be cumbersome and complex, struggling to cover the numerous corner cases. In recent years, there has been a pronounced shift towards end-to-end autonomous driving approaches, which have demonstrated significant progress in simplifying and streamlining the autonomous driving pipeline. UniAD hu2023planning is a pioneering work that introduces an end-to-end framework encompassing tasks such as perception, prediction, and planning, with these tasks executed sequentially to ultimately produce control outputs. Building upon this framework, VAD alexanian1990vad further streamlines the pipeline, enhancing efficiency and reducing complexity. However, AD-MLP zhai2023rethinking and BEV-Planner li2023ego have observed that existing end-to-end methods can achieve high performance on open-loop benchmarks like nuScenes caesar2020nuscenes by simply fitting to the ego status of the autonomous vehicle. This finding suggests that the integration of planning and control in these models may not yet fully capture the complexities of real-world driving scenarios. Subsequent works, such as Think-Twice jia2023think and VADv2 chen2024vadv2 , have made substantial advancements in closed-loop simulators like Carla dosovitskiy2017carla . Following BEV-Planner li2023ego , we present results with and without the ego status to address the open-loop challenges on the nuScenes caesar2020nuscenes benchmark.

VLM-Agent for Autonomous Driving. The visual-language model (VLM) domenstrates promising results in the fields of visual-language understanding and logical reasoning, and has been extended to autonomous driving xu2023drivegpt4 ; tian2024drivevlm ; wang2023drivemlm ; shao2023lmdrive ; jia2023adriver ; xie2023sed . DriveGPT4 xu2023drivegpt4 employs a VLM model to predict driving commands and provide rational explanations for its decisions. DriveLM sima2023drivelm excels at conversing about environmental information, while ADriver-I jia2023adriver focuses on predicting low-level vehicle signals. Furthermore, DriveMLM wang2023drivemlm and LMDrive shao2023lmdrive have implemented end-to-end autonomous driving solutions and validated effectiveness on CARLA dosovitskiy2017carla closed-loop benchmarks, showcasing the potential of VLM-based agents. Despite impressive progress, no work explores how the 3D-tokenized LLM influences real-life autonomous driving.

5 Conclusion and Limitations

In this paper, we explored VLM-based methods increasingly used in autonomous driving, focusing first on perception. We found large gaps between task-specific and 2D tokenized LLM-based methods in environmental perception, which is essential for reliable autonomous driving. To address these gaps, we introduced Atlas, a system combining DETR-style 3D perceptrons with LLMs. This approach integrates 3D priors for better depth perception and supports high-resolution, multi-view images, and temporal modeling through query propagation. Our evaluation of Atlas on nuScenes dataset revealed substantial improvements in 3D detection and planning, surpassing established methods. This confirms our belief that 3D-tokenized LLM is the key to reliable autonomous driving.

Limitations. This paper aims to demonstrate the effectiveness of the 3D tokenizer for VLM-based autonomous driving. Although our method has demonstrated outstanding performance in open-loop planning, it has not yet been tested on a closed-loop dataset. However, existing close-loop benchmarks (e.g., CARLA dosovitskiy2017carla ) lack reality, which fails to verify our motivation. Moreover, this paper lacks of performance comparison with VLM-based AD methods. This omission is due to the proprietary codes for these methods.

References

[1] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. NeurIPS, 1988.
[2] Joel Janai, Fatma Güney, Aseem Behl, Andreas Geiger, et al. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Foundations and Trends® in Computer Graphics and Vision, 2020.
[3] Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
[4] Ming Liang, Bin Yang, Wenyuan Zeng, Yun Chen, Rui Hu, Sergio Casas, and Raquel Urtasun. Pnpnet: End-to-end perception and prediction with tracking in the loop. In CVPR, 2020.
[5] Sergio Casas, Abbas Sadat, and Raquel Urtasun. Mp3: A unified model to map, perceive, predict and plan. In CVPR, 2021.
[6] Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In CVPR, 2022.
[7] Yunpeng Zhang, Zheng Zhu, Wenzhao Zheng, Junjie Huang, Guan Huang, Jie Zhou, and Jiwen Lu. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
[8] Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
[9] Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, and Hang Zhao. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In CVPR, 2023.
[10] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In CVPR, 2023.
[11] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
[12] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth KY Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
[13] Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024.
[14] Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245, 2023.
[15] Hao Shao, Yuxuan Hu, Letian Wang, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In CVPR, 2024.
[16] Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023.
[17] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
[18] Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xiangyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. In ICCV, 2023.
[19] Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, and Jianbing Shen. Topomlp: An simple yet strong pipeline for driving topology reasoning. In ICLR, 2024.
[20] Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024.
[21] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
[22] Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, et al. Openlane-v2: A topology reasoning benchmark for unified 3d hd mapping. NeurIPS, 2024.
[23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. NeurIPS, 2024.
[25] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
[26] En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, et al. Merlin: Empowering multimodal llms with foresight minds. arXiv preprint arXiv:2312.00589, 2023.
[27] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, 2022.
[28] Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? arXiv preprint arXiv:2312.03031, 2023.
[29] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
[30] Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
[31] Peiyun Hu, Aaron Huang, John Dolan, David Held, and Deva Ramanan. Safe local motion planning with self-supervised freespace forecasting. In CVPR, 2021.
[32] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023.
[33] Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023.
[34] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[35] Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yihong Gong. Autoregressive visual tracking. In CVPR, 2023.
[36] Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Artrackv2: Prompting autoregressive tracker where to look and how to describe. In CVPR, 2024.
[37] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
[38] Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In ICCV, 2023.
[39] Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
[40] Yue Wang, Guizilini Vitor Campagnolo, Tianyuan Zhang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, 2022.
[41] Yingfei Liu, Junjie Yan, Fan Jia, Shuailin Li, Qi Gao, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petrv2: A unified framework for 3d perception from multi-camera images. In ICCV, 2023.
[42] Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d v2: Recurrent temporal fusion with sparse model. arXiv preprint arXiv:2305.14018, 2023.
[43] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023.
[44] Tianyu Li, Li Chen, Xiangwei Geng, Huijie Wang, Yang Li, Zhenbo Liu, Shengyin Jiang, Yuting Wang, Hang Xu, Chunjing Xu, et al. Topology reasoning for driving scenes. arXiv preprint arXiv:2304.05277, 2023.
[45] Raymond Alexanian, Bart Barlogie, and Susan Tucker. Vad-based regimens as primary treatment for multiple myeloma. American journal of hematology, 1990.
[46] Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023.
[47] Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023.
[48] Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243, 2024.
[49] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In CoRL, 2017.
[50] Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. In CVPR, 2024.
[51] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.

Appendix

Appendix A Datasets Details

To align with the requirements of VLM-based models, the necessary step is to transform all evaluation datasets into a textual format, specifically structured as question-answer pairs. In this section, we will delve deeper into the specifics of various datasets, including 3D object detection (§A.1), 3D lane detection (§A.2), driving captioning (§A.3), and ego planning (§A.4).

A.1 3D Object Detection

The 3D object detection utilized in the VLM-based method (2D/3D visual tokenizers with LLM) evaluation is based on nuScenes caesar2020nuscenes . To adapt to the inputs and outputs of LLM, we convert the detection task into a text-format question-answer task. Here, the question is randomly sampled from a pool that is listed in Table 6. As seen, we set a special token ‘<query>’ to accept tokens from 3D tokenizers. If the inputs are six-view images, we replace the text ‘They are uniformly represented as queries embeddings<query>’ in question with ‘They represent left rear image<query>, left front image<query>, direct front image<query>, right front image<query>, right rear image<query>, and direct rear image<query>.’. As for the answer, we choose the category name and 3D center points of each bounding box, as shown in Figure 5. To facilitate more efficient localization, we discretize the bird’s-eye view (BEV) space ranging from -50 meters to +50 meters into 1,000 bins.

Table 6: Question pool of 3D object detection for VLM-based methods.

A.2 3D Lane Detection

We formulate a 3D lane detection dataset with question-answer pairs based on the OpenLane-V2 Subset-B wang2024openlane , which itself originates from the nuScenes dataset. A representative is shown in Figure 5. Their questions are sampled from Table 7, and the corresponding answer comprises a set of four lane points. Analogous to the 3D object detection dataset, we discretize the BEV space, spanning from -50 meters to +50 meters, into 1,000 bins.

Table 7: Question pool of 3D lane detection for VLM-based methods.

A.3 Driving Captioning

Our driving captioning dataset is created through the annotation of nuScenes, leveraging the capabilities of GPT-4V. The specific prompt utilized in GPT-4V is detailed in Table 8, while an illustrative example is presented in Figure 5. It is worth mentioning that, to harness the full potential of GPT-4V, we request a unique description for each individual view, resulting in a total of approximately 180k question-answer pairs.

Table 8: Prompt used in GPT-4V for caption generation.

A.4 Ego Planning

Similar to 3D object and lane detection, we adapt the nuScenes dataset into a question-answer pairs format. Following the chain-of-thought approach, we prompt our model to generate safe driving plans and to describe various ego states, such as velocity and acceleration. The specific questions used are sampled from Table 9. For the answers, the model predicts the current state’s velocity and acceleration and then generates the ego-car’s planning waypoints for the next 3 seconds at 0.5-second intervals. This approach mirrors our methods in 3D object detection and 3D lane detection, where we discretize the BEV space, which ranges from -50 to +50 meters, into 1,000 bins. Similarly, we discretize both velocity and acceleration across a range from -50 m/s (m/s²) to +50 m/s (m/s²) into 1,000 bins each.

Table 9: Question pool of ego planning for VLM-based methods.

Appendix B Model Details

B.1 3D Tokenizers Pre-training

We pre-train two distinct 3D tokenizers: StreamPETR wang2023exploring and TopoMLP wu2023topomlp . StreamPETR wang2023exploring is designed for multi-view 3D object detection. We utilize a ViT-L backbone eva02 and process images at a high resolution of 800x1600. Moreover, we follow the official training schedule established for the nuScenes dataset. TopoMLP wu2023topomlp focuses on constructing vectorized maps from multiple views. To maintain methodological consistency with StreamPETR, we employ the same ViT-L backbone and resolution. The training strategy for TopoMLP also mirrors the official.

B.2 3D-tokenized LLM

Query Representation. For the innate priors of the 3D physical world, the query-based BEV framework is introduced. These DETR-style methods, StreamPETR and TopoMLP, extract target-aware query embeddings aka query representations (content) with reference points (localization) to represent objects from multi-view images.

Reference Point Embeddings. As previously mentioned, a target is characterized by both its content and location. We integrate the query embeddings by adding reference point embeddings, which are generated from reference points via a single linear layer, to formulate the 3D tokens that represent target information. A notable aspect of our setup is we initialize the weight of the reference point projector to zero.

Memory Queue. Taking inspiration from StreamPETR, our approach involves the storage of historical queries to preserve continuity in time, as memory queues. Specifically, we concatenate these memory queries with current queries for temporal modeling. To elaborate, our method includes storing queries from three additional frames that exhibit the highest confidence—specifically, the top-K queries, where in our implementation, K is set to 256. The management of these queues adheres to a first-in, first-out (FIFO) principle.

Our 3D-tokenized LLM, Atlas, integrates the 3D tokenizers described earlier with an LLM, specifically the Vicuna-1.5. This LLM has been pre-trained on a diverse open-world data corpus, providing a robust foundation for understanding and processing spatial-temporal. Atlas follows most of the basic settings in Merlin, with a batch size of 1, a learning rate of 2e-5, and the AdamW optimizer with a weight decay of 1e-4. We implement a linear warm-up phase consisting of the first 3% steps in total. Following the warm-up, we transition to a cosine learning rate strategy. The maximum length of prompt tokens is 4096.

Appendix C 3D Detection Results

Precion-Recall Curve. In the paper text, we present a comparison of the F1 scores between task-specific models and Atlas in 3D detection, focusing on predictions with a confidence score above $0.3$ , which yielded the highest F1 score. Additionally, we illustrate the performance variations of PETR, StreamPETR, and Atlas through the Precision-Recall curves at different positive thresholds, as shown in Figure 6. It’s important to note that Atlas does not generate confidence scores; therefore, we treated all its predictions as positive samples for the purpose of calculating precision and recall. Although Atlas shows slightly weaker performance in making fine-grained predictions (specifically at a threshold of 0.5 meters), it excels in scenarios with larger thresholds. This observation suggests that large language models like Atlas might struggle with highly precise numerical predictions but perform well when broader tolerances are acceptable.

Appendix D More Qualitative Results

D.1 Qualitative Results of 3D Detection

We visualize the prediction results of the Atlas model in 3D detection tasks, as shown in Figure 7. The results align well with our performance metrics, demonstrating a notably high recall rate. This high recall is particularly important in practical applications of autonomous driving, where accurately detecting every potential obstacle, like pedestrians, is critical. Furthermore, the model maintains its accuracy even in complex scenarios characterized by high pedestrian density or closely packed targets. Moreover, Atlas proves robust under challenging environmental conditions. For instance, even on rainy days, the model continues to perform strongly. This resilience is essential for the reliability needed in real-world applications, ensuring consistent performance regardless of weather conditions.

D.2 Qualitative Results of 3D Lane Detection

We showcase the visualization outcomes of Atlas in its application to 3D lane detection, depicted in Figure 8. While the quantitative performance does not surpass task-specific models, Atlas demonstrates noteworthy qualitative performance. As seen, our model performs well in challenging road situations because it accurately recognizes road crossings and dividings.

D.3 Qualitative Results of Planning

We also demonstrated the adaptability of Atlas’s driving plans across various weather conditions in Figure 9. Notably, even during rain, Atlas effectively plans its future travel trajectories with considerable diversity. This capability underscores the model’s robustness in challenging environments. Furthermore, Atlas impressively maintains compliance with traffic signals, such as stopping at red lights, without having undergone specific training for traffic light recognition. This aspect highlights the model’s inherent understanding and application of world knowledge relying on LLM. Additionally, the model’s diverse planning strategy enables it to effectively balance the decisions between maintaining its current lane and executing lane changes for overtaking. This flexibility greatly enhances the variety of possible travel routes, adapting dynamically to the flow of traffic and road conditions.

Appendix E Failure Cases

Discussing error examples in our model, Atlas, provides valuable insights that can guide future improvements. In this section, we analyze two primary types of failure observed during our experiments:

Overly Conservative Behavior. Atlas tends to make overly conservative decisions, favoring caution even when the path ahead is clear, as shown in Figure 11. This behavior results in a lower travel efficiency as the model opts to prioritize safety excessively. Our analysis suggests that this conservatism is likely rooted in the sampling bias of the nuScenes dataset. This dataset predominantly includes safer driving examples and favors lower-speed scenarios, which may have influenced Atlas’ decision-making strategy. To address this issue, incorporating a substantial amount of closed-loop data could be beneficial. This would provide Atlas with more dynamic and varied driving scenarios, potentially reducing its overly conservative tendencies.

Violation of Traffic Regulations. Despite Atlas having learned to adhere to several traffic rules, it occasionally fails to comply with traffic light signals, as shown in Figure 11. Specifically, Atlas may proceed through intersections during a red light. This error stems from the model’s lack of explicit traffic light information in its current framework. To mitigate this issue, integrating enhanced traffic-related data queries could be crucial. By providing Atlas with more explicit and detailed traffic signal information, we can improve its compliance with traffic laws and overall decision-making accuracy.

These findings highlight critical areas for further research and development. Enhancing the dataset and incorporating explicit models of traffic elements such as lights and signs are promising avenues for improving Atlas’ performance and reliability.

Appendix F Border Impact

Our work focuses on designing a VLM-based solution for autonomous driving. Training our model using a large language model is often computationally demanding, which may result in significant environmental impacts. Also, researchers with constrained computational resources may find it challenging to follow our work. Except for these, there are no typical border impacts for our work.