Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Sample-efficient Imitative Multi-token Decision Transformer for Generalizable Real World Driving

Hang Zhou1, Dan Xu2,†, Yiding Ji1,†
1The Hong Kong University of Science and Technology (Guangzhou)
2The Hong Kong University of Science and Technology
hzhou269@connect.hkust-gz.edu.cn, danxu@cse.ust.hk, jiyiding@hkust-gz.edu.cn
Abstract

Reinforcement learning via sequence modeling has shown remarkable promise in autonomous systems, harnessing the power of offline datasets to make informed decisions in simulated environments. However, the full potential of such methods in complex dynamic environments remain to be discovered. In autonomous driving domain, learning-based agents face significant challenges when transferring knowledge from simulated to real-world settings and the performance is also significantly impacted by data distribution shift. To address these issue, we propose Sample-efficient Imitative Multi-token Decision Transformer (SimDT). SimDT introduces multi-token prediction, imitative online learning and prioritized experience replay to Decision Transformer. The performance is evaluated through empirical experiments and results exceed popular imitation and reinforcement learning algorithms on Waymax benchmark.

Keywords: Reinforcement Learning, Motion Planning, Autonomous Driving

1 Introduction

The realm of autonomous driving research has witnessed remarkable progress, with simulation technologies [1][2][3][4] reaching unprecedented levels of realism and the burgeoning availability of real-world driving datasets [5][6][7][8]. Despite these advancements, data-driven planning continues to confront a formidable obstacle: the infinite state space and extensive data distribution characteristic of real-world driving.

Imitation learning approaches encounter hurdles [9][10] when presented with scenarios that deviate from the training distribution, exemplified by rare events like emergency braking for unforeseen obstacles. Similarly, these methods grapple with long-tail distribution phenomena, such as navigating through unexpected weather conditions or handling the erratic movements of a jaywalking pedestrian. On the other hand, reinforcement learning (RL) strategies aim to cultivate policies through reward-based learning. RL has difficulty bridging the sim-real gap and sampling efficiency [11]. It often struggle to extrapolate a driving policy that encapsulates the nuanced decision-making process of an experienced human driver, especially when the simulator lacks interactivity or the scenario falls short of realism [12].

Traditional reinforcement learning approaches also struggle with large state space, long-horizon planning and sparse rewards, which are also characteristic of real-world driving scenarios. Decision Transformer [13] leverages a transformer-based architecture to learn policies for decision-making in reinforcement learning tasks via sequence modeling. Despite its potential on scaling with large state space [14], the original architecture and pipeline is deigned for offline learning and is not enough for complex and dynamic autonomous driving task. Classic RL techniques such as prioritized experience replay [15] which is dealing with large scale dataset cannot naturally be applied as Decision Transformer does not compute temporal-difference.

On the other hand, concentrating solely on single-token prediction renders the model excessively susceptible to immediate contextual patterns, thereby neglecting the necessity for more extensive deliberation over protracted sequences. Models trained through next-token prediction methodologies necessitate substantial dataset to achieve a degree of intelligence that humans attain with considerably less token exposure [16]. Receding Horizon Control[17] is a control method that optimizes decision-making over a rolling time horizon, constantly updating its strategy based on newly acquired information. This approach is analogous to multi-token prediction in decision transformers and has potential shifting from myopic to panoramic prediction closed to human cognitive processes.

This paper seeks to address these challenges by proposing an improved Decision Transformer network and a hybrid learning framework that leverages the complementary strengths of imitation and reinforcement learning. Experiment results indicate that our approach yields a substantial enhancement in performance with improvements observed in terms of policy robustness and sample efficiency. The main contributions are as follows:

  • We present a fully online imitative Decision Transformer pipeline designed for wide data distribution across large-scale real-world driving dataset.

  • We propose multi-token Decision Transformer architecture for receding horizon control to enhance long-horizon prediction and broaden attention field.

  • We introduce prioritized experience replay to Decision Transformer and enables sample-efficient training for large-scale sequence modelling based reinforcement learning.

Refer to caption
Figure 1: Comparative Illustration of Learning Approaches. The left figure depicts a data distribution of expert data, highlighting its limitations in managing distributional shifts and challenges arising from suboptimal training data. In contrast, the right figure presents our imitative reinforcement learning pipeline that demonstrates enhanced robustness by adapting policies online, thereby achieving superior performance under variable conditions.

2 Related Work

Reinforcement learning via sequence modeling. Trajectory Transformer [18] and Decision Transformer (DT) [13] are pioneer in this area, leveraging transformer architectures to model sequences of state-action-reward trajectories and predicting future actions in offline manner. Following work [19] [20] [21] [22] extends leverage the power of transformers for efficient and generalized decision-making in RL. Online DT [23] and Hyper DT [24] adapt original concept for online settings and interacts with environments. However, previous work are done on relatively simple environments compared to autonomous driving environment.

Multi-token prediction. Transformers have significantly impacted NLP since their inception [25], outperforming RNNs and LSTMs by processing sequences in parallel and efficiently handling long-range dependencies. Subsequent models like GPT [26] and BERT [27] have refined the architecture, enhancing pre-training, fine-tuning, and scalability. Recent studies explore multi-token prediction on semantic representation [28], streamline computation [29], prediction technique [30] and multilingual [31]. However,Focusing only on single-token prediction makes the model too sensitive to immediate context and overlooks the need for deeper analysis of longer sequences[16]. This paper extends the concept to Decision Transformer and explore the potential benefits of multi-token prediction for motion planning.

Learning with real-world driving data. Many work has been done to accommodate with real-world driving data [5][2][6][7] for generalizable driving policy. Lu et al. [32] explores the cooperation between reinforcement learning and imitation learning for real-world driving data. DriveIRL [33] designs an inverse reinforcement learning architecture to learn score component in complex heavy traffic scenarios. TuPlan [9] combines both learning methods with rule-based method for real-world planning. Trajeglish [10] models real world traffic auto-repressively as language processing problem and it has most similar network structure as MSDT. Our approach differs as we apply imitative reinforcement learning for pretraining and online real-world planning adaption for performance enhancement.

3 Methods

In this section, we introduce MSDT, a multi-token sample-efficient reinforcement learning framework via sequence modelling for dynamic driving scenarios. MSDT consists of three components: multi-token decision transformer, online imitative pipeline and prioritized experience replay.

Refer to caption
Figure 2: The general network architecture of SimDT. Feature encoding is applied to extract complex real-world driving perceptual data to small but meaningful embeddings. Inside causal transformer, attention relationship are calculated for past context length of [(st,at,gt)]TcTsuperscriptsubscriptdelimited-[]subscript𝑠𝑡subscript𝑎𝑡subscript𝑔𝑡𝑇𝑐𝑇[(s_{t},a_{t},g_{t})]_{T-c}^{T}[ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_T - italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The decoder now predicts multi-tokens for actions and only the first action is applied during the inference stage.

3.1 Network Structure

Since the real-world driving environment is complex and dynamic, specific feature encoding network is designed for the states representation. Real-world driving state contains many perceptual information such as obstacles, road map, traffic and so on. We follow the vectorized representation to organize road map as polylines and then extract with Polyline Encoder [34]. Obstacle with past 10 historical information are recorded in terms of [px,py,vx,vy,l,w]subscript𝑝𝑥subscript𝑝𝑦subscript𝑣𝑥subscript𝑣𝑦𝑙𝑤\bm{[}p_{x},p_{y},v_{x},v_{y},l,w]bold_[ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_l , italic_w ]. Obstacle and traffic embedding are extracted with multi-layer perception network.

The work further extends the method to goal-conditioned reinforcement learning by adding the relative vector distance between ego vehicle and destination. The importance of goal-condition lies in its influence on the decision-making process of the autonomous agent. Even in an identical environment, the actions taken by the vehicle can vary significantly depending on the specified goal.

Multi-token prediction in causal transformer simultaneously generates multiple tokens in a single forward pass, while still respecting the autoregressive property that ensures each prediction only depends on previously generated tokens. This is typically achieved by using a masked self-attention mechanism that allows the model to consider multiple future positions without violating causal dependencies. As shown in Equation (1), next-token prediction has loss function Lasubscript𝐿𝑎L_{a}italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT defined as the negative log-likelihood of the policy.

La=logπθ(atst:tc,at1:tc,gt:tc)subscript𝐿𝑎subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠:𝑡𝑡𝑐subscript𝑎:𝑡1𝑡𝑐subscript𝑔:𝑡𝑡𝑐L_{a}=-\log\pi_{\theta}(a_{t}\mid s_{t:t-c},a_{t-1:t-c},g_{t:t-c})\ italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_c end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT ) (1)

where πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the training driving policy. maximize the probability of atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the next prediction action, given the history of past tokens with context length c𝑐citalic_c of st:tc=st,,stc.at1:tc=at1,,atc.gt:tc=gt,,gtcs_{t:t-c}=s_{t},\,...,\,s_{t-c}.\quad a_{t-1:t-c}=a_{t-1},\,...,\,a_{t-c}.% \quad g_{t:t-c}=g_{t},\,...,\,g_{t-c}italic_s start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT . italic_a start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_c end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT . italic_g start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT.

The loss function is modified for multi-token prediction and assume network predict next 3 tokens. Where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are the coefficient designed for network to learn more about current step action predictions.

Lma=subscript𝐿𝑚𝑎absent\displaystyle L_{ma}=italic_L start_POSTSUBSCRIPT italic_m italic_a end_POSTSUBSCRIPT = logπθ(atst:tc,at1:tc,gt:tc)subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠:𝑡𝑡𝑐subscript𝑎:𝑡1𝑡𝑐subscript𝑔:𝑡𝑡𝑐\displaystyle-\log\pi_{\theta}(a_{t}\mid s_{t:t-c},a_{t-1:t-c},g_{t:t-c})- roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_c end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT ) (2)
αlogπθ(at+1st:tc,at1:tc,gt:tc)βlogπθ(at+2st:tc,at1:tc,gt:tc)𝛼subscript𝜋𝜃conditionalsubscript𝑎𝑡1subscript𝑠:𝑡𝑡𝑐subscript𝑎:𝑡1𝑡𝑐subscript𝑔:𝑡𝑡𝑐𝛽subscript𝜋𝜃conditionalsubscript𝑎𝑡2subscript𝑠:𝑡𝑡𝑐subscript𝑎:𝑡1𝑡𝑐subscript𝑔:𝑡𝑡𝑐\displaystyle-\alpha*\log\pi_{\theta}(a_{t+1}\mid s_{t:t-c},a_{t-1:t-c},g_{t:t% -c})-\beta*\log\pi_{\theta}(a_{t+2}\mid s_{t:t-c},a_{t-1:t-c},g_{t:t-c})- italic_α ∗ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_c end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT ) - italic_β ∗ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 : italic_t - italic_c end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t : italic_t - italic_c end_POSTSUBSCRIPT )

3.2 Online Imitative Training Pipeline

Algorithm 1 Online Imitative Training Pipeline
Initialize Transition Replay Buffer Dtranssubscript𝐷𝑡𝑟𝑎𝑛𝑠D_{trans}italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT for capacity A, Trajectory Replay Buffer Dtrajsubscript𝐷𝑡𝑟𝑎𝑗D_{traj}italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT
while nnum_scenarios𝑛𝑛𝑢𝑚_𝑠𝑐𝑒𝑛𝑎𝑟𝑖𝑜𝑠n\leq num\_scenariositalic_n ≤ italic_n italic_u italic_m _ italic_s italic_c italic_e italic_n italic_a italic_r italic_i italic_o italic_s do
     while Dtranssubscript𝐷𝑡𝑟𝑎𝑛𝑠D_{trans}italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT is not full do \triangleright Online Data Collection
         if n0.5num_scenarios𝑛0.5𝑛𝑢𝑚_𝑠𝑐𝑒𝑛𝑎𝑟𝑖𝑜𝑠n\leq 0.5*num\_scenariositalic_n ≤ 0.5 ∗ italic_n italic_u italic_m _ italic_s italic_c italic_e italic_n italic_a italic_r italic_i italic_o italic_s then
              reproduce scenarios with Human Expert Driving Data, Dtrans(s,a,r)subscript𝐷𝑡𝑟𝑎𝑛𝑠𝑠𝑎𝑟D_{trans}\leftarrow(s,a,r)italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ← ( italic_s , italic_a , italic_r )
         else
              reproduce scenarios with Human Expert Driving Data, Dtrans(s,a,r)subscript𝐷𝑡𝑟𝑎𝑛𝑠𝑠𝑎𝑟D_{trans}\leftarrow(s,a,r)italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ← ( italic_s , italic_a , italic_r )
              explore scenarios with Policy agent πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , Dtrans(s,a,r)subscript𝐷𝑡𝑟𝑎𝑛𝑠𝑠𝑎𝑟D_{trans}\leftarrow(s,a,r)italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ← ( italic_s , italic_a , italic_r )
         end if
     end while
     HindsightReturnRelabeling: DtrajDtranssubscript𝐷𝑡𝑟𝑎𝑗subscript𝐷𝑡𝑟𝑎𝑛𝑠D_{traj}\leftarrow D_{trans}italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT, [[(si,j,ai,j,gi,j)]i=0T]j=0A/T[(si,ai,ri)]i=0Asuperscriptsubscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑖0𝑇𝑗0𝐴𝑇superscriptsubscriptdelimited-[]subscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖𝑖0𝐴[[(s_{i,j},a_{i,j},g_{i,j})]_{i=0}^{T}]_{j=0}^{A/T}\leftarrow\left[(s_{i},a_{i% },r_{i})\right]_{i=0}^{A}[ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A / italic_T end_POSTSUPERSCRIPT ← [ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT
     Dtranssubscript𝐷𝑡𝑟𝑎𝑛𝑠D_{trans}\leftarrow\emptysetitalic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ← ∅
     for k in range(1000) do:
         sample and ShuffleObstacleOrder: [[(si,j,ai,j,gi,j)]tct]j=0BDtrajsuperscriptsubscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑡𝑐𝑡𝑗0𝐵subscript𝐷𝑡𝑟𝑎𝑗[[(s_{i,j},a_{i,j},g_{i,j})]_{t-c}^{t}]_{j=0}^{B}\leftarrow D_{traj}[ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ← italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT
         train on sampled data
     end for
end while

The general idea of the proposed algorithm is to perform sample-efficient online imitative reinforcement learning with off-policy expert data for pre-training at beginning. Subsequently, the model undergoes a mixed on-policy adaptation phase which is introduced at the mid-point of the training process. The core concept behind is to quickly shift the distribution towards the expert behavior at beginning and reduce environmental distribution shift with on-policy adaption. Note online adaption and imitative reinforcement learning are performed concurrently after mid of training, this helps the network no to fall into online local minimal.

Imitative reinforcement learning is done by applying similar concept as Shaped IL [35] and GRI [36] where reward is shaped for expert demonstration data. Following same implementation in [32], expert data from real world driving trajectory was converted to expert agent actions with inverse kinematics. We also design negative reward for offroad and overlap (collision) behavior. The network will learn good behavior through imitation reward and bad actions through online interaction with offraod and overlap rewards. The overall online imitative pipeline is essential to achieve the greater data-distributed policy described in Figure 1.

reward function:

Rimitaiton={1.0if log_divergence<0.2,0.0if log_divergence>0.2.subscript𝑅𝑖𝑚𝑖𝑡𝑎𝑖𝑡𝑜𝑛cases1.0if log_divergence0.20.0if log_divergence0.2R_{imitaiton}=\begin{cases}1.0&\text{if }\text{{log\_divergence}}<0.2,\\ 0.0&\text{if }\text{{log\_divergence}}>0.2.\end{cases}italic_R start_POSTSUBSCRIPT italic_i italic_m italic_i italic_t italic_a italic_i italic_t italic_o italic_n end_POSTSUBSCRIPT = { start_ROW start_CELL 1.0 end_CELL start_CELL if log_divergence < 0.2 , end_CELL end_ROW start_ROW start_CELL 0.0 end_CELL start_CELL if log_divergence > 0.2 . end_CELL end_ROW (3)
Roffroad=2subscript𝑅𝑜𝑓𝑓𝑟𝑜𝑎𝑑2R_{offroad}=-2italic_R start_POSTSUBSCRIPT italic_o italic_f italic_f italic_r italic_o italic_a italic_d end_POSTSUBSCRIPT = - 2 (4)
Roverlap=10subscript𝑅𝑜𝑣𝑒𝑟𝑙𝑎𝑝10R_{overlap}=-10italic_R start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_l italic_a italic_p end_POSTSUBSCRIPT = - 10 (5)

However, the real-time collected transition level replay buffer does not contain return-to-go as it can only be calculated after episode is finished and all rewards is collected. Similar to Online Decision Transformer, the transition level replay buffer converted to hindsight trajectory replay buffer when fixed amount of trajectories are collected.

3.3 Prioritized Experience Replay for Decision Transformer

Algorithm 2 Prioritized Experience Replay for Decision Transformer
Initialize Prioritized Trajectory Replay Buffers Dsinglepersuperscriptsubscript𝐷𝑠𝑖𝑛𝑔𝑙𝑒𝑝𝑒𝑟D_{single}^{per}italic_D start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT, Doverallpersuperscriptsubscript𝐷𝑜𝑣𝑒𝑟𝑎𝑙𝑙𝑝𝑒𝑟D_{overall}^{per}italic_D start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT with capacity B𝐵Bitalic_B
while nnum_scenarios𝑛num_scenariosn\leq\text{num\_scenarios}italic_n ≤ num_scenarios do
     Execute lines from Algorithm 1 \triangleright Online Data Collection
     HindsightReturnRelabeling: DtrajDtranssubscript𝐷𝑡𝑟𝑎𝑗subscript𝐷𝑡𝑟𝑎𝑛𝑠D_{traj}\leftarrow D_{trans}italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT, [[(si,j,ai,j,gi,j)]i=0T]j=0A/T[(si,ai,ri)]i=0Asuperscriptsubscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑖0𝑇𝑗0𝐴𝑇superscriptsubscriptdelimited-[]subscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖𝑖0𝐴[[(s_{i,j},a_{i,j},g_{i,j})]_{i=0}^{T}]_{j=0}^{A/T}\leftarrow\left[(s_{i},a_{i% },r_{i})\right]_{i=0}^{A}[ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A / italic_T end_POSTSUPERSCRIPT ← [ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT
     Dtranssubscript𝐷𝑡𝑟𝑎𝑛𝑠D_{trans}\leftarrow\emptysetitalic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT ← ∅
     for k in range(1000) do:
         Sample and ShuffleObstacleOrder: [[(si,j,ai,j,gi,j)]tct]j=0BDtrajsuperscriptsubscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑡𝑐𝑡𝑗0𝐵subscript𝐷𝑡𝑟𝑎𝑗[[(s_{i,j},a_{i,j},g_{i,j})]_{t-c}^{t}]_{j=0}^{B}\leftarrow D_{traj}[ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ← italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_j end_POSTSUBSCRIPT
         train on sampled data and obtain Lsinglesubscript𝐿𝑠𝑖𝑛𝑔𝑙𝑒L_{single}italic_L start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT and Loverallsubscript𝐿𝑜𝑣𝑒𝑟𝑎𝑙𝑙L_{overall}italic_L start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT
         Dsingleper{[[(si,j,ai,j,gi,j)]tct]j,Lsingle}superscriptsubscript𝐷𝑠𝑖𝑛𝑔𝑙𝑒𝑝𝑒𝑟subscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑡𝑐𝑡𝑗subscript𝐿𝑠𝑖𝑛𝑔𝑙𝑒D_{single}^{per}\leftarrow\{[[(s_{i,j},a_{i,j},g_{i,j})]_{t-c}^{t}]_{j},L_{% single}\}italic_D start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT ← { [ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT }
         Doverallper{[[(si,j,ai,j,gi,j)]tct]j,Loverall}superscriptsubscript𝐷𝑜𝑣𝑒𝑟𝑎𝑙𝑙𝑝𝑒𝑟subscriptdelimited-[]superscriptsubscriptdelimited-[]subscript𝑠𝑖𝑗subscript𝑎𝑖𝑗subscript𝑔𝑖𝑗𝑡𝑐𝑡𝑗subscript𝐿𝑜𝑣𝑒𝑟𝑎𝑙𝑙D_{overall}^{per}\leftarrow\{[[(s_{i,j},a_{i,j},g_{i,j})]_{t-c}^{t}]_{j},L_{% overall}\}italic_D start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT ← { [ [ ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_t - italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT }
     end for
     Train on Dsinglepersuperscriptsubscript𝐷𝑠𝑖𝑛𝑔𝑙𝑒𝑝𝑒𝑟D_{single}^{per}italic_D start_POSTSUBSCRIPT italic_s italic_i italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT and Doverallpersuperscriptsubscript𝐷𝑜𝑣𝑒𝑟𝑎𝑙𝑙𝑝𝑒𝑟D_{overall}^{per}italic_D start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_e italic_r end_POSTSUPERSCRIPT
end while

Prioritized Experience Replay (PER) selectively samples experiences with high temporal-difference errors from the replay buffer for focusing on more informative experiences. However, the Decision Transformer doesn’t use temporal-difference errors, precluding direct application of PER. Instead, we adapt by using action loss to gauge transition importance within the Decision Transformer, which assesses state-action-return relationships. The design concept is that if the model’s predicted actions diverge from actual ones, it indicates a misinterpretation of the environment.

On top of above architecture, extra replay buffers are designed to store prioritised sampled trajectories based on action loss. The action loss represents the difference between the actions predicted by the policy network and the actual actions taken. A low actor loss means that the policy network’s predictions are close to the actual actions, while a high actor loss means that the predictions are far from the actual actions. Prioritised sampled trajectories are stroed based on following criteria:

Criterion 1: Preservation of transitions with maximal single-step action discrepancy: This methodology concentrates on isolating the instances wherein the model’s prognostications manifest the greatest deviation from expected accuracy. Such a strategy is instrumental in directing the model’s learning efforts towards ameliorating its most significant errors.

Criterion 2: Preservation of transitions with maximal cumulative action discrepancy: This methodology is characterized by its emphasis on identifying and retaining sequences wherein the aggregate error of the model’s predictions reaches its apex. This approach holds particular utility for endeavors aimed at refining the model’s performance across a continuum of actions.

The replay buffers store data based on high value in low value out. The prioritized experience replay buffer is sampled for training every fixed amount of episode and its priorities are updated at meantime. The goal for the proposed prioritized experience replay in this paper is to prioritize the trajectories where model has biggest misunderstanding of the corresponding scenarios, and therefore to prioritize on long-tail scenarios.

4 Experimental Results

4.1 Experimental Setup

Agent Action Space Sim Agent Off-Road Rate (%) Collision Rate (%) Kinematic Infeasibility (%) ADE (m) Route Progress Ratio (%)
Expert Delta - 0.32 0.61 4.33 0.00 100.00
Expert Bicycle - 0.34 0.62 0.00 0.04 100.00
Expert Bicycle(D) - 0.41 0.67 0.00 0.09 100.00
Wayformer[37] Delta - 7.89 10.68 5.40 2.38 123.58
BC[38] Delta - 4.14±plus-or-minus\pm±2.04 5.83±plus-or-minus\pm±1.09 0.18±plus-or-minus\pm±0.16 6.28±plus-or-minus\pm±1.93 79.58±plus-or-minus\pm±24.98
BC Delta (D) - 4.42±plus-or-minus\pm±0.19 5.97±plus-or-minus\pm±0.10 66.25±plus-or-minus\pm±0.22 2.98±plus-or-minus\pm±0.06 98.82±plus-or-minus\pm±3.46
BC Bicycle - 13.59±plus-or-minus\pm±12.71 11.20±plus-or-minus\pm±5.34 0.00±plus-or-minus\pm±0.00 3.60±plus-or-minus\pm±1.11 137.11±plus-or-minus\pm±33.78
BC Bicycle(D) - 1.11±plus-or-minus\pm±0.20 4.59±plus-or-minus\pm±0.06 0.00±plus-or-minus\pm±0.00 2.26±plus-or-minus\pm±0.02 129.84±plus-or-minus\pm±0.98
DQN[39] Bicycle(D) IDM 3.74±plus-or-minus\pm±0.90 6.50±plus-or-minus\pm±0.31 0.00±plus-or-minus\pm±0.00 9.83±plus-or-minus\pm±0.48 177.91±plus-or-minus\pm±5.67
DQN Bicycle(D) Playback 4.31±plus-or-minus\pm±1.09 4.91±plus-or-minus\pm±0.70 0.00±plus-or-minus\pm±0.00 10.74±plus-or-minus\pm±0.53 215.26±plus-or-minus\pm±38.20
SimDT(ours) Bicycle Playback 3.36±plus-or-minus\pm±0.04 2.65±plus-or-minus\pm±0.06 0.00±plus-or-minus\pm±0.00 6.73±plus-or-minus\pm±0.41 -
Table 1: Performance evaluation are done against IDM simulation agents. Agents run without any termination conditions in WOD1.2 evaluation dataset. Action space is continuous unless denoted with D (discrete). Waymax benchmark table is used as we use exact same experiment settings.

Dataset, simulator and metrics. Training and Experiments are done based on Waymo Open Dataset and Waymax simulator. Waymax provides embedded support for reinforcement learning and diverse scenarios drawn from real driving data. Waymax incorporate with Waymo Open Motion Dataset (WOMD) which provides 531,101531101531,101531 , 101 real-world driving scenarios for training and 44,0964409644,09644 , 096 scenarios for validation, each scenario contains 90 frames of data. Specifically, WOMD v1.2 and exact same metrics (off-road rate, collision rate, kinematic infeasibility, average displacement error (ADE)) from Waymax are used to benchmark with the paper.

Implementation Detail. Models of various sizes are developed to quickly conduct ablation studies and assess final performance effectively. Raw observation takes nearest ego vehicle, 15 nearest dynamic obstacles, 250 of closest roadgraph elements, traffic signals and position goal as input. The total size for each step observation is 7050705070507050 and feature extraction is applied to reduce the total size. SimDT(tiny) has 256 tokens for each element of (s,a,g)𝑠𝑎𝑔(s,a,g)( italic_s , italic_a , italic_g ) pair, 6 blocks, 16 attention head and in total 7.7 million parameters. SimDT(small) has 384 tokens for each element of (s,a,g)𝑠𝑎𝑔(s,a,g)( italic_s , italic_a , italic_g ) pair, 10 blocks, 16 attention head and in total 22.2 million parameters. Both models use context length with value 10, which means causal transformer has access to past 10 (s,a,g)𝑠𝑎𝑔(s,a,g)( italic_s , italic_a , italic_g ) pairs.

4.2 Benchmark Comparison

SimDT is evaluated using Intelligent Driving Model (IDM)[40] as the simulated agent. SimDT achieves Off-Road Rate of 3.36%, Collision Rate of 2.65%, Kinematic Infeasibility of 0.00%, and ADE of 6.73m. SimDT significantly outperforms them in collision rate and being second in off-road rate against other learning-based approaches. Compared same reinforcement learning category method, SimDT demonstrates a substantial reduction in Off-Road Rate and Collision Rate than DQN. Suggesting that our method is more effective at keeping the vehicle on the road and avoiding accidents. The Off-Road Rate of SimDT is higher than the best performing BC ’Bicycle (D)’ model by 2%. Similarly, the Collision Rate of SimDT shows a 1.94 percentage point improvement over the same BC model. This improvement in safety-critical metrics highlights the robustness of SimDT in real-world driving scenarios.

When compared to expert demonstrations, SimDT achieves competitive results in terms of safety metrics Collision Rate are within the same magnitude as those reported by the experts. However, the ADE of SimDT is notably higher at 6.73m, which is approximately 6 meters away from the expert models. This suggests that SimDT learns a safe and feasible policy but different from the expert recording. While the ADE for SimDT is higher than that of other imitation learning models, it is important to note that ADE alone may not capture the complete picture of driving performance. The emphasis on safety and kinematic feasibility by SimDT may contribute to a cautious driving style, which can result in a slightly higher ADE but with significantly safer outcomes.

4.3 Ablation Study

Agent Off-Road Rate (%) Collision Rate (%) Kinematic Infeasibility (%) ADE (m)
DT(tiny) 5.86 3.43 0.00±plus-or-minus\pm±0.00 7.64
DT(tiny) + PER 4.28 3.27 0.00±plus-or-minus\pm±0.00 7.26
DT(tiny) + PER + OPA 3.73 2.79 0.00±plus-or-minus\pm±0.00 7.05
DT(tiny) + PER + OPA + 2 token prediction 3.74 2.72 0.00±plus-or-minus\pm±0.00 7.09
DT(tiny) + PER + OPA + 3 token prediction 3.57 2.54 0.00±plus-or-minus\pm±0.00 6.97
DT(small) + PER + OPA + 3 token prediction 3.36 2.65 0.00±plus-or-minus\pm±0.00 6.73
Table 2: Ablation Study. PER is prioritized experience replay, OPA is online-policy adaption.
Refer to caption
(a) Irregular Traffic Junction
Refer to caption
(b) Parking Slot
Refer to caption
(c) T Junction
Figure 3: Illustration of Data Selection for Prioritized Experience Replay: 3(a) is chosen because its uncommon expert behavior that need to slow down while steer to the right to keep lane. 3(b) illustrates a rare parking situation and highlights a case that was picked because it had the most mistakes when looking at the whole series of actions. 3(c) is kept as the suboptimal action taken in that situation was not reproduced given the corresponding low return-to-go.

Prioritized Experience Replay for Decision Transformer. Since our proposed imitative reinforcement learning can obtain almost infinite amount of dataset through online interaction, the ability of prioritized experience replay becomes critical for sample efficiency. Compare to pure Decision Transformer, the model which adapts PER has 1.58% and 0.16% reduction in off-road and collision rate. Decision Transformer with PER is able to reach same performance with 80% of data. There are three types of the data that is preferentially stored for PER (Fig. 3). The initial category encompasses instances wherein a discernible discrepancy arises between the predicted actions of the learning model and those executed by an expert. The second category pertains to scenarios wherein the cumulative action loss associated with a particular trajectory is substantially elevated, a phenomenon that predominantly transpires within the confines of rare encountered environmental conditions. The third category is representative of situations where trajectories indicative of suboptimal online adaptation are documented, highlighting the model’s challenges in identifying and rectifying suboptimal behaviors. The sample-efficient leanrnig curve can be found in Appendix. 6

Multi-token Decision Transformer. Due to physics limitation of real world vehicles such as inertia and momentum, actions taken at current time-step can significantly affect the following time-step actions. Current state has effect to near future actions steps. eg. reckless pedestrian crossing can cause emergency breaking for ego vehicle and it takes at least few steps to finish. It is important for Neural Network to understand return-to-go and sequence action consequence during the training stage. On the other hand, Focusing only on predicting one token at a time makes a model too sensitive to the immediate context, overlooking the need to consider longer sequences of token for better understanding. Multi-token prediction allows for a more nuanced grasp of world interaction with less data. Compared with single-token prediction, 3-token SimDT has 0.16% and 0.25% improvement in off-road rate and collision rate.

Refer to caption
(a) scenario
Refer to caption
(b) 1-token prediction
Refer to caption
(c) 3-token prediction
Figure 4: Attention map comparison for single-token and multi-token prediction. Multi-token prediction network has more diverse attention field.
Refer to caption
(a) demonstration(navy) and exploration trajectory
Refer to caption
(b) learned policy
Figure 5: Illustration of how demonstration and exploration trajectory to learn a generalized policy.

5 Conclusion and Discussion

We introduces SmiDT, an innovative approach to sequence modeling based reinforcement learning, particularly targeted for the complexities of real-world driving scenarios. Our fully online imitative Decision Transformer pipeline is adept at handling diverse data distributions found within extensive driving datasets, ensuring wide applicability and robustness. By implementing a multi-token Decision Transformer that integrates receding horizon control, we improve the model’s ability to predict over longer horizons and extend its attention span across broader contexts. Furthermore, the incorporation of prioritized experience replay within our framework enhances the sample efficiency of training, allowing for more effective learning from large-scale datasets. Our work can also benefit other real-world robotics tasks that demand sample-efficient imitative reinforcement learning.

Limitation. Due to computational constraints, we couldn’t train a larger network with increased embedding sizes, more transformer blocks, additional attention heads, and extended context length. A longer context would enhance the model’s grasp of its environment, potentially improving its capability for both high-level task planning and low-level action planning.

Acknowledgments

If a paper is accepted, the final camera-ready version will (and probably should) include acknowledgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support.

References

  • Dosovitskiy et al. [2017] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017.
  • H. Caesar [2021] K. T. e. a. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. In CVPR ADP3 workshop, 2021.
  • Xu et al. [2023] D. Xu, Y. Chen, B. Ivanovic, and M. Pavone. Bits: Bi-level imitation for traffic simulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2929–2936. IEEE, 2023.
  • Gulino et al. [2023] C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y. Lu, J. Harb, X. Pan, Y. Wang, X. Chen, J. D. Co-Reyes, R. Agarwal, R. Roelofs, Y. Lu, N. Montali, P. Mougin, Z. Z. Yang, B. White, A. Faust, R. T. McAllister, D. Anguelov, and B. Sapp. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=7VSBaP2OXN.
  • Sun et al. [2020] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Houston et al. [2020] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari, V. Iglovikov, and P. Ondruska. One thousand and one hours: Self-driving motion prediction dataset, 2020.
  • Yu et al. [2020] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Caesar et al. [2020] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  • Dauner et al. [2023] D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta. Parting with misconceptions about learning-based vehicle motion planning. In Conference on Robot Learning (CoRL), 2023.
  • Philion et al. [2024] J. Philion, X. B. Peng, and S. Fidler. Trajeglish: Learning the language of driving scenarios. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Z59Rb5bPPP.
  • Kiran et al. [2022] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. A. Sallab, S. Yogamani, and P. Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2022. doi:10.1109/TITS.2021.3054625.
  • Liu et al. [2022] H. Liu, Z. Huang, X. Mo, and C. Lv. Augmenting reinforcement learning with transformer-based scene representation learning for decision-making of autonomous driving, 2022.
  • Chen et al. [2021] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15084–15097. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.
  • Zhou et al. [2024] Z. Zhou, C. Zhu, R. Zhou, Q. Cui, A. Gupta, and S. S. Du. Free from bellman completeness: Trajectory stitching via model-based return-conditioned supervised learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7zY781bMDO.
  • Schaul et al. [2016] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learning Representations (ICLR), 2016.
  • Gloeckle et al. [2024] F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction, 2024.
  • Maciejowski [2002] J. M. Maciejowski. Predictive control with constraints. Prentice hall, 2002.
  • Janner et al. [2021] M. Janner, Q. Li, and S. Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021.
  • Lee et al. [2022] K.-H. Lee, O. Nachum, M. Yang, L. Lee, D. Freeman, W. Xu, S. Guadarrama, I. Fischer, E. Jang, H. Michalewski, and I. Mordatch. Multi-game decision transformers, 2022.
  • Meng et al. [2022] L. Meng, M. Wen, Y. Yang, C. Le, X. Li, W. Zhang, Y. Wen, H. Zhang, J. Wang, and B. Xu. Offline pre-trained multi-agent decision transformer: One big sequence model tackles all smac tasks, 2022.
  • Wu et al. [2023] Y.-H. Wu, X. Wang, and M. Hamaya. Elastic decision transformer. arXiv preprint arXiv:2307.02484, 2023.
  • Badrinath et al. [2023] A. Badrinath, Y. Flet-Berliac, A. Nie, and E. Brunskill. Waypoint transformer: Reinforcement learning via supervised learning with intermediate targets. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 78006–78027. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/f58c24798220ba724fe05c0fa786227d-Paper-Conference.pdf.
  • Zheng et al. [2022] Q. Zheng, A. Zhang, and A. Grover. Online decision transformer. In international conference on machine learning, pages 27042–27059. PMLR, 2022.
  • Xu et al. [2023] M. Xu, Y. Lu, Y. Shen, S. Zhang, D. Zhao, and C. Gan. Hyper-decision transformer for efficient online policy adaptation. In International Conference on Learning Representations, 2023.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Radford et al. [2018] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • Devlin et al. [2019] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi:10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Kitaev et al. [2020] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  • Wang and Cho [2019] A. Wang and K. Cho. Structured neural embeddings for named entity recognition. arXiv preprint arXiv:1911.04474, 2019.
  • Qi et al. [2020] W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 2401–2410, 2020.
  • Jiang et al. [2020] Z. Jiang, A. Anastasopoulos, J. Araki, H. Ding, and G. Neubig. X-FACTR: Multilingual factual knowledge retrieval from pretrained language models. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5943–5959, Online, Nov. 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.479. URL https://aclanthology.org/2020.emnlp-main.479.
  • Lu et al. [2023] Y. Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whiteson, D. Anguelov, and S. Levine. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7553–7560, 2023. doi:10.1109/IROS55552.2023.10342038.
  • Phan-Minh et al. [2023] T. Phan-Minh, F. Howington, T.-S. Chu, M. S. Tomov, R. E. Beaudoin, S. U. Lee, N. Li, C. Dicle, S. Findler, F. Suarez-Ruiz, B. Yang, S. Omari, and E. M. Wolff. Driveirl: Drive in real life with inverse reinforcement learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1544–1550, 2023. doi:10.1109/ICRA48891.2023.10160449.
  • Shi et al. [2022] S. Shi, L. Jiang, D. Dai, and B. Schiele. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems, 2022.
  • Judah et al. [2014] K. Judah, A. Fern, P. Tadepalli, and R. Goetschalckx. Imitation learning with demonstrations and shaping rewards. Proceedings of the AAAI Conference on Artificial Intelligence, 28, Jun. 2014. doi:10.1609/aaai.v28i1.9024. URL https://ojs.aaai.org/index.php/AAAI/article/view/9024.
  • Chen et al. [2021] D. Chen, V. Koltun, and P. Krähenbühl. Gri: General reinforced imitation and its application to camera-based autonomous driving. arXiv preprint arXiv:2103.09109, 2021.
  • Nayakanti et al. [2022] N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp. Wayformer: Motion forecasting via simple & efficient attention networks, 2022.
  • Argall et al. [2009] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  • Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning, 2013.
  • Treiber et al. [2000] M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations. Physical Review E, 62(2):1805–1824, Aug. 2000. ISSN 1095-3787. doi:10.1103/physreve.62.1805. URL http://dx.doi.org/10.1103/PhysRevE.62.1805.

Appendix

A. Learning cure

where model 1 is Pure Decision transformer, and model 2 has Decision Transformer with Prioritized Experience Replay. Figure shows our Sample-efficient Imitative Pipeline converges faster and has better performance.

Refer to caption
Figure 6: Learning Curve. where Model 1 is Pure Decision transformer, and Model 2 has Decision Transformer with Prioritized Experience Replay. Figure shows our Sample-efficient Imitative Pipeline converges faster and has better performance.

B. Metrics Definition

Collision rate This metric checks for overlap between bounding boxes of objects in a 2D top-down view at the same time step to determine if a collision has occurred.

Off-Road rate indicates the percentage whether the vehicle is driving within the road boundaries, with any deviation to the right of the road’s edge considered off-road.

Kinematic Infeasibility Metric is binary metric assesses whether a vehicle’s transition between two consecutive states is within predefined acceleration and steering curvature limits, based on inverse kinematics.

Average Displacement Error (ADE) calculates the mean L2 distance between the vehicle’s simulated position and its logged position at corresponding time steps across the entire trajectory.

Route Progress Ratio calculates the proportion of the planned route completed by the vehicle, based on the closest point along the path at a given time step. Route Progress Ratio feature is not released yet and benchmark in this paper will skip this metric.