Humanoid Locomotion As Next Token Prediction

Humanoid Locomotion as Next Token Prediction
Ilija Radosavovic 1 Bike Zhang 1 Baifeng Shi 1 Jathushan Rajasegaran 1 Sarthak Kamat 1
Trevor Darrell 1 Koushil Sreenath 1 Jitendra Malik 1
Abstract tions in a similar fashion? Indeed, we have seen that we can

We cast real-world humanoid control as a next learn good representations of high-dimensional visual data
by autoregressive modeling (6) and related masked model-
arXiv:2402.19469v1 [cs.RO] 29 Feb 2024
token prediction problem, akin to predicting the

next word in language. Our model is a causal ing approaches (13). While there has been positive signal
transformer trained via autoregressive prediction on learning sensorimotor representations in the context of
of sensorimotor trajectories. To account for the manipulation (32), this area remains largely unexplored.
multi-modal nature of the data, we perform pre- In this paper, we cast humanoid control as data modeling
diction in a modality-aligned way, and for each of large collections of sensorimotor trajectories. Like in
input token predict the next token from the same language, we train a general transformer model to autore-
modality. This general formulation enables us to gressively predict shifted input sequences. In contrast to
leverage data with missing modalities, like video language, the nature of data in robotics is different. It is
trajectories without actions. We train our model high-dimensional and contains multiple input modalities.
on a collection of simulated trajectories coming Different modalities include sensors, like joint encoders or
from prior neural network policies, model-based inertial measurement units, as well as motor commands.
controllers, motion capture data, and YouTube These give rise to sensorimotor trajectories which we view
videos of humans. We show that our model en- as the sentences of the physical world. Adopting this per-
ables a full-sized humanoid to walk in San Fran- spective suggests a simple instantiation of the language
cisco zero-shot. Our model can transfer to the modeling framework in the robotic context. We tokenize
real world even when trained on only 27 hours the input trajectories and train a causal transformer model
of walking data, and can generalize to commands to predict shifted tokens. Importantly, we predict complete
not seen during training like walking backward. input sequences, including both sensory and motor tokens.
These findings suggest a promising path toward In other words, we are modeling the joint data distribution
learning challenging real-world control tasks by as opposed to the conditional action distribution.
generative modeling of sensorimotor trajectories.
This has several benefits. First, we are training the neu-
ral network to predict more bits of information and conse-
quently acquire a richer model of the world. Second, we
1. Introduction
can leverage noisy or imperfect trajectories that may contain
The last decade of artificial intelligence (AI) has shown suboptimal actions. Third, we can generalize our framework
that large neural networks trained on diverse datasets from to learning from trajectories with missing information.
the Internet can lead to impressive results across different
Our core observation is that if a trajectory is incomplete, i.e.,
settings. The core enablers of this wave of AI have been
some of the sensory or motor information is missing, we
large transformer models (42) trained by generative model-
can still learn from it by predicting whatever information
ing of massive quantities of language data from the Inter-
is present and replacing the missing tokens with learnable
net (29, 8, 30, 31, 4). By predicting the next word, these
mask tokens. The intuition is that if the model has learned to
models acquire rich representations of language that can be
make good predictions, even in the absence of information,
transferred to downstream tasks (29), perform multi-task
it will have acquired a better model of the physical world. A
learning (30, 31), and learn in a few-shot manner (4).
very important source of such data are human videos from
Are such modeling techniques exclusive to language? Can the Internet. Namely, we can observe human movement in
we learn powerful models of sensory and motor representa- videos but we do not get access to the motor commands or
1
complete sensory inputs. We demonstrate that our method
University of California, Berkeley. Correspondence to: Ilija can learn from such data sources effectively.
Radosavovic <ilija@berkeley.edu>.
Figure 1: A humanoid that walks in San Francisco. We deploy our policy to various locations in San Francisco over
the course of one week. Please see our project page for videos. We show that our policy can walk over different surfaces
including walkways, concrete, asphalt, tiled plazas, and sanded roads. We find that our policy follows omnidirectional
velocity commands well and enables deployment in a challenging city environment like San Francisco.
To validate our method, we apply it to the challenging task models in robotics. We have seen several works showing
of real-world humanoid locomotion. We use the full-sized that transformers can be effective with behavior cloning.
Digit humanoid robot developed by Agility Robotics. We For example, (38) learns multi-task transformer policies
first collect a dataset of sensorimotor trajectories in simu- with language, and (2) trains language-conditioned manipu-
lation. These include complete trajectories from a neural lation policies from large-scale data. (10) trains language
network policy trained with reinforcement learning (33), as models with embodied data. We have also seen that trans-
well as incomplete trajectories from three different sources: former policies can be effective for large-scale reinforce-
(i) Agility Robotics controller based on model predictive ment learning (33). (32) learns sensorimotor representations
control, (ii) motion capture of humans, and (iii) YouTube with masked prediction. (1) trains goal-conditioned poli-
videos of humans. We reconstruct human videos by using cies are learned from demonstrations. Likewise, we share
computer vision techniques and retarget both motion capture the goal of using transformer models for robotics but fo-
and YouTube trajectories via inverse kinematics. We then cus on autoregressive modeling of diverse trajectories for
train a transformer model to autoregressively predict trajec- real-world humanoid locomotion.
tories. At test time, we execute the actions autoregressively
Humanoid locomotion. Mastering the ability for robots
and ignore the sensory predictions.
to walk has been a long-standing challenge in robotics. In
We demonstrate that our policy can be deployed in the real the past several decades, roboticists have built a variety
world zero-shot and walk on different surfaces. Specifically, of humanoid robots (20, 15, 26, 40, 7) to explore human-
deploy our model across a range of different locations in like locomotion skills. Stable locomotion behaviors have
San Francisco over the course of one week. Please see been achieved through model-based control approaches
Figure 1 for examples and our project page for videos. To (34, 18), and optimization-based methods further enable
quantitatively evaluate different aspects of our approach, highly dynamic humanoid motions (22). Although signifi-
we perform an extensive study in simulation. We find that cant progress has been made with these strategies and com-
our autoregressive policies trained from offline data alone bining them with learning (5), learning-based approaches
are comparable to the state-of-the-art approaches that use are gaining attention for their ability to improve and adapt to
reinforcement learning (33) in tested settings. We further a wide range of environments. Recently, we have seen that
find that our approach can readily benefit from incomplete a purely learning based approach trained with large-scale
trajectories and has favorable scaling properties. reinforcement learning in simulation can enable real-world
humanoid locomotion (33). Like in prior work, our model
These findings suggest a promising path toward learning
is a causal transformer. Unlike prior work, we perform
challenging real-world robot control tasks by generative
autoregressive modeling instead of reinforcement learning.
modeling of large collections of sensorimotor trajectories.
2. Related Work 3. Approach

In this section, we assume that a dataset D of sensorimotor
Generative modeling. The study of data has been exten-
trajectories T is given and describe our approach below.
sive, ranging from Shannon’s foundational work (37) to the
recent era of large language models. Various such mod-
els emerged over the last decade. Notable such models 3.1. Objective
includes, GAN (12) and Diffusion models (39, 16) for gen- Each sensorimotor trajectory is a sequence of sensory ob-
erating pixels, LSTM (17) and GPT (29) for generating servations and actions: T = (o1 , a1 , o2 , a2 , ..., oT , aT ).
language tokens. These models have been adopted for other We first tokenize the trajectory into K tokens to obtain
modalities as well (27, 11, 43). Among these, autoregres- t = (t1 , t2 , t3 , ..., tK ). Our goal is to train a neural net-
sive transformer models became the front runner, due to work to model the density function p(t) autoregressively:
the impressive scaling behaviours (19) and ability to learn
from in-context examples (3). This behavior is even shown K
Y
to extend to other modalities such as pixels (6), language- p(t) = p(tk |tk−1 , ..., t1 ) (1)
pixels (36), and language-pixels-audio (21). We explore au- k=1
toregressive generative models in the context of real-world We train our model by minimizing the negative log-
humanoid locomotion. likelihood over our trajectory dataset:
Transformers in robotics. Following the success of trans- X
former models (42) in natural language processing (29, 8, L= − log p(t) (2)
30, 3) and computer vision (9, 13), over the last few years, t∈D
there has been an increased interested in using transformer
Data Training Deployment

Neural network policy Model based controller
Motion capture Internet videos Transformer
Figure 2: Humanoid locomotion as next token prediction. We collect a dataset on trajectories from various sources, such
as from neural network policies, model-based controllers, human motion capture, and YouTube videos of humans. Then we
use this dataset to train a transformer policy by autoregressive modeling of observations and actions. Our transformer allows
a humanoid to walk zero-shot on various terrains around San Francisco. Please see our project page for video results.
We assume a Gaussian distribution with constant variance 3.4. Joint training

and train a neural network to minimize the mean squared
We have two options for training on collections that con-
error between the predicted and the ground truth tokens:
tain diverse trajectories in terms of noise levels or modality
K subsets. We can either train jointly with all data at once, in-
1 Xb
L= (tk − tk )2 (3) cluding complete and incomplete trajectories. Alternatively,
K we can first pre-train on noisy and incomplete trajectories.
k=1
This can be viewed as providing a good initialization for
Instead of regressing the raw token values, we could quantiz-
then training on complete trajectories. We find that both
ing each dimension into bins or perform vector quantization.
approaches work comparably in our setting and opt for joint
However, we found the regression approach to work reason-
training in the majority of the experiments for simplicity.
ably well in practice and opt for it for simplicity.
3.5. Model architecture
3.2. Missing modalities
Our model is a vanilla transformer (42). Given the trajec-
In the discussion so far we have assumed that each tra-
tories from either complete or incomplete data, we first
jectory is a sequence of observations and actions. Next,
tokenize the trajectories into tokens. We learn separate lin-
we show how our framework can be generalized to se-
ear projection layers for each modality but shared across
quences with missing modalities, like trajectories extracted
time. To encode the temporal information we use positional
from human videos that do not have actions. Suppose we
embeddings. Let’s assume oi ∈ Rm and ai ∈ Rn , then:
are given a trajectory of observations without the actions
T = (o1 , o2 , ..., oT ). Our key insight is that we can treat
ti = concat(oi , ai ), (4)
a trajectory without actions like a regular trajectory with
actions masked. Namely, we can insert mask tokens [M] h0i = W ti , (5)
to obtain T = (o1 , [M], o2 , [M], ..., oT , [M]). This trajec-
tory now has the same format as our regular trajectories and where W ∈ Rd×(m+n) is a linear projection layer to project
thus can be processed in a unified way. We ignore the loss concatenated observation and action modalities to d dimen-
for the predictions that correspond to the masked part of sional embedding vector. The superscript 0 indicates the
inputs. Note that this principle is not limited to actions and embedding at 0-th layer, i.e., the input layer. When action is
applies to any other modality as well. unavailable, we use a mask token [M] ∈ Rn to replace ai ,
and [M] is initialized as a random vector and learned end-
3.3. Aligned prediction to-end with the whole model. The model takes the sequence
of embedding vectors H0 = {h01 , h02 , ..., h0t } as input.
Rather than predicting the next token in a modality-agnostic
way, we make predictions in a modality-aligned way. The transformer architecture contains L layers, each con-
Namely, for each input token we predict the next token sisting of a multi-head self-attention module and an MLP
of the same modality. Please see Figure 3 for diagrams. module. Assume the output of the layer l is Hl , then the
Training with complete data Training with missing data
Transformer Transformer
M M M M
Figure 3: A general framework for training with different data sources. Our data modeling allows us to train our
transformer with multiple modes of training. In the case of observation-action pairs being available, we train our transformer
to predict the next pair of observation-action. When there is no action data available, with MoCap and internet data, we
only train our transformer to predict the next observations by masking the actions with a mask token. These two models of
training allow our model to utilize both types of data, and this enables us to scale our training in terms of data.
layer l + 1 output is computed as follows: 4. Dataset

H̃l = LayerNorm(Hl ) (6) Our approach motivates building a dataset of trajectories for
training our model. Our dataset includes trajectories from
H̃l = H̃l + M HSA(H̃l ) (7) different sources: (i) neural network policies, (ii) model-
Hl+1 = H̃l + M LP (H̃l ) (8) based controllers, (iii) human motion capture, and (iv) hu-
man videos from YouTube. An illustration of different data
sources is shown in Figure 4. We describe each in turn next.
Here, the multi-head self-attention has causal masking,
where the token only attends to itself and the past tokens.
Once the tokens are processed through all the layers, we 4.1. Neural network trajectories
project the embedding to predicted states and actions, by As the first source of training trajectories, we use a neu-
c ∈ R(m+n)×d :
learning a linear projection layer W ral network policy trained with large-scale reinforcement
learning (33). Specifically, this policy was trained with
b c hL
ti+1 = W i (9) billions of samples from thousands of randomized environ-
obi+1 = (b
ti+1 )0:m (10) ments in Isaac Gym (25). We run this policy in the Agility
Robotics’ simulator and collect 10k trajectories of 10s each
ai+1 = (b
b ti+1 )m:(m+n) (11) on flat ground, without domain randomization. Each trajec-
tory is conditioned on a velocity command sampled from a
Then we train the transformer with the objective in (3). In clipped normal distribution as follows: linear velocity for-
the cases where the token is masked, we do not apply any ward [0.0, 1.0] m/s, linear velocity sideways [−0.5, 0.5] m/s,
losses. We train our transformer with both types of data, as and turning angular velocity [−0.5, 0.5] rad/s.
shown in Figure 3. This allows us to use various sources of
Since we have access to the data generation policies, we
data, thus enabling scaling in terms of data.
are able to record complete observations as well as the
exact actions that the model predicted. We use this set as
3.6. Model inference our source of complete sensorimotor trajectories that have
At inference time, our transformer model will always have complete observations as well as ground truth actions.
access to observation-action pairs. In this setting, we apply
our transformer model autoregressively for each observation- 4.2. Model-based trajectories
action pair token. By conditioning on past observations and
As the second source of trajectories, we use the model-based
actions, we predict the next actions (or observation-action
controller developed by Agility Robotics. It is the controller
pair) and execute the action. Then we take the observations
that is deployed on the Digit humanoid robot and available
from the robot and discard the predicted observations. We
in the Agility Robotics’ simulator as well. We collect two
use the observed observation and predicted action as the
sets of 10k trajectories of walking on a flat ground of 10s
next set of tokens and concatenate them with past pairs to
each. In both cases, we sample the velocity commands
predict the next observation-action pair.
Neural Net Controller Model based Controller MoCap Internet Videos
Figure 4: Training dataset. To train our model, we construct a dataset of trajectories coming from four different sources. (i)
neural network policy: provides trajectories with complete observations and actions. (ii) model-based controller: produces
trajectories without actions. (iii) motion capture of humans: does not contain actions and is approximately retargeted
onto the robot. (iv) internet videos of humans: noisy human poses are first reconstructed via 3D reconstruction and then
approximately retargeted onto the robot.
as follows: linear velocity forward [−1.0, 1.0] m/s, linear The optimization variables include q, q̇. For constraints,
velocity sideways [−1.0, 1.0] m/s, and turning angular ve- (12b) is the Euler integration of posture q, (12c) constrains
locity [−1.0, 1.0] rad/s. We use the default model-based the range of q and q̇ to their admissible sets Q and V. In the
configurations for one set and randomize the leg length, step cost function, φtraj tracks keypoint locations from human
clearance, and bounciness of the floor for the other. trajectories, and φreg represents the regularization costs,
such as joint velocity minimization and smoothness.
As this controller outputs joint torques, which are not con-
sistent with our joint position action space. We only record
the observations without the actions. This data serves as 4.4. Trajectories from YouTube videos
a source of trajectories with reasonably good observations Internet videos of people doing various activities are poten-
from the same morphology but without the actions. tially a vat source of data for learning human locomotion.
However, the raw pixels have no information about the state
4.3. Human motion capture trajectories and actions of the human. To recover this, we first we run a
computer vision tracking algorithm PHALP (35) to extract
As the next source of trajectories, we use the motion capture
human trajectories in 3D. This provides an estimate of the
(MoCap) recordings of humans from the KIT dataset (28)
3D joints of the human body SMPL (23) parameters and
distributed via the AMASS repository (24). This data was
a noisy estimate of the human joints in the world coordi-
recorded using optical marker-based tracking in a laboratory
nates. We use the human body joint positions to retarget the
setting. The dataset consists of ∼4k trajectories. We use a
motion to the humanoid robot using the inverse kinematics
subset of ∼1k standing, walking, and running trajectories.
optimization described above. Once we retarget the motion
In addition to not containing the ground truth actions, the from the Internet videos to humanoid trajectories, we filter
MoCap trajectories come with an additional challenge: dif- the trajectories with the low optimization cost. Note that the
ferent morphology. Namely, MoCap trajectories capture scale of this data comes with the cost of being noisy.
human keypoint positions in 3D. In order to use these trajec-
tories for training a robot, we solve an inverse kinematics 5. Experiments
problem to find the corresponding robot poses.
We evaluate our approach on the challenging task of hu-
We formulate an inverse kinematics optimization problem:
manoid locomotion. We perform outdoor experiments on
N
X real hardware and systematic evaluations in simulation.
min φtraj [t] + φreg [t] (12a)
q[t],q̇[t]
t=1 5.1. Experimental setup
q̇[t + 1] + q̇[t]
s.t. q[t + 1] = q[t] + dt, (12b) Robot platform. Digit is a humanoid robot platform devel-
2
oped by Agility Robotics. It is a full-sized humanoid that
q ∈ Q, q̇ ∈ V (12c)
is 1.6m tall and weighs 45 kilograms. It has 30 degrees of
where q is the robot state in the generalized coordinates, and freedom of which 20 are actuated. Due to its high dimen-
N and dt are the optimization horizon and sampling time. sionality and four-bar linkage structure, it is challenging
Figure 5: Comparison to state of the art, trajectory ad- Figure 6: Tracking error comparisons. We measure the
herence. The robot is commanded to walk starting from tracking error of our policy against a state-of-the-art bench-
the origin with a fixed heading command of 0.5 m/s and mark (left), as well as the improvement produced by com-
varying yaw commands in [−0.4, 0.4] rad/s. We plot the plementing action-labeled RL trajectories with action-free
desired (dotted) and actual (solid) trajectories for our policy trajectories (right).
and a reinforcement-learning trained policy (RL).
to optimize fast which makes it particularly interesting for metrics with details as follows and show that two metrics
learning approaches that can learn efficiently from trajectory can consistently predict locomotion performance.
collections like ours.
Tracking error. In all experiments, the robot starts from rest
Model. Our model has a hidden size of 192 dimensions, in a simulated environment and is issued a constant natural
with 4 layers of self-attention layers and MLP layers. Each walking command consisting of a desired heading veloc-
self-attention has 4 heads. We use LayerNorm before each ity sampled in [0.35, 0.70] m/s, angular velocity sampled
attention layer and ReLU activation after the MLP layer. in [−0.4, 0.4] rad/s, and zero lateral velocity. We compute
We use a BatchNorm layer to process the input before the x∗ (t), the ideal robot base position trajectory that fully sat-
transformer model. When predicting a token at time k, to isfies the velocity command v∗ (t) at all time steps. To
keep the context length at a reasonable size, we only keep measure the accuracy of commandPT tracking, we define the
the past 16 steps in input. In Section 5.9, we show the model position tracking error as T1 t=0 ∥x(t) − x∗ (t)∥. We use
is able to scale up to more parameters and longer context the MuJoCo simulator (41) for evaluations, and all trajecto-
length and achieve higher performance. ries last for a duration of 10 seconds.
Prediction error. Since the model is trained with the next
5.2. Real-world deployment
token prediction, we test the prediction error on a set of vali-
We begin by reporting the results of deploying our policy dation data that is separated from training data and contains
in the real world. Specifically, we evaluate deploying our state-action trajectories collected from the RL policy. This
robot at various locations in San Francisco over the course is similar to the language modeling evaluation for large lan-
of one week. Please see Figure 1 for examples and project guage models (14). We test both state and action prediction
page for videos. We find that our policy is able to walk over errors and add them together as the final error metric.
a variety of surfaces including walkways, concrete, asphalt,
tiled plazas, and dirt roads. Note that the deployment in a 5.4. Comparison to the state of the art
large city environment, like San Francisco, is considerably
more challenging than in constrained environments. The city Trajectory Adherence. We compare our policy to a
environment is much more crowded, less patient, and not neuralfig:tracking network controller trained with rein-
forgiving. This makes the error tolerance low and requires a forcement learning (RL) (33). Figure 5 presents a vi-
policy that works consistently well. sual comparison of the trajectory adherence of our con-
troller against these state-of-the-art baselines. Starting
with a robot at the origin, we plot the actual trajec-
5.3. Evaluation Metrics
tory of the robot with eleven different yaw commands
We evaluate locomotion policies with two metrics: tracking selected from {0.00, ±0.05, ±0.10, ±0.20, ±0.30, ±0.40}
error and prediction error. Tracking error measures how rad/s. For each policy, we jointly plot the desired and actual
accurately the robot follows a specific locomotion command. path traced by the robot base. Our model exhibits superior
The prediction error is the next token prediction loss mea- tracking to the RL controller at all turning speeds, and has
sured on a separate set of validation data. We introduce two near-perfect tracking for straight-line walking.
0.28
Pos. Tracking Error (m)
0.30
0.32
0.34
0.36
r=0.87
0.38
1.3 1.2 1.1 1.0 0.9
Prediction Loss (10−2)
Figure 7: Prediction error correlates with performance. Figure 8: Gait quality. We command the robot with a head-
We plot the tracking error and prediction error for 14 models. ing velocity of 0.5 m/s and plot the resulting phase portrait
The prediction error linearly correlates with task tracking of the left knee joint. Compared to the RL policy, our policy
error with r = 0.87, which means lower prediction loss features fewer irregularities and a smoother, cyclic gait.
likely indicates more accurate command following.
tains the overall shape of the RL policy while having fewer

Quantitative Evaluation. In Figure 6, left, we repeat the aberrations. This supports our qualitative assessment of the
above comparison to the RL controller (N = 245), with more regularized behavior seen on our policy.
the full range of heading and yaw velocities mentioned
in Section 5.3. We plot the mean position tracking error, 5.7. Generalization to unseen commands
binned by the commanded angular yaw. While both models
have lower tracking errors at lower yaw, ours consistently We find that our policy also extrapolates new skills such as
outperforms the baseline RL policy. This is an interesting walking backward, which was not included in the action-
result, since our model was trained on next token prediction labeled training data. As Figure 9 illustrates, by prompting
on trajectories produced by this very policy. our controller with negative values for the heading com-
mand, we find that the robot naturally performs backward
5.5. Prediction error correlates with performance walking at speeds up to 0.5 m/s without falling.
We collect 14 models trained with different training recipes, 5.8. Training with action-free data
model architectures, data size and types, and test tracking
error and prediction error for each one of them. We plot One of the benefits of our approach is that it can be applied
the tracking and prediction errors of all the models into to trajectories from diverse sources, including missing in-
a single scatter plot, as shown in Figure 7. We can see formation like actions in the case of human videos from
that tracking and prediction error are highly correlated with YouTube. In Figure 6, right, we compare the performance
Pearson coefficient r = 0.87, which means models with of training only with complete trajectories to joint training
lower prediction error on the validation set likely follow on both complete and incomplete trajectories. We observe
different commands with higher accuracy. This suggests that including incomplete trajectories consistently leads to
that the prediction error is predictive task performance. better performance. This is a promising signal for scaling
our approach to a large collection of diverse trajectories.
5.6. Gait quality
In humanoid locomotion, the smoothness in the robot’s gait
is contingent on the rhythmic functioning of its actuated
knee joints. One way to measure this is a phase portrait,
which is a parametric plot of a joint’s generalized position
and velocity over time. Patterns in the plot can reveal infor-
mation about the type of movement the joint is undergoing.
For example, a cyclic pattern may indicate repetitive motion,
while irregular patterns might suggest complex or varied
movements, such as stumbling. In Figure 8, we command
the robot to walk forward at 0.5 m/s, and plot the associated Figure 9: Unseen commands. Our policy is able to follow
phase portrait of its left knee joint. Notice that our policy re- backward commands at test time, unseen during training.
Figure 10: Scaling studies. We find that our approach scales with the number of trajectories in the training dataset (left),
context length (middle), and larger models (right).
5.9. Scaling studies when predicting action of (t + 1)-th step, since there is
no alignment, we need to first predict obi+1 and use this
Training data. In Figure 10, left, we study the scaling
ai+1 . If the predicted obi+1
prediction as input to predict b
of our model’s performance by increasing the size of the
is not accurate compared to real oi+1 (which is used to
training dataset. We find that training on more trajectories
ai+1 during training), there will be a discrepancy
predict b
reduces position tracking error, which is a positive signal
between test and training data which will cause error in
for increased performance when training on larger datasets.
action prediction.
Context length. We study the effect of increasing the num-
Joint training vs. staged training. Given both complete
ber of tokens used in the context window of the transformer
data with action and incomplete data without action, we can
policy, varying it between 16, 32, and 48 steps in Figure
either jointly train on both data as described in Section 3,
10 middle. Larger context windows produce better policies,
or we can first pre-train the model on all the data with state
which suggests that our generative policy performs a form
prediction only, then fine-tune the model on complete data
of in-context adaptation that improves with scale.
with action prediction. We compare these two approaches
Model size. We compare models with increasing number in Table 1c. We observe no significant difference between
of parameters (1M, 2M, 8M) by varying the embedding these two, which indicates that pre-training on state pre-
dimension (144, 192, 384), number of attention heads (3, 4, diction then fine-tuning on action prediction also gives a
12), and number of transformer blocks (4, 4, 6) respectively. reasonable locomotion policy.
Tracking error monotonically decreases with model size.
State-action prediction vs. action-only prediction. We
compare the performance of our policy when trained with
5.10. Ablation studies only predicting actions, versus when trained with predicting
Concatenated vs. separate tokens. For the input of trans- both states and actions. The results in Table 1d show that
former, we can either concatenate observation and action at the state-action prediction improves model performance on
each step into a single token, or embed them into two sepa- trajectory tracking. We hypothesize that the additional learn-
rate tokens. We compare these two choices in Table 1a. We ing signal enables the model to learn richer representations
can see that concatenation has lower prediction error while of the world that are beneficial for the locomotion task.
separating tokens has lower tracking error. Overall these two
perform comparably while using separate tokens doubles 6. Discussion
the input length and introduces computation overhead.
We present a self-supervised approach for real-world hu-
Modality-aligned vs. non-aligned prediction. When we manoid locomotion. Our model is trained on a collection
use separate tokens for observation and actions as input, we of sensorimotor trajectories, which come from prior neural
can either predict obi+1 from oi and bai+1 from ai , which network policies, model-based controllers, human motion
aligns modality between prediction and input, or we can capture, and YouTube videos of humans. We show that our
predict obi+1 from ai and bai+1 from oi+1 , which does not model enables a full-sized humanoid to walk in the real-
have alignment. From Table 1b, we can see that modality world zero-shot. These findings suggest a promising path
alignment has clearly better performance than no alignment. toward learning challenging real-world robot control tasks
We suspect this is because, at t-th step during inference, by generative modeling of large collections of trajectories.
Track Err. Pred. Err. Track Err. Pred. Err.

Concat 0.310 0.88 Align 0.299 0.98
Separate 0.299 0.98 Non-align 0.338 1.05
(a) Concatenated vs. separate tokens for states and action. Two (b) Alignment vs. non-alignment of states or actions for next
modeling designs have comparable performance while concatenat- token prediction. Prediction with aligned modality performs better
ing state and action gives shorter input length and faster inference. on bothe tracking error and next token prediction error.
Track Err. Pred. Err. Track Err. Pred. Err.

Joint training 0.310 0.88 State-action 0.305 0.97
Staged training 0.311 - Action-only 0.335 -
(c) Joint vs. staged training on data with and without actions. (d) State-action vs. action-only prediction. Predicting both
Staged training which pre-trains on state prediction and finetunes states and actions leads to lower tracking error than only predict-
on action prediction has similar performance as joint training. ing action as in vanilla behavior cloning.
Table 1: Ablations on different design choices in modeling and training. For each ablation we compare the average
tracking error on a set of commands, as well as the next token prediction error on the test set. For a fair comparison, we do
not report next token prediction error for models that only predict actions.
Acknowledgements [5] Castillo, G. A., Weng, B., Zhang, W., and Hereid, A.
Robust feedback motion policy design using reinforce-
This work was supported in part by DARPA Machine Com- ment learning on a 3d digit bipedal robot. In IROS,
mon Sense program, ONR MURI program (N00014-21-1- 2021.
2801), NVIDIA, Hong Kong Centre for Logistics Robotics,
The AI Institute, and BAIR’s industrial alliance programs. [6] Chen, M., Radford, A., Child, R., Wu, J., Jun, H.,
We thank Saner Cakir and Vikas Ummadisetty for help with Luan, D., and Sutskever, I. Generative pretraining
the inverse kinematics simulation experiments. from pixels. In ICML, 2020.
References [7] Chignoli, M., Kim, D., Stanger-Jones, E., and Kim,
S. The mit humanoid robot: Design, motion planning,
[1] Bousmalis, K., Vezzani, G., Rao, D., Devin, C., Lee, and control for acrobatic behaviors. In Humanoids,
A. X., Bauza, M., Davchev, T., Zhou, Y., Gupta, A., 2021.
Raju, A., et al. Robocat: A self-improving foundation
agent for robotic manipulation. arXiv:2306.11706, [8] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
2023. Bert: Pre-training of deep bidirectional transformers
for language understanding. In NAACL-HCT, 2019.
[2] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y.,
Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis-
Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer senborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
for real-world control at scale. arXiv:2212.06817, Minderer, M., Heigold, G., Gelly, S., et al. An im-
2022. age is worth 16x16 words: Transformers for image
recognition at scale. In ICLR, 2020.
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, [10] Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowd-
G., Askell, A., et al. Language models are few-shot hery, A., Ichter, B., Wahid, A., Tompson, J., Vuong,
learners. In NeurIPS, 2020. Q., Yu, T., et al. Palm-e: An embodied multimodal
language model. arXiv:2303.03378, 2023.
[4] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Ka-
plan, J., Dhariwal, P., Neelakantan, A., Shyam, P., [11] Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Don-
Sastry, G., Askell, A., et al. Language models are ahue, C., and Roberts, A. Gansynth: Adversarial
few-shot learners. NeurIPS, 2020. neural audio synthesis. arXiv:1902.08710, 2019.
[12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., [26] Nelson, G., Saunders, A., Neville, N., Swilling, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Bondaryk, J., Billings, D., Lee, C., Playter, R., and
Y. Generative adversarial nets. In NeurIPS, 2014. Raibert, M. Petman: A humanoid robot for testing
chemical protective clothing. Journal of the Robotics
[13] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Gir- Society of Japan, 2012.
shick, R. Masked autoencoders are scalable vision
learners. arXiv:2111.06377, 2021. [27] Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K.,
Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
[14] Hendrycks, D., Burns, C., Basart, S., Zou, A.,
and Kavukcuoglu, K. Wavenet: A generative model
Mazeika, M., Song, D., and Steinhardt, J. Measur-
for raw audio. arXiv:1609.03499, 2016.
ing massive multitask language understanding. arXiv
preprint arXiv:2009.03300, 2020. [28] Plappert, M., Mandery, C., and Asfour, T. The KIT
motion-language dataset. Big Data, 2016.
[15] Hirai, K., Hirose, M., Haikawa, Y., and Takenaka, T.
The development of honda humanoid robot. In ICRA, [29] Radford, A., Narasimhan, K., Salimans, T., and
1998. Sutskever, I. Improving language understanding by
[16] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion generative pre-training. 2018.
probabilistic models. In NeurIPS, 2020. [30] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
[17] Hochreiter, S. and Schmidhuber, J. Long short-term Sutskever, I., et al. Language models are unsupervised
memory. Neural computation, 1997. multitask learners. 2019.
[18] Kajita, S., Kanehiro, F., Kaneko, K., Yokoi, K., and [31] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,
Hirukawa, H. The 3d linear inverted pendulum mode: G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,
A simple modeling for a biped walking pattern genera- Clark, J., et al. Learning transferable visual models
tion. In IROS, 2001. from natural language supervision. In ICML, 2021.
[19] Kaplan, J., McCandlish, S., Henighan, T., Brown, [32] Radosavovic, I., Shi, B., Fu, L., Goldberg, K., Darrell,
T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, T., and Malik, J. Robot learning with sensorimotor
J., and Amodei, D. Scaling laws for neural language pre-training. In CoRL, 2023.
models. arXiv:2001.08361, 2020.
[33] Radosavovic, I., Xiao, T., Zhang, B., Darrell, T., Malik,
[20] Kato, I. Development of wabot 1. Biomechanism, J., and Sreenath, K. Real-world humanoid locomotion
1973. with reinforcement learning. arXiv:2303.03381, 2023.
[21] Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., [34] Raibert, M. H. Legged robots that balance. MIT press,
Hornung, R., Adam, H., Akbari, H., Alon, Y., Birod- 1986.
kar, V., et al. Videopoet: A large language model for
zero-shot video generation. arXiv:2312.14125, 2023. [35] Rajasegaran, J., Pavlakos, G., Kanazawa, A., and Ma-
lik, J. Tracking people by predicting 3d appearance,
[22] Kuindersma, S. Recent progress on atlas, the world’s location and pose. In Proceedings of the IEEE/CVF
most dynamic humanoid robot, 2020. URL https: Conference on Computer Vision and Pattern Recogni-
//youtu.be/EGABAx52GKI. tion, pp. 2740–2749, 2022.
[23] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., [36] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C.,
and Black, M. J. Smpl: A skinned multi-person linear Radford, A., Chen, M., and Sutskever, I. Zero-shot
model. In Seminal Graphics Papers: Pushing the text-to-image generation. In ICML, 2021.
Boundaries, Volume 2, 2023.
[37] Shannon, C. E. Prediction and entropy of printed
[24] Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, english. Bell system technical journal, 1951.
G., and Black, M. J. AMASS: Archive of motion
capture as surface shapes. In ICCV, 2019. [38] Shridhar, M., Manuelli, L., and Fox, D. Perceiver-
actor: A multi-task transformer for robotic manipula-
[25] Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., tion. In CoRL, 2022.
Storey, K., Macklin, M., Hoeller, D., Rudin, N., All-
shire, A., Handa, A., et al. Isaac gym: High perfor- [39] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N.,
mance gpu-based physics simulation for robot learning. and Ganguli, S. Deep unsupervised learning using
In NeurIPS, 2021. nonequilibrium thermodynamics. In ICML, 2015.
[40] Stasse, O., Flayols, T., Budhiraja, R., Giraud-Esclasse,

K., Carpentier, J., Mirabel, J., Del Prete, A., Souères,
P., Mansard, N., Lamiraux, F., et al. Talos: A new
humanoid research platform targeted for industrial ap-
plications. In Humanoids, 2017.
[41] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics
engine for model-based control. In IROS, 2012.
[42] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
Attention is all you need. In NeurIPS, 2017.
[43] Wu, J., Zhang, C., Xue, T., Freeman, B., and Tenen-
baum, J. Learning a probabilistic latent space of ob-
ject shapes via 3d generative-adversarial modeling. In
NeurIPS, 2016.

Humanoid Locomotion As Next Token Prediction

Uploaded by

Copyright:

Available Formats

Humanoid Locomotion As Next Token Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Humanoid Locomotion As Next Token Prediction

Uploaded by

Copyright:

Available Formats

Humanoid Locomotion as Next Token Prediction

Abstract tions in a similar fashion? Indeed, we have seen that we can

token prediction problem, akin to predicting the

2. Related Work 3. Approach

Data Training Deployment

Motion capture Internet videos Transformer

We assume a Gaussian distribution with constant variance 3.4. Joint training

Training with complete data Training with missing data

layer l + 1 output is computed as follows: 4. Dataset

Neural Net Controller Model based Controller MoCap Internet Videos

tains the overall shape of the RL policy while having fewer

Track Err. Pred. Err. Track Err. Pred. Err.

Track Err. Pred. Err. Track Err. Pred. Err.

[40] Stasse, O., Flayols, T., Budhiraja, R., Giraud-Esclasse,

[42] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,

You might also like