Humanoid Locomotion As Next Token Prediction
Humanoid Locomotion As Next Token Prediction
Humanoid Locomotion As Next Token Prediction
Ilija Radosavovic 1 Bike Zhang 1 Baifeng Shi 1 Jathushan Rajasegaran 1 Sarthak Kamat 1
Trevor Darrell 1 Koushil Sreenath 1 Jitendra Malik 1
Figure 1: A humanoid that walks in San Francisco. We deploy our policy to various locations in San Francisco over
the course of one week. Please see our project page for videos. We show that our policy can walk over different surfaces
including walkways, concrete, asphalt, tiled plazas, and sanded roads. We find that our policy follows omnidirectional
velocity commands well and enables deployment in a challenging city environment like San Francisco.
Humanoid Locomotion as Next Token Prediction
To validate our method, we apply it to the challenging task models in robotics. We have seen several works showing
of real-world humanoid locomotion. We use the full-sized that transformers can be effective with behavior cloning.
Digit humanoid robot developed by Agility Robotics. We For example, (38) learns multi-task transformer policies
first collect a dataset of sensorimotor trajectories in simu- with language, and (2) trains language-conditioned manipu-
lation. These include complete trajectories from a neural lation policies from large-scale data. (10) trains language
network policy trained with reinforcement learning (33), as models with embodied data. We have also seen that trans-
well as incomplete trajectories from three different sources: former policies can be effective for large-scale reinforce-
(i) Agility Robotics controller based on model predictive ment learning (33). (32) learns sensorimotor representations
control, (ii) motion capture of humans, and (iii) YouTube with masked prediction. (1) trains goal-conditioned poli-
videos of humans. We reconstruct human videos by using cies are learned from demonstrations. Likewise, we share
computer vision techniques and retarget both motion capture the goal of using transformer models for robotics but fo-
and YouTube trajectories via inverse kinematics. We then cus on autoregressive modeling of diverse trajectories for
train a transformer model to autoregressively predict trajec- real-world humanoid locomotion.
tories. At test time, we execute the actions autoregressively
Humanoid locomotion. Mastering the ability for robots
and ignore the sensory predictions.
to walk has been a long-standing challenge in robotics. In
We demonstrate that our policy can be deployed in the real the past several decades, roboticists have built a variety
world zero-shot and walk on different surfaces. Specifically, of humanoid robots (20, 15, 26, 40, 7) to explore human-
deploy our model across a range of different locations in like locomotion skills. Stable locomotion behaviors have
San Francisco over the course of one week. Please see been achieved through model-based control approaches
Figure 1 for examples and our project page for videos. To (34, 18), and optimization-based methods further enable
quantitatively evaluate different aspects of our approach, highly dynamic humanoid motions (22). Although signifi-
we perform an extensive study in simulation. We find that cant progress has been made with these strategies and com-
our autoregressive policies trained from offline data alone bining them with learning (5), learning-based approaches
are comparable to the state-of-the-art approaches that use are gaining attention for their ability to improve and adapt to
reinforcement learning (33) in tested settings. We further a wide range of environments. Recently, we have seen that
find that our approach can readily benefit from incomplete a purely learning based approach trained with large-scale
trajectories and has favorable scaling properties. reinforcement learning in simulation can enable real-world
humanoid locomotion (33). Like in prior work, our model
These findings suggest a promising path toward learning
is a causal transformer. Unlike prior work, we perform
challenging real-world robot control tasks by generative
autoregressive modeling instead of reinforcement learning.
modeling of large collections of sensorimotor trajectories.
toregressive generative models in the context of real-world We train our model by minimizing the negative log-
humanoid locomotion. likelihood over our trajectory dataset:
Transformers in robotics. Following the success of trans- X
former models (42) in natural language processing (29, 8, L= − log p(t) (2)
30, 3) and computer vision (9, 13), over the last few years, t∈D
there has been an increased interested in using transformer
Humanoid Locomotion as Next Token Prediction
Figure 2: Humanoid locomotion as next token prediction. We collect a dataset on trajectories from various sources, such
as from neural network policies, model-based controllers, human motion capture, and YouTube videos of humans. Then we
use this dataset to train a transformer policy by autoregressive modeling of observations and actions. Our transformer allows
a humanoid to walk zero-shot on various terrains around San Francisco. Please see our project page for video results.
Transformer Transformer
M M M M
Figure 3: A general framework for training with different data sources. Our data modeling allows us to train our
transformer with multiple modes of training. In the case of observation-action pairs being available, we train our transformer
to predict the next pair of observation-action. When there is no action data available, with MoCap and internet data, we
only train our transformer to predict the next observations by masking the actions with a mask token. These two models of
training allow our model to utilize both types of data, and this enables us to scale our training in terms of data.
Figure 4: Training dataset. To train our model, we construct a dataset of trajectories coming from four different sources. (i)
neural network policy: provides trajectories with complete observations and actions. (ii) model-based controller: produces
trajectories without actions. (iii) motion capture of humans: does not contain actions and is approximately retargeted
onto the robot. (iv) internet videos of humans: noisy human poses are first reconstructed via 3D reconstruction and then
approximately retargeted onto the robot.
as follows: linear velocity forward [−1.0, 1.0] m/s, linear The optimization variables include q, q̇. For constraints,
velocity sideways [−1.0, 1.0] m/s, and turning angular ve- (12b) is the Euler integration of posture q, (12c) constrains
locity [−1.0, 1.0] rad/s. We use the default model-based the range of q and q̇ to their admissible sets Q and V. In the
configurations for one set and randomize the leg length, step cost function, φtraj tracks keypoint locations from human
clearance, and bounciness of the floor for the other. trajectories, and φreg represents the regularization costs,
such as joint velocity minimization and smoothness.
As this controller outputs joint torques, which are not con-
sistent with our joint position action space. We only record
the observations without the actions. This data serves as 4.4. Trajectories from YouTube videos
a source of trajectories with reasonably good observations Internet videos of people doing various activities are poten-
from the same morphology but without the actions. tially a vat source of data for learning human locomotion.
However, the raw pixels have no information about the state
4.3. Human motion capture trajectories and actions of the human. To recover this, we first we run a
computer vision tracking algorithm PHALP (35) to extract
As the next source of trajectories, we use the motion capture
human trajectories in 3D. This provides an estimate of the
(MoCap) recordings of humans from the KIT dataset (28)
3D joints of the human body SMPL (23) parameters and
distributed via the AMASS repository (24). This data was
a noisy estimate of the human joints in the world coordi-
recorded using optical marker-based tracking in a laboratory
nates. We use the human body joint positions to retarget the
setting. The dataset consists of ∼4k trajectories. We use a
motion to the humanoid robot using the inverse kinematics
subset of ∼1k standing, walking, and running trajectories.
optimization described above. Once we retarget the motion
In addition to not containing the ground truth actions, the from the Internet videos to humanoid trajectories, we filter
MoCap trajectories come with an additional challenge: dif- the trajectories with the low optimization cost. Note that the
ferent morphology. Namely, MoCap trajectories capture scale of this data comes with the cost of being noisy.
human keypoint positions in 3D. In order to use these trajec-
tories for training a robot, we solve an inverse kinematics 5. Experiments
problem to find the corresponding robot poses.
We evaluate our approach on the challenging task of hu-
We formulate an inverse kinematics optimization problem:
manoid locomotion. We perform outdoor experiments on
N
X real hardware and systematic evaluations in simulation.
min φtraj [t] + φreg [t] (12a)
q[t],q̇[t]
t=1 5.1. Experimental setup
q̇[t + 1] + q̇[t]
s.t. q[t + 1] = q[t] + dt, (12b) Robot platform. Digit is a humanoid robot platform devel-
2
oped by Agility Robotics. It is a full-sized humanoid that
q ∈ Q, q̇ ∈ V (12c)
is 1.6m tall and weighs 45 kilograms. It has 30 degrees of
where q is the robot state in the generalized coordinates, and freedom of which 20 are actuated. Due to its high dimen-
N and dt are the optimization horizon and sampling time. sionality and four-bar linkage structure, it is challenging
Humanoid Locomotion as Next Token Prediction
Figure 5: Comparison to state of the art, trajectory ad- Figure 6: Tracking error comparisons. We measure the
herence. The robot is commanded to walk starting from tracking error of our policy against a state-of-the-art bench-
the origin with a fixed heading command of 0.5 m/s and mark (left), as well as the improvement produced by com-
varying yaw commands in [−0.4, 0.4] rad/s. We plot the plementing action-labeled RL trajectories with action-free
desired (dotted) and actual (solid) trajectories for our policy trajectories (right).
and a reinforcement-learning trained policy (RL).
to optimize fast which makes it particularly interesting for metrics with details as follows and show that two metrics
learning approaches that can learn efficiently from trajectory can consistently predict locomotion performance.
collections like ours.
Tracking error. In all experiments, the robot starts from rest
Model. Our model has a hidden size of 192 dimensions, in a simulated environment and is issued a constant natural
with 4 layers of self-attention layers and MLP layers. Each walking command consisting of a desired heading veloc-
self-attention has 4 heads. We use LayerNorm before each ity sampled in [0.35, 0.70] m/s, angular velocity sampled
attention layer and ReLU activation after the MLP layer. in [−0.4, 0.4] rad/s, and zero lateral velocity. We compute
We use a BatchNorm layer to process the input before the x∗ (t), the ideal robot base position trajectory that fully sat-
transformer model. When predicting a token at time k, to isfies the velocity command v∗ (t) at all time steps. To
keep the context length at a reasonable size, we only keep measure the accuracy of commandPT tracking, we define the
the past 16 steps in input. In Section 5.9, we show the model position tracking error as T1 t=0 ∥x(t) − x∗ (t)∥. We use
is able to scale up to more parameters and longer context the MuJoCo simulator (41) for evaluations, and all trajecto-
length and achieve higher performance. ries last for a duration of 10 seconds.
Prediction error. Since the model is trained with the next
5.2. Real-world deployment
token prediction, we test the prediction error on a set of vali-
We begin by reporting the results of deploying our policy dation data that is separated from training data and contains
in the real world. Specifically, we evaluate deploying our state-action trajectories collected from the RL policy. This
robot at various locations in San Francisco over the course is similar to the language modeling evaluation for large lan-
of one week. Please see Figure 1 for examples and project guage models (14). We test both state and action prediction
page for videos. We find that our policy is able to walk over errors and add them together as the final error metric.
a variety of surfaces including walkways, concrete, asphalt,
tiled plazas, and dirt roads. Note that the deployment in a 5.4. Comparison to the state of the art
large city environment, like San Francisco, is considerably
more challenging than in constrained environments. The city Trajectory Adherence. We compare our policy to a
environment is much more crowded, less patient, and not neuralfig:tracking network controller trained with rein-
forgiving. This makes the error tolerance low and requires a forcement learning (RL) (33). Figure 5 presents a vi-
policy that works consistently well. sual comparison of the trajectory adherence of our con-
troller against these state-of-the-art baselines. Starting
with a robot at the origin, we plot the actual trajec-
5.3. Evaluation Metrics
tory of the robot with eleven different yaw commands
We evaluate locomotion policies with two metrics: tracking selected from {0.00, ±0.05, ±0.10, ±0.20, ±0.30, ±0.40}
error and prediction error. Tracking error measures how rad/s. For each policy, we jointly plot the desired and actual
accurately the robot follows a specific locomotion command. path traced by the robot base. Our model exhibits superior
The prediction error is the next token prediction loss mea- tracking to the RL controller at all turning speeds, and has
sured on a separate set of validation data. We introduce two near-perfect tracking for straight-line walking.
Humanoid Locomotion as Next Token Prediction
0.28
Pos. Tracking Error (m)
0.30
0.32
0.34
0.36
r=0.87
0.38
1.3 1.2 1.1 1.0 0.9
Prediction Loss (10−2)
Figure 7: Prediction error correlates with performance. Figure 8: Gait quality. We command the robot with a head-
We plot the tracking error and prediction error for 14 models. ing velocity of 0.5 m/s and plot the resulting phase portrait
The prediction error linearly correlates with task tracking of the left knee joint. Compared to the RL policy, our policy
error with r = 0.87, which means lower prediction loss features fewer irregularities and a smoother, cyclic gait.
likely indicates more accurate command following.
We collect 14 models trained with different training recipes, 5.8. Training with action-free data
model architectures, data size and types, and test tracking
error and prediction error for each one of them. We plot One of the benefits of our approach is that it can be applied
the tracking and prediction errors of all the models into to trajectories from diverse sources, including missing in-
a single scatter plot, as shown in Figure 7. We can see formation like actions in the case of human videos from
that tracking and prediction error are highly correlated with YouTube. In Figure 6, right, we compare the performance
Pearson coefficient r = 0.87, which means models with of training only with complete trajectories to joint training
lower prediction error on the validation set likely follow on both complete and incomplete trajectories. We observe
different commands with higher accuracy. This suggests that including incomplete trajectories consistently leads to
that the prediction error is predictive task performance. better performance. This is a promising signal for scaling
our approach to a large collection of diverse trajectories.
5.6. Gait quality
In humanoid locomotion, the smoothness in the robot’s gait
is contingent on the rhythmic functioning of its actuated
knee joints. One way to measure this is a phase portrait,
which is a parametric plot of a joint’s generalized position
and velocity over time. Patterns in the plot can reveal infor-
mation about the type of movement the joint is undergoing.
For example, a cyclic pattern may indicate repetitive motion,
while irregular patterns might suggest complex or varied
movements, such as stumbling. In Figure 8, we command
the robot to walk forward at 0.5 m/s, and plot the associated Figure 9: Unseen commands. Our policy is able to follow
phase portrait of its left knee joint. Notice that our policy re- backward commands at test time, unseen during training.
Humanoid Locomotion as Next Token Prediction
Figure 10: Scaling studies. We find that our approach scales with the number of trajectories in the training dataset (left),
context length (middle), and larger models (right).
5.9. Scaling studies when predicting action of (t + 1)-th step, since there is
no alignment, we need to first predict obi+1 and use this
Training data. In Figure 10, left, we study the scaling
ai+1 . If the predicted obi+1
prediction as input to predict b
of our model’s performance by increasing the size of the
is not accurate compared to real oi+1 (which is used to
training dataset. We find that training on more trajectories
ai+1 during training), there will be a discrepancy
predict b
reduces position tracking error, which is a positive signal
between test and training data which will cause error in
for increased performance when training on larger datasets.
action prediction.
Context length. We study the effect of increasing the num-
Joint training vs. staged training. Given both complete
ber of tokens used in the context window of the transformer
data with action and incomplete data without action, we can
policy, varying it between 16, 32, and 48 steps in Figure
either jointly train on both data as described in Section 3,
10 middle. Larger context windows produce better policies,
or we can first pre-train the model on all the data with state
which suggests that our generative policy performs a form
prediction only, then fine-tune the model on complete data
of in-context adaptation that improves with scale.
with action prediction. We compare these two approaches
Model size. We compare models with increasing number in Table 1c. We observe no significant difference between
of parameters (1M, 2M, 8M) by varying the embedding these two, which indicates that pre-training on state pre-
dimension (144, 192, 384), number of attention heads (3, 4, diction then fine-tuning on action prediction also gives a
12), and number of transformer blocks (4, 4, 6) respectively. reasonable locomotion policy.
Tracking error monotonically decreases with model size.
State-action prediction vs. action-only prediction. We
compare the performance of our policy when trained with
5.10. Ablation studies only predicting actions, versus when trained with predicting
Concatenated vs. separate tokens. For the input of trans- both states and actions. The results in Table 1d show that
former, we can either concatenate observation and action at the state-action prediction improves model performance on
each step into a single token, or embed them into two sepa- trajectory tracking. We hypothesize that the additional learn-
rate tokens. We compare these two choices in Table 1a. We ing signal enables the model to learn richer representations
can see that concatenation has lower prediction error while of the world that are beneficial for the locomotion task.
separating tokens has lower tracking error. Overall these two
perform comparably while using separate tokens doubles 6. Discussion
the input length and introduces computation overhead.
We present a self-supervised approach for real-world hu-
Modality-aligned vs. non-aligned prediction. When we manoid locomotion. Our model is trained on a collection
use separate tokens for observation and actions as input, we of sensorimotor trajectories, which come from prior neural
can either predict obi+1 from oi and bai+1 from ai , which network policies, model-based controllers, human motion
aligns modality between prediction and input, or we can capture, and YouTube videos of humans. We show that our
predict obi+1 from ai and bai+1 from oi+1 , which does not model enables a full-sized humanoid to walk in the real-
have alignment. From Table 1b, we can see that modality world zero-shot. These findings suggest a promising path
alignment has clearly better performance than no alignment. toward learning challenging real-world robot control tasks
We suspect this is because, at t-th step during inference, by generative modeling of large collections of trajectories.
Humanoid Locomotion as Next Token Prediction
(a) Concatenated vs. separate tokens for states and action. Two (b) Alignment vs. non-alignment of states or actions for next
modeling designs have comparable performance while concatenat- token prediction. Prediction with aligned modality performs better
ing state and action gives shorter input length and faster inference. on bothe tracking error and next token prediction error.
(c) Joint vs. staged training on data with and without actions. (d) State-action vs. action-only prediction. Predicting both
Staged training which pre-trains on state prediction and finetunes states and actions leads to lower tracking error than only predict-
on action prediction has similar performance as joint training. ing action as in vanilla behavior cloning.
Table 1: Ablations on different design choices in modeling and training. For each ablation we compare the average
tracking error on a set of commands, as well as the next token prediction error on the test set. For a fair comparison, we do
not report next token prediction error for models that only predict actions.
Acknowledgements [5] Castillo, G. A., Weng, B., Zhang, W., and Hereid, A.
Robust feedback motion policy design using reinforce-
This work was supported in part by DARPA Machine Com- ment learning on a 3d digit bipedal robot. In IROS,
mon Sense program, ONR MURI program (N00014-21-1- 2021.
2801), NVIDIA, Hong Kong Centre for Logistics Robotics,
The AI Institute, and BAIR’s industrial alliance programs. [6] Chen, M., Radford, A., Child, R., Wu, J., Jun, H.,
We thank Saner Cakir and Vikas Ummadisetty for help with Luan, D., and Sutskever, I. Generative pretraining
the inverse kinematics simulation experiments. from pixels. In ICML, 2020.
References [7] Chignoli, M., Kim, D., Stanger-Jones, E., and Kim,
S. The mit humanoid robot: Design, motion planning,
[1] Bousmalis, K., Vezzani, G., Rao, D., Devin, C., Lee, and control for acrobatic behaviors. In Humanoids,
A. X., Bauza, M., Davchev, T., Zhou, Y., Gupta, A., 2021.
Raju, A., et al. Robocat: A self-improving foundation
agent for robotic manipulation. arXiv:2306.11706, [8] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
2023. Bert: Pre-training of deep bidirectional transformers
for language understanding. In NAACL-HCT, 2019.
[2] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y.,
Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., [9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis-
Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer senborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
for real-world control at scale. arXiv:2212.06817, Minderer, M., Heigold, G., Gelly, S., et al. An im-
2022. age is worth 16x16 words: Transformers for image
recognition at scale. In ICLR, 2020.
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, [10] Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowd-
G., Askell, A., et al. Language models are few-shot hery, A., Ichter, B., Wahid, A., Tompson, J., Vuong,
learners. In NeurIPS, 2020. Q., Yu, T., et al. Palm-e: An embodied multimodal
language model. arXiv:2303.03378, 2023.
[4] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Ka-
plan, J., Dhariwal, P., Neelakantan, A., Shyam, P., [11] Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Don-
Sastry, G., Askell, A., et al. Language models are ahue, C., and Roberts, A. Gansynth: Adversarial
few-shot learners. NeurIPS, 2020. neural audio synthesis. arXiv:1902.08710, 2019.
Humanoid Locomotion as Next Token Prediction
[12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., [26] Nelson, G., Saunders, A., Neville, N., Swilling, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Bondaryk, J., Billings, D., Lee, C., Playter, R., and
Y. Generative adversarial nets. In NeurIPS, 2014. Raibert, M. Petman: A humanoid robot for testing
chemical protective clothing. Journal of the Robotics
[13] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Gir- Society of Japan, 2012.
shick, R. Masked autoencoders are scalable vision
learners. arXiv:2111.06377, 2021. [27] Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K.,
Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
[14] Hendrycks, D., Burns, C., Basart, S., Zou, A.,
and Kavukcuoglu, K. Wavenet: A generative model
Mazeika, M., Song, D., and Steinhardt, J. Measur-
for raw audio. arXiv:1609.03499, 2016.
ing massive multitask language understanding. arXiv
preprint arXiv:2009.03300, 2020. [28] Plappert, M., Mandery, C., and Asfour, T. The KIT
motion-language dataset. Big Data, 2016.
[15] Hirai, K., Hirose, M., Haikawa, Y., and Takenaka, T.
The development of honda humanoid robot. In ICRA, [29] Radford, A., Narasimhan, K., Salimans, T., and
1998. Sutskever, I. Improving language understanding by
[16] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion generative pre-training. 2018.
probabilistic models. In NeurIPS, 2020. [30] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
[17] Hochreiter, S. and Schmidhuber, J. Long short-term Sutskever, I., et al. Language models are unsupervised
memory. Neural computation, 1997. multitask learners. 2019.
[18] Kajita, S., Kanehiro, F., Kaneko, K., Yokoi, K., and [31] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,
Hirukawa, H. The 3d linear inverted pendulum mode: G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,
A simple modeling for a biped walking pattern genera- Clark, J., et al. Learning transferable visual models
tion. In IROS, 2001. from natural language supervision. In ICML, 2021.
[19] Kaplan, J., McCandlish, S., Henighan, T., Brown, [32] Radosavovic, I., Shi, B., Fu, L., Goldberg, K., Darrell,
T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, T., and Malik, J. Robot learning with sensorimotor
J., and Amodei, D. Scaling laws for neural language pre-training. In CoRL, 2023.
models. arXiv:2001.08361, 2020.
[33] Radosavovic, I., Xiao, T., Zhang, B., Darrell, T., Malik,
[20] Kato, I. Development of wabot 1. Biomechanism, J., and Sreenath, K. Real-world humanoid locomotion
1973. with reinforcement learning. arXiv:2303.03381, 2023.
[21] Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., [34] Raibert, M. H. Legged robots that balance. MIT press,
Hornung, R., Adam, H., Akbari, H., Alon, Y., Birod- 1986.
kar, V., et al. Videopoet: A large language model for
zero-shot video generation. arXiv:2312.14125, 2023. [35] Rajasegaran, J., Pavlakos, G., Kanazawa, A., and Ma-
lik, J. Tracking people by predicting 3d appearance,
[22] Kuindersma, S. Recent progress on atlas, the world’s location and pose. In Proceedings of the IEEE/CVF
most dynamic humanoid robot, 2020. URL https: Conference on Computer Vision and Pattern Recogni-
//youtu.be/EGABAx52GKI. tion, pp. 2740–2749, 2022.
[23] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., [36] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C.,
and Black, M. J. Smpl: A skinned multi-person linear Radford, A., Chen, M., and Sutskever, I. Zero-shot
model. In Seminal Graphics Papers: Pushing the text-to-image generation. In ICML, 2021.
Boundaries, Volume 2, 2023.
[37] Shannon, C. E. Prediction and entropy of printed
[24] Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, english. Bell system technical journal, 1951.
G., and Black, M. J. AMASS: Archive of motion
capture as surface shapes. In ICCV, 2019. [38] Shridhar, M., Manuelli, L., and Fox, D. Perceiver-
actor: A multi-task transformer for robotic manipula-
[25] Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., tion. In CoRL, 2022.
Storey, K., Macklin, M., Hoeller, D., Rudin, N., All-
shire, A., Handa, A., et al. Isaac gym: High perfor- [39] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N.,
mance gpu-based physics simulation for robot learning. and Ganguli, S. Deep unsupervised learning using
In NeurIPS, 2021. nonequilibrium thermodynamics. In ICML, 2015.
Humanoid Locomotion as Next Token Prediction