SMPLOlympics: Sports Environments for Physically Simulated Humanoids

Zhengyi Luo¹ Jiashun Wang¹ ^†^†footnotemark: Kangni Liu¹ ^†^†footnotemark: Haotian Zhang² Chen Tessler² Jingbo Wang Ye Yuan² Jinkun Cao¹ Zihui Lin¹ Fengyi Wang¹ Jessica Hodgins¹ Kris Kitani¹
¹Carnegie Mellon University; ²Nvidia
https://smplolympics.github.io/SMPLOlympics
Equal Contribution

Abstract

We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for many years, there is also a plethora of existing knowledge on the preferred strategy to achieve better performance. To leverage these existing human demonstrations from videos and motion capture, we design our humanoid to be compatible with the widely-used SMPL and SMPL-X human models from the vision and graphics community. We provide a suite of individual sports environments, including golf, javelin throw, high jump, long jump, and hurdling, as well as competitive sports, including both 1v1 and 2v2 games such as table tennis, tennis, fencing, boxing, soccer, and basketball. Our analysis shows that combining strong motion priors with simple rewards can result in human-like behavior in various sports. By providing a unified sports benchmark and baseline implementation of state and reward designs, we hope that SMPLOlympics can help the control and animation communities achieve human-like and performant behaviors.

\etocdepthtag

.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone

1 Introduction

Competitive sports, much like their role in human society, offer a standardized way of measuring the performance of learning algorithms and creating emergent human behavior. While there exist isolated efforts to bring individual sport into physics simulation [8, 36, 7, 35, 29, 26], each work uses a different humanoid, simulator, and learning algorithm, which prevents unified evaluation. Their specially built humanoids also make it difficult to acquire compatible motion data, as retargeting might be required to translate motion to each humanoid. Building a collection of simulated sports environments that uses a shared humanoid embodiment and training pipeline is challenging, as it requires expert knowledge in humanoid design, reinforcement learning (RL), and physics simulation.

These challenges have led to previous benchmarks and simulated environments [3, 25] focusing mainly on locomotion tasks for humanoids. While these tasks (e.g., moving forward, getting up from the ground, traversing terrains) are as benchmarks, they lack the depth and diversity needed to induce a wide range of behaviors and strategies. As a result, these environments do not fully exploit the potential of humanoids to discover actions and skills found in real-world human activities.

Another important aspect of working with simulated humanoids is the ease of obtaining human demonstrations. The resemblance to the human body makes humanoids capable of performing a diverse set of skills; a human can also easily judge the strategies used by humanoids. Curated human motion can be used either as motion prior [17, 18, 24] or in evaluation protocols. Thus, having an easy way to obtain new human motion data compatible with the humanoid, either from motion capture (MoCap) or videos, is critical for simulated humanoid environments.

In this work, we propose SMPLOlympics, a collection of physically simulated environments for a variety of Olympic sports. SMPLOlympics offers a wide range of sports scenarios that require not only locomotion skills, but also manipulation, coordination, and planning. Unified under one humanoid embodiment, our environments provide a rich set of challenges for developing and testing embodied agents. We use humanoids compatible with the SMPL family of models, which enables the direct conversion of human motion in the SMPL format to our humanoid. For tasks that require articulated fingers, we use SMPL-X [16] based humanoid which has a much higher degree of freedom (DOF); for tasks that do not need hands, we use SMPL [2]. As popular human models, the SMPL family of models is widely adopted in the vision and graphics community, which provides us with access to human pose estimation methods [34] capable of extracting coherent motion from videos. The existing large-scale human motion dataset [13] in the SMPL format also helps build general-purpose motion representation for humanoids [10].

Our sports environments support both individual and competitive sports, providing a comprehensive platform for testing and benchmarking. For individual sports, we include activities such as golf, javelin throw, high jump, long jump, and hurdling. Competitive sports in our suite include 1v1 games such as ping pong, tennis, fencing, and boxing, as well as team sports such as soccer and basketball. To facilitate benchmarking, we also include tasks such as penalty kicks (for soccer) and ball-target hitting (for ping-pong and tennis) that are easy to measure performance. To demonstrate the importance of human demonstrations, we extract motion from videos using off-the-shelf pose estimation methods, and show that using human motion data as motion prior can [18] significantly improves human likeness in the resulting motion. We also test recent motion representations in simulated humanoid control using hierarchical RL [10], and show that a learned motion representation combined with simple rewards can lead to many versatile human-like behaviors to achieve impressive sports results (i.e. discovering the Fosbury way for high jump).

In conclusion, our contributions are: (1) we propose SMPLOlympics, a collection of simulated environments that allow humanoids to compete in 10 Olympic sports; (2) we provide a pipeline to extract human demonstration data from videos and show their effectiveness in helping build human-like strategies in simulated sports; (3) we provide the starting state and reward designs for each sport, benchmark state-of-the-art algorithms, and show that simple rewards combined with a strong motion prior can lead to impressive sports feats.

2 Related Works

Simulated Humanoid Sports

Simulated humanoid sports can help generate animations and explore optimal sports strategies. Research has focused on various individual sports within simulated environments, including tennis [36], table tennis [26], boxing [29, 38], fencing [29], basketball dribbling [7, 27] and soccer [31, 8]. These studies leverage human motion to achieve human-like behaviors, using it to acquire motor skills [8, 29] or establish motion prior [36]. However, the diversity in humanoid definitions across studies makes it difficult to aggregate additional human demonstration data due to the need for retargetting. Furthermore, the task-specific training pipelines in these studies are hard to generalize to new sports. In contrast, SMPLOlympics provides a unified benchmark employing a consistent humanoid model and training pipeline across all sports. This standardization not only facilitates extension to more sports, but also simplifies benchmarking learning algorithms.

Simulated RL Benchmarks

Simulated full-body humanoids provide a valuable platform for studying embodied intelligence due to their close resemblance to real-world human behavior and physical interactions. Current RL benchmarks [3, 25, 14] often focus on locomotion tasks such as moving forward and traversing terrain. dm_control [25] and OpenAI [3] Gym only include locomotion tasks. ASE [19] includes results for five tasks based on mocap data, which involve mainly simple locomotion and sword-swinging actions. These tasks lack the complexity required to fully exploit the capabilities of simulated humanoids. Sports scenarios require agile motion and strategic teamwork. They are also easily interpretable and provide measurable outcomes for success. A concurrent work, HumanoidBench [23] employs a commercially available humanoid robot in simulation to address 27 locomotion and manipulation tasks. Unlike HumanoidBench, ours targets competitive sports and uses available human demonstration data to enhance the learning of human-like behaviors. This emphasis is essential, as without human demonstrations, behaviors developed in benchmarks can often appear erratic, nonhuman-like, and inefficient.

Humanoid Motion Representation

Adversarial learning has proven to be a powerful method for using human reference motions to enhance the naturalness of humanoid animations [18, 32, 1]. Due to the high DoF in humanoids and the inherent sample inefficiency of RL training, efforts have focused on developing motion primitives [6, 15, 5, 20] and motion latent spaces [4, 19, 24]. These techniques aim to accelerate training and provide human-like motion priors. Notably, approaches such as ASE [19], CASE [4], and CALM [24] utilize adversarial learning objectives to encourage mapping between random noise and realistic motor behavior. Furthermore, methods such as ControlVAE [33], NPMP [15], PhysicsVAE [30], NCP [38], and PULSE [10] leverage the motion imitation task to acquire and reuse motor skills for the learning of downstream tasks. In this work, we study AMP [18] and PULSE [10] as exemplary methods to provide motion priors. Our findings demonstrate that a robust motion prior, combined with straightforward reward designs, can effectively induce human-like behaviors in solving complex sports tasks.

3 Problem Formulation

We define the full-body human pose as ${\boldsymbol{{q}}_{t}}\triangleq({\boldsymbol{\theta}_{t}},{\boldsymbol{{p}}_{% t}})$ , consisting of 3D joint rotations ${\boldsymbol{\theta}_{t}}\in\mathbb{R}^{J\times 6}$ and positions ${\boldsymbol{{p}}_{t}}\in\mathbb{R}^{J\times 3}$ of all $J$ joints on the humanoid, using the 6 DoF rotation representation [37]. To define velocities ${\boldsymbol{\dot{q}}_{1:T}}$ , we have ${\boldsymbol{\dot{q}}_{t}}\triangleq({\boldsymbol{{\omega}}_{t}},{\boldsymbol{% v}_{t}})$ as angular ${\boldsymbol{{\omega}}_{t}}\in\mathbb{R}^{J\times 3}$ and linear velocities ${\boldsymbol{v}_{t}}\in\mathbb{R}^{J\times 3}$ . If an object is involved (e.g. javelin, football, ping-pong ball), we define their 3D trajectories $\boldsymbol{q}^{\text{obj}}_{t}$ using object position $\boldsymbol{p}^{\text{obj}}_{t}$ , orientation $\boldsymbol{\theta}^{\text{obj}}_{t}$ , linear velocity $\boldsymbol{v}^{\text{obj}}_{t}$ , and angular velocity $\boldsymbol{\omega}^{\text{obj}}_{t}$ . As a notation convention, we use $\widehat{\cdot}$ to denote the ground truth kinematic quantities from Motion Capture (MoCap) and normal symbols without accents for values from the physics simulation.

Goal-conditioned Reinforcement Learning for Humanoid Control

We define each sport using the general framework of goal-conditioned RL. Namely, a goal-conditioned policy ${\pi_{\text{task}}}$ is trained to control a simulated humanoid competing in a sports environment. The learning task is formulated as a Markov Decision Process (MDP) defined by the tuple ${\mathcal{M}}=\langle\mathcal{\boldsymbol{S}},\mathcal{\boldsymbol{A}},% \mathcal{\boldsymbol{T}},\boldsymbol{\mathcal{R}},\gamma\rangle$ of states, actions, transition dynamics, reward function, and discount factor. The simulation determines the state ${\boldsymbol{s}_{t}}\in\mathcal{\boldsymbol{S}}$ and transition dynamics $\mathcal{\boldsymbol{T}}$ , where a policy computes the action ${\boldsymbol{a}_{t}}$ . The state ${\boldsymbol{s}_{t}}$ contains the proprioception ${\boldsymbol{s}^{\text{p}}_{t}}$ and the goal state ${\boldsymbol{s}^{\text{g}}_{t}}$ . Proprioception is defined as ${\boldsymbol{s}^{\text{p}}_{t}}\triangleq({\boldsymbol{{q}}_{t}},{\boldsymbol{% \dot{q}}_{t}})$ , which contains the 3D body pose ${\boldsymbol{{q}}_{t}}$ and velocity ${\boldsymbol{\dot{q}}_{t}}$ . We use $\boldsymbol{b}$ to indicate the boundary of the arena to which a sport is limited. All values are normalized with respect to the humanoid heading (yaw).

4 SMPLOlympics: sports environments For Simulated Humanoids

In this section, we describe the formulation of each of our sports environments, from single-person sports (Sec. 4.1) to multi-person sports (Sec. B.2). Then, we describe our pipeline for acquiring human demonstration data from videos (Sec. 4.3). An overview can be found in Fig. 1. For each sport, we provide a preliminary reward design that serves as a baseline for future research. Due to space constraints, omitted details can be found in the supplement.

Refer to caption — Figure 1: An overview of SMPLOlympics: we design a collection of simulated sports environments and leverage RL and human demonstrations (from videos or MoCap) as prior to tackle them.

4.1 Single-person Sports

High Jump

In the high jump environment, the humanoid jumps over a horizontal bar placed at a certain height without touching it and aims to reach a goal point that is 2 meters behind the bar. The bar is positioned following the setup of the official Olympic game. The high jump goal state ${\boldsymbol{s}^{\text{g-high\_jump}}_{t}}=(\boldsymbol{p}^{b}_{t},\boldsymbol% {p}^{g}_{t})$ contains the positions of the bar $\boldsymbol{p}^{b}_{t}\in\mathbb{R}^{3}$ and the goal point $\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}$ . The reward is defined as $\boldsymbol{\mathcal{R}}^{\text{high jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-high\_jump}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_% {t}+w^{\text{h}}r^{\text{h}}_{t}$ . The position reward $r^{p}_{t}$ encourages the humanoid to go closer to the goal point. The height reward $r^{h}_{t}$ encourages the humanoid to jump higher. Training terminates when the humanoid is in contact with the bar, does not pass the bar, or falls to the ground before jumping. We also set up four bar heights for curriculum learning: 0.5m, 1m, 1.5m, and 2m.

Long Jump

Long jump is also set similar to the real-world setting, with a 20m runway followed by a jump area. Before the humanoid jumps, its feet should be behind the jump line. The goal state ${\boldsymbol{s}^{\text{g-long\_jump}}_{t}}\triangleq(\boldsymbol{p}^{s}_{t},% \boldsymbol{p}^{l}_{t},\boldsymbol{p}^{g}_{t})$ includes the position of the starting point $\boldsymbol{p}^{s}_{t}\in\mathbb{R}^{3}$ , jump line $\boldsymbol{p}^{l}_{t}\in\mathbb{R}^{3}$ , and the goal $\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}$ . The training reward is defined as $\boldsymbol{\mathcal{R}}^{\text{long jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-long\_jump}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_% {t}+w^{\text{v}}r^{\text{v}}_{t}+w^{\text{h}}r^{\text{h}}_{t}+w^{\text{l}}r^{% \text{l}}_{t}$ . The position reward $r^{\text{p}}_{t}$ encourages the humanoid to get closer to the goal, the velocity reward $r^{\text{v}}_{t}$ encourages larger running speed, and the height reward $r^{\text{h}}_{t}$ encourages higher jump. Finally, $r^{\text{l}}_{t}$ encourages jumping far.

Hurdling

In hurdling, the humanoid tries to reach a finishing line 110 meters ahead and needs to jump over 10 hurdles (each 1.067m high, placed 13.72m from the start, with subsequent hurdles spaced every 9.14m). The goal state is defined as ${\boldsymbol{s}^{\text{g-hurdling}}_{t}}\triangleq(\boldsymbol{p}^{h}_{t},% \boldsymbol{p}^{f}_{t})$ , where $\boldsymbol{p}^{h}_{t}\in\mathbb{R}^{10\times 3}$ and $\boldsymbol{p}^{f}_{t}\in\mathbb{R}^{3}$ includes the positions of these hurdles as well as the finish line. We define a simple reward function as $\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq r^{\text{distance}}_{t}$ . $\boldsymbol{\mathcal{R}}^{\text{hurdling}}$ encourages the agent to run towards the finish line and clear each hurdle. Additionally, we employ a curriculum for hurdling, where the height of each hurdle is randomly sampled between 0 and 1.167 meters for each episode.

Golf

For golf, the humanoid’s right hand is replaced with a golf club measuring 1.14 meters. The driver of the golf club is simulated as a small box ( $0.05\text{m}\times 0.025\text{m}\times 0.02\text{m}$ ). We incorporate a wave-like terrain with an amplitude of 0.5 meters in the golf environment, designed to mimic real-world grasslands. The golf goal is positioned to the left of the humanoid, at a distance ranging from 0 meters to 20 meters away. The goal state ${\boldsymbol{s}^{\text{g-golf}}_{t}}\triangleq(\boldsymbol{p}^{b}_{t},% \boldsymbol{p}^{c}_{t},\boldsymbol{p}^{g}_{t},\boldsymbol{o}_{t})$ includes the ball position $\boldsymbol{p}^{b}_{t}\in\mathbb{R}^{3}$ , club $\boldsymbol{c}^{b}_{t}\in\mathbb{R}^{3}$ , goal position $\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}$ , and terrain height map $\boldsymbol{o}_{t}\in\mathbb{R}^{32\times 32}$ . The reward is defined as $\boldsymbol{\mathcal{R}}^{\text{golf}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-golf}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_{t}+w^% {\text{c}}r^{\text{c}}_{t}+w^{\text{g}}r^{\text{g}}_{t}+w^{\text{pred}}r^{% \text{pred}}_{t}$ , where the $r^{\text{p}}_{t}$ encourages the ball to move forward, $r^{\text{c}}_{t}$ encourages swinging the golf club to hit the ball, and $r^{\text{g}}_{t}$ encourages the ball to reach the goal. In addition, we predict the ball’s trajectory and provide a dense reward $r^{\text{pred}}_{t}$ based on the distance between the predicted landing point and the goal.

Javelin

For javelin throw, we use SMPL-X humanoid with articulated fingers. The goal state is defined as ${\boldsymbol{s}^{\text{g-javelin}}_{t}}\triangleq(\boldsymbol{q}^{\text{obj}}_% {t},\boldsymbol{p}^{r}_{t},\boldsymbol{p}^{h}_{t})$ , where $\boldsymbol{q}^{\text{obj}}_{t}\in\mathbb{R}^{13}$ , includes the position, orientation, linear, and angular velocity of the javelin. $\boldsymbol{p}^{r}_{t}$ and $\boldsymbol{p}^{h}_{t}$ are the positions of the root and right hand. The reward is defined as $\boldsymbol{\mathcal{R}}^{\text{javelin}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-javelin}}_{t}})\triangleq w^{\text{grab}}r^{\text{grab% }}_{t}+w^{\text{js}}r^{\text{js}}_{t}+w^{\text{goal}}r^{\text{goal}}_{t}+w^{% \text{s}}r^{\text{s}}_{t}$ . The grab reward $r^{\text{grab}}_{t}$ encourages the right hand to grab the javelin. The javelin stability reward $r^{\text{js}}_{t}$ minimizes the javelin’s self-rotation. The goal reward $r^{\text{goal}}_{t}$ encourages the humanoid to throw the javelin further. The stability reward $r^{\text{s}}_{t}$ is to avoid large movements of the body.

4.2 Multi-person Sports

Tennis

For tennis, each humanoid’s right hand is replaced as an oval racket. We use the same measurement as a real tennis court and ball. We design two tasks: a single-player task where the humanoid trains to hit balls launched randomly, and a 1v1 mode where the humanoid plays against another humanoid. For both tasks, the goal state is defined as ${\boldsymbol{s}^{\text{g-tennis}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{v}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{racket}}_{t}},% {\boldsymbol{p}^{\text{tar}}_{t}}$ , where ${\boldsymbol{p}^{\text{ball}}_{t}}\in\mathbb{R}^{3},{\boldsymbol{v}^{\text{% ball}}_{t}}\in\mathbb{R}^{3},{\boldsymbol{p}^{\text{racket}}_{t}}\in\mathbb{R}% ^{3}$ and ${\boldsymbol{p}^{\text{tar}}_{t}}\in\mathbb{R}^{3}$ , which includes the position and velocity of the ball, position of the racket and position of the target. The reward function for tennis is defined as $\boldsymbol{\mathcal{R}}^{\text{tennis}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-tennis}}_{t}})\triangleq w_{\text{p}}r^{\text{racket}}% _{t}+w_{\text{b}}r^{\text{ball}}_{t}$ . The racket reward $r^{\text{racket}}_{t}$ encourages the racket to reach the ball, and the ball reward $r^{\text{ball}}_{t}$ aims to successfully hit the ball into the opponent’s court, as close to the target as possible. For the single-player task, we shoot a ball from the opposite side from a random position and trajectory, simulating a ball hit by the opponent. The target ${\boldsymbol{p}^{\text{tar}}_{t}}$ is also randomly sampled. For the 1v1 scenario, we can either train models from scratch or initialize two identical single-player models as opponents, which can play back and forth.

Table Tennis

For table tennis, each humanoid is equipped with a circular paddle (replacing the right hand) and plays on a standard table. Similar to tennis, we have the single-player task and the 1v1 task. Similarly, the goal state is defined as ${\boldsymbol{s}^{\text{g-tennis}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{v}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{racket}}_{t}},% {\boldsymbol{p}^{\text{tar}}_{t}})$ . The reward function for table tennis is defined as $\boldsymbol{\mathcal{R}}^{\text{table tennis}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-table\_tennis}}_{t}})\triangleq w_{\text{p}}r^{\text% {racket}}_{t}+w_{\text{b}}r^{\text{ball}}_{t}$ . The paddle reward $r^{\text{racket}}_{t}$ is the same as tennis while we modify the $r^{\text{ball}}_{t}$ slightly to encourage more hits for table tennis.

Fencing

For 1v1 fencing, each humanoid is equipped with a sword (replacing the right hand) and plays on a standard fencing field. The goal state is defined as ${\boldsymbol{s}^{\text{g-fencing}}_{t}}\triangleq({\boldsymbol{p}^{\text{opp}}% _{t}},{\boldsymbol{v}^{\text{opp}}_{t}},{\boldsymbol{p}^{\text{sword}}_{t}}-{% \boldsymbol{p}^{\text{opp-target}}_{t}},\|{\boldsymbol{{c}}_{t}}\|_{2}^{2},\|{% \boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2},\boldsymbol{b})$ , which contains the opponent’s position body ${\boldsymbol{p}^{\text{opp}}_{t}}\in\mathbb{R}^{24\times 3}$ , linear velocity ${\boldsymbol{v}^{\text{opp}}_{t}}\in\mathbb{R}^{24\times 3}$ , the difference between target body position ${\boldsymbol{p}^{\text{opp-target}}_{t}}\in\mathbb{R}^{5\times 3}$ on the opponent and agent’s sword tip position ${\boldsymbol{p}^{\text{sword}}_{t}}$ , normalized contract forces on the agent itself $\|{\boldsymbol{{c}}_{t}}\|_{2}^{2}\in\mathbb{R}^{24\times 3}$ and its opponent $\|{\boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2}\in\mathbb{R}^{24\times 3}$ , as well as the bounding box $\boldsymbol{b}\in\mathbb{R}^{4}$ . To train the fencing agent, we define the fencing reward function as $\boldsymbol{\mathcal{R}}^{\text{fencing}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-fencing}}_{t}})\triangleq w_{\text{f}}r^{\text{facing}% }_{t}+w_{\text{v}}r^{\text{vel}}_{t}+w_{\text{s}}r^{\text{strike}}_{t}+w_{% \text{p}}r^{\text{point}}_{t}$ . The facing $r^{\text{facing}}_{t}$ and velocity reward $r^{\text{vel}}_{t}$ encourage the agent to face and move toward the opponent. The strike reward $r^{\text{strike}}_{t}$ encourages the agent’s sword tip to get close to the target, while $r^{\text{point}}_{t}$ is the reward for getting in contact with the target. We use the pelvis, head, spine, chest, and torso as the target bodies. The episode terminates if either of the humanoids falls or steps out of bounds.

Boxing

For boxing, we simulate two humanoids with sphere hands in a bounded arena. The goal state is similar to fencing: ${\boldsymbol{s}^{\text{g-boxing}}_{t}}\triangleq({\boldsymbol{p}^{\text{opp}}_% {t}},{\boldsymbol{v}^{\text{opp}}_{t}},{\boldsymbol{{p}}_{t}^{\text{hand}}}-{% \boldsymbol{p}^{\text{opp-target}}_{t}},\|{\boldsymbol{{c}}_{t}}\|_{2}^{2},\|{% \boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2})$ but without the bounding box information. The reward function and target body parts are also the same as fencing, though replacing the sword tip to the hands.

Soccer

The soccer environment includes one or more humanoids, a ball, two goal posts, and the field boundaries. The field measures 32m $\times$ 20m. We support three tasks: penalty kicks, 1v1, and 2v2.

For penalty kicks, the humanoid is positioned 13 meters from the goal line, with the ball placed at a fixed spot 12 meters directly in front of the goal center. The objective is to kick the ball toward a randomly sampled target within the goal post. To achieve this, the controller is provided ${\boldsymbol{s}^{\text{g-kick}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}_{% t}},{\boldsymbol{\dot{q}}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{goal-post}% }_{t}},{\boldsymbol{p}^{\text{goal-target}}_{t}})$ , where ${\boldsymbol{p}^{\text{ball}}_{t}}\in\mathbb{R}^{3}$ is the ball position, ${\boldsymbol{\dot{q}}^{\text{ball}}_{t}}\in\mathbb{R}^{3}$ is the velocity and angular velocity, ${\boldsymbol{p}^{\text{goal-post}}_{t}}\in\mathbb{R}^{4}$ is the bounding box of the goal, and ${\boldsymbol{p}^{\text{goal-target}}_{t}}\in\mathbb{R}^{3}$ is the target location within the goal post. The reward is $\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}}+w% ^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{b2t}}r^{% \text{b2t}}-c^{\text{no-dribble}}_{t}$ . Various rewards are designed to guide the character towards a run-and-kick motion. The player-to-ball ( $r^{\text{p2b}}$ ) reward motivates the character to move towards the ball. The ball-to-goal reward ( $r^{\text{b2g}}$ ) reduces the distance between the ball and the target. The ball-velocity-to-goal ( $r^{\text{bv2g}}$ ) encourages a higher velocity of the ball toward the target. The ball-to-target ( $r^{\text{b2t}}$ ) reward encourages a smaller distance between the target and the predicted landing spot of the ball based on its current position and velocity. Finally, a negative reward ( $c^{\text{no-dribble}}_{t}$ ) is applied if the character passes the spawn position of the ball, which discourages dribbling and encourages kicking.

Beyond penalty kicks, we explore team-play dynamics, including 1v1 and 2v2. The controller is provided with a state defined as ${\boldsymbol{s}^{\text{g-soccer}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{\dot{q}}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{goal-% post}}_{t}},{\boldsymbol{p}^{\text{ally-root}}_{t}},{\boldsymbol{p}^{\text{opp% -root}}_{t}})$ , where ${\boldsymbol{p}^{\text{ally-root}}_{t}}\in\mathbb{R}^{3}$ and ${\boldsymbol{p}^{\text{opp-root}}_{t}}\in\mathbb{R}^{3}$ are the root positions of the ally and opponents (1 or 2). The controller is then trained using the following reward $\boldsymbol{\mathcal{R}}^{\text{soccer-match}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}% }+w^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{point}}% r^{\text{point}}$ , where $r^{\text{p2b}}$ , $r^{\text{b2g}}$ and $r^{\text{bv2g}}$ are the same as in penalty kick. $r^{\text{b2g}}$ and $r^{\text{bv2g}}$ are zeroed out when the distance to the ball is greater than 0.5m. $r^{\text{point}}$ , the scoring a goal, provides a one-time bonus and or penalty for goals. Notice that this is a rudimentary reward design compared to prior art [8] and serves as a starting point for further development.

Basketball

Our basketball environment is set up similarly to the soccer environment except for using the SMPL-X humanoid. The court measures 29m $\times$ 15m, with a 3m high hoop. We also introduce the task of free-throw, where the humanoid begins at a distance of 4.5 meters from the hoop with the ball initially positioned close to its hands. The objective is to successfully throw the basketball into the hoop. The goal state for this task is defined similarly to that of the soccer penalty kicks, with the distinction being the prohibition of foot-to-ball contact to maintain basketball rules.

Competitive Self-play

In competitive sports environments, we implement a basic adversarial self-play mechanism where two policies, initialized randomly, compete against each other to optimize their rewards. We adopt an alternating optimization strategy from [29], where one policy is frozen while the other is trained. This encourages each policy to develop offensive and defensive strategies, contributing to more competitive behavior, as observed in boxing and fencing (supplement site).

4.3 Acquiring Human Demonstration From Videos

We utilize TRAM [28] for 3D motion reconstruction from Internet videos, providing robust global trajectory and pose estimation under dynamic camera movements, commonly found in sports broadcasting. Specifically, TRAM estimates SMPL parameters [9] which include global root translation, orientation, body poses, and shape parameters. We further apply PHC [11], a physics-based motion tracker, to imitate these estimated motions, ensuring physical plausibility. We find these corrected motions are significantly more effective as positive samples for adversarial learning compared to raw estimated results. More details and ablation are provided in the supplementary materials.

5 Experiments

Implementation Details

Simulation is conducted in Isaac Gym [14], where the policy runs at 30 Hz and the simulation at 60 Hz. All task policies utilize three-layer MLPs with units [2048, 1024, 512]. The SMPL humanoid models adhere to the SMPL kinematic structure, featuring 24 joints, 23 of which are actuated, yielding an action space of $\mathcal{R}^{69}$ . The SMPL-X humanoid has 52 joints, 51 actuated, including 21 body joints and hands, resulting in an action space of $\mathcal{R}^{153}$ . Body parts on our humanoid consist of primitives such as capsules and blocks. All models can be trained on a single Nvidia RTX 3090 GPU in 1-3 days. We limit all joint actuation forces to 500 Nm. For more implementation details, please refer to the supplement.

Baselines

We benchmark our simulated sports using some of the state-of-the-art simulated humanoid control methods. While not a comprehensive list, it provides a baseline for the challenging environments. Each task is trained using PPO [22], AMP [18], PULSE [10], and a combination of PULSE and AMP. AMP use a discriminator with the policy to provide an adversarial reward, using human demonstration data to deliver a “style" reward that reflects the human-likeness of humanoid motion. Both task and discriminator rewards are equally weighted at 0.5. PULSE extracts a 32-dimensional universal motion representation from AMASS data, surpassing previous methods [24, 19] in coverage of motor skills and applicability to downstream tasks. Compared to AMP, PULSE uses hierarchical RL and offers a learned action space that accelerates training and provides human-like motion prior (instead of a discriminative reward). PULSE and AMP can be combined effectively, where PULSE provides the action space and AMP provides task-specific style reward.

Metrics

We provide quantitative evaluations for tasks with easily measurable metrics such as high jump, long jump, hurdling, javelin, golf, single-player tennis, table tennis, penalty kicks, and free throws. These metrics are detailed in the supplementary materials, where we also present qualitative assessments for tasks that are more challenging to quantify, such as boxing, fencing, and team soccer. Specifically, success rate (Suc Rate) determines whether an agent completes a sport according to set rules. Average distance (Avg Dis) indicates the extent an agent or object travels. For sports involving ball hits, such as tennis and table tennis, we record the average number of successful ball strikes (Avg Hits). Error distance (Error Dis) measures the distance between the intended target and the actual landing spot, applicable in sports like golf, tennis, and penalty kicks. Additionally, the hit rate in golf quantifies the success of striking the ball with the club. Evaluations are performed on 1000 trials.

5.1 Benchmarking Popular Simulated Humanoid Algorithms

In this section, we evaluate the performance of various control methods across our sports environments. We provide qualitative results in Fig. 2 and Fig. 3, and training curves in Fig. 4. To view extensive qualitative results, including human-like soccer kick, boxing, high jump, etc., please see supplement.

Track & Field Sports (Without Video Data)

We first evaluate track and field sports, including long jump, high jump, hurdling, and javelin throwing. For these sports, SOTA pose estimation methods fail to estimate coherent motion and global root trajectory as players and cameras are both fast-moving. Thus, we utilize a subset of the AMASS dataset containing locomotion data [21] as reference motions. Since PULSE is pretrained on AMASS, we exclude PULSE + AMP from these tests. Table 1 summarizes the quantitative results of different methods. In long jump, AMP fails entirely, often walking slowly to the jump line without a forward leap. This failure occurs because the policy prioritizes discriminator rewards over task completion. If the task is too hard, the policy will use simple motion (such as standing still) to maximize the discriminator reward instead of trying to complete the task. In contrast, PPO, while capable of jumping great distances, exhibits unnatural motions. PULSE successfully executes jumps with human-like motion, but lacks the specialized skills for top-tier records due to the absence of corresponding motion data in AMASS. The high jump displays similar patterns: PPO achieves impressive heights but with unnatural movements while AMP struggles to reconcile adversarial and task rewards. Surprisingly, as shown in Figure 2, PULSE successfully adopts a Fosbury flop approach without specific rewards to encourage this technique, likely leveraging breakdance skills. For hurdling, AMP completely fails, stopping before the first hurdle. PPO bounces energetically over each obstacle as shown in Figure 2, but sometimes falls and fails to complete the race, with an average success rate of just over 50% and an average distance of less than 110m. PULSE facilitates natural clearance of hurdles, and completes races in 17.76 seconds, a competitive time compared to human standards. Javelin throwing poses similar challenges: PPO uses inhuman strategies, AMP struggles with balancing rewards, and PULSE adopts human-like strategies but lacks specific skills for record-setting performance.

Table 1: Evaluation on Long Jump, High Jump, Hurdling and Javelin. World records are in parentheses.

	Long Jump (8.95m)		High Jump (2.45m)				Hurdling (12.8s)			Javelin (104.8m)
Method	Suc Rate $\uparrow$	Avg Dis $\uparrow$	Suc Rate (1m) $\uparrow$	Height (1m) $\uparrow$	Suc Rate (1.5m) $\uparrow$	Height (1.5m) $\uparrow$	Suc Rate $\uparrow$	Avg Dis $\uparrow$	Time $\downarrow$	Suc Rate $\uparrow$	Avg Dis $\uparrow$
PPO [22]	53.6%	19.42	100%	4.08	100%	4.11	57.6%	108.9	11.22	100%	44.5
AMP [18]	0%	-	0%	-	0%	-	0%	13.24	-	0.31%	2.03
PULSE [10]	100%	5.105	100%	2.01	100%	1.98	100%	122.1	17.76	100%	9.63

Table 2: Evaluation on Golf, Tennis, Table Tennis, Penalty Kick and Free Throw

	Tennis		Table Tennis		Golf		Penalty Kick		Free Throw
Method	Avg Hits $\uparrow$	Error Dis $\downarrow$	Avg Hits $\uparrow$	Error Dis $\downarrow$	Hit Rate $\uparrow$	Error Dis $\downarrow$	Suc Rate $\uparrow$	Error Dis $\downarrow$	Suc Rate $\uparrow$
PPO [22]	2.76	1.92	1.01	0.06	0%	-	0.0%	-	91.4%
AMP [18]	3.95	5.30	1.10	0.13	100%	1.43	0.0%	-	0.0%
PULSE [10]	2.48	3.50	0.74	0.19	99.9%	1.29	76.6%	0.25	85.6%
PULSE [10] + AMP [18]	2.62	3.64	1.83	0.23	99.9%	2.18	27.5%	0.27	89.8%

Sports With Video Data

For sports including golf, tennis, table tennis, and soccer penalty kick, we utilize processed motion from videos as demonstrations for AMP and PULSE+AMP. The results are reported in Table 2 and Fig. 3. In tennis, AMP demonstrates superior performance in terms of average hits; however, returned balls often land far from the intended targets. This is because prolonged rallies increase discriminator rewards, leading AMP to ignore task rewards. Notably, AMP exhibits inhuman motions at the moment of ball contact and reverts to natural movements when preparing for the next hit as shown in Fig. 3. This behavior underscores a reward conflict between balancing task and discriminator rewards. PPO plays tennis in an unnatural way, while PULSE and PULSE + AMP show similar performance. In table tennis, PPO achieves impressive error distances, but struggles with consistency and often fails to return second shots. We observe video data proves particularly beneficial for table tennis. PULSE+AMP records significantly higher hit averages with reasonable error distances. Table tennis requires quick reactions within a short time, which the pre-trained PULSE model supports by providing necessary motor skills, enhanced by video data that guide the learning of proper stroke techniques. For golf, penalty kicks, and free throws, the “initiating contact with an object" part makes them challenging. Here, only PULSE and PULSE+AMP manage to solve the three tasks effectively and consistently, leveraging PULSE’s latent space for effective exploration. The design of these tasks often results in a sparse exploration phase where triggering penalty rewards, such as $c^{\text{no-dribble}}_{t}$ for moving past the ball’s initial position. The AMP reward also negatively affects training penalty kick, as the human demonstration contains other soccer motions such as running and dribbling, and the policy finds them easier to learn and exploit.

Curriculum learning

We find curriculum learning is an essential component in achieving better results for some tasks. In Table 3, we study variants of high jump and hurdling task with and without

Table 3: Evaluation on curriculum learning.

	High Jump		Hurdling
Method	Suc Rate (1m)	Suc Rate (1.5m)	Suc Rate	Avg Dis	Time
w/o curriculum	100%	0%	0%	13.65	-
w/ curriculum	100%	100%	100%	122.1	17.76

the curriculum using PULSE. We can see that without curriculum, high jump and hurdling both fail to solve the task. This is due to the policy not being able to obtain any reward facing challenging heights of bars and hurdles and the policy gets stuck in the local minima.

6 Limitations, Conclusion and Future Work

Limitations

While SMPLOlympics provides a large collection of simulated sports environments, it is far from being comprehensive. Certain sports are omitted due to simulation constraints (e.g., swimming, shooting, ice hockey, cycling) or their inherent complexity (e.g., 11-a-side soccer, equestrian events). Nevertheless, our framework is highly adaptable, allowing easy incorporation of additional sports like climbing, rugby, wrestling etc. Our initial design of rewards, though able to achieve sensible results, is also far from optimal. For competitive sports such as 2v2 soccer and basketball, our results also fall short of SOTA [8] which employs much more complex systems.

Conclusion and Future Work

We introduce SMPLOlympics, a collection of sports environments for simulated humanoids. We provide carefully designed state and reward, and benchmark humanoid control algorithms and motion priors. We find that by combining simple reward design and powerful human motion prior, one can achieve human-like behavior for solving various challenging sports. Our humanoid’s compatibility with the SMPL family of models also provides an easy way to obtain additional data from video for training, which we demonstrate to be helpful in training some sports. These well-defined simulation environments could also serve as valuable platforms for frontier models [12] to gain physical understanding. We believe that SMPLOlympics provides a valuable starting point for the community to further explore physically simulated humanoids.

References

Bae et al. [2023] Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, and Young Min Kim. Pmp: Learning to physically interact with environments using part-wise motion priors. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–10, 2023.
Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. Lect. Notes Comput. Sci., 9909 LNCS:561–578, 2016. ISSN 0302-9743,1611-3349.
Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
Dou et al. [2023] Zhiyang Dou, Xuelin Chen, Qingnan Fan, Taku Komura, and Wenping Wang. C· ase: Learning conditional adversarial skill embeddings for physics-based characters. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
Haarnoja et al. [2018] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.
Hasenclever et al. [2020] Leonard Hasenclever, Fabio Pardo, Raia Hadsell, Nicolas Heess, and Josh Merel. CoMic: Complementary task learning & mimicry for reusable skills. In Hal Daumé Iii and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4105–4115. PMLR, 2020.
Liu and Hodgins [2018] Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
Liu et al. [2021] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S M Ali Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess. From motor control to team play in simulated humanoid football. arXiv preprint arXiv:2105.12196, 2021.
Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34, 2015. ISSN 0730-0301,1557-7368.
Luo et al. [2023a] Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representations for physics-based control. arXiv preprint arXiv:2310.04582, 2023a.
Luo et al. [2023b] Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. In International Conference on Computer Vision (ICCV), 2023b.
Ma et al. [2023] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob:5441–5450, 2019. ISSN 1550-5499.
Makoviychuk et al. [2021] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
Merel et al. [2018] Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control, 2018. ISSN 2331-8422.
Pavlakos et al. [2019] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
Peng et al. [2018] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic. ACM Trans. Graph., 37:1–14, 2018. ISSN 0730-0301.
Peng et al. [2021] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Trans. Graph., pages 1–20, 2021.
Peng et al. [2022] Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. arXiv preprint arXiv:2205.01906, 2022.
Rao et al. [2021] Dushyant Rao, Fereshteh Sadeghi, Leonard Hasenclever, Markus Wulfmeier, Martina Zambelli, Giulia Vezzani, Dhruva Tirumala, Yusuf Aytar, Josh Merel, Nicolas Heess, and Raia Hadsell. Learning transferable motor skills with hierarchical latent mixture policies. arXiv preprint arXiv:2112.05062, 2021.
Rempe et al. [2023] Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, and Or Litany. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. arXiv preprint arXiv:2304.01893, 2023.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
Sferrazza et al. [2024] Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. arXiv preprint arXiv:2403.10506, 2024.
[24] Chen Tessler, Israel Yoni Kasten, Israel Yunrong Guo, and Canada Nvidia. Calm: Conditional adversarial latent models for directable virtual characters.
Tunyasuvunakool et al. [2020] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.
Wang et al. [2024a] Jiashun Wang, Jessica Hodgins, and Jungdam Won. Strategy and skill learning for physics-based table tennis animation. In ACM SIGGRAPH 2024 Conference Proceedings, SIGGRAPH 2024, 2024a.
Wang et al. [2023] Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
Wang et al. [2024b] Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. arXiv preprint arXiv:2403.17346, 2024b.
Won et al. [2021] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports. ACM Trans. Graph., 40:1–11, 2021. ISSN 0730-0301.
Won et al. [2022] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using conditional vaes. ACM Trans. Graph., 41:1–12, 2022. ISSN 0730-0301.
Xie et al. [2022] Zhaoming Xie, Sebastian Starke, Hung Yu Ling, and Michiel van de Panne. Learning soccer juggling skills with layer-wise mixture-of-experts. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
Xu et al. [2023] Pei Xu, Xiumin Shang, Victor Zordan, and Ioannis Karamouzas. Composite motion learning with task control. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023.
Yao et al. [2022] Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters. arXiv preprint arXiv:2210.06063, 2022.
Ye et al. [2023] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023.
Yin et al. [2021] Zhiqi Yin, Zeshi Yang, Michiel Van De Panne, and KangKang Yin. Discovering diverse athletic jumping strategies. ACM Transactions on Graphics (TOG), 40(4):1–17, 2021.
Zhang et al. [2023] Haotian Zhang, Ye Yuan, Viktor Makoviychuk, Yunrong Guo, Sanja Fidler, Xue Bin Peng, and Kayvon Fatahalian. Learning physically simulated tennis skills from broadcast videos. ACM Trans. Graph., 42:1–14, 2023. ISSN 0730-0301,1557-7368.
Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:5738–5746, 2019. ISSN 1063-6919.
Zhu et al. [2023] Qingxu Zhu, He Zhang, Mengting Lan, and Lei Han. Neural categorical priors for physics-based character control. arXiv preprint arXiv:2308.07200, 2023.

Appendix \etocdepthtag.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection \etocsettocstyle

Appendix A Introduction

In the appendix, we provide comprehensive implementation details for SMPLOlympics, including the reward designs for each sport environment, training procedures, and hyperparameters. Extensive qualitative results can be accessed on our supplement site, where we provide visualizations of all sports environments and training results based on our preliminary reward designs. Baseline results (PPO, AMP, PULSE, PULSE+AMP) are presented to support the quantitative findings discussed in the main paper. Furthermore, we offer visualizations of the reference motion extracted from in-the-wild videos. For our pipeline to acquire the human demonstration in the SMPL format, we conduct an ablation study evaluating the impact of employing a motion imitator (PHC Luo et al. [2023b]) as a refinement step. Code, videos, and asset attributions can also be found in our supplementary materials.

Appendix B Implementation Details

B.1 Rewards and Termination Conditions

High Jump

For high jump, the humanoid’s task is to leap over a horizontal bar positioned 20m ahead and 6m to the left of its starting point. The humanoid aims to reach the goal point $\boldsymbol{p}^{\text{g-high jump}}=(22,6,1)$ located 2 m behind the bar. The reward function is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{high jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-high\_jump}}_{t}})\triangleq\begin{cases}1\times r^{% \text{p}}_{t}&\text{if \ }{\boldsymbol{p}^{\text{p}}_{t,x}}\leq 19.5\text{m},% \\ 1\times r^{\text{p}}_{t}+1\times r^{\text{h}}_{t}&\text{if \ }19.5\text{m}<{% \boldsymbol{p}^{\text{p}}_{t,x}}<20.5\text{m},\\ 1\times r^{\text{p}}_{t}&\text{if \ }20.5\text{m}\leq{\boldsymbol{p}^{\text{p}% }_{t,x}}.\\ \end{cases}

(1)

where ${\boldsymbol{p}^{\text{p}}_{t,x}}$ denotes the x-axis position. The height reward, $r^{\text{h}}_{t}={\boldsymbol{p}^{\text{p}}_{t,z}}$ , with ${\boldsymbol{p}^{\text{p}}_{t,z}}$ representing the z-axis position, incentivizes the humanoid to jump higher. The position reward, $r^{\text{p}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{\text{g-% high jump}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{\text{g-% high jump}}\|_{2}$ (clamped to [0,1]), motivates the humanoid to reach the goal. An episode is terminated if the humanoid falls down, fails to leap over the bar, or moves beyond the designated run-up area.

Long Jump

In the long jump environment, the humanoid has a 20-meter runway before the jump line, which its feet should not exceed. The humanoid’s goal is to reach the goal position, $\boldsymbol{p}^{\text{g-long jump}}=(30,0,1)$ . The training reward is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{long jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-long\_jump}}_{t}})\triangleq\begin{cases}1\times r^{% \text{p}}_{t}+0.01\times r^{\text{v}}_{t}&\text{if \ }{\boldsymbol{p}^{\text{p% }}_{t,x}}\leq 20\text{m},\\ 1\times r^{\text{p}}_{t}+0.01\times r^{\text{v}}_{t}+0.1\times r^{\text{h}}+30% \times r^{\text{l}}&\text{if \ }20\text{m}<{\boldsymbol{p}^{\text{p}}_{t,x}}.% \\ \end{cases}

(2)

The position reward, $r^{\text{p}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{\text{g-% long jump}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{\text{g-% long jump}}\|_{2}$ (clamped to [0,1]) encourages the humanoid to reach the goal point. The velocity reward, $r^{\text{v}}_{t}={\boldsymbol{v}^{\text{p}}_{t,x}}$ prompts the humanoid to reach higher speed along the x-axis. The jump height reward $r^{\text{h}}_{t}={\boldsymbol{p}^{\text{p}}_{t,z}}$ encourages the humanoid to jump higher after reaching the jump line. The jump length reward $r^{\text{l}}_{t}={\boldsymbol{p}^{\text{p}}_{t,x}}-20$ promotes longer final jump length. Each episode terminates if the humanoid falls or runs off the track.

Hurdling

In the hurdling task, the humanoid aims to reach a finish line 110m ahead while jumping over 10 hurdles, each 1.067m high. The first hurdle is placed 13.72m from the start, with subsequent hurdles spaced every 9.14m. The reward function is defined as $\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq r^{\text{distance}}_{t}$ , which encourages the agent to run towards the finish line and clear each hurdle.

\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq 1\times r^{\text{distance}}% _{t}\\

(3)

The distance reward, $r^{\text{distance}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{% \text{g-hurdling}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{% \text{g-hurdling}}\|_{2}$ , is clamped to $[0,1]$ and encourages the humanoid to get closer to the goal point. We terminate each episode if the character falls or runs off the track.

Golf

In the golf task, the humanoid is equipped with a golf club of dimensions of $0.05\text{m}\times 0.025\text{m}\times 0.02\text{m}$ . The target location for the golf ball is positioned to the left of the humanoid, in the direction of the x-axis, at a distance ranging from 0m to 20m. The reward function is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{golf}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-golf}}_{t}})\triangleq 1\times r^{\text{p}}_{t}+1% \times r^{\text{c}}_{t}+1\times r^{\text{g}}_{t}+1\times r^{\text{pred}}_{t}\\

(4)

The position reward, $r^{\text{p}}_{t}\triangleq\|{\boldsymbol{p}^{\text{ball}}_{t-1}}-{\boldsymbol{% p}^{\text{tar}}_{t}}\|_{2}-\|{\boldsymbol{p}^{\text{ball}}_{t}}-{\boldsymbol{p% }^{\text{tar}}_{t}}\|_{2}$ , clamped such that $0<r^{\text{p}}_{t}<1$ , encourages the ball to get closer to the target. The contact reward $r^{\text{c}}_{t}$ encourages swinging the golf club to hit the ball, defined as:

r^{\text{c}}_{t}=\begin{cases}1\times\exp(-100\times\|{\boldsymbol{p}^{\text{% ball}}_{t}}-{\boldsymbol{p}^{\text{club}}_{t}}\|^{2})&\text{if \ }C_{\text{cb}% }=0,\\ 1&\text{if \ }C_{\text{cb}}=1.\end{cases}

(5)

Here, $C_{\text{cb}}=0$ indicates that the club has not made contact with the ball and $C_{\text{cb}}=1$ indicates the club has made contact. The goal reward, $r^{\text{g}}_{t}=\exp(-0.1\times\|{\boldsymbol{p}^{\text{ball}}_{t,xy}}-{% \boldsymbol{p}^{\text{tar}}_{t,xy}}\|^{2})$ , encourages the ball to reach the target position in the x-y plane. In addition, we predict the ball’s trajectory and provide a dense reward $r^{\text{pred}}_{t}=\exp(-0.1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{ball}}_{t,xy}}\|^{2})$ based on the distance between the predicted landing point and the goal on the x-y plane Zhang et al. [2023]. The landing position, ${\boldsymbol{p}^{\text{land}}}=\left(x^{\text{land}},y^{\text{land}}\right)$ , can be calculated using the initial position and velocity as follows ( $g$ is gravity):

x_{\text{land}}=x_{0}+v_{0,x}\left(\frac{v_{0,z}+\sqrt{v_{0,z}^{2}+2gz_{0}}}{g% }\right),\ \ y_{\text{land}}=y_{0}+v_{0,y}\left(\frac{v_{0,z}+\sqrt{v_{0,z}^{2% }+2gz_{0}}}{g}\right)

(6)

Early termination is triggered if the ball moves backward, does not contact the golf club within 2 seconds, is too close to the humanoid’s body, or the humanoid falls.

Javelin

For javelin throw, the humanoid is equipped with a javelin of length 2.7m. Due to the complexity introduced by articulated fingers, the reward function $\boldsymbol{\mathcal{R}}^{\text{javelin}}$ is applied in three stages: first, the humanoid learns to hold the javelin stably; then, it learns to throw it; finally, the javelin flies as far as possible. A timer is used to differentiate the three stages. Specifically, $\boldsymbol{\mathcal{R}}^{\text{javelin}}$ is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{javelin}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-javelin}}_{t}})\triangleq\begin{cases}0.9\times r^{% \text{grab}}_{t}+0.1\times r^{\text{js}}_{t}&\text{if \ }t<0.6s,\\ 0.9\times r^{\text{goal}}_{t}+0.05\times r^{\text{s}}_{t}-0.05\times r^{\text{% grab}}_{t}&\text{if \ }0.6s\leq t<1.2s,\\ 0.9\times r^{\text{goal}}_{t}+0.1\times r^{\text{js}}_{t}&\text{if \ }1.2s\leq t% .\end{cases}

(7)

The reward for grasping $r^{\text{grab}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{right-hand}}_{t}}-{% \boldsymbol{p}^{\text{javelin}}_{t}}\|^{2})$ encourages the hand to stay close to the javelin. The javelin stability reward $r^{\text{js}}_{t}=\exp(-1\times\|{\boldsymbol{q}^{\text{javelin}}_{t}}-{% \boldsymbol{q}^{\text{javelin-default}}_{t}}\|^{2})$ encourages the 6 DoF pose of the javelin to remain close to the default pose, which faces forward and tilts 30 degrees upward, mimicking a flying pose. The humanoid stability reward, $r^{\text{s}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{root}}_{t}}\|^{2})$ , encourages the humanoid to keep its root position fixed. The termination conditions vary according to the stage: during the grasping and throwing stages, the episode terminates if the javelin is too far from the right hand or deviates significantly from the default pose ${\boldsymbol{q}^{\text{javelin-default}}_{t}}$ . During the flying stage, termination occurs if the javelin is too close to the right hand.

B.2 Multi-person Sports

Tennis

For tennis, each humanoid is equipped with a circular racket with a 15cm radius, positioned 35cm away from the wrist, replacing the right hand. The court measures 23.77m in length and 8.23m in width, mirroring the dimensions and layout of a real tennis court. The net height is 1m, and the simulated ball has a radius of 3.2cm. We design two tasks: a single-player ball return task, where the humanoid trains to hit balls launched randomly, and a 1v1 mode, where the humanoid competes against another humanoid. In the ball return task, the humanoid is positioned at the center of the baseline, with balls launched from the opposite side. The landing location is uniformly sampled on the opposite side and the ball launch velocity is randomly sampled. The reward function is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{tennis}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-tennis}}_{t}})\triangleq\begin{cases}1\times r^{\text{% racket}}_{t}+0\times r^{\text{ball}}_{t},&\text{if \ }C_{\text{rb}}=0,\\ 0\times r^{\text{racket}}_{t}+1\times r^{\text{ball}}_{t},&\text{if \ }C_{% \text{rb}}=1.\end{cases}

(8)

Here, $C_{\text{rb}}=0$ indicates that the racket has not made contact with the ball, and $C_{\text{rb}}=1$ indicates the racket has made contact. $r^{\text{racket}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{racket}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t}}\|^{2})$ rewards the racket for getting closer to the ball. $r^{\text{ball}}_{t}=1+\exp(-1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{tar}}_{t}}\|^{2})$ encourages the predicted landing location of the ball to be close to the target. Similar to the golf task, the landing location of the ball is calculated based on ${\boldsymbol{p}^{\text{ball}}_{t}}$ and ${\boldsymbol{v}^{\text{ball}}_{t}}$ , providing a dense reward function to facilitate training Zhang et al. [2023]. Early termination occurs if the humanoid loses the point, either by failing to catch the ball or by hitting the ball out of bounds. In the 1v1 mode, two humanoids are placed on opposite sides of the court and the first ball is launched from the middle of the court, randomly directed at each player. The same reward function as the ball return task is used. To facilitate 1v1 training, the pre-trained model from the ball return task is used as a warm start. Similarly, the episode terminates if one player fails to catch the ball or returns the ball out of bounds.

Table Tennis

For table tennis, each humanoid is equipped with a circular paddle with an 8 cm radius, positioned 12 cm from the wrist, replacing the right hand. The table adheres to standard dimensions, featuring a playing surface 2.74 m in length and 1.525 m in width, standing 0.76 m high. The net is 15.25 cm high, and the table tennis ball has a radius of 2 cm. The setup includes a single-player ball return task and a 1v1 task. The reward function is designed similarly to tennis, except we define the ball reward as $r^{\text{ball}}_{t}=1+\exp(-1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{tar}}_{t}}\|^{2})+N_{\text{hit}}$ , where $N_{\text{hit}}$ counts the number of successful hits in one episode. This formulation is intended to encourage the humanoid to continuously hit the ball effectively. Unlike in golf and tennis, we calculate $\boldsymbol{p}^{\text{land}}$ when it lands on the table at a height of 0.76 m. For early termination and the warm start in 1v1, we maintain the same setting as in the tennis task.

Fencing

For 1v1 fencing, similar to real-world fencing, the two players are confined to a 14m by 2m playground, where stepping out of the bound will reset the game. The fencing reward is structured similarly to the boxing setup in NCP Zhu et al. [2023]:

\boldsymbol{\mathcal{R}}^{\text{fencing}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-fencing}}_{t}})\triangleq 0.1\times r^{\text{facing}}_% {t}+0.1\times r^{\text{vel}}_{t}+0.6\times r^{\text{strike}}_{t}+1\times r^{% \text{point}}_{t}.

(9)

The facing reward $r^{\text{facing}}_{t}$ penalizes deviation from facing the opponent’s root position ${\boldsymbol{p}^{\text{opp-root}}_{t}}$ . The velocity reward, $r^{\text{vel}}_{t}$ , encourages the x-y plane linear velocity to be directed towards the opponent’s root position ${\boldsymbol{p}^{\text{opp-root}}_{t}}$ . The strike reward, $r^{\text{strike}}_{t}=\exp(-10\times\operatorname*{argmin}\|{\boldsymbol{p}^{% \text{sword}}_{t}}-{\boldsymbol{p}^{\text{opp-target}}_{t}}\|^{2})$ , encourages the swordtip to get closer to the target body parts ${\boldsymbol{p}^{\text{opp-target}}_{t}}$ , which include the pelvis, head, spine, chest, and torso. If there is contact with the target body part with sufficient force, a positive reward is provided:

r^{\text{point}}_{t}=\begin{cases}1&\text{if \ }\operatorname*{argmin}\|{% \boldsymbol{p}^{\text{sword}}_{t}}-{\boldsymbol{p}^{\text{opp-target}}_{t}}\|^% {2}\leq 0.1\enskip\text{and}\enskip\text{contact force}\geq 50\text{Nm},\\ 0&\text{otherwise}.\\ \end{cases}

(10)

Our fencing agents are trained using competitive self-play, as introduced in the main paper.

Boxing

For boxing, the humanoid competes in a boxing ring measuring 5m by 5m. The humanoid’s right hand is replaced with a sphere of 8cm radius. The boxing reward function has the same composition as fencing, except that the sword tip position ${\boldsymbol{p}^{\text{sword}}_{t}}$ is replaced by the hand position ${\boldsymbol{p}^{\text{hand}}_{t}}$ . Our boxing agents are also trained using competitive self-play.

Soccer

The soccer field measures 32m in length and 20m in width. Each goal is 4m wide and 2m tall. The ball has a diameter of 11.5 cm and weighs 450 grams. For the penalty kick task, the reward function $\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}}+w% ^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{b2t}}r^{% \text{b2t}}-c^{\text{no-dribble}}_{t}$ is divided into stages based on whether the ball is moving toward the goal. Specifically, we define a "closer to goal" variable as ${g^{\text{ball-to-goal}}_{t}}=\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{goal-target% }}_{t}}-{\boldsymbol{p}^{\text{ball}}_{t}}\|_{2}$ , which indicates whether the ball is getting closer to the goal. The full reward function is defined as follows:

\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq\begin{cases}0.4\times r^{\text% {p2b}}-c^{\text{no-dribble}}_{t}&\text{if \ }{g^{\text{ball-to-goal}}_{t}}\leq 0% ,\\ 0.1\times r^{\text{b2g}}+0.1\times r^{\text{bv2g}}+0.8\times r^{\text{b2t}}-c^% {\text{no-dribble}}_{t}&\text{otherwise}.\\ \end{cases}

(11)

Essentially, if the ball is not moving toward the goal, the humanoid is encouraged to move toward the ball; if the ball is moving, the agent is rewarded for shooting the ball toward the target in the goal post. The player-to-ball reward, $r^{\text{p2b}}=\|{\boldsymbol{p}^{\text{root}}_{t-1}}-{\boldsymbol{p}^{\text{% ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{root}}_{t}}-{\boldsymbol{p}^{% \text{ball}}_{t}}\|_{2}$ , is a point-goal reward Won et al. [2022]. The ball-to-goal reward $r^{\text{b2g}}=\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{\boldsymbol{p}^{% \text{ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t}}\|_{2}$ encourages the ball to move closer to the goal position. The ball-velocity-to-goal reward $r^{\text{bv2g}}$ incentivizes the ball velocity toward the goal position. The ball-to-target reward $r^{\text{b2t}}$ predicts the landing position of the ball in the net based on its current velocity and position, providing a reward if the ball is close to the target. Finally, $c^{\text{no-dribble}}_{t}$ penalizes the humanoid if its root position is over the ball’s spawning point.

The team play (1v1 and 2v2) soccer tasks use similar rewards as the penalty kick task. The reward function for team play is $\boldsymbol{\mathcal{R}}^{\text{soccer-match}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}% }+w^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{point}}% r^{\text{point}}$ , where $r^{\text{p2b}}$ , $r^{\text{b2g}}$ are the same as in the penalty kick. $r^{\text{point}}$ provides a one-time bonus for scoring.

Basketball

The basketball environment is similar to soccer except that it utilizes the SMPL-X humanoid with articulated fingers. In the free-throwing task, the ball is initialized between the humanoid’s hands. The free throw reward is defined as: $\boldsymbol{\mathcal{R}}^{\text{free-throw}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq 0.5\times r^{\text{ballvel}}+% 0.5\times r^{\text{bv2g}}+r^{\text{basket}}$ . The basketball velocity reward $r^{\text{ballvel}}=\exp(-0.1\times\|{\boldsymbol{{v}}^{\text{ball}}_{t}}-{% \boldsymbol{{v}}^{\text{ball-desired}}_{t}}\|_{2}^{2})$ encourages the ball’s velocity to be close to the desired velocity to reach the goal. The desired velocity, ${\boldsymbol{{v}}^{\text{ball-desired}}_{t}}$ , is computed using the goal position ${\boldsymbol{p}^{\text{goal-target}}_{t}}$ , and the ball position ${\boldsymbol{p}^{\text{ball}}_{t}}$ , with the following physics equations:

	$\displaystyle T^{\text{reach}}_{t}$	$\displaystyle=\sqrt{\frac{2\times\\|({\boldsymbol{p}^{\text{ball}}_{t}}-{% \boldsymbol{p}^{\text{goal-target}}_{t}})_{z}\\|_{2}}{g}}\ ,\ {\boldsymbol{{v}}% ^{\text{ball-desired}}_{t,xy}}=\frac{\\|({\boldsymbol{p}^{\text{ball}}_{t}}-{% \boldsymbol{p}^{\text{goal-target}}_{t}})_{xy}\\|_{2}}{T^{\text{reach}}_{t}}$		(12)
	$\displaystyle{\boldsymbol{{v}}^{\text{ball-desired}}_{t,z}}$	$\displaystyle=\frac{({\boldsymbol{p}^{\text{ball}}_{t}}-{\boldsymbol{p}^{\text% {goal-target}}_{t}})_{z}+0.5\times g\times(T^{\text{reach}}_{t})^{2}}{T^{\text% {reach}}}.$		(12)

The ball-velocity-to-goal reward $r^{\text{bv2g}}$ encourages the velocity to be directed towards the goal position. The basket reward, $r^{\text{basket}}$ , provides a one-time reward if the ball passes through the basket.

Team-play basketball has a similar reward design as soccer. The team-play basketball task is highly challenging due to the difficulty of picking the ball up, which is more complex than kicking a ball. Thus, while we support 1v1 and 2v2 team-play basketball, our preliminary reward design does not yield interesting behavior, unlike in soccer.

B.3 Hyperparamters

Training hyperparameters are provided in Table 4. We use the same set of hyperparameters to train all of our sports environments, highlighting the advantage of employing a unified humanoid embodiment for simulated sports.

Table 4: Hyperparameters for training each baseline used in SMPLOlympics. We use the same set of hyperparamters for each sport. Notice that AMP and PULSE uses PPO as the optimization method but add respective motion priors (as reward or motion representation).

\sigma

: fixed variance for policy.

\gamma

: discount factor.

\epsilon

: clip range for PPO.

w_{\text{disc}}

and

w_{\text{task}}

: weights for discriminator and task rewards.

	Batch Size	Learning Rate	$\sigma$	$\gamma$	$\epsilon$	MLP-size	$w_{\text{disc}}$	$w_{\text{task}}$	# of samples
PPO Schulman et al. [2017]	1024	$5\times 10^{-4}$	0.05	0.99	0.2	[2048, 1024, 512]	0	1	$\sim 10^{9}$
AMP Peng et al. [2021]	1024	$5\times 10^{-4}$	0.05	0.99	0.2	[2048, 1024, 512]	0.5	0.5	$\sim 10^{9}$
PULSE Luo et al. [2023a]	1024	$5\times 10^{-4}$	0.3	0.99	0.2	[2048, 1024, 512]	0	1	$\sim 10^{9}$
PULSE Luo et al. [2023a] + AMP Peng et al. [2021]	1024	$5\times 10^{-4}$	0.3	0.99	0.2	[2048, 1024, 512]	0.5	0.5	$\sim 10^{9}$

B.4 Details about Baselines

For our baseline methods PULSE Luo et al. [2023a] and AMP Peng et al. [2021], we use the official implementations. For PULSE Luo et al. [2023a], we employ the publicly released model without modification, which is pre-trained on the AMASS dataset. We follow a similar setup for downstream tasks in PULSE, using the frozen prior $\boldsymbol{\mathcal{P}}_{\text{PULSE}}$ , decoder $\boldsymbol{\mathcal{D}}_{\text{PULSE}}$ , and residual action representation. Since PULSE only includes trained models for the SMPL-based models, we train SMPL-X humanoid based models following the official code. Specifically, we train a humanoid motion imitator following PHC Luo et al. [2023b], and distill motor skills into a 48-dimensional latent space (instead of 32-D, to accommodate articulated fingers). PULSE provides an action space for hierarchical RL and can be integrated with AMP. For PULSE+AMP, the AMP reward offers additional style guidance for the humanoid, which is particularly beneficial for tasks such as table tennis. However, we find that the demonstration sequences used for AMP need to be task-specific (e.g. contains only a swinging motion); otherwise, the discriminator reward can overpower the task reward and lead to undesired behavior (as seen in the free kick results).

Appendix C Additional Ablations

We conducted an ablation study to evaluate the role of physics-based tracking (w/ PHC) in acquiring human reference motion. Specifically, we used the pose estimation results directly from TRAM Wang et al. [2024b] as positive samples for the discriminator during policy training (w/ PHC). Our experiments were performed in the context of table tennis. As shown in Table 5, we found that providing video data without PHC leads to significantly lower performance compared to using PHC, similar to the results obtained using only PULSE. We observe that when the quality of the provided reference motion is poor (e.g., with significant noise in position,

Table 5: Ablation study on PHC.

	Table Tennis
Method	Avg Hits $\uparrow$	Error Dis $\downarrow$
PULSE	0.74	0.19
PULSE+AMP, w/o PHC	0.91	0.18
PULSE+AMP, w/ PHC	1.83	0.23

and drastic velocity changes), the model struggles to effectively utilize the reference motion as style guidance to achieve natural movements. In contrast, employing physics-based tracking to refine pose estimates from in-the-wild videos results in physically plausible motion, which significantly aids in policy learning.

Appendix D Broader Social impact

We propose SMPLOlympics, a collection of sports environments for simulated humanoids. These environments can be used to benchmark learning algorithms, discover new humanoid behaviors, create animations, and more. The potential negative social impact includes the risk of generating animations that could be used to create DeepFakes. Positive social impact includes the development of intelligent and collaborative agents, advancements in robot learning, discovery of new sports techniques, and the generation of immersive and physically realistic animations.