Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SMPLOlympics: Sports Environments for Physically Simulated Humanoids

Zhengyi Luo1  Jiashun Wang1 footnotemark:  Kangni Liu1 footnotemark:  Haotian Zhang2  Chen Tessler2  Jingbo Wang  Ye Yuan2  Jinkun Cao1  Zihui Lin1  Fengyi Wang1  Jessica Hodgins1  Kris Kitani1
1Carnegie Mellon University; 2Nvidia
https://smplolympics.github.io/SMPLOlympics
Equal Contribution
Abstract

We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for many years, there is also a plethora of existing knowledge on the preferred strategy to achieve better performance. To leverage these existing human demonstrations from videos and motion capture, we design our humanoid to be compatible with the widely-used SMPL and SMPL-X human models from the vision and graphics community. We provide a suite of individual sports environments, including golf, javelin throw, high jump, long jump, and hurdling, as well as competitive sports, including both 1v1 and 2v2 games such as table tennis, tennis, fencing, boxing, soccer, and basketball. Our analysis shows that combining strong motion priors with simple rewards can result in human-like behavior in various sports. By providing a unified sports benchmark and baseline implementation of state and reward designs, we hope that SMPLOlympics can help the control and animation communities achieve human-like and performant behaviors.

\etocdepthtag

.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone

1 Introduction

Competitive sports, much like their role in human society, offer a standardized way of measuring the performance of learning algorithms and creating emergent human behavior. While there exist isolated efforts to bring individual sport into physics simulation [8, 36, 7, 35, 29, 26], each work uses a different humanoid, simulator, and learning algorithm, which prevents unified evaluation. Their specially built humanoids also make it difficult to acquire compatible motion data, as retargeting might be required to translate motion to each humanoid. Building a collection of simulated sports environments that uses a shared humanoid embodiment and training pipeline is challenging, as it requires expert knowledge in humanoid design, reinforcement learning (RL), and physics simulation.

These challenges have led to previous benchmarks and simulated environments [3, 25] focusing mainly on locomotion tasks for humanoids. While these tasks (e.g., moving forward, getting up from the ground, traversing terrains) are as benchmarks, they lack the depth and diversity needed to induce a wide range of behaviors and strategies. As a result, these environments do not fully exploit the potential of humanoids to discover actions and skills found in real-world human activities.

Another important aspect of working with simulated humanoids is the ease of obtaining human demonstrations. The resemblance to the human body makes humanoids capable of performing a diverse set of skills; a human can also easily judge the strategies used by humanoids. Curated human motion can be used either as motion prior [17, 18, 24] or in evaluation protocols. Thus, having an easy way to obtain new human motion data compatible with the humanoid, either from motion capture (MoCap) or videos, is critical for simulated humanoid environments.

In this work, we propose SMPLOlympics, a collection of physically simulated environments for a variety of Olympic sports. SMPLOlympics offers a wide range of sports scenarios that require not only locomotion skills, but also manipulation, coordination, and planning. Unified under one humanoid embodiment, our environments provide a rich set of challenges for developing and testing embodied agents. We use humanoids compatible with the SMPL family of models, which enables the direct conversion of human motion in the SMPL format to our humanoid. For tasks that require articulated fingers, we use SMPL-X [16] based humanoid which has a much higher degree of freedom (DOF); for tasks that do not need hands, we use SMPL [2]. As popular human models, the SMPL family of models is widely adopted in the vision and graphics community, which provides us with access to human pose estimation methods [34] capable of extracting coherent motion from videos. The existing large-scale human motion dataset [13] in the SMPL format also helps build general-purpose motion representation for humanoids [10].

Our sports environments support both individual and competitive sports, providing a comprehensive platform for testing and benchmarking. For individual sports, we include activities such as golf, javelin throw, high jump, long jump, and hurdling. Competitive sports in our suite include 1v1 games such as ping pong, tennis, fencing, and boxing, as well as team sports such as soccer and basketball. To facilitate benchmarking, we also include tasks such as penalty kicks (for soccer) and ball-target hitting (for ping-pong and tennis) that are easy to measure performance. To demonstrate the importance of human demonstrations, we extract motion from videos using off-the-shelf pose estimation methods, and show that using human motion data as motion prior can [18] significantly improves human likeness in the resulting motion. We also test recent motion representations in simulated humanoid control using hierarchical RL [10], and show that a learned motion representation combined with simple rewards can lead to many versatile human-like behaviors to achieve impressive sports results (i.e. discovering the Fosbury way for high jump).

In conclusion, our contributions are: (1) we propose SMPLOlympics, a collection of simulated environments that allow humanoids to compete in 10 Olympic sports; (2) we provide a pipeline to extract human demonstration data from videos and show their effectiveness in helping build human-like strategies in simulated sports; (3) we provide the starting state and reward designs for each sport, benchmark state-of-the-art algorithms, and show that simple rewards combined with a strong motion prior can lead to impressive sports feats.

2 Related Works

Simulated Humanoid Sports

Simulated humanoid sports can help generate animations and explore optimal sports strategies. Research has focused on various individual sports within simulated environments, including tennis [36], table tennis [26], boxing [29, 38], fencing [29], basketball dribbling [7, 27] and soccer [31, 8]. These studies leverage human motion to achieve human-like behaviors, using it to acquire motor skills [8, 29] or establish motion prior [36]. However, the diversity in humanoid definitions across studies makes it difficult to aggregate additional human demonstration data due to the need for retargetting. Furthermore, the task-specific training pipelines in these studies are hard to generalize to new sports. In contrast, SMPLOlympics provides a unified benchmark employing a consistent humanoid model and training pipeline across all sports. This standardization not only facilitates extension to more sports, but also simplifies benchmarking learning algorithms.

Simulated RL Benchmarks

Simulated full-body humanoids provide a valuable platform for studying embodied intelligence due to their close resemblance to real-world human behavior and physical interactions. Current RL benchmarks [3, 25, 14] often focus on locomotion tasks such as moving forward and traversing terrain. dm_control [25] and OpenAI [3] Gym only include locomotion tasks. ASE [19] includes results for five tasks based on mocap data, which involve mainly simple locomotion and sword-swinging actions. These tasks lack the complexity required to fully exploit the capabilities of simulated humanoids. Sports scenarios require agile motion and strategic teamwork. They are also easily interpretable and provide measurable outcomes for success. A concurrent work, HumanoidBench [23] employs a commercially available humanoid robot in simulation to address 27 locomotion and manipulation tasks. Unlike HumanoidBench, ours targets competitive sports and uses available human demonstration data to enhance the learning of human-like behaviors. This emphasis is essential, as without human demonstrations, behaviors developed in benchmarks can often appear erratic, nonhuman-like, and inefficient.

Humanoid Motion Representation

Adversarial learning has proven to be a powerful method for using human reference motions to enhance the naturalness of humanoid animations [18, 32, 1]. Due to the high DoF in humanoids and the inherent sample inefficiency of RL training, efforts have focused on developing motion primitives [6, 15, 5, 20] and motion latent spaces [4, 19, 24]. These techniques aim to accelerate training and provide human-like motion priors. Notably, approaches such as ASE [19], CASE [4], and CALM [24] utilize adversarial learning objectives to encourage mapping between random noise and realistic motor behavior. Furthermore, methods such as ControlVAE [33], NPMP [15], PhysicsVAE [30], NCP [38], and PULSE [10] leverage the motion imitation task to acquire and reuse motor skills for the learning of downstream tasks. In this work, we study AMP [18] and PULSE [10] as exemplary methods to provide motion priors. Our findings demonstrate that a robust motion prior, combined with straightforward reward designs, can effectively induce human-like behaviors in solving complex sports tasks.

3 Problem Formulation

We define the full-body human pose as 𝒒t(𝜽t,𝒑t)subscript𝒒𝑡subscript𝜽𝑡subscript𝒑𝑡{\boldsymbol{{q}}_{t}}\triangleq({\boldsymbol{\theta}_{t}},{\boldsymbol{{p}}_{% t}})bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), consisting of 3D joint rotations 𝜽tJ×6subscript𝜽𝑡superscript𝐽6{\boldsymbol{\theta}_{t}}\in\mathbb{R}^{J\times 6}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 6 end_POSTSUPERSCRIPT and positions 𝒑tJ×3subscript𝒑𝑡superscript𝐽3{\boldsymbol{{p}}_{t}}\in\mathbb{R}^{J\times 3}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT of all J𝐽Jitalic_J joints on the humanoid, using the 6 DoF rotation representation [37]. To define velocities 𝒒˙1:Tsubscriptbold-˙𝒒:1𝑇{\boldsymbol{\dot{q}}_{1:T}}overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, we have 𝒒˙t(𝝎t,𝒗t)subscriptbold-˙𝒒𝑡subscript𝝎𝑡subscript𝒗𝑡{\boldsymbol{\dot{q}}_{t}}\triangleq({\boldsymbol{{\omega}}_{t}},{\boldsymbol{% v}_{t}})overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as angular 𝝎tJ×3subscript𝝎𝑡superscript𝐽3{\boldsymbol{{\omega}}_{t}}\in\mathbb{R}^{J\times 3}bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT and linear velocities 𝒗tJ×3subscript𝒗𝑡superscript𝐽3{\boldsymbol{v}_{t}}\in\mathbb{R}^{J\times 3}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT. If an object is involved (e.g. javelin, football, ping-pong ball), we define their 3D trajectories 𝒒tobjsubscriptsuperscript𝒒obj𝑡\boldsymbol{q}^{\text{obj}}_{t}bold_italic_q start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using object position 𝒑tobjsubscriptsuperscript𝒑obj𝑡\boldsymbol{p}^{\text{obj}}_{t}bold_italic_p start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, orientation 𝜽tobjsubscriptsuperscript𝜽obj𝑡\boldsymbol{\theta}^{\text{obj}}_{t}bold_italic_θ start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, linear velocity 𝒗tobjsubscriptsuperscript𝒗obj𝑡\boldsymbol{v}^{\text{obj}}_{t}bold_italic_v start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and angular velocity 𝝎tobjsubscriptsuperscript𝝎obj𝑡\boldsymbol{\omega}^{\text{obj}}_{t}bold_italic_ω start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As a notation convention, we use ^^\widehat{\cdot}over^ start_ARG ⋅ end_ARG to denote the ground truth kinematic quantities from Motion Capture (MoCap) and normal symbols without accents for values from the physics simulation.

Goal-conditioned Reinforcement Learning for Humanoid Control

We define each sport using the general framework of goal-conditioned RL. Namely, a goal-conditioned policy πtasksubscript𝜋task{\pi_{\text{task}}}italic_π start_POSTSUBSCRIPT task end_POSTSUBSCRIPT is trained to control a simulated humanoid competing in a sports environment. The learning task is formulated as a Markov Decision Process (MDP) defined by the tuple =𝓢,𝓐,𝓣,𝓡,γ𝓢𝓐𝓣𝓡𝛾{\mathcal{M}}=\langle\mathcal{\boldsymbol{S}},\mathcal{\boldsymbol{A}},% \mathcal{\boldsymbol{T}},\boldsymbol{\mathcal{R}},\gamma\ranglecaligraphic_M = ⟨ bold_caligraphic_S , bold_caligraphic_A , bold_caligraphic_T , bold_caligraphic_R , italic_γ ⟩ of states, actions, transition dynamics, reward function, and discount factor. The simulation determines the state 𝒔t𝓢subscript𝒔𝑡𝓢{\boldsymbol{s}_{t}}\in\mathcal{\boldsymbol{S}}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ bold_caligraphic_S and transition dynamics 𝓣𝓣\mathcal{\boldsymbol{T}}bold_caligraphic_T, where a policy computes the action 𝒂tsubscript𝒂𝑡{\boldsymbol{a}_{t}}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The state 𝒔tsubscript𝒔𝑡{\boldsymbol{s}_{t}}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT contains the proprioception 𝒔tpsubscriptsuperscript𝒔p𝑡{\boldsymbol{s}^{\text{p}}_{t}}bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the goal state 𝒔tgsubscriptsuperscript𝒔g𝑡{\boldsymbol{s}^{\text{g}}_{t}}bold_italic_s start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Proprioception is defined as 𝒔tp(𝒒t,𝒒˙t)subscriptsuperscript𝒔p𝑡subscript𝒒𝑡subscriptbold-˙𝒒𝑡{\boldsymbol{s}^{\text{p}}_{t}}\triangleq({\boldsymbol{{q}}_{t}},{\boldsymbol{% \dot{q}}_{t}})bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which contains the 3D body pose 𝒒tsubscript𝒒𝑡{\boldsymbol{{q}}_{t}}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and velocity 𝒒˙tsubscriptbold-˙𝒒𝑡{\boldsymbol{\dot{q}}_{t}}overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We use 𝒃𝒃\boldsymbol{b}bold_italic_b to indicate the boundary of the arena to which a sport is limited. All values are normalized with respect to the humanoid heading (yaw).

4 SMPLOlympics: sports environments For Simulated Humanoids

In this section, we describe the formulation of each of our sports environments, from single-person sports (Sec. 4.1) to multi-person sports (Sec. B.2). Then, we describe our pipeline for acquiring human demonstration data from videos (Sec. 4.3). An overview can be found in Fig. 1. For each sport, we provide a preliminary reward design that serves as a baseline for future research. Due to space constraints, omitted details can be found in the supplement.

Refer to caption
Figure 1: An overview of SMPLOlympics: we design a collection of simulated sports environments and leverage RL and human demonstrations (from videos or MoCap) as prior to tackle them.

4.1 Single-person Sports

High Jump

In the high jump environment, the humanoid jumps over a horizontal bar placed at a certain height without touching it and aims to reach a goal point that is 2 meters behind the bar. The bar is positioned following the setup of the official Olympic game. The high jump goal state 𝒔tg-high_jump=(𝒑tb,𝒑tg)subscriptsuperscript𝒔g-high_jump𝑡subscriptsuperscript𝒑𝑏𝑡subscriptsuperscript𝒑𝑔𝑡{\boldsymbol{s}^{\text{g-high\_jump}}_{t}}=(\boldsymbol{p}^{b}_{t},\boldsymbol% {p}^{g}_{t})bold_italic_s start_POSTSUPERSCRIPT g-high_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_italic_p start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) contains the positions of the bar 𝒑tb3subscriptsuperscript𝒑𝑏𝑡superscript3\boldsymbol{p}^{b}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and the goal point 𝒑tg3subscriptsuperscript𝒑𝑔𝑡superscript3\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The reward is defined as 𝓡high jump(𝒔tp,𝒔tg-high_jump)wprtp+whrthsuperscript𝓡high jumpsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-high_jump𝑡superscript𝑤psubscriptsuperscript𝑟p𝑡superscript𝑤hsubscriptsuperscript𝑟h𝑡\boldsymbol{\mathcal{R}}^{\text{high jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-high\_jump}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_% {t}+w^{\text{h}}r^{\text{h}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT high jump end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-high_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The position reward rtpsubscriptsuperscript𝑟𝑝𝑡r^{p}_{t}italic_r start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the humanoid to go closer to the goal point. The height reward rthsubscriptsuperscript𝑟𝑡r^{h}_{t}italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the humanoid to jump higher. Training terminates when the humanoid is in contact with the bar, does not pass the bar, or falls to the ground before jumping. We also set up four bar heights for curriculum learning: 0.5m, 1m, 1.5m, and 2m.

Long Jump

Long jump is also set similar to the real-world setting, with a 20m runway followed by a jump area. Before the humanoid jumps, its feet should be behind the jump line. The goal state 𝒔tg-long_jump(𝒑ts,𝒑tl,𝒑tg)subscriptsuperscript𝒔g-long_jump𝑡subscriptsuperscript𝒑𝑠𝑡subscriptsuperscript𝒑𝑙𝑡subscriptsuperscript𝒑𝑔𝑡{\boldsymbol{s}^{\text{g-long\_jump}}_{t}}\triangleq(\boldsymbol{p}^{s}_{t},% \boldsymbol{p}^{l}_{t},\boldsymbol{p}^{g}_{t})bold_italic_s start_POSTSUPERSCRIPT g-long_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) includes the position of the starting point 𝒑ts3subscriptsuperscript𝒑𝑠𝑡superscript3\boldsymbol{p}^{s}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, jump line 𝒑tl3subscriptsuperscript𝒑𝑙𝑡superscript3\boldsymbol{p}^{l}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and the goal 𝒑tg3subscriptsuperscript𝒑𝑔𝑡superscript3\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The training reward is defined as 𝓡long jump(𝒔tp,𝒔tg-long_jump)wprtp+wvrtv+whrth+wlrtlsuperscript𝓡long jumpsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-long_jump𝑡superscript𝑤psubscriptsuperscript𝑟p𝑡superscript𝑤vsubscriptsuperscript𝑟v𝑡superscript𝑤hsubscriptsuperscript𝑟h𝑡superscript𝑤lsubscriptsuperscript𝑟l𝑡\boldsymbol{\mathcal{R}}^{\text{long jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-long\_jump}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_% {t}+w^{\text{v}}r^{\text{v}}_{t}+w^{\text{h}}r^{\text{h}}_{t}+w^{\text{l}}r^{% \text{l}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT long jump end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-long_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT l end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The position reward rtpsubscriptsuperscript𝑟p𝑡r^{\text{p}}_{t}italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the humanoid to get closer to the goal, the velocity reward rtvsubscriptsuperscript𝑟v𝑡r^{\text{v}}_{t}italic_r start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages larger running speed, and the height reward rthsubscriptsuperscript𝑟h𝑡r^{\text{h}}_{t}italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages higher jump. Finally, rtlsubscriptsuperscript𝑟l𝑡r^{\text{l}}_{t}italic_r start_POSTSUPERSCRIPT l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages jumping far.

Hurdling

In hurdling, the humanoid tries to reach a finishing line 110 meters ahead and needs to jump over 10 hurdles (each 1.067m high, placed 13.72m from the start, with subsequent hurdles spaced every 9.14m). The goal state is defined as 𝒔tg-hurdling(𝒑th,𝒑tf)subscriptsuperscript𝒔g-hurdling𝑡subscriptsuperscript𝒑𝑡subscriptsuperscript𝒑𝑓𝑡{\boldsymbol{s}^{\text{g-hurdling}}_{t}}\triangleq(\boldsymbol{p}^{h}_{t},% \boldsymbol{p}^{f}_{t})bold_italic_s start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝒑th10×3subscriptsuperscript𝒑𝑡superscript103\boldsymbol{p}^{h}_{t}\in\mathbb{R}^{10\times 3}bold_italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 × 3 end_POSTSUPERSCRIPT and 𝒑tf3subscriptsuperscript𝒑𝑓𝑡superscript3\boldsymbol{p}^{f}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT includes the positions of these hurdles as well as the finish line. We define a simple reward function as 𝓡hurdling(𝒔tp,𝒔tg-hurdling)rtdistancesuperscript𝓡hurdlingsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-hurdling𝑡subscriptsuperscript𝑟distance𝑡\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq r^{\text{distance}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT hurdling end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_r start_POSTSUPERSCRIPT distance end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. 𝓡hurdlingsuperscript𝓡hurdling\boldsymbol{\mathcal{R}}^{\text{hurdling}}bold_caligraphic_R start_POSTSUPERSCRIPT hurdling end_POSTSUPERSCRIPT encourages the agent to run towards the finish line and clear each hurdle. Additionally, we employ a curriculum for hurdling, where the height of each hurdle is randomly sampled between 0 and 1.167 meters for each episode.

Golf

For golf, the humanoid’s right hand is replaced with a golf club measuring 1.14 meters. The driver of the golf club is simulated as a small box (0.05m×0.025m×0.02m0.05m0.025m0.02m0.05\text{m}\times 0.025\text{m}\times 0.02\text{m}0.05 m × 0.025 m × 0.02 m). We incorporate a wave-like terrain with an amplitude of 0.5 meters in the golf environment, designed to mimic real-world grasslands. The golf goal is positioned to the left of the humanoid, at a distance ranging from 0 meters to 20 meters away. The goal state 𝒔tg-golf(𝒑tb,𝒑tc,𝒑tg,𝒐t)subscriptsuperscript𝒔g-golf𝑡subscriptsuperscript𝒑𝑏𝑡subscriptsuperscript𝒑𝑐𝑡subscriptsuperscript𝒑𝑔𝑡subscript𝒐𝑡{\boldsymbol{s}^{\text{g-golf}}_{t}}\triangleq(\boldsymbol{p}^{b}_{t},% \boldsymbol{p}^{c}_{t},\boldsymbol{p}^{g}_{t},\boldsymbol{o}_{t})bold_italic_s start_POSTSUPERSCRIPT g-golf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) includes the ball position 𝒑tb3subscriptsuperscript𝒑𝑏𝑡superscript3\boldsymbol{p}^{b}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, club 𝒄tb3subscriptsuperscript𝒄𝑏𝑡superscript3\boldsymbol{c}^{b}_{t}\in\mathbb{R}^{3}bold_italic_c start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, goal position 𝒑tg3subscriptsuperscript𝒑𝑔𝑡superscript3\boldsymbol{p}^{g}_{t}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and terrain height map 𝒐t32×32subscript𝒐𝑡superscript3232\boldsymbol{o}_{t}\in\mathbb{R}^{32\times 32}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 32 × 32 end_POSTSUPERSCRIPT. The reward is defined as 𝓡golf(𝒔tp,𝒔tg-golf)wprtp+wcrtc+wgrtg+wpredrtpredsuperscript𝓡golfsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-golf𝑡superscript𝑤psubscriptsuperscript𝑟p𝑡superscript𝑤csubscriptsuperscript𝑟c𝑡superscript𝑤gsubscriptsuperscript𝑟g𝑡superscript𝑤predsubscriptsuperscript𝑟pred𝑡\boldsymbol{\mathcal{R}}^{\text{golf}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-golf}}_{t}})\triangleq w^{\text{p}}r^{\text{p}}_{t}+w^% {\text{c}}r^{\text{c}}_{t}+w^{\text{g}}r^{\text{g}}_{t}+w^{\text{pred}}r^{% \text{pred}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT golf end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-golf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where the rtpsubscriptsuperscript𝑟p𝑡r^{\text{p}}_{t}italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the ball to move forward, rtcsubscriptsuperscript𝑟c𝑡r^{\text{c}}_{t}italic_r start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages swinging the golf club to hit the ball, and rtgsubscriptsuperscript𝑟g𝑡r^{\text{g}}_{t}italic_r start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the ball to reach the goal. In addition, we predict the ball’s trajectory and provide a dense reward rtpredsubscriptsuperscript𝑟pred𝑡r^{\text{pred}}_{t}italic_r start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the distance between the predicted landing point and the goal.

Javelin

For javelin throw, we use SMPL-X humanoid with articulated fingers. The goal state is defined as 𝒔tg-javelin(𝒒tobj,𝒑tr,𝒑th)subscriptsuperscript𝒔g-javelin𝑡subscriptsuperscript𝒒obj𝑡subscriptsuperscript𝒑𝑟𝑡subscriptsuperscript𝒑𝑡{\boldsymbol{s}^{\text{g-javelin}}_{t}}\triangleq(\boldsymbol{q}^{\text{obj}}_% {t},\boldsymbol{p}^{r}_{t},\boldsymbol{p}^{h}_{t})bold_italic_s start_POSTSUPERSCRIPT g-javelin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_q start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝒒tobj13subscriptsuperscript𝒒obj𝑡superscript13\boldsymbol{q}^{\text{obj}}_{t}\in\mathbb{R}^{13}bold_italic_q start_POSTSUPERSCRIPT obj end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT, includes the position, orientation, linear, and angular velocity of the javelin. 𝒑trsubscriptsuperscript𝒑𝑟𝑡\boldsymbol{p}^{r}_{t}bold_italic_p start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒑thsubscriptsuperscript𝒑𝑡\boldsymbol{p}^{h}_{t}bold_italic_p start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the positions of the root and right hand. The reward is defined as 𝓡javelin(𝒔tp,𝒔tg-javelin)wgrabrtgrab+wjsrtjs+wgoalrtgoal+wsrtssuperscript𝓡javelinsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-javelin𝑡superscript𝑤grabsubscriptsuperscript𝑟grab𝑡superscript𝑤jssubscriptsuperscript𝑟js𝑡superscript𝑤goalsubscriptsuperscript𝑟goal𝑡superscript𝑤ssubscriptsuperscript𝑟s𝑡\boldsymbol{\mathcal{R}}^{\text{javelin}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-javelin}}_{t}})\triangleq w^{\text{grab}}r^{\text{grab% }}_{t}+w^{\text{js}}r^{\text{js}}_{t}+w^{\text{goal}}r^{\text{goal}}_{t}+w^{% \text{s}}r^{\text{s}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-javelin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT goal end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The grab reward rtgrabsubscriptsuperscript𝑟grab𝑡r^{\text{grab}}_{t}italic_r start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the right hand to grab the javelin. The javelin stability reward rtjssubscriptsuperscript𝑟js𝑡r^{\text{js}}_{t}italic_r start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT minimizes the javelin’s self-rotation. The goal reward rtgoalsubscriptsuperscript𝑟goal𝑡r^{\text{goal}}_{t}italic_r start_POSTSUPERSCRIPT goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the humanoid to throw the javelin further. The stability reward rtssubscriptsuperscript𝑟s𝑡r^{\text{s}}_{t}italic_r start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is to avoid large movements of the body.

4.2 Multi-person Sports

Tennis

For tennis, each humanoid’s right hand is replaced as an oval racket. We use the same measurement as a real tennis court and ball. We design two tasks: a single-player task where the humanoid trains to hit balls launched randomly, and a 1v1 mode where the humanoid plays against another humanoid. For both tasks, the goal state is defined as 𝒔tg-tennis(𝒑tball,𝒗tball,𝒑tracket,𝒑ttar{\boldsymbol{s}^{\text{g-tennis}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{v}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{racket}}_{t}},% {\boldsymbol{p}^{\text{tar}}_{t}}bold_italic_s start_POSTSUPERSCRIPT g-tennis end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_v start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where 𝒑tball3,𝒗tball3,𝒑tracket3formulae-sequencesubscriptsuperscript𝒑ball𝑡superscript3formulae-sequencesubscriptsuperscript𝒗ball𝑡superscript3subscriptsuperscript𝒑racket𝑡superscript3{\boldsymbol{p}^{\text{ball}}_{t}}\in\mathbb{R}^{3},{\boldsymbol{v}^{\text{% ball}}_{t}}\in\mathbb{R}^{3},{\boldsymbol{p}^{\text{racket}}_{t}}\in\mathbb{R}% ^{3}bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 𝒑ttar3subscriptsuperscript𝒑tar𝑡superscript3{\boldsymbol{p}^{\text{tar}}_{t}}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, which includes the position and velocity of the ball, position of the racket and position of the target. The reward function for tennis is defined as 𝓡tennis(𝒔tp,𝒔tg-tennis)wprtracket+wbrtballsuperscript𝓡tennissubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-tennis𝑡subscript𝑤psubscriptsuperscript𝑟racket𝑡subscript𝑤bsubscriptsuperscript𝑟ball𝑡\boldsymbol{\mathcal{R}}^{\text{tennis}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-tennis}}_{t}})\triangleq w_{\text{p}}r^{\text{racket}}% _{t}+w_{\text{b}}r^{\text{ball}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT tennis end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-tennis end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUBSCRIPT p end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT b end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The racket reward rtracketsubscriptsuperscript𝑟racket𝑡r^{\text{racket}}_{t}italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the racket to reach the ball, and the ball reward rtballsubscriptsuperscript𝑟ball𝑡r^{\text{ball}}_{t}italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT aims to successfully hit the ball into the opponent’s court, as close to the target as possible. For the single-player task, we shoot a ball from the opposite side from a random position and trajectory, simulating a ball hit by the opponent. The target 𝒑ttarsubscriptsuperscript𝒑tar𝑡{\boldsymbol{p}^{\text{tar}}_{t}}bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is also randomly sampled. For the 1v1 scenario, we can either train models from scratch or initialize two identical single-player models as opponents, which can play back and forth.

Table Tennis

For table tennis, each humanoid is equipped with a circular paddle (replacing the right hand) and plays on a standard table. Similar to tennis, we have the single-player task and the 1v1 task. Similarly, the goal state is defined as 𝒔tg-tennis(𝒑tball,𝒗tball,𝒑tracket,𝒑ttar)subscriptsuperscript𝒔g-tennis𝑡subscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒗ball𝑡subscriptsuperscript𝒑racket𝑡subscriptsuperscript𝒑tar𝑡{\boldsymbol{s}^{\text{g-tennis}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{v}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{racket}}_{t}},% {\boldsymbol{p}^{\text{tar}}_{t}})bold_italic_s start_POSTSUPERSCRIPT g-tennis end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_v start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The reward function for table tennis is defined as 𝓡table tennis(𝒔tp,𝒔tg-table_tennis)wprtracket+wbrtballsuperscript𝓡table tennissubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-table_tennis𝑡subscript𝑤psubscriptsuperscript𝑟racket𝑡subscript𝑤bsubscriptsuperscript𝑟ball𝑡\boldsymbol{\mathcal{R}}^{\text{table tennis}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-table\_tennis}}_{t}})\triangleq w_{\text{p}}r^{\text% {racket}}_{t}+w_{\text{b}}r^{\text{ball}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT table tennis end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-table_tennis end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUBSCRIPT p end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT b end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The paddle reward rtracketsubscriptsuperscript𝑟racket𝑡r^{\text{racket}}_{t}italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the same as tennis while we modify the rtballsubscriptsuperscript𝑟ball𝑡r^{\text{ball}}_{t}italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT slightly to encourage more hits for table tennis.

Fencing

For 1v1 fencing, each humanoid is equipped with a sword (replacing the right hand) and plays on a standard fencing field. The goal state is defined as 𝒔tg-fencing(𝒑topp,𝒗topp,𝒑tsword𝒑topp-target,𝒄t22,𝒄topp22,𝒃)subscriptsuperscript𝒔g-fencing𝑡subscriptsuperscript𝒑opp𝑡subscriptsuperscript𝒗opp𝑡subscriptsuperscript𝒑sword𝑡subscriptsuperscript𝒑opp-target𝑡superscriptsubscriptnormsubscript𝒄𝑡22superscriptsubscriptnormsubscriptsuperscript𝒄opp𝑡22𝒃{\boldsymbol{s}^{\text{g-fencing}}_{t}}\triangleq({\boldsymbol{p}^{\text{opp}}% _{t}},{\boldsymbol{v}^{\text{opp}}_{t}},{\boldsymbol{p}^{\text{sword}}_{t}}-{% \boldsymbol{p}^{\text{opp-target}}_{t}},\|{\boldsymbol{{c}}_{t}}\|_{2}^{2},\|{% \boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2},\boldsymbol{b})bold_italic_s start_POSTSUPERSCRIPT g-fencing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_v start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT sword end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∥ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∥ bold_italic_c start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_b ), which contains the opponent’s position body 𝒑topp24×3subscriptsuperscript𝒑opp𝑡superscript243{\boldsymbol{p}^{\text{opp}}_{t}}\in\mathbb{R}^{24\times 3}bold_italic_p start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 24 × 3 end_POSTSUPERSCRIPT, linear velocity 𝒗topp24×3subscriptsuperscript𝒗opp𝑡superscript243{\boldsymbol{v}^{\text{opp}}_{t}}\in\mathbb{R}^{24\times 3}bold_italic_v start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 24 × 3 end_POSTSUPERSCRIPT, the difference between target body position 𝒑topp-target5×3subscriptsuperscript𝒑opp-target𝑡superscript53{\boldsymbol{p}^{\text{opp-target}}_{t}}\in\mathbb{R}^{5\times 3}bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 3 end_POSTSUPERSCRIPT on the opponent and agent’s sword tip position 𝒑tswordsubscriptsuperscript𝒑sword𝑡{\boldsymbol{p}^{\text{sword}}_{t}}bold_italic_p start_POSTSUPERSCRIPT sword end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, normalized contract forces on the agent itself 𝒄t2224×3superscriptsubscriptnormsubscript𝒄𝑡22superscript243\|{\boldsymbol{{c}}_{t}}\|_{2}^{2}\in\mathbb{R}^{24\times 3}∥ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 24 × 3 end_POSTSUPERSCRIPT and its opponent 𝒄topp2224×3superscriptsubscriptnormsubscriptsuperscript𝒄opp𝑡22superscript243\|{\boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2}\in\mathbb{R}^{24\times 3}∥ bold_italic_c start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 24 × 3 end_POSTSUPERSCRIPT, as well as the bounding box 𝒃4𝒃superscript4\boldsymbol{b}\in\mathbb{R}^{4}bold_italic_b ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. To train the fencing agent, we define the fencing reward function as 𝓡fencing(𝒔tp,𝒔tg-fencing)wfrtfacing+wvrtvel+wsrtstrike+wprtpointsuperscript𝓡fencingsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-fencing𝑡subscript𝑤fsubscriptsuperscript𝑟facing𝑡subscript𝑤vsubscriptsuperscript𝑟vel𝑡subscript𝑤ssubscriptsuperscript𝑟strike𝑡subscript𝑤psubscriptsuperscript𝑟point𝑡\boldsymbol{\mathcal{R}}^{\text{fencing}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-fencing}}_{t}})\triangleq w_{\text{f}}r^{\text{facing}% }_{t}+w_{\text{v}}r^{\text{vel}}_{t}+w_{\text{s}}r^{\text{strike}}_{t}+w_{% \text{p}}r^{\text{point}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT fencing end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-fencing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUBSCRIPT f end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT facing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT v end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT vel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT s end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT strike end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT p end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The facing rtfacingsubscriptsuperscript𝑟facing𝑡r^{\text{facing}}_{t}italic_r start_POSTSUPERSCRIPT facing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and velocity reward rtvelsubscriptsuperscript𝑟vel𝑡r^{\text{vel}}_{t}italic_r start_POSTSUPERSCRIPT vel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourage the agent to face and move toward the opponent. The strike reward rtstrikesubscriptsuperscript𝑟strike𝑡r^{\text{strike}}_{t}italic_r start_POSTSUPERSCRIPT strike end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages the agent’s sword tip to get close to the target, while rtpointsubscriptsuperscript𝑟point𝑡r^{\text{point}}_{t}italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the reward for getting in contact with the target. We use the pelvis, head, spine, chest, and torso as the target bodies. The episode terminates if either of the humanoids falls or steps out of bounds.

Boxing

For boxing, we simulate two humanoids with sphere hands in a bounded arena. The goal state is similar to fencing: 𝒔tg-boxing(𝒑topp,𝒗topp,𝒑thand𝒑topp-target,𝒄t22,𝒄topp22)subscriptsuperscript𝒔g-boxing𝑡subscriptsuperscript𝒑opp𝑡subscriptsuperscript𝒗opp𝑡superscriptsubscript𝒑𝑡handsubscriptsuperscript𝒑opp-target𝑡superscriptsubscriptnormsubscript𝒄𝑡22superscriptsubscriptnormsubscriptsuperscript𝒄opp𝑡22{\boldsymbol{s}^{\text{g-boxing}}_{t}}\triangleq({\boldsymbol{p}^{\text{opp}}_% {t}},{\boldsymbol{v}^{\text{opp}}_{t}},{\boldsymbol{{p}}_{t}^{\text{hand}}}-{% \boldsymbol{p}^{\text{opp-target}}_{t}},\|{\boldsymbol{{c}}_{t}}\|_{2}^{2},\|{% \boldsymbol{{c}}^{\text{opp}}_{t}}\|_{2}^{2})bold_italic_s start_POSTSUPERSCRIPT g-boxing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_v start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT hand end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∥ bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∥ bold_italic_c start_POSTSUPERSCRIPT opp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) but without the bounding box information. The reward function and target body parts are also the same as fencing, though replacing the sword tip to the hands.

Soccer

The soccer environment includes one or more humanoids, a ball, two goal posts, and the field boundaries. The field measures 32m ×\times× 20m. We support three tasks: penalty kicks, 1v1, and 2v2.

For penalty kicks, the humanoid is positioned 13 meters from the goal line, with the ball placed at a fixed spot 12 meters directly in front of the goal center. The objective is to kick the ball toward a randomly sampled target within the goal post. To achieve this, the controller is provided 𝒔tg-kick(𝒑tball,𝒒˙tball,𝒑tgoal-post,𝒑tgoal-target)subscriptsuperscript𝒔g-kick𝑡subscriptsuperscript𝒑ball𝑡subscriptsuperscriptbold-˙𝒒ball𝑡subscriptsuperscript𝒑goal-post𝑡subscriptsuperscript𝒑goal-target𝑡{\boldsymbol{s}^{\text{g-kick}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}_{% t}},{\boldsymbol{\dot{q}}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{goal-post}% }_{t}},{\boldsymbol{p}^{\text{goal-target}}_{t}})bold_italic_s start_POSTSUPERSCRIPT g-kick end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT goal-post end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝒑tball3subscriptsuperscript𝒑ball𝑡superscript3{\boldsymbol{p}^{\text{ball}}_{t}}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the ball position, 𝒒˙tball3subscriptsuperscriptbold-˙𝒒ball𝑡superscript3{\boldsymbol{\dot{q}}^{\text{ball}}_{t}}\in\mathbb{R}^{3}overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the velocity and angular velocity, 𝒑tgoal-post4subscriptsuperscript𝒑goal-post𝑡superscript4{\boldsymbol{p}^{\text{goal-post}}_{t}}\in\mathbb{R}^{4}bold_italic_p start_POSTSUPERSCRIPT goal-post end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is the bounding box of the goal, and 𝒑tgoal-target3subscriptsuperscript𝒑goal-target𝑡superscript3{\boldsymbol{p}^{\text{goal-target}}_{t}}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the target location within the goal post. The reward is 𝓡soccer-kick(𝒔tp,𝒔tg-kick)wp2brp2b+wb2grb2g+wbv2grbv2g+wb2trb2tctno-dribblesuperscript𝓡soccer-kicksubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-kick𝑡superscript𝑤p2bsuperscript𝑟p2bsuperscript𝑤b2gsuperscript𝑟b2gsuperscript𝑤bv2gsuperscript𝑟bv2gsuperscript𝑤b2tsuperscript𝑟b2tsubscriptsuperscript𝑐no-dribble𝑡\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}}+w% ^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{b2t}}r^{% \text{b2t}}-c^{\text{no-dribble}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT soccer-kick end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-kick end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT - italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Various rewards are designed to guide the character towards a run-and-kick motion. The player-to-ball (rp2bsuperscript𝑟p2br^{\text{p2b}}italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT) reward motivates the character to move towards the ball. The ball-to-goal reward (rb2gsuperscript𝑟b2gr^{\text{b2g}}italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT) reduces the distance between the ball and the target. The ball-velocity-to-goal (rbv2gsuperscript𝑟bv2gr^{\text{bv2g}}italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT) encourages a higher velocity of the ball toward the target. The ball-to-target (rb2tsuperscript𝑟b2tr^{\text{b2t}}italic_r start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT) reward encourages a smaller distance between the target and the predicted landing spot of the ball based on its current position and velocity. Finally, a negative reward (ctno-dribblesubscriptsuperscript𝑐no-dribble𝑡c^{\text{no-dribble}}_{t}italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) is applied if the character passes the spawn position of the ball, which discourages dribbling and encourages kicking.

Beyond penalty kicks, we explore team-play dynamics, including 1v1 and 2v2. The controller is provided with a state defined as 𝒔tg-soccer(𝒑tball,𝒒˙tball,𝒑tgoal-post,𝒑tally-root,𝒑topp-root)subscriptsuperscript𝒔g-soccer𝑡subscriptsuperscript𝒑ball𝑡subscriptsuperscriptbold-˙𝒒ball𝑡subscriptsuperscript𝒑goal-post𝑡subscriptsuperscript𝒑ally-root𝑡subscriptsuperscript𝒑opp-root𝑡{\boldsymbol{s}^{\text{g-soccer}}_{t}}\triangleq({\boldsymbol{p}^{\text{ball}}% _{t}},{\boldsymbol{\dot{q}}^{\text{ball}}_{t}},{\boldsymbol{p}^{\text{goal-% post}}_{t}},{\boldsymbol{p}^{\text{ally-root}}_{t}},{\boldsymbol{p}^{\text{opp% -root}}_{t}})bold_italic_s start_POSTSUPERSCRIPT g-soccer end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , overbold_˙ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT goal-post end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT ally-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT opp-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝒑tally-root3subscriptsuperscript𝒑ally-root𝑡superscript3{\boldsymbol{p}^{\text{ally-root}}_{t}}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT ally-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 𝒑topp-root3subscriptsuperscript𝒑opp-root𝑡superscript3{\boldsymbol{p}^{\text{opp-root}}_{t}}\in\mathbb{R}^{3}bold_italic_p start_POSTSUPERSCRIPT opp-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT are the root positions of the ally and opponents (1 or 2). The controller is then trained using the following reward 𝓡soccer-match(𝒔tp,𝒔tg-soccer)wp2brp2b+wb2grb2g+wbv2grbv2g+wpointrpointsuperscript𝓡soccer-matchsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-soccer𝑡superscript𝑤p2bsuperscript𝑟p2bsuperscript𝑤b2gsuperscript𝑟b2gsuperscript𝑤bv2gsuperscript𝑟bv2gsuperscript𝑤pointsuperscript𝑟point\boldsymbol{\mathcal{R}}^{\text{soccer-match}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}% }+w^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{point}}% r^{\text{point}}bold_caligraphic_R start_POSTSUPERSCRIPT soccer-match end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-soccer end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT, where rp2bsuperscript𝑟p2br^{\text{p2b}}italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT, rb2gsuperscript𝑟b2gr^{\text{b2g}}italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT and rbv2gsuperscript𝑟bv2gr^{\text{bv2g}}italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT are the same as in penalty kick. rb2gsuperscript𝑟b2gr^{\text{b2g}}italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT and rbv2gsuperscript𝑟bv2gr^{\text{bv2g}}italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT are zeroed out when the distance to the ball is greater than 0.5m. rpointsuperscript𝑟pointr^{\text{point}}italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT, the scoring a goal, provides a one-time bonus and or penalty for goals. Notice that this is a rudimentary reward design compared to prior art [8] and serves as a starting point for further development.

Basketball

Our basketball environment is set up similarly to the soccer environment except for using the SMPL-X humanoid. The court measures 29m ×\times× 15m, with a 3m high hoop. We also introduce the task of free-throw, where the humanoid begins at a distance of 4.5 meters from the hoop with the ball initially positioned close to its hands. The objective is to successfully throw the basketball into the hoop. The goal state for this task is defined similarly to that of the soccer penalty kicks, with the distinction being the prohibition of foot-to-ball contact to maintain basketball rules.

Competitive Self-play

In competitive sports environments, we implement a basic adversarial self-play mechanism where two policies, initialized randomly, compete against each other to optimize their rewards. We adopt an alternating optimization strategy from [29], where one policy is frozen while the other is trained. This encourages each policy to develop offensive and defensive strategies, contributing to more competitive behavior, as observed in boxing and fencing (supplement site).

4.3 Acquiring Human Demonstration From Videos

We utilize TRAM [28] for 3D motion reconstruction from Internet videos, providing robust global trajectory and pose estimation under dynamic camera movements, commonly found in sports broadcasting. Specifically, TRAM estimates SMPL parameters [9] which include global root translation, orientation, body poses, and shape parameters. We further apply PHC [11], a physics-based motion tracker, to imitate these estimated motions, ensuring physical plausibility. We find these corrected motions are significantly more effective as positive samples for adversarial learning compared to raw estimated results. More details and ablation are provided in the supplementary materials.

Refer to caption
Figure 2: Qualitative results for high jump, javelin, golf, and hurdling. PPO and AMP try to solve the task using inhuman behavior, while PULSE can discover human-like behavior.

5 Experiments

Implementation Details

Simulation is conducted in Isaac Gym [14], where the policy runs at 30 Hz and the simulation at 60 Hz. All task policies utilize three-layer MLPs with units [2048, 1024, 512]. The SMPL humanoid models adhere to the SMPL kinematic structure, featuring 24 joints, 23 of which are actuated, yielding an action space of 69superscript69\mathcal{R}^{69}caligraphic_R start_POSTSUPERSCRIPT 69 end_POSTSUPERSCRIPT. The SMPL-X humanoid has 52 joints, 51 actuated, including 21 body joints and hands, resulting in an action space of 153superscript153\mathcal{R}^{153}caligraphic_R start_POSTSUPERSCRIPT 153 end_POSTSUPERSCRIPT. Body parts on our humanoid consist of primitives such as capsules and blocks. All models can be trained on a single Nvidia RTX 3090 GPU in 1-3 days. We limit all joint actuation forces to 500 Nm. For more implementation details, please refer to the supplement.

Baselines

We benchmark our simulated sports using some of the state-of-the-art simulated humanoid control methods. While not a comprehensive list, it provides a baseline for the challenging environments. Each task is trained using PPO [22], AMP [18], PULSE [10], and a combination of PULSE and AMP. AMP use a discriminator with the policy to provide an adversarial reward, using human demonstration data to deliver a “style" reward that reflects the human-likeness of humanoid motion. Both task and discriminator rewards are equally weighted at 0.5. PULSE extracts a 32-dimensional universal motion representation from AMASS data, surpassing previous methods [24, 19] in coverage of motor skills and applicability to downstream tasks. Compared to AMP, PULSE uses hierarchical RL and offers a learned action space that accelerates training and provides human-like motion prior (instead of a discriminative reward). PULSE and AMP can be combined effectively, where PULSE provides the action space and AMP provides task-specific style reward.

Metrics

We provide quantitative evaluations for tasks with easily measurable metrics such as high jump, long jump, hurdling, javelin, golf, single-player tennis, table tennis, penalty kicks, and free throws. These metrics are detailed in the supplementary materials, where we also present qualitative assessments for tasks that are more challenging to quantify, such as boxing, fencing, and team soccer. Specifically, success rate (Suc Rate) determines whether an agent completes a sport according to set rules. Average distance (Avg Dis) indicates the extent an agent or object travels. For sports involving ball hits, such as tennis and table tennis, we record the average number of successful ball strikes (Avg Hits). Error distance (Error Dis) measures the distance between the intended target and the actual landing spot, applicable in sports like golf, tennis, and penalty kicks. Additionally, the hit rate in golf quantifies the success of striking the ball with the club. Evaluations are performed on 1000 trials.

5.1 Benchmarking Popular Simulated Humanoid Algorithms

In this section, we evaluate the performance of various control methods across our sports environments. We provide qualitative results in Fig. 2 and Fig. 3, and training curves in Fig. 4. To view extensive qualitative results, including human-like soccer kick, boxing, high jump, etc., please see supplement.

Track & Field Sports (Without Video Data)

We first evaluate track and field sports, including long jump, high jump, hurdling, and javelin throwing. For these sports, SOTA pose estimation methods fail to estimate coherent motion and global root trajectory as players and cameras are both fast-moving. Thus, we utilize a subset of the AMASS dataset containing locomotion data [21] as reference motions. Since PULSE is pretrained on AMASS, we exclude PULSE + AMP from these tests. Table 1 summarizes the quantitative results of different methods. In long jump, AMP fails entirely, often walking slowly to the jump line without a forward leap. This failure occurs because the policy prioritizes discriminator rewards over task completion. If the task is too hard, the policy will use simple motion (such as standing still) to maximize the discriminator reward instead of trying to complete the task. In contrast, PPO, while capable of jumping great distances, exhibits unnatural motions. PULSE successfully executes jumps with human-like motion, but lacks the specialized skills for top-tier records due to the absence of corresponding motion data in AMASS. The high jump displays similar patterns: PPO achieves impressive heights but with unnatural movements while AMP struggles to reconcile adversarial and task rewards. Surprisingly, as shown in Figure 2, PULSE successfully adopts a Fosbury flop approach without specific rewards to encourage this technique, likely leveraging breakdance skills. For hurdling, AMP completely fails, stopping before the first hurdle. PPO bounces energetically over each obstacle as shown in Figure 2, but sometimes falls and fails to complete the race, with an average success rate of just over 50% and an average distance of less than 110m. PULSE facilitates natural clearance of hurdles, and completes races in 17.76 seconds, a competitive time compared to human standards. Javelin throwing poses similar challenges: PPO uses inhuman strategies, AMP struggles with balancing rewards, and PULSE adopts human-like strategies but lacks specific skills for record-setting performance.

Table 1: Evaluation on Long Jump, High Jump, Hurdling and Javelin. World records are in parentheses.
Long Jump (8.95m) High Jump (2.45m) Hurdling (12.8s) Javelin (104.8m)
Method Suc Rate \uparrow Avg Dis \uparrow Suc Rate (1m) \uparrow Height (1m) \uparrow Suc Rate (1.5m) \uparrow Height (1.5m) \uparrow Suc Rate \uparrow Avg Dis \uparrow Time \downarrow Suc Rate \uparrow Avg Dis \uparrow
PPO [22] 53.6% 19.42 100% 4.08 100% 4.11 57.6% 108.9 11.22 100% 44.5
AMP [18] 0% - 0% - 0% - 0% 13.24 - 0.31% 2.03
PULSE [10] 100% 5.105 100% 2.01 100% 1.98 100% 122.1 17.76 100% 9.63
Table 2: Evaluation on Golf, Tennis, Table Tennis, Penalty Kick and Free Throw
Tennis Table Tennis Golf Penalty Kick Free Throw
Method Avg Hits \uparrow Error Dis \downarrow Avg Hits \uparrow Error Dis \downarrow Hit Rate \uparrow Error Dis \downarrow Suc Rate \uparrow Error Dis \downarrow Suc Rate \uparrow
PPO [22] 2.76 1.92 1.01 0.06 0% - 0.0% - 91.4%
AMP [18] 3.95 5.30 1.10 0.13 100% 1.43 0.0% - 0.0%
PULSE [10] 2.48 3.50 0.74 0.19 99.9% 1.29 76.6% 0.25 85.6%
PULSE [10] + AMP [18] 2.62 3.64 1.83 0.23 99.9% 2.18 27.5% 0.27 89.8%

Sports With Video Data

For sports including golf, tennis, table tennis, and soccer penalty kick, we utilize processed motion from videos as demonstrations for AMP and PULSE+AMP. The results are reported in Table 2 and Fig. 3. In tennis, AMP demonstrates superior performance in terms of average hits; however, returned balls often land far from the intended targets. This is because prolonged rallies increase discriminator rewards, leading AMP to ignore task rewards. Notably, AMP exhibits inhuman motions at the moment of ball contact and reverts to natural movements when preparing for the next hit as shown in Fig. 3. This behavior underscores a reward conflict between balancing task and discriminator rewards. PPO plays tennis in an unnatural way, while PULSE and PULSE + AMP show similar performance. In table tennis, PPO achieves impressive error distances, but struggles with consistency and often fails to return second shots. We observe video data proves particularly beneficial for table tennis. PULSE+AMP records significantly higher hit averages with reasonable error distances. Table tennis requires quick reactions within a short time, which the pre-trained PULSE model supports by providing necessary motor skills, enhanced by video data that guide the learning of proper stroke techniques. For golf, penalty kicks, and free throws, the “initiating contact with an object" part makes them challenging. Here, only PULSE and PULSE+AMP manage to solve the three tasks effectively and consistently, leveraging PULSE’s latent space for effective exploration. The design of these tasks often results in a sparse exploration phase where triggering penalty rewards, such as ctno-dribblesubscriptsuperscript𝑐no-dribble𝑡c^{\text{no-dribble}}_{t}italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for moving past the ball’s initial position. The AMP reward also negatively affects training penalty kick, as the human demonstration contains other soccer motions such as running and dribbling, and the policy finds them easier to learn and exploit.

Refer to caption
Figure 3: Qualitative results for table tennis and tennis. PPO and AMP result in inhuman behavior; PULSE can use human-like movement but PULSE + AMP result in behavior specific to the sport.
Refer to caption
Figure 4: Learning curves on various tasks.

Curriculum learning

We find curriculum learning is an essential component in achieving better results for some tasks. In Table 3, we study variants of high jump and hurdling task with and without

Table 3: Evaluation on curriculum learning.
High Jump Hurdling
Method Suc Rate (1m) Suc Rate (1.5m) Suc Rate Avg Dis Time
w/o curriculum 100% 0% 0% 13.65 -
w/ curriculum 100% 100% 100% 122.1 17.76

the curriculum using PULSE. We can see that without curriculum, high jump and hurdling both fail to solve the task. This is due to the policy not being able to obtain any reward facing challenging heights of bars and hurdles and the policy gets stuck in the local minima.

6 Limitations, Conclusion and Future Work

Limitations

While SMPLOlympics provides a large collection of simulated sports environments, it is far from being comprehensive. Certain sports are omitted due to simulation constraints (e.g., swimming, shooting, ice hockey, cycling) or their inherent complexity (e.g., 11-a-side soccer, equestrian events). Nevertheless, our framework is highly adaptable, allowing easy incorporation of additional sports like climbing, rugby, wrestling etc. Our initial design of rewards, though able to achieve sensible results, is also far from optimal. For competitive sports such as 2v2 soccer and basketball, our results also fall short of SOTA [8] which employs much more complex systems.

Conclusion and Future Work

We introduce SMPLOlympics, a collection of sports environments for simulated humanoids. We provide carefully designed state and reward, and benchmark humanoid control algorithms and motion priors. We find that by combining simple reward design and powerful human motion prior, one can achieve human-like behavior for solving various challenging sports. Our humanoid’s compatibility with the SMPL family of models also provides an easy way to obtain additional data from video for training, which we demonstrate to be helpful in training some sports. These well-defined simulation environments could also serve as valuable platforms for frontier models [12] to gain physical understanding. We believe that SMPLOlympics provides a valuable starting point for the community to further explore physically simulated humanoids.

References

  • Bae et al. [2023] Jinseok Bae, Jungdam Won, Donggeun Lim, Cheol-Hui Min, and Young Min Kim. Pmp: Learning to physically interact with environments using part-wise motion priors. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–10, 2023.
  • Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. Lect. Notes Comput. Sci., 9909 LNCS:561–578, 2016. ISSN 0302-9743,1611-3349.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
  • Dou et al. [2023] Zhiyang Dou, Xuelin Chen, Qingnan Fan, Taku Komura, and Wenping Wang. C· ase: Learning conditional adversarial skill embeddings for physics-based characters. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808, 2018.
  • Hasenclever et al. [2020] Leonard Hasenclever, Fabio Pardo, Raia Hadsell, Nicolas Heess, and Josh Merel. CoMic: Complementary task learning & mimicry for reusable skills. In Hal Daumé Iii and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4105–4115. PMLR, 2020.
  • Liu and Hodgins [2018] Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
  • Liu et al. [2021] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S M Ali Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess. From motor control to team play in simulated humanoid football. arXiv preprint arXiv:2105.12196, 2021.
  • Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34, 2015. ISSN 0730-0301,1557-7368.
  • Luo et al. [2023a] Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, and Weipeng Xu. Universal humanoid motion representations for physics-based control. arXiv preprint arXiv:2310.04582, 2023a.
  • Luo et al. [2023b] Zhengyi Luo, Jinkun Cao, Alexander W. Winkler, Kris Kitani, and Weipeng Xu. Perpetual humanoid control for real-time simulated avatars. In International Conference on Computer Vision (ICCV), 2023b.
  • Ma et al. [2023] Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  • Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob:5441–5450, 2019. ISSN 1550-5499.
  • Makoviychuk et al. [2021] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
  • Merel et al. [2018] Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Wayne, Yee Whye Teh, and Nicolas Heess. Neural probabilistic motor primitives for humanoid control, 2018. ISSN 2331-8422.
  • Pavlakos et al. [2019] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Peng et al. [2018] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic. ACM Trans. Graph., 37:1–14, 2018. ISSN 0730-0301.
  • Peng et al. [2021] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Trans. Graph., pages 1–20, 2021.
  • Peng et al. [2022] Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. arXiv preprint arXiv:2205.01906, 2022.
  • Rao et al. [2021] Dushyant Rao, Fereshteh Sadeghi, Leonard Hasenclever, Markus Wulfmeier, Martina Zambelli, Giulia Vezzani, Dhruva Tirumala, Yusuf Aytar, Josh Merel, Nicolas Heess, and Raia Hadsell. Learning transferable motor skills with hierarchical latent mixture policies. arXiv preprint arXiv:2112.05062, 2021.
  • Rempe et al. [2023] Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, and Or Litany. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. arXiv preprint arXiv:2304.01893, 2023.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
  • Sferrazza et al. [2024] Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoidbench: Simulated humanoid benchmark for whole-body locomotion and manipulation. arXiv preprint arXiv:2403.10506, 2024.
  • [24] Chen Tessler, Israel Yoni Kasten, Israel Yunrong Guo, and Canada Nvidia. Calm: Conditional adversarial latent models for directable virtual characters.
  • Tunyasuvunakool et al. [2020] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020.
  • Wang et al. [2024a] Jiashun Wang, Jessica Hodgins, and Jungdam Won. Strategy and skill learning for physics-based table tennis animation. In ACM SIGGRAPH 2024 Conference Proceedings, SIGGRAPH 2024, 2024a.
  • Wang et al. [2023] Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, and Lei Zhang. Physhoi: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
  • Wang et al. [2024b] Yufu Wang, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos. arXiv preprint arXiv:2403.17346, 2024b.
  • Won et al. [2021] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports. ACM Trans. Graph., 40:1–11, 2021. ISSN 0730-0301.
  • Won et al. [2022] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using conditional vaes. ACM Trans. Graph., 41:1–12, 2022. ISSN 0730-0301.
  • Xie et al. [2022] Zhaoming Xie, Sebastian Starke, Hung Yu Ling, and Michiel van de Panne. Learning soccer juggling skills with layer-wise mixture-of-experts. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
  • Xu et al. [2023] Pei Xu, Xiumin Shang, Victor Zordan, and Ioannis Karamouzas. Composite motion learning with task control. ACM Transactions on Graphics (TOG), 42(4):1–16, 2023.
  • Yao et al. [2022] Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters. arXiv preprint arXiv:2210.06063, 2022.
  • Ye et al. [2023] Vickie Ye, Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21222–21232, 2023.
  • Yin et al. [2021] Zhiqi Yin, Zeshi Yang, Michiel Van De Panne, and KangKang Yin. Discovering diverse athletic jumping strategies. ACM Transactions on Graphics (TOG), 40(4):1–17, 2021.
  • Zhang et al. [2023] Haotian Zhang, Ye Yuan, Viktor Makoviychuk, Yunrong Guo, Sanja Fidler, Xue Bin Peng, and Kayvon Fatahalian. Learning physically simulated tennis skills from broadcast videos. ACM Trans. Graph., 42:1–14, 2023. ISSN 0730-0301,1557-7368.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:5738–5746, 2019. ISSN 1063-6919.
  • Zhu et al. [2023] Qingxu Zhu, He Zhang, Mengting Lan, and Lei Han. Neural categorical priors for physics-based character control. arXiv preprint arXiv:2308.07200, 2023.

Appendix \etocdepthtag.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection \etocsettocstyle

Appendix A Introduction

In the appendix, we provide comprehensive implementation details for SMPLOlympics, including the reward designs for each sport environment, training procedures, and hyperparameters. Extensive qualitative results can be accessed on our supplement site, where we provide visualizations of all sports environments and training results based on our preliminary reward designs. Baseline results (PPO, AMP, PULSE, PULSE+AMP) are presented to support the quantitative findings discussed in the main paper. Furthermore, we offer visualizations of the reference motion extracted from in-the-wild videos. For our pipeline to acquire the human demonstration in the SMPL format, we conduct an ablation study evaluating the impact of employing a motion imitator (PHC Luo et al. [2023b]) as a refinement step. Code, videos, and asset attributions can also be found in our supplementary materials.

Appendix B Implementation Details

B.1 Rewards and Termination Conditions

High Jump

For high jump, the humanoid’s task is to leap over a horizontal bar positioned 20m ahead and 6m to the left of its starting point. The humanoid aims to reach the goal point 𝒑g-high jump=(22,6,1)superscript𝒑g-high jump2261\boldsymbol{p}^{\text{g-high jump}}=(22,6,1)bold_italic_p start_POSTSUPERSCRIPT g-high jump end_POSTSUPERSCRIPT = ( 22 , 6 , 1 ) located 2 m behind the bar. The reward function is defined as follows:

𝓡high jump(𝒔tp,𝒔tg-high_jump){1×rtpif 𝒑t,xp19.5m,1×rtp+1×rthif 19.5m<𝒑t,xp<20.5m,1×rtpif 20.5m𝒑t,xp.superscript𝓡high jumpsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-high_jump𝑡cases1subscriptsuperscript𝑟p𝑡if subscriptsuperscript𝒑p𝑡𝑥19.5m1subscriptsuperscript𝑟p𝑡1subscriptsuperscript𝑟h𝑡if 19.5msubscriptsuperscript𝒑p𝑡𝑥20.5m1subscriptsuperscript𝑟p𝑡if 20.5msubscriptsuperscript𝒑p𝑡𝑥\boldsymbol{\mathcal{R}}^{\text{high jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-high\_jump}}_{t}})\triangleq\begin{cases}1\times r^{% \text{p}}_{t}&\text{if \ }{\boldsymbol{p}^{\text{p}}_{t,x}}\leq 19.5\text{m},% \\ 1\times r^{\text{p}}_{t}+1\times r^{\text{h}}_{t}&\text{if \ }19.5\text{m}<{% \boldsymbol{p}^{\text{p}}_{t,x}}<20.5\text{m},\\ 1\times r^{\text{p}}_{t}&\text{if \ }20.5\text{m}\leq{\boldsymbol{p}^{\text{p}% }_{t,x}}.\\ \end{cases}bold_caligraphic_R start_POSTSUPERSCRIPT high jump end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-high_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ { start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT ≤ 19.5 m , end_CELL end_ROW start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if 19.5 m < bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT < 20.5 m , end_CELL end_ROW start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if 20.5 m ≤ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT . end_CELL end_ROW (1)

where 𝒑t,xpsubscriptsuperscript𝒑p𝑡𝑥{\boldsymbol{p}^{\text{p}}_{t,x}}bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT denotes the x-axis position. The height reward, rth=𝒑t,zpsubscriptsuperscript𝑟h𝑡subscriptsuperscript𝒑p𝑡𝑧r^{\text{h}}_{t}={\boldsymbol{p}^{\text{p}}_{t,z}}italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_z end_POSTSUBSCRIPT, with 𝒑t,zpsubscriptsuperscript𝒑p𝑡𝑧{\boldsymbol{p}^{\text{p}}_{t,z}}bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_z end_POSTSUBSCRIPT representing the z-axis position, incentivizes the humanoid to jump higher. The position reward, rtp=𝒑t1p𝒑g-high jump2𝒑tp𝒑g-high jump2subscriptsuperscript𝑟p𝑡subscriptnormsubscriptsuperscript𝒑p𝑡1superscript𝒑g-high jump2subscriptnormsubscriptsuperscript𝒑p𝑡superscript𝒑g-high jump2r^{\text{p}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{\text{g-% high jump}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{\text{g-% high jump}}\|_{2}italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-high jump end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-high jump end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (clamped to [0,1]), motivates the humanoid to reach the goal. An episode is terminated if the humanoid falls down, fails to leap over the bar, or moves beyond the designated run-up area.

Long Jump

In the long jump environment, the humanoid has a 20-meter runway before the jump line, which its feet should not exceed. The humanoid’s goal is to reach the goal position, 𝒑g-long jump=(30,0,1)superscript𝒑g-long jump3001\boldsymbol{p}^{\text{g-long jump}}=(30,0,1)bold_italic_p start_POSTSUPERSCRIPT g-long jump end_POSTSUPERSCRIPT = ( 30 , 0 , 1 ). The training reward is defined as follows:

𝓡long jump(𝒔tp,𝒔tg-long_jump){1×rtp+0.01×rtvif 𝒑t,xp20m,1×rtp+0.01×rtv+0.1×rh+30×rlif 20m<𝒑t,xp.superscript𝓡long jumpsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-long_jump𝑡cases1subscriptsuperscript𝑟p𝑡0.01subscriptsuperscript𝑟v𝑡if subscriptsuperscript𝒑p𝑡𝑥20m1subscriptsuperscript𝑟p𝑡0.01subscriptsuperscript𝑟v𝑡0.1superscript𝑟h30superscript𝑟lif 20msubscriptsuperscript𝒑p𝑡𝑥\boldsymbol{\mathcal{R}}^{\text{long jump}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-long\_jump}}_{t}})\triangleq\begin{cases}1\times r^{% \text{p}}_{t}+0.01\times r^{\text{v}}_{t}&\text{if \ }{\boldsymbol{p}^{\text{p% }}_{t,x}}\leq 20\text{m},\\ 1\times r^{\text{p}}_{t}+0.01\times r^{\text{v}}_{t}+0.1\times r^{\text{h}}+30% \times r^{\text{l}}&\text{if \ }20\text{m}<{\boldsymbol{p}^{\text{p}}_{t,x}}.% \\ \end{cases}bold_caligraphic_R start_POSTSUPERSCRIPT long jump end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-long_jump end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ { start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.01 × italic_r start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT ≤ 20 m , end_CELL end_ROW start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.01 × italic_r start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.1 × italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT + 30 × italic_r start_POSTSUPERSCRIPT l end_POSTSUPERSCRIPT end_CELL start_CELL if 20 m < bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT . end_CELL end_ROW (2)

The position reward, rtp=𝒑t1p𝒑g-long jump2𝒑tp𝒑g-long jump2subscriptsuperscript𝑟p𝑡subscriptnormsubscriptsuperscript𝒑p𝑡1superscript𝒑g-long jump2subscriptnormsubscriptsuperscript𝒑p𝑡superscript𝒑g-long jump2r^{\text{p}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{\text{g-% long jump}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{\text{g-% long jump}}\|_{2}italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-long jump end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-long jump end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (clamped to [0,1]) encourages the humanoid to reach the goal point. The velocity reward, rtv=𝒗t,xpsubscriptsuperscript𝑟v𝑡subscriptsuperscript𝒗p𝑡𝑥r^{\text{v}}_{t}={\boldsymbol{v}^{\text{p}}_{t,x}}italic_r start_POSTSUPERSCRIPT v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_v start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT prompts the humanoid to reach higher speed along the x-axis. The jump height reward rth=𝒑t,zpsubscriptsuperscript𝑟h𝑡subscriptsuperscript𝒑p𝑡𝑧r^{\text{h}}_{t}={\boldsymbol{p}^{\text{p}}_{t,z}}italic_r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_z end_POSTSUBSCRIPT encourages the humanoid to jump higher after reaching the jump line. The jump length reward rtl=𝒑t,xp20subscriptsuperscript𝑟l𝑡subscriptsuperscript𝒑p𝑡𝑥20r^{\text{l}}_{t}={\boldsymbol{p}^{\text{p}}_{t,x}}-20italic_r start_POSTSUPERSCRIPT l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x end_POSTSUBSCRIPT - 20 promotes longer final jump length. Each episode terminates if the humanoid falls or runs off the track.

Hurdling

In the hurdling task, the humanoid aims to reach a finish line 110m ahead while jumping over 10 hurdles, each 1.067m high. The first hurdle is placed 13.72m from the start, with subsequent hurdles spaced every 9.14m. The reward function is defined as 𝓡hurdling(𝒔tp,𝒔tg-hurdling)rtdistancesuperscript𝓡hurdlingsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-hurdling𝑡subscriptsuperscript𝑟distance𝑡\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq r^{\text{distance}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT hurdling end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_r start_POSTSUPERSCRIPT distance end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which encourages the agent to run towards the finish line and clear each hurdle.

𝓡hurdling(𝒔tp,𝒔tg-hurdling)1×rtdistancesuperscript𝓡hurdlingsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-hurdling𝑡1subscriptsuperscript𝑟distance𝑡\boldsymbol{\mathcal{R}}^{\text{hurdling}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-hurdling}}_{t}})\triangleq 1\times r^{\text{distance}}% _{t}\\ bold_caligraphic_R start_POSTSUPERSCRIPT hurdling end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ 1 × italic_r start_POSTSUPERSCRIPT distance end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (3)

The distance reward, rtdistance=𝒑t1p𝒑g-hurdling2𝒑tp𝒑g-hurdling2subscriptsuperscript𝑟distance𝑡subscriptnormsubscriptsuperscript𝒑p𝑡1superscript𝒑g-hurdling2subscriptnormsubscriptsuperscript𝒑p𝑡superscript𝒑g-hurdling2r^{\text{distance}}_{t}=\|{\boldsymbol{p}^{\text{p}}_{t-1}}-\boldsymbol{p}^{% \text{g-hurdling}}\|_{2}-\|{\boldsymbol{p}^{\text{p}}_{t}}-\boldsymbol{p}^{% \text{g-hurdling}}\|_{2}italic_r start_POSTSUPERSCRIPT distance end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT g-hurdling end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, is clamped to [0,1]01[0,1][ 0 , 1 ] and encourages the humanoid to get closer to the goal point. We terminate each episode if the character falls or runs off the track.

Golf

In the golf task, the humanoid is equipped with a golf club of dimensions of 0.05m×0.025m×0.02m0.05m0.025m0.02m0.05\text{m}\times 0.025\text{m}\times 0.02\text{m}0.05 m × 0.025 m × 0.02 m. The target location for the golf ball is positioned to the left of the humanoid, in the direction of the x-axis, at a distance ranging from 0m to 20m. The reward function is defined as follows:

𝓡golf(𝒔tp,𝒔tg-golf)1×rtp+1×rtc+1×rtg+1×rtpredsuperscript𝓡golfsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-golf𝑡1subscriptsuperscript𝑟p𝑡1subscriptsuperscript𝑟c𝑡1subscriptsuperscript𝑟g𝑡1subscriptsuperscript𝑟pred𝑡\boldsymbol{\mathcal{R}}^{\text{golf}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-golf}}_{t}})\triangleq 1\times r^{\text{p}}_{t}+1% \times r^{\text{c}}_{t}+1\times r^{\text{g}}_{t}+1\times r^{\text{pred}}_{t}\\ bold_caligraphic_R start_POSTSUPERSCRIPT golf end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-golf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ 1 × italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (4)

The position reward, rtp𝒑t1ball𝒑ttar2𝒑tball𝒑ttar2subscriptsuperscript𝑟p𝑡subscriptnormsubscriptsuperscript𝒑ball𝑡1subscriptsuperscript𝒑tar𝑡2subscriptnormsubscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒑tar𝑡2r^{\text{p}}_{t}\triangleq\|{\boldsymbol{p}^{\text{ball}}_{t-1}}-{\boldsymbol{% p}^{\text{tar}}_{t}}\|_{2}-\|{\boldsymbol{p}^{\text{ball}}_{t}}-{\boldsymbol{p% }^{\text{tar}}_{t}}\|_{2}italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ∥ bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, clamped such that 0<rtp<10subscriptsuperscript𝑟p𝑡10<r^{\text{p}}_{t}<10 < italic_r start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < 1, encourages the ball to get closer to the target. The contact reward rtcsubscriptsuperscript𝑟c𝑡r^{\text{c}}_{t}italic_r start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encourages swinging the golf club to hit the ball, defined as:

rtc={1×exp(100×𝒑tball𝒑tclub2)if Ccb=0,1if Ccb=1.subscriptsuperscript𝑟c𝑡cases1100superscriptnormsubscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒑club𝑡2if subscript𝐶cb01if subscript𝐶cb1r^{\text{c}}_{t}=\begin{cases}1\times\exp(-100\times\|{\boldsymbol{p}^{\text{% ball}}_{t}}-{\boldsymbol{p}^{\text{club}}_{t}}\|^{2})&\text{if \ }C_{\text{cb}% }=0,\\ 1&\text{if \ }C_{\text{cb}}=1.\end{cases}italic_r start_POSTSUPERSCRIPT c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL 1 × roman_exp ( - 100 × ∥ bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT club end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_C start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 0 , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL if italic_C start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 1 . end_CELL end_ROW (5)

Here, Ccb=0subscript𝐶cb0C_{\text{cb}}=0italic_C start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 0 indicates that the club has not made contact with the ball and Ccb=1subscript𝐶cb1C_{\text{cb}}=1italic_C start_POSTSUBSCRIPT cb end_POSTSUBSCRIPT = 1 indicates the club has made contact. The goal reward, rtg=exp(0.1×𝒑t,xyball𝒑t,xytar2)subscriptsuperscript𝑟g𝑡0.1superscriptnormsubscriptsuperscript𝒑ball𝑡𝑥𝑦subscriptsuperscript𝒑tar𝑡𝑥𝑦2r^{\text{g}}_{t}=\exp(-0.1\times\|{\boldsymbol{p}^{\text{ball}}_{t,xy}}-{% \boldsymbol{p}^{\text{tar}}_{t,xy}}\|^{2})italic_r start_POSTSUPERSCRIPT g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 0.1 × ∥ bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x italic_y end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x italic_y end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), encourages the ball to reach the target position in the x-y plane. In addition, we predict the ball’s trajectory and provide a dense reward rtpred=exp(0.1×𝒑land𝒑t,xyball2)subscriptsuperscript𝑟pred𝑡0.1superscriptnormsuperscript𝒑landsubscriptsuperscript𝒑ball𝑡𝑥𝑦2r^{\text{pred}}_{t}=\exp(-0.1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{ball}}_{t,xy}}\|^{2})italic_r start_POSTSUPERSCRIPT pred end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 0.1 × ∥ bold_italic_p start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x italic_y end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) based on the distance between the predicted landing point and the goal on the x-y plane Zhang et al. [2023]. The landing position, 𝒑land=(xland,yland)superscript𝒑landsuperscript𝑥landsuperscript𝑦land{\boldsymbol{p}^{\text{land}}}=\left(x^{\text{land}},y^{\text{land}}\right)bold_italic_p start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT = ( italic_x start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT ), can be calculated using the initial position and velocity as follows (g𝑔gitalic_g is gravity):

xland=x0+v0,x(v0,z+v0,z2+2gz0g),yland=y0+v0,y(v0,z+v0,z2+2gz0g)formulae-sequencesubscript𝑥landsubscript𝑥0subscript𝑣0𝑥subscript𝑣0𝑧superscriptsubscript𝑣0𝑧22𝑔subscript𝑧0𝑔subscript𝑦landsubscript𝑦0subscript𝑣0𝑦subscript𝑣0𝑧superscriptsubscript𝑣0𝑧22𝑔subscript𝑧0𝑔x_{\text{land}}=x_{0}+v_{0,x}\left(\frac{v_{0,z}+\sqrt{v_{0,z}^{2}+2gz_{0}}}{g% }\right),\ \ y_{\text{land}}=y_{0}+v_{0,y}\left(\frac{v_{0,z}+\sqrt{v_{0,z}^{2% }+2gz_{0}}}{g}\right)italic_x start_POSTSUBSCRIPT land end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT 0 , italic_x end_POSTSUBSCRIPT ( divide start_ARG italic_v start_POSTSUBSCRIPT 0 , italic_z end_POSTSUBSCRIPT + square-root start_ARG italic_v start_POSTSUBSCRIPT 0 , italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_g italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_g end_ARG ) , italic_y start_POSTSUBSCRIPT land end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT 0 , italic_y end_POSTSUBSCRIPT ( divide start_ARG italic_v start_POSTSUBSCRIPT 0 , italic_z end_POSTSUBSCRIPT + square-root start_ARG italic_v start_POSTSUBSCRIPT 0 , italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_g italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_g end_ARG ) (6)

Early termination is triggered if the ball moves backward, does not contact the golf club within 2 seconds, is too close to the humanoid’s body, or the humanoid falls.

Javelin

For javelin throw, the humanoid is equipped with a javelin of length 2.7m. Due to the complexity introduced by articulated fingers, the reward function 𝓡javelinsuperscript𝓡javelin\boldsymbol{\mathcal{R}}^{\text{javelin}}bold_caligraphic_R start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT is applied in three stages: first, the humanoid learns to hold the javelin stably; then, it learns to throw it; finally, the javelin flies as far as possible. A timer is used to differentiate the three stages. Specifically, 𝓡javelinsuperscript𝓡javelin\boldsymbol{\mathcal{R}}^{\text{javelin}}bold_caligraphic_R start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT is defined as follows:

𝓡javelin(𝒔tp,𝒔tg-javelin){0.9×rtgrab+0.1×rtjsif t<0.6s,0.9×rtgoal+0.05×rts0.05×rtgrabif 0.6st<1.2s,0.9×rtgoal+0.1×rtjsif 1.2st.superscript𝓡javelinsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-javelin𝑡cases0.9subscriptsuperscript𝑟grab𝑡0.1subscriptsuperscript𝑟js𝑡if 𝑡0.6𝑠0.9subscriptsuperscript𝑟goal𝑡0.05subscriptsuperscript𝑟s𝑡0.05subscriptsuperscript𝑟grab𝑡if 0.6𝑠𝑡1.2𝑠0.9subscriptsuperscript𝑟goal𝑡0.1subscriptsuperscript𝑟js𝑡if 1.2𝑠𝑡\boldsymbol{\mathcal{R}}^{\text{javelin}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-javelin}}_{t}})\triangleq\begin{cases}0.9\times r^{% \text{grab}}_{t}+0.1\times r^{\text{js}}_{t}&\text{if \ }t<0.6s,\\ 0.9\times r^{\text{goal}}_{t}+0.05\times r^{\text{s}}_{t}-0.05\times r^{\text{% grab}}_{t}&\text{if \ }0.6s\leq t<1.2s,\\ 0.9\times r^{\text{goal}}_{t}+0.1\times r^{\text{js}}_{t}&\text{if \ }1.2s\leq t% .\end{cases}bold_caligraphic_R start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-javelin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ { start_ROW start_CELL 0.9 × italic_r start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.1 × italic_r start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_t < 0.6 italic_s , end_CELL end_ROW start_ROW start_CELL 0.9 × italic_r start_POSTSUPERSCRIPT goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.05 × italic_r start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 0.05 × italic_r start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if 0.6 italic_s ≤ italic_t < 1.2 italic_s , end_CELL end_ROW start_ROW start_CELL 0.9 × italic_r start_POSTSUPERSCRIPT goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.1 × italic_r start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if 1.2 italic_s ≤ italic_t . end_CELL end_ROW (7)

The reward for grasping rtgrab=exp(1×𝒑tright-hand𝒑tjavelin2)subscriptsuperscript𝑟grab𝑡1superscriptnormsubscriptsuperscript𝒑right-hand𝑡subscriptsuperscript𝒑javelin𝑡2r^{\text{grab}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{right-hand}}_{t}}-{% \boldsymbol{p}^{\text{javelin}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT grab end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 1 × ∥ bold_italic_p start_POSTSUPERSCRIPT right-hand end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) encourages the hand to stay close to the javelin. The javelin stability reward rtjs=exp(1×𝒒tjavelin𝒒tjavelin-default2)subscriptsuperscript𝑟js𝑡1superscriptnormsubscriptsuperscript𝒒javelin𝑡subscriptsuperscript𝒒javelin-default𝑡2r^{\text{js}}_{t}=\exp(-1\times\|{\boldsymbol{q}^{\text{javelin}}_{t}}-{% \boldsymbol{q}^{\text{javelin-default}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT js end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 1 × ∥ bold_italic_q start_POSTSUPERSCRIPT javelin end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUPERSCRIPT javelin-default end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) encourages the 6 DoF pose of the javelin to remain close to the default pose, which faces forward and tilts 30 degrees upward, mimicking a flying pose. The humanoid stability reward, rts=exp(1×𝒑troot2)subscriptsuperscript𝑟s𝑡1superscriptnormsubscriptsuperscript𝒑root𝑡2r^{\text{s}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{root}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 1 × ∥ bold_italic_p start_POSTSUPERSCRIPT root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), encourages the humanoid to keep its root position fixed. The termination conditions vary according to the stage: during the grasping and throwing stages, the episode terminates if the javelin is too far from the right hand or deviates significantly from the default pose 𝒒tjavelin-defaultsubscriptsuperscript𝒒javelin-default𝑡{\boldsymbol{q}^{\text{javelin-default}}_{t}}bold_italic_q start_POSTSUPERSCRIPT javelin-default end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. During the flying stage, termination occurs if the javelin is too close to the right hand.

B.2 Multi-person Sports

Tennis

For tennis, each humanoid is equipped with a circular racket with a 15cm radius, positioned 35cm away from the wrist, replacing the right hand. The court measures 23.77m in length and 8.23m in width, mirroring the dimensions and layout of a real tennis court. The net height is 1m, and the simulated ball has a radius of 3.2cm. We design two tasks: a single-player ball return task, where the humanoid trains to hit balls launched randomly, and a 1v1 mode, where the humanoid competes against another humanoid. In the ball return task, the humanoid is positioned at the center of the baseline, with balls launched from the opposite side. The landing location is uniformly sampled on the opposite side and the ball launch velocity is randomly sampled. The reward function is defined as follows:

𝓡tennis(𝒔tp,𝒔tg-tennis){1×rtracket+0×rtball,if Crb=0,0×rtracket+1×rtball,if Crb=1.superscript𝓡tennissubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-tennis𝑡cases1subscriptsuperscript𝑟racket𝑡0subscriptsuperscript𝑟ball𝑡if subscript𝐶rb00subscriptsuperscript𝑟racket𝑡1subscriptsuperscript𝑟ball𝑡if subscript𝐶rb1\boldsymbol{\mathcal{R}}^{\text{tennis}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-tennis}}_{t}})\triangleq\begin{cases}1\times r^{\text{% racket}}_{t}+0\times r^{\text{ball}}_{t},&\text{if \ }C_{\text{rb}}=0,\\ 0\times r^{\text{racket}}_{t}+1\times r^{\text{ball}}_{t},&\text{if \ }C_{% \text{rb}}=1.\end{cases}bold_caligraphic_R start_POSTSUPERSCRIPT tennis end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-tennis end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ { start_ROW start_CELL 1 × italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0 × italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT rb end_POSTSUBSCRIPT = 0 , end_CELL end_ROW start_ROW start_CELL 0 × italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT rb end_POSTSUBSCRIPT = 1 . end_CELL end_ROW (8)

Here, Crb=0subscript𝐶rb0C_{\text{rb}}=0italic_C start_POSTSUBSCRIPT rb end_POSTSUBSCRIPT = 0 indicates that the racket has not made contact with the ball, and Crb=1subscript𝐶rb1C_{\text{rb}}=1italic_C start_POSTSUBSCRIPT rb end_POSTSUBSCRIPT = 1 indicates the racket has made contact. rtracket=exp(1×𝒑tracket𝒑tball2)subscriptsuperscript𝑟racket𝑡1superscriptnormsubscriptsuperscript𝒑racket𝑡subscriptsuperscript𝒑ball𝑡2r^{\text{racket}}_{t}=\exp(-1\times\|{\boldsymbol{p}^{\text{racket}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 1 × ∥ bold_italic_p start_POSTSUPERSCRIPT racket end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) rewards the racket for getting closer to the ball. rtball=1+exp(1×𝒑land𝒑ttar2)subscriptsuperscript𝑟ball𝑡11superscriptnormsuperscript𝒑landsubscriptsuperscript𝒑tar𝑡2r^{\text{ball}}_{t}=1+\exp(-1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{tar}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 + roman_exp ( - 1 × ∥ bold_italic_p start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) encourages the predicted landing location of the ball to be close to the target. Similar to the golf task, the landing location of the ball is calculated based on 𝒑tballsubscriptsuperscript𝒑ball𝑡{\boldsymbol{p}^{\text{ball}}_{t}}bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒗tballsubscriptsuperscript𝒗ball𝑡{\boldsymbol{v}^{\text{ball}}_{t}}bold_italic_v start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, providing a dense reward function to facilitate training Zhang et al. [2023]. Early termination occurs if the humanoid loses the point, either by failing to catch the ball or by hitting the ball out of bounds. In the 1v1 mode, two humanoids are placed on opposite sides of the court and the first ball is launched from the middle of the court, randomly directed at each player. The same reward function as the ball return task is used. To facilitate 1v1 training, the pre-trained model from the ball return task is used as a warm start. Similarly, the episode terminates if one player fails to catch the ball or returns the ball out of bounds.

Table Tennis

For table tennis, each humanoid is equipped with a circular paddle with an 8 cm radius, positioned 12 cm from the wrist, replacing the right hand. The table adheres to standard dimensions, featuring a playing surface 2.74 m in length and 1.525 m in width, standing 0.76 m high. The net is 15.25 cm high, and the table tennis ball has a radius of 2 cm. The setup includes a single-player ball return task and a 1v1 task. The reward function is designed similarly to tennis, except we define the ball reward as rtball=1+exp(1×𝒑land𝒑ttar2)+Nhitsubscriptsuperscript𝑟ball𝑡11superscriptnormsuperscript𝒑landsubscriptsuperscript𝒑tar𝑡2subscript𝑁hitr^{\text{ball}}_{t}=1+\exp(-1\times\|\boldsymbol{p}^{\text{land}}-{\boldsymbol% {p}^{\text{tar}}_{t}}\|^{2})+N_{\text{hit}}italic_r start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 + roman_exp ( - 1 × ∥ bold_italic_p start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT tar end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_N start_POSTSUBSCRIPT hit end_POSTSUBSCRIPT, where Nhitsubscript𝑁hitN_{\text{hit}}italic_N start_POSTSUBSCRIPT hit end_POSTSUBSCRIPT counts the number of successful hits in one episode. This formulation is intended to encourage the humanoid to continuously hit the ball effectively. Unlike in golf and tennis, we calculate 𝒑landsuperscript𝒑land\boldsymbol{p}^{\text{land}}bold_italic_p start_POSTSUPERSCRIPT land end_POSTSUPERSCRIPT when it lands on the table at a height of 0.76 m. For early termination and the warm start in 1v1, we maintain the same setting as in the tennis task.

Fencing

For 1v1 fencing, similar to real-world fencing, the two players are confined to a 14m by 2m playground, where stepping out of the bound will reset the game. The fencing reward is structured similarly to the boxing setup in NCP Zhu et al. [2023]:

𝓡fencing(𝒔tp,𝒔tg-fencing)0.1×rtfacing+0.1×rtvel+0.6×rtstrike+1×rtpoint.superscript𝓡fencingsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-fencing𝑡0.1subscriptsuperscript𝑟facing𝑡0.1subscriptsuperscript𝑟vel𝑡0.6subscriptsuperscript𝑟strike𝑡1subscriptsuperscript𝑟point𝑡\boldsymbol{\mathcal{R}}^{\text{fencing}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-fencing}}_{t}})\triangleq 0.1\times r^{\text{facing}}_% {t}+0.1\times r^{\text{vel}}_{t}+0.6\times r^{\text{strike}}_{t}+1\times r^{% \text{point}}_{t}.bold_caligraphic_R start_POSTSUPERSCRIPT fencing end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-fencing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ 0.1 × italic_r start_POSTSUPERSCRIPT facing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.1 × italic_r start_POSTSUPERSCRIPT vel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 0.6 × italic_r start_POSTSUPERSCRIPT strike end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 × italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (9)

The facing reward rtfacingsubscriptsuperscript𝑟facing𝑡r^{\text{facing}}_{t}italic_r start_POSTSUPERSCRIPT facing end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT penalizes deviation from facing the opponent’s root position 𝒑topp-rootsubscriptsuperscript𝒑opp-root𝑡{\boldsymbol{p}^{\text{opp-root}}_{t}}bold_italic_p start_POSTSUPERSCRIPT opp-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The velocity reward, rtvelsubscriptsuperscript𝑟vel𝑡r^{\text{vel}}_{t}italic_r start_POSTSUPERSCRIPT vel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, encourages the x-y plane linear velocity to be directed towards the opponent’s root position 𝒑topp-rootsubscriptsuperscript𝒑opp-root𝑡{\boldsymbol{p}^{\text{opp-root}}_{t}}bold_italic_p start_POSTSUPERSCRIPT opp-root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The strike reward, rtstrike=exp(10×argmin𝒑tsword𝒑topp-target2)subscriptsuperscript𝑟strike𝑡10argminsuperscriptnormsubscriptsuperscript𝒑sword𝑡subscriptsuperscript𝒑opp-target𝑡2r^{\text{strike}}_{t}=\exp(-10\times\operatorname*{argmin}\|{\boldsymbol{p}^{% \text{sword}}_{t}}-{\boldsymbol{p}^{\text{opp-target}}_{t}}\|^{2})italic_r start_POSTSUPERSCRIPT strike end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_exp ( - 10 × roman_argmin ∥ bold_italic_p start_POSTSUPERSCRIPT sword end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), encourages the swordtip to get closer to the target body parts 𝒑topp-targetsubscriptsuperscript𝒑opp-target𝑡{\boldsymbol{p}^{\text{opp-target}}_{t}}bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which include the pelvis, head, spine, chest, and torso. If there is contact with the target body part with sufficient force, a positive reward is provided:

rtpoint={1if argmin𝒑tsword𝒑topp-target20.1andcontact force50Nm,0otherwise.subscriptsuperscript𝑟point𝑡cases1if argminsuperscriptnormsubscriptsuperscript𝒑sword𝑡subscriptsuperscript𝒑opp-target𝑡20.1andcontact force50Nm0otherwiser^{\text{point}}_{t}=\begin{cases}1&\text{if \ }\operatorname*{argmin}\|{% \boldsymbol{p}^{\text{sword}}_{t}}-{\boldsymbol{p}^{\text{opp-target}}_{t}}\|^% {2}\leq 0.1\enskip\text{and}\enskip\text{contact force}\geq 50\text{Nm},\\ 0&\text{otherwise}.\\ \end{cases}italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if roman_argmin ∥ bold_italic_p start_POSTSUPERSCRIPT sword end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT opp-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0.1 and contact force ≥ 50 Nm , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW (10)

Our fencing agents are trained using competitive self-play, as introduced in the main paper.

Boxing

For boxing, the humanoid competes in a boxing ring measuring 5m by 5m. The humanoid’s right hand is replaced with a sphere of 8cm radius. The boxing reward function has the same composition as fencing, except that the sword tip position 𝒑tswordsubscriptsuperscript𝒑sword𝑡{\boldsymbol{p}^{\text{sword}}_{t}}bold_italic_p start_POSTSUPERSCRIPT sword end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is replaced by the hand position 𝒑thandsubscriptsuperscript𝒑hand𝑡{\boldsymbol{p}^{\text{hand}}_{t}}bold_italic_p start_POSTSUPERSCRIPT hand end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Our boxing agents are also trained using competitive self-play.

Soccer

The soccer field measures 32m in length and 20m in width. Each goal is 4m wide and 2m tall. The ball has a diameter of 11.5 cm and weighs 450 grams. For the penalty kick task, the reward function 𝓡soccer-kick(𝒔tp,𝒔tg-kick)wp2brp2b+wb2grb2g+wbv2grbv2g+wb2trb2tctno-dribblesuperscript𝓡soccer-kicksubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-kick𝑡superscript𝑤p2bsuperscript𝑟p2bsuperscript𝑤b2gsuperscript𝑟b2gsuperscript𝑤bv2gsuperscript𝑟bv2gsuperscript𝑤b2tsuperscript𝑟b2tsubscriptsuperscript𝑐no-dribble𝑡\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}}+w% ^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{b2t}}r^{% \text{b2t}}-c^{\text{no-dribble}}_{t}bold_caligraphic_R start_POSTSUPERSCRIPT soccer-kick end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-kick end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT - italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is divided into stages based on whether the ball is moving toward the goal. Specifically, we define a "closer to goal" variable as gtball-to-goal=𝒑tgoal-target𝒑t1ball2𝒑tgoal-target𝒑tball2subscriptsuperscript𝑔ball-to-goal𝑡subscriptnormsubscriptsuperscript𝒑goal-target𝑡subscriptsuperscript𝒑ball𝑡12subscriptnormsubscriptsuperscript𝒑goal-target𝑡subscriptsuperscript𝒑ball𝑡2{g^{\text{ball-to-goal}}_{t}}=\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{goal-target% }}_{t}}-{\boldsymbol{p}^{\text{ball}}_{t}}\|_{2}italic_g start_POSTSUPERSCRIPT ball-to-goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which indicates whether the ball is getting closer to the goal. The full reward function is defined as follows:

𝓡soccer-kick(𝒔tp,𝒔tg-kick){0.4×rp2bctno-dribbleif gtball-to-goal0,0.1×rb2g+0.1×rbv2g+0.8×rb2tctno-dribbleotherwise.superscript𝓡soccer-kicksubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-kick𝑡cases0.4superscript𝑟p2bsubscriptsuperscript𝑐no-dribble𝑡if subscriptsuperscript𝑔ball-to-goal𝑡00.1superscript𝑟b2g0.1superscript𝑟bv2g0.8superscript𝑟b2tsubscriptsuperscript𝑐no-dribble𝑡otherwise\boldsymbol{\mathcal{R}}^{\text{soccer-kick}}({\boldsymbol{s}^{\text{p}}_{t}},% {\boldsymbol{s}^{\text{g-kick}}_{t}})\triangleq\begin{cases}0.4\times r^{\text% {p2b}}-c^{\text{no-dribble}}_{t}&\text{if \ }{g^{\text{ball-to-goal}}_{t}}\leq 0% ,\\ 0.1\times r^{\text{b2g}}+0.1\times r^{\text{bv2g}}+0.8\times r^{\text{b2t}}-c^% {\text{no-dribble}}_{t}&\text{otherwise}.\\ \end{cases}bold_caligraphic_R start_POSTSUPERSCRIPT soccer-kick end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-kick end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ { start_ROW start_CELL 0.4 × italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT - italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL if italic_g start_POSTSUPERSCRIPT ball-to-goal end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 0 , end_CELL end_ROW start_ROW start_CELL 0.1 × italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT + 0.1 × italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + 0.8 × italic_r start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT - italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW (11)

Essentially, if the ball is not moving toward the goal, the humanoid is encouraged to move toward the ball; if the ball is moving, the agent is rewarded for shooting the ball toward the target in the goal post. The player-to-ball reward, rp2b=𝒑t1root𝒑t1ball2𝒑troot𝒑tball2superscript𝑟p2bsubscriptnormsubscriptsuperscript𝒑root𝑡1subscriptsuperscript𝒑ball𝑡12subscriptnormsubscriptsuperscript𝒑root𝑡subscriptsuperscript𝒑ball𝑡2r^{\text{p2b}}=\|{\boldsymbol{p}^{\text{root}}_{t-1}}-{\boldsymbol{p}^{\text{% ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{root}}_{t}}-{\boldsymbol{p}^{% \text{ball}}_{t}}\|_{2}italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT root end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, is a point-goal reward Won et al. [2022]. The ball-to-goal reward rb2g=𝒑tgoal-target𝒑t1ball2𝒑tgoal-target𝒑tball2superscript𝑟b2gsubscriptnormsubscriptsuperscript𝒑goal-target𝑡subscriptsuperscript𝒑ball𝑡12subscriptnormsubscriptsuperscript𝒑goal-target𝑡subscriptsuperscript𝒑ball𝑡2r^{\text{b2g}}=\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{\boldsymbol{p}^{% \text{ball}}_{t-1}}\|_{2}-\|{\boldsymbol{p}^{\text{goal-target}}_{t}}-{% \boldsymbol{p}^{\text{ball}}_{t}}\|_{2}italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT = ∥ bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT encourages the ball to move closer to the goal position. The ball-velocity-to-goal reward rbv2gsuperscript𝑟bv2gr^{\text{bv2g}}italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT incentivizes the ball velocity toward the goal position. The ball-to-target reward rb2tsuperscript𝑟b2tr^{\text{b2t}}italic_r start_POSTSUPERSCRIPT b2t end_POSTSUPERSCRIPT predicts the landing position of the ball in the net based on its current velocity and position, providing a reward if the ball is close to the target. Finally, ctno-dribblesubscriptsuperscript𝑐no-dribble𝑡c^{\text{no-dribble}}_{t}italic_c start_POSTSUPERSCRIPT no-dribble end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT penalizes the humanoid if its root position is over the ball’s spawning point.

The team play (1v1 and 2v2) soccer tasks use similar rewards as the penalty kick task. The reward function for team play is 𝓡soccer-match(𝒔tp,𝒔tg-soccer)wp2brp2b+wb2grb2g+wbv2grbv2g+wpointrpointsuperscript𝓡soccer-matchsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-soccer𝑡superscript𝑤p2bsuperscript𝑟p2bsuperscript𝑤b2gsuperscript𝑟b2gsuperscript𝑤bv2gsuperscript𝑟bv2gsuperscript𝑤pointsuperscript𝑟point\boldsymbol{\mathcal{R}}^{\text{soccer-match}}({\boldsymbol{s}^{\text{p}}_{t}}% ,{\boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq w^{\text{p2b}}r^{\text{p2b}% }+w^{\text{b2g}}r^{\text{b2g}}+w^{\text{bv2g}}r^{\text{bv2g}}+w^{\text{point}}% r^{\text{point}}bold_caligraphic_R start_POSTSUPERSCRIPT soccer-match end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-soccer end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_w start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + italic_w start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT, where rp2bsuperscript𝑟p2br^{\text{p2b}}italic_r start_POSTSUPERSCRIPT p2b end_POSTSUPERSCRIPT, rb2gsuperscript𝑟b2gr^{\text{b2g}}italic_r start_POSTSUPERSCRIPT b2g end_POSTSUPERSCRIPT are the same as in the penalty kick. rpointsuperscript𝑟pointr^{\text{point}}italic_r start_POSTSUPERSCRIPT point end_POSTSUPERSCRIPT provides a one-time bonus for scoring.

Basketball

The basketball environment is similar to soccer except that it utilizes the SMPL-X humanoid with articulated fingers. In the free-throwing task, the ball is initialized between the humanoid’s hands. The free throw reward is defined as: 𝓡free-throw(𝒔tp,𝒔tg-soccer)0.5×rballvel+0.5×rbv2g+rbasketsuperscript𝓡free-throwsubscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g-soccer𝑡0.5superscript𝑟ballvel0.5superscript𝑟bv2gsuperscript𝑟basket\boldsymbol{\mathcal{R}}^{\text{free-throw}}({\boldsymbol{s}^{\text{p}}_{t}},{% \boldsymbol{s}^{\text{g-soccer}}_{t}})\triangleq 0.5\times r^{\text{ballvel}}+% 0.5\times r^{\text{bv2g}}+r^{\text{basket}}bold_caligraphic_R start_POSTSUPERSCRIPT free-throw end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT g-soccer end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ 0.5 × italic_r start_POSTSUPERSCRIPT ballvel end_POSTSUPERSCRIPT + 0.5 × italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT + italic_r start_POSTSUPERSCRIPT basket end_POSTSUPERSCRIPT. The basketball velocity reward rballvel=exp(0.1×𝒗tball𝒗tball-desired22)superscript𝑟ballvel0.1superscriptsubscriptnormsubscriptsuperscript𝒗ball𝑡subscriptsuperscript𝒗ball-desired𝑡22r^{\text{ballvel}}=\exp(-0.1\times\|{\boldsymbol{{v}}^{\text{ball}}_{t}}-{% \boldsymbol{{v}}^{\text{ball-desired}}_{t}}\|_{2}^{2})italic_r start_POSTSUPERSCRIPT ballvel end_POSTSUPERSCRIPT = roman_exp ( - 0.1 × ∥ bold_italic_v start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_v start_POSTSUPERSCRIPT ball-desired end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) encourages the ball’s velocity to be close to the desired velocity to reach the goal. The desired velocity, 𝒗tball-desiredsubscriptsuperscript𝒗ball-desired𝑡{\boldsymbol{{v}}^{\text{ball-desired}}_{t}}bold_italic_v start_POSTSUPERSCRIPT ball-desired end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is computed using the goal position 𝒑tgoal-targetsubscriptsuperscript𝒑goal-target𝑡{\boldsymbol{p}^{\text{goal-target}}_{t}}bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the ball position 𝒑tballsubscriptsuperscript𝒑ball𝑡{\boldsymbol{p}^{\text{ball}}_{t}}bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the following physics equations:

Ttreachsubscriptsuperscript𝑇reach𝑡\displaystyle T^{\text{reach}}_{t}italic_T start_POSTSUPERSCRIPT reach end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =2×(𝒑tball𝒑tgoal-target)z2g,𝒗t,xyball-desired=(𝒑tball𝒑tgoal-target)xy2Ttreachformulae-sequenceabsent2subscriptnormsubscriptsubscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒑goal-target𝑡𝑧2𝑔subscriptsuperscript𝒗ball-desired𝑡𝑥𝑦subscriptnormsubscriptsubscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒑goal-target𝑡𝑥𝑦2subscriptsuperscript𝑇reach𝑡\displaystyle=\sqrt{\frac{2\times\|({\boldsymbol{p}^{\text{ball}}_{t}}-{% \boldsymbol{p}^{\text{goal-target}}_{t}})_{z}\|_{2}}{g}}\ ,\ {\boldsymbol{{v}}% ^{\text{ball-desired}}_{t,xy}}=\frac{\|({\boldsymbol{p}^{\text{ball}}_{t}}-{% \boldsymbol{p}^{\text{goal-target}}_{t}})_{xy}\|_{2}}{T^{\text{reach}}_{t}}= square-root start_ARG divide start_ARG 2 × ∥ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_g end_ARG end_ARG , bold_italic_v start_POSTSUPERSCRIPT ball-desired end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_x italic_y end_POSTSUBSCRIPT = divide start_ARG ∥ ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT reach end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (12)
𝒗t,zball-desiredsubscriptsuperscript𝒗ball-desired𝑡𝑧\displaystyle{\boldsymbol{{v}}^{\text{ball-desired}}_{t,z}}bold_italic_v start_POSTSUPERSCRIPT ball-desired end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_z end_POSTSUBSCRIPT =(𝒑tball𝒑tgoal-target)z+0.5×g×(Ttreach)2Treach.absentsubscriptsubscriptsuperscript𝒑ball𝑡subscriptsuperscript𝒑goal-target𝑡𝑧0.5𝑔superscriptsubscriptsuperscript𝑇reach𝑡2superscript𝑇reach\displaystyle=\frac{({\boldsymbol{p}^{\text{ball}}_{t}}-{\boldsymbol{p}^{\text% {goal-target}}_{t}})_{z}+0.5\times g\times(T^{\text{reach}}_{t})^{2}}{T^{\text% {reach}}}.= divide start_ARG ( bold_italic_p start_POSTSUPERSCRIPT ball end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT goal-target end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + 0.5 × italic_g × ( italic_T start_POSTSUPERSCRIPT reach end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T start_POSTSUPERSCRIPT reach end_POSTSUPERSCRIPT end_ARG .

The ball-velocity-to-goal reward rbv2gsuperscript𝑟bv2gr^{\text{bv2g}}italic_r start_POSTSUPERSCRIPT bv2g end_POSTSUPERSCRIPT encourages the velocity to be directed towards the goal position. The basket reward, rbasketsuperscript𝑟basketr^{\text{basket}}italic_r start_POSTSUPERSCRIPT basket end_POSTSUPERSCRIPT, provides a one-time reward if the ball passes through the basket.

Team-play basketball has a similar reward design as soccer. The team-play basketball task is highly challenging due to the difficulty of picking the ball up, which is more complex than kicking a ball. Thus, while we support 1v1 and 2v2 team-play basketball, our preliminary reward design does not yield interesting behavior, unlike in soccer.

B.3 Hyperparamters

Training hyperparameters are provided in Table 4. We use the same set of hyperparameters to train all of our sports environments, highlighting the advantage of employing a unified humanoid embodiment for simulated sports.

Table 4: Hyperparameters for training each baseline used in SMPLOlympics. We use the same set of hyperparamters for each sport. Notice that AMP and PULSE uses PPO as the optimization method but add respective motion priors (as reward or motion representation). σ𝜎\sigmaitalic_σ: fixed variance for policy. γ𝛾\gammaitalic_γ: discount factor. ϵitalic-ϵ\epsilonitalic_ϵ: clip range for PPO. wdiscsubscript𝑤discw_{\text{disc}}italic_w start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT and wtasksubscript𝑤taskw_{\text{task}}italic_w start_POSTSUBSCRIPT task end_POSTSUBSCRIPT: weights for discriminator and task rewards.
Batch Size Learning Rate σ𝜎\sigmaitalic_σ γ𝛾\gammaitalic_γ ϵitalic-ϵ\epsilonitalic_ϵ MLP-size wdiscsubscript𝑤discw_{\text{disc}}italic_w start_POSTSUBSCRIPT disc end_POSTSUBSCRIPT wtasksubscript𝑤taskw_{\text{task}}italic_w start_POSTSUBSCRIPT task end_POSTSUBSCRIPT # of samples
PPO Schulman et al. [2017] 1024 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.05 0.99 0.2 [2048, 1024, 512] 0 1 109similar-toabsentsuperscript109\sim 10^{9}∼ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT
AMP Peng et al. [2021] 1024 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.05 0.99 0.2 [2048, 1024, 512] 0.5 0.5 109similar-toabsentsuperscript109\sim 10^{9}∼ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT
PULSE Luo et al. [2023a] 1024 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.3 0.99 0.2 [2048, 1024, 512] 0 1 109similar-toabsentsuperscript109\sim 10^{9}∼ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT
PULSE Luo et al. [2023a] + AMP Peng et al. [2021] 1024 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.3 0.99 0.2 [2048, 1024, 512] 0.5 0.5 109similar-toabsentsuperscript109\sim 10^{9}∼ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT

B.4 Details about Baselines

For our baseline methods PULSE Luo et al. [2023a] and AMP Peng et al. [2021], we use the official implementations. For PULSE Luo et al. [2023a], we employ the publicly released model without modification, which is pre-trained on the AMASS dataset. We follow a similar setup for downstream tasks in PULSE, using the frozen prior 𝓟PULSEsubscript𝓟PULSE\boldsymbol{\mathcal{P}}_{\text{PULSE}}bold_caligraphic_P start_POSTSUBSCRIPT PULSE end_POSTSUBSCRIPT, decoder 𝓓PULSEsubscript𝓓PULSE\boldsymbol{\mathcal{D}}_{\text{PULSE}}bold_caligraphic_D start_POSTSUBSCRIPT PULSE end_POSTSUBSCRIPT, and residual action representation. Since PULSE only includes trained models for the SMPL-based models, we train SMPL-X humanoid based models following the official code. Specifically, we train a humanoid motion imitator following PHC Luo et al. [2023b], and distill motor skills into a 48-dimensional latent space (instead of 32-D, to accommodate articulated fingers). PULSE provides an action space for hierarchical RL and can be integrated with AMP. For PULSE+AMP, the AMP reward offers additional style guidance for the humanoid, which is particularly beneficial for tasks such as table tennis. However, we find that the demonstration sequences used for AMP need to be task-specific (e.g. contains only a swinging motion); otherwise, the discriminator reward can overpower the task reward and lead to undesired behavior (as seen in the free kick results).

Appendix C Additional Ablations

We conducted an ablation study to evaluate the role of physics-based tracking (w/ PHC) in acquiring human reference motion. Specifically, we used the pose estimation results directly from TRAM Wang et al. [2024b] as positive samples for the discriminator during policy training (w/ PHC). Our experiments were performed in the context of table tennis. As shown in Table 5, we found that providing video data without PHC leads to significantly lower performance compared to using PHC, similar to the results obtained using only PULSE. We observe that when the quality of the provided reference motion is poor (e.g., with significant noise in position,

Table 5: Ablation study on PHC.
Table Tennis
Method Avg Hits \uparrow Error Dis \downarrow
PULSE 0.74 0.19
PULSE+AMP, w/o PHC 0.91 0.18
PULSE+AMP, w/ PHC 1.83 0.23

and drastic velocity changes), the model struggles to effectively utilize the reference motion as style guidance to achieve natural movements. In contrast, employing physics-based tracking to refine pose estimates from in-the-wild videos results in physically plausible motion, which significantly aids in policy learning.

Appendix D Broader Social impact

We propose SMPLOlympics, a collection of sports environments for simulated humanoids. These environments can be used to benchmark learning algorithms, discover new humanoid behaviors, create animations, and more. The potential negative social impact includes the risk of generating animations that could be used to create DeepFakes. Positive social impact includes the development of intelligent and collaborative agents, advancements in robot learning, discovery of new sports techniques, and the generation of immersive and physically realistic animations.