In this section, we compare our method with previous work, perform an ablation study to evaluate the main components, and assess the use of the system with different user dimensions, i.e., user height and proportions. We performed all the evaluations in real time, exactly mimicking real-world use.
7.1 Comparison
To the best of our knowledge, there are no data-driven methods for reconstructing full-body poses from a sparse set of sensors providing positional and rotational information. Nonetheless, there are some methods able to synthesize plausible poses from three 6-DoF sensors (HMD and two hand-held controllers). The state-of-the-art method is
AvatarPoser (AP) [Jiang et al.
2022a] which employs a Transformer model to generate full-body poses and uses an optimization-based IK method to refine the arms. We extended the implementation of AP to work with six 6-DoF sensors to enable a fair comparison with our approach. We will refer to our extended implementation of AP as the
Extended AvatarPoser (EAP). Specifically, we modified the input layer of the Transformer model while maintaining the training procedure and the remainder of the code.
We also evaluate our method against
Final IK (FIK) [RootMotion
2017], which is a state-of-the-art IK method for animating full-body VR avatars when using a sparse set of 6
degrees of freedom (DoF) trackers. Finally, we compare with other state-of-the-art data-driven methods that reconstruct full-body poses from IMU sensors, such as:
TransPose (TP) [Yi et al.
2021], and
Physical Inertial Poser (PIP) [Yi et al.
2022]. Although comparisons with AP and FIK enable us to evaluate the quality of our method with 6-DoF sensors, it is essential to compare with IMU-based methods to gain a comprehensive understanding of our approach. This is because the use of 6-DoF sensors does not necessarily ensure superiority over IMU-based methods. Additionally, comparing with the wider body of literature on full-body reconstruction provides a broader context for assessing our overall performance gains.
As the generator is convolution-based, we use a window of 64 frames for real-time predictions. When predicting a new pose, we fill this window with past frames of sparse data, the current data, and, optionally, future data. When latency is not an issue, e.g., to generate poses offline from an already captured sequence, we can allow the system to have access to some future information to improve quality. Our system, labeled as
Ours-7 in Table
1, uses a window of 64 frames, including 56 past frames, the current frame, and 7 future frames. Similarly,
Ours-0 uses 63 past frames, the current frame, but no future frames, resulting in no added latency. In comparison, TransPose uses 5 future frames, while AP, Final IK, and PIP do not use future information.
We conduct a qualitative and quantitative evaluation of our method against EAP, AP, Final IK, TransPose, and PIP. Please refer to the supplementary video for an animated version of our results.
Qualitative. In order to provide a visual comparison of our method with related work, selected frames from the video are shown in Figure
8. In this experiment, we simultaneously collected positional, rotational, and raw IMU data (accelerations and orientations) using the HTC VIVE system and six IMUs from the Xsens Awinda motion capture system. To make it easier to visually compare the poses, the root is fixed in the generated poses.
Both TransPose and PIP generate natural human-like poses in most cases, however, they face challenges when dealing with poses that involve a certain level of ambiguity from the sparse input; for example, when the user crosses two end-effectors, such as hands or feet, or when the user is crouching or lying on the ground. Overall, the movement reconstructed by these methods is often overly smoothed and fails to precisely position the end-effectors. In contrast, Final IK is able to precisely match the end-effectors but fails to reconstruct the real orientations of the joints. For instance, as seen in the fourth row of Figure
8, the position of the right foot is correct, but the lower leg appears parallel to the ground, differing from the ground truth. In addition, poses often appear too stiff and robotic. Extended AvatarPoser performance lies within an intermediate range, as it generates natural-looking poses in most scenarios. However, its limitations become apparent when it fails to accurately position end-effectors in some instances, resulting in a smoothed pose. This is particularly evident in situations where the pose is ambiguous, as demonstrated in the third row of Figure
8. Our method, in contrast, is able to position the end-effectors accurately, similar to Final IK, while also maintaining the natural appearance of the poses and correctly matching the joint rotations when compared to the ground truth. We believe our method produces more accurate results due to the two-stage approach, which combines the strengths of a convolutional-based pose generator and a learned IK for accurate positioning.
Quantitative. We test our method using two datasets from AMASS [Mahmood et al.
2019] that have not been used for training in the learning-based methods: HUMAN4D [Chatzitofis et al.
2020] and SOMA [Ghorbani and Black
2021], which contain a variety of human activities captured by commercial marker-based motion capture systems. We chose AMASS as it is a well-known human motion database and is compatible with SMPL [Loper et al.
2015], which is required by the code provided by the authors of AvatarPoser, TransPose, and PIP. In line with previous works that have trained their networks using multiple datasets from AMASS, our system is trained using DanceDB [Aristidou et al.
2019], which is also part of AMASS. We also retrained AvatarPoser with the DanceDB. Because our approach relies on joint information as input, there is no need to synthesize VR trackers. Instead, we directly use the orientations from the databases and apply Forward Kinematics to obtain the positions of the end-effectors.
Similar to previous work [Jiang et al.
2022a; Yi et al.
2021,
2022; Jiang et al.
2022b], we evaluate the performance of our method using different metrics:
–
Positional Error (Pos) measures the mean Euclidean distance error of all joints in centimeters. The root position is aligned with the ground truth data.
–
Rotational Error (Rot) measures the mean global rotation error of all joints in degrees. We compute the distance between two rotations represented by rotation matrices \(R_0\) and \(R_1\) as the angle of the difference rotation represented by the rotation matrix \(D = R_0 R_1^T\) .
–
End Effector Positional Error (EE Pos) measures the mean Euclidean distance error of end-effectors (excluding the root) in centimeters. The root position is aligned with the ground truth data.
–
Root Error (Root) measures the mean Euclidean distance error of the root joint in centimeters.
–
Jitter measures the mean jerk of all joints in ten squared meters per second cubed. Jerk is the third derivative of position with respect to time, i.e., the rate of change of the acceleration [Flash and Hogan
1985]. We use it as a measure of the smoothness of the motion.
–
Velocity Error (Vel) measures the mean velocity error of all joints in centimeters per second. The velocity is computed by forward finite differences.
We group these metrics into three main categories: pose quality, end-effector accuracy, and smoothness. To evaluate the overall pose quality of the generated data, we use the Positional Error and Rotational Error that measure the joint positions and rotations accuracy, respectively, when the root is aligned with the ground truth data. To evaluate end-effector accuracy, we distinguish between the character’s placement in the world (Root Error) and the positions of the remaining end-effectors (such as the head, hands, and toes) when the root position is aligned. Lastly, motion smoothness is assessed using Jitter and Velocity Error.
Table
1 presents the comparison results. The goal of our proposed method is to achieve optimal pose quality while also maximizing end-effector accuracy. Reconstructing full-body poses from sparse data is an under-constrained problem, therefore, a balance must be struck between the two metrics to achieve optimal results. Our method balances the competing demands of high-quality poses and accurate end-effector positioning without negatively impacting the overall human-like appearance of the pose.
It can be observed that Final IK, being an inverse kinematics method, effectively tracks the end-effectors but struggles in synthesizing natural poses, and often introduces jittering artifacts with abrupt changes in direction. Conversely, methods such as TransPose and PIP, since they use IMU sensors, can achieve high overall pose quality, but they introduce Positional Error and low end-effector accuracy. Our model achieves the highest scores for pose quality, regardless of whether future frames are used or not. Additionally, our method greatly improves the accuracy of end-effectors when compared to other data-driven methods, achieving results similar to Final IK, which is specifically designed to minimize the distance between end-effectors and the target. Furthermore, our model outperforms other methods in Root Error as we do not predict the root position, but constrain it based on the root sensor and let the networks adjust the pose. This aspect is crucial for self-avatar animation as it keeps the user correctly positioned with the virtual avatar. In terms of smoothness, PIP has the best results in Jitter but the worst in End-Effector Positional Error, which suggests that they are missing the high-frequency details of the movement. In contrast, our method provides a good balance as it obtains the second-best scores in Jitter and Velocity Error while maintaining high end-effector accuracy with a smaller variance. This suggests fewer large changes in pose between frames and fewer jittering artifacts, resulting in less noticeable popping artifacts in the animation.
Finally, our method outperforms Extended AvatarPoser across all metrics (except for
Root Error, since both methods introduce no root error). We consider AvatarPoser as our baseline since it also uses 6-DoF trackers, but employs the well-established Transformer architecture. Hence, the performance of our approach is not solely attributable to the use of 6-DoF trackers. As we extended the input of AvatarPoser’s Transformer model to include six 6-DoF trackers instead of the original three, to further validate our findings, we also present in Table
2 a comparison of the same metrics but only for the upper-body joints synthesized with the original AvatarPoser implementation. Remarkably, even when focusing solely on the upper-body joints, our approach still clearly outperforms AvatarPoser.
We attribute the superior performance of our approach compared to the Extended Avatar Poser to the specialized architectural composition of our networks. As opposed to Transformers, originally crafted for natural language processing, our method deploys skeleton-aware operations intrinsically designed to accommodate the hierarchical structure of the human skeleton. In addition, our dual-stage strategy employs a time-aware network using convolutions, enabling them to learn a comprehensive representation of human motion, at the expense of losing some high-frequency motion details. Nonetheless, our method can recover the high-frequency details through the utilization of the learned IK. Crucially, we posit that our learned IK, trained in an end-to-end fashion with the generator, is capable of learning an optimization policy that more accurately replicates natural human motion, surpassing the traditional optimization-based IK employed in AvatarPoser.
7.2 Ablation Study
As outlined in the previous section, our goal is to achieve both optimal pose quality and maximum end-effector accuracy. In this section, we describe an ablation study to examine the impact of each of the components of our network on the balance between pose quality and end-effector accuracy. We trained and evaluated our system on the same datasets as in Section
7.1. For a fair comparison, all experiments in this section had access to the 7 future frames, matching the conditions of the
Ours-7 version, which all ablation tests are compared against. All results are listed in Table
3; please refer to the supplementary materials for an animated version of these results.
In the initial experiment, we assess the effect of using the generator alone, without the learned IK. We compared two versions: first (No Learned IK in Table
3), the learned IK is not used and the rest of the pipeline remains intact; second (Generator
\(\mathcal {L}_{S}\) in Table
3), a Forward Kinematics loss similar to
\(\mathcal {L}_{S}\) was added to compare the pose generated by the generator and the ground truth,
\(MSE (FK(\mathbf {Q}), FK(\mathbf {\hat{Q}^{G}}))\) .
In this case, the only metric that showed improvement was jitter. However, it was observed that the reconstructed motion failed to maintain high-frequency details, resulting in lower performance in other metrics. In the second case, when the FK-based loss is added to the output of the generator, we observed a slight decrease in rotational error, but a notable increase in both end-effector positional error and overall positional error when compared to the case of using the learned IK component. Thus, these findings suggest that the inclusion of the learned IK component significantly improves the end-effector accuracy while preserving the high-quality poses synthesized by the generator. It is worth noting that, by improving the end-effector positions and maintaining a low rotational error, the overall positional error is decreased as the limbs are correctly positioned.
Since the learned IK operates on each limb independently, it lacks the ability to take into account the overall body pose. Therefore, when omitting the
\(\mathcal {L}_{Reg}\) loss term (No
\(\mathcal {L}_{Reg}\) in Table
3), while there may be a slight improvement in end-effector accuracy, a significant decline in pose quality is observed. By looking at the generated poses, it can be seen how the limbs are attempting to reach the end-effectors at the cost of synthesizing non-human-like motion. As such, the inclusion of the
\(\mathcal {L}_{Reg}\) loss term leverages the strengths of the generator with the learned IK, resulting in improved pose quality and end-effector accuracy.
Additionally, to evaluate the impact of the skeletal-aware operations, we define a baseline method (No Skeletal Op. in Table
3). Specifically, we replaced the previously-used skeletal convolutions with conventional one-dimensional convolutions and modified the skeletal unpooling to allow unpooled joints to receive information from all joints instead of just neighboring ones. Not accounting for the joint adjacency resulted in a significant decline in performance across all metrics. By inspecting the visual results, we believe that allowing convolutions to have access to all joints produces an average effect that results in an overly smooth motion.