1 Introduction
Human pose transfer has become increasingly compelling recently since it can be applied to real-world applications such as movies’ special effects [
56], entertainment systems [
62], reenactment [
31], and so forth [
12,
40]. At the same time, it is also closely related to many computer vision tasks like human-object interaction recognition [
42,
71,
72], person re-identification [
33,
55], human pose segmentation [
59,
70] and human parsing [
27,
60], and all these methods can be beneficial to each other. Given some images of a target person and a source person image with the desired pose (e.g., judo, dance), the goal of the human pose transfer task is to synthesize a realistic image of the target person with the desired pose of the source person.
With the power of deep learning, especially the
generative adversarial networks (
GANs) [
15], pioneering works have raised impressive solutions to address the human image generation [
34,
37,
41] by efficiently leveraging the image-to-image translation schemes and have achieved significant progress. Intuitively, early routine coarsely conducts human pose transfer through general image-to-image translation methods such as Pix2Pix [
23] and CycleGAN [
73], which attempt to translate the extracted skeleton image of the source person to the image of target person with the desired poses.
Subsequent approaches [
28,
37,
38] adopt specifically designed modules for human pose transfer. Specifically, the U-net architecture with skip connections in [
14] is employed to keep the low-level features. To mitigate the pose misalignment between the source and target persons, [
50] uses part-wise affine transformation with a modified feature fusion mechanism to warp the appearance features onto the target pose. Later, extensive works have been presented to strengthen the modeling ability of body deformation and feature transfer with different methods, including 3D surface models [
16,
28,
41], local attention [
43,
74], and optical flow [
57]. [
27] and [
60] propose a rectification strategy in a self-learning way and hierarchical information framework, respectively, for human parsing, which benefits the downstream pose transfer task. However, the warping methods commonly struggle with
pose ambiguity when the viewpoint changes, occlusions occur, or even transferring a complicated pose in many situations. To address the pose ambiguity, a series of works [
34,
57] use predictive branches to illuminate and replenish new contents for invisible regions. When the hallucinated contents have a different context style than the local-warped ones, generated images will have a low visual fidelity due to
appearance inconsistency. One of the main reasons for pose ambiguity and appearance inconsistency is that the commonly used reconstruction loss and the adversarial generative loss only constrain the synthesized image generation at the image level.
Towards alleviating the mentioned limitations, it is important to disentangle the pose and appearance information, and exploit the disentangled pose and appearance feature consistencies between the synthesized and real images, i.e., the synthesized target image should have a similar high-level appearance feature to the real target person as well as a similar high-level pose feature to the real source person. The disentangled pose and appearance feature consistencies can constrain the training at the feature level and lead to a more consistent and realistic synthesized result. In CDMS [
70], a multi-mutual consistency learning strategy is proposed for the human pose segmentation task, showing the importance of feature consistency for distinguishing the human pose.
In this article, we propose a pose transfer network with augmented
Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. DFC-Net contains a pose feature encoder and a static feature encoder to extract pose and appearance features from the source and target person, respectively. In the pose feature encoder, we integrated a pre-trained pose estimator such as OpenPose [
8] to extract the keypoint heatmaps. Notice that the pose estimator is pre-trained on COCO keypoint challenge dataset [
30], which is not any dataset deployed in our experiments. As shown in Figure
2, though the pre-trained pose estimator can predict pose heatmaps for unseen subjects in our dataset, it cannot generalize well and the heatmaps have much noise, which hinders subsequent pose transfer. Further, in order to remedy the distortion of the extracted keypoints caused by the distribution shift from the pose estimator, we introduce a keypoint amplifier to eliminate the noise in keypoint heatmaps. An image generator synthesizes a realistic image of the target person conditioned on the disentangled pose and appearance features. The feature encoders and image generator empower DFC-Net to enable us to present novel feature-level pose and appearance consistency losses [
73]. These losses reinforce the consistency of pose and appearance information in the feature space and simultaneously maintain visual fidelity. Additionally, to further improve the robustness and generality of DFC-Net, by disentangling the pose information from different source persons, we present a novel data augmentation scheme that builds an extra unpaired support dataset as the source images, which provides different persons with unseen poses in the training set and augmented consistency constraints.
We also notice that the commonly used real-person datasets and benchmarks [
35,
69] usually do not have the image of the target person with the desired pose from another source person, which is the ground truth. It is common practice to use a target person image directly from the testing dataset to provide the pose information during the evaluation process. Thus this practice raises the risk of leaking information and is also inconsistent with the usage in the real-world (i.e., the pose information is from
another source person). In order to be consistent with the real-world application and to better evaluate the proposed method, inspired by [
1], we collect an animation character image dataset named
Mixamo-Pose from Adobe Mixamo [
2], a 3D animation library, to accurately generate different characters performing identical poses as a benchmark to assess the human pose transfer between different people.
Mixamo-Posecontains four different animation characters performing 15 kinds of poses. To further evaluate the DFC-Net, we also modify a real person dataset called
EDN-10k upon [
10], which contains 10K high-resolution images for four real subjects performing different poses. The experimental results on these two datasets demonstrate that our model can effectively synthesize realistic images and conduct pose transfer for both the animation characters and real persons.
In summary, our contributions are as follows:
–
We propose a novel method DFC-Net for human pose transfer with two disentangled feature consistency losses to make the information between the real images and synthesized images consistent.
–
We propose a novel data augmentation scheme that enforces augmented consistency constraints with an unpaired support dataset to further improve the generality of our model.
–
We collect an animation character dataset Mixamo-Pose as a new benchmark to enable the accurate evaluation of pose transfer between different people in the animation domain.
–
We conduct extensive experiments on datasets Mixamo-Pose and EDN-10k, on which the empirical results demonstrate the effectiveness of our method.
2 Related Work
Generative adversarial networks [
15] and Diffusion models [
20] have achieved tremendous success in image generation tasks, whose goal is to generate high-fidelity images based on other images or text prompts from a different domain. Pix2Pix [
23] proposes a framework based on cGANs [
39] with an encoder-decoder architecture [
19]; CycleGAN [
73] addressed this problem by using cycle-consistent GANs; DualGAN [
64] and [
21] are also unsupervised image-to-image translation methods trained on unpaired datasets. Similarly, [
6,
22,
32] are also image-to-image translation techniques, but they try to generate a dataset of the target domain with labels for domain adaptation tasks. The above works can be exploited as a general approach in the human pose transfer task, while the precondition is that they have a specific image domain that can be converted to the synthesized image domain, e.g., using a pose estimator [
9] to generate a paired skeleton image dataset. Based on the diffusion model, Diffustereo [
45] proposes a diffusion kernel and stereo constraints for 3D human reconstruction from sparse cameras. MotionDiffuse [
66] leverages the diffusion model on the text-driven motion generation task. In this work, we focus on the 2D pose-guided motion transfer task, which differs from the above 3D reconstruction and test-driven tasks. Different from the image-to-image translation methods, DFC-Net improved the quality of the synthesized image by adding consistency constraints in the feature space.
Recently, there have been a growing number of human pose transfer methods with specifically designed modules. One branch is the spatial transformation methods [
13,
28,
50], aiming at building the deformation mapping of the keypoint correspondences in the human body. By leveraging the spatial transformation capability of CNN, [
24] presented the
spatial transformer networks (STN) that approximate the global affine transformation to warp the features. Following STN, several variant works [
25,
29,
65] have been proposed to synthesize images with better performance. [
59] introduced an external eye-tracking dataset and two cascaded attention modules for comprehensive pose segmentation. [
60] incorporated three different inference processes to detect each part of the human body. [
4] used image segmentation to decompose the problem into modular subtasks for each body part and then integrated all parts into the final result. [
50] built deformable skip connections to move information and transfer textures for pose transfer. Monkey-Net [
48] encoded pose information via dense flow fields generated from keypoints learned in a self-supervised fashion. First-Order Motion Model [
49] decoupled appearance and pose and proposes to use learned keypoints and local affine transformations to generate image animation. [
34] integrated the human pose transfer, appearance transfer, and novel view synthesis into one unified framework by using SMPL [
36] to generate a human body mesh. The spatial transformation methods usually implicitly assume that the warping operation can cover the whole body. However, when the viewpoint changes, and occlusions occur, the above assumption can not hold, leading to
pose ambiguity and performance dropping.
Another branch methods are pose-guided and aim at predicting new appearance contents in uncovered regions to handle the pose ambiguity problem. One of the earliest works, PG
\(^{2}\) [
37], presented a two-stage method using U-Net to synthesize the target person with arbitrary poses. [
38] further decomposed the image into the foreground, background, and pose features to achieve more precise control of different information. [
47] introduced a multi-stage GAN loss and synthesized each body part, respectively. [
41] leveraged the DensePose [
3] rather than the commonly used 2D key-points to perform accurate pose transfer. [
10] learned a direct mapping from the skeleton images to synthesized images with corresponding poses based on the architecture of Pix2PixHD [
58]. PATH [
74] introduced cascaded attention transfer blocks (PATBs) to refine pose and appearance features simultaneously. Inspired by PATH, PMAN [
11] proposed a progressive multi-attention framework with memory networks to improve image quality. However, some of these methods [
41,
57,
68] focused on synthesizing results at the image level (i.e., adversarial and reconstruction losses), thus leading to appearance inconsistency when predicted local contents are not consistent with the surrounding contexts. Some works [
46,
67] designed the light weighted networks to accelerate the training and inference process. Our method can also benefit from these light weighted networks to achieve high efficiency human pose transfer.
In contrast, our method learns to disentangle and reassemble the pose and appearance in the feature space. One similar work close to ours is C
\(^{2}\) GAN [
54] which consists of three generation cycles (i.e., one for image generation and two for keypoint generation). C
\(^{2}\) GAN explored the cross-modal information in the image level at the cost of model complexity and training instability while DFC-Net only introduced two feature consistency losses into the full objective, which kept the model simple and effective. By disentangling the pose and appearance features, we can enforce the feature consistencies between the synthesized and real images and leverage the pose features from an unpaired dataset to improve performance.