research-article

Open access

Human Pose Transfer with Augmented Disentangled Feature Consistency

Authors:

Kun Wu,

Gangyi DingAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 1

Article No.: 3, Pages 1 - 22

https://doi.org/10.1145/3626241

Published: 19 December 2023 Publication History

PDF eReader

Abstract

Deep generative models have made great progress in synthesizing images with arbitrary human poses and transferring the poses of one person to others. Though many different methods have been proposed to generate images with high visual fidelity, the main challenge remains and comes from two fundamental issues: pose ambiguity and appearance inconsistency. To alleviate the current limitations and improve the quality of the synthesized images, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. Given a pair of images containing the source and target person, DFC-Net extracts pose and static information from the source and target respectively, then synthesizes an image of the target person with the desired pose from the source. Moreover, DFC-Net leverages disentangled feature consistency losses in the adversarial training to strengthen the transfer coherence and integrates a keypoint amplifier to enhance the pose feature extraction. With the help of the disentangled feature consistency losses, we further propose a novel data augmentation scheme that introduces unpaired support data with the augmented consistency constraints to improve the generality and robustness of DFC-Net. Extensive experimental results on Mixamo-Pose and EDN-10k have demonstrated DFC-Net achieves state-of-the-art performance on pose transfer.

1 Introduction

Human pose transfer has become increasingly compelling recently since it can be applied to real-world applications such as movies’ special effects [56], entertainment systems [62], reenactment [31], and so forth [12, 40]. At the same time, it is also closely related to many computer vision tasks like human-object interaction recognition [42, 71, 72], person re-identification [33, 55], human pose segmentation [59, 70] and human parsing [27, 60], and all these methods can be beneficial to each other. Given some images of a target person and a source person image with the desired pose (e.g., judo, dance), the goal of the human pose transfer task is to synthesize a realistic image of the target person with the desired pose of the source person.

With the power of deep learning, especially the generative adversarial networks (GANs) [15], pioneering works have raised impressive solutions to address the human image generation [34, 37, 41] by efficiently leveraging the image-to-image translation schemes and have achieved significant progress. Intuitively, early routine coarsely conducts human pose transfer through general image-to-image translation methods such as Pix2Pix [23] and CycleGAN [73], which attempt to translate the extracted skeleton image of the source person to the image of target person with the desired poses.

Subsequent approaches [28, 37, 38] adopt specifically designed modules for human pose transfer. Specifically, the U-net architecture with skip connections in [14] is employed to keep the low-level features. To mitigate the pose misalignment between the source and target persons, [50] uses part-wise affine transformation with a modified feature fusion mechanism to warp the appearance features onto the target pose. Later, extensive works have been presented to strengthen the modeling ability of body deformation and feature transfer with different methods, including 3D surface models [16, 28, 41], local attention [43, 74], and optical flow [57]. [27] and [60] propose a rectification strategy in a self-learning way and hierarchical information framework, respectively, for human parsing, which benefits the downstream pose transfer task. However, the warping methods commonly struggle with pose ambiguity when the viewpoint changes, occlusions occur, or even transferring a complicated pose in many situations. To address the pose ambiguity, a series of works [34, 57] use predictive branches to illuminate and replenish new contents for invisible regions. When the hallucinated contents have a different context style than the local-warped ones, generated images will have a low visual fidelity due to appearance inconsistency. One of the main reasons for pose ambiguity and appearance inconsistency is that the commonly used reconstruction loss and the adversarial generative loss only constrain the synthesized image generation at the image level.

Towards alleviating the mentioned limitations, it is important to disentangle the pose and appearance information, and exploit the disentangled pose and appearance feature consistencies between the synthesized and real images, i.e., the synthesized target image should have a similar high-level appearance feature to the real target person as well as a similar high-level pose feature to the real source person. The disentangled pose and appearance feature consistencies can constrain the training at the feature level and lead to a more consistent and realistic synthesized result. In CDMS [70], a multi-mutual consistency learning strategy is proposed for the human pose segmentation task, showing the importance of feature consistency for distinguishing the human pose.

In this article, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. DFC-Net contains a pose feature encoder and a static feature encoder to extract pose and appearance features from the source and target person, respectively. In the pose feature encoder, we integrated a pre-trained pose estimator such as OpenPose [8] to extract the keypoint heatmaps. Notice that the pose estimator is pre-trained on COCO keypoint challenge dataset [30], which is not any dataset deployed in our experiments. As shown in Figure 2, though the pre-trained pose estimator can predict pose heatmaps for unseen subjects in our dataset, it cannot generalize well and the heatmaps have much noise, which hinders subsequent pose transfer. Further, in order to remedy the distortion of the extracted keypoints caused by the distribution shift from the pose estimator, we introduce a keypoint amplifier to eliminate the noise in keypoint heatmaps. An image generator synthesizes a realistic image of the target person conditioned on the disentangled pose and appearance features. The feature encoders and image generator empower DFC-Net to enable us to present novel feature-level pose and appearance consistency losses [73]. These losses reinforce the consistency of pose and appearance information in the feature space and simultaneously maintain visual fidelity. Additionally, to further improve the robustness and generality of DFC-Net, by disentangling the pose information from different source persons, we present a novel data augmentation scheme that builds an extra unpaired support dataset as the source images, which provides different persons with unseen poses in the training set and augmented consistency constraints.

Fig. 1.

Fig. 2.

We also notice that the commonly used real-person datasets and benchmarks [35, 69] usually do not have the image of the target person with the desired pose from another source person, which is the ground truth. It is common practice to use a target person image directly from the testing dataset to provide the pose information during the evaluation process. Thus this practice raises the risk of leaking information and is also inconsistent with the usage in the real-world (i.e., the pose information is from another source person). In order to be consistent with the real-world application and to better evaluate the proposed method, inspired by [1], we collect an animation character image dataset named Mixamo-Pose from Adobe Mixamo [2], a 3D animation library, to accurately generate different characters performing identical poses as a benchmark to assess the human pose transfer between different people. Mixamo-Posecontains four different animation characters performing 15 kinds of poses. To further evaluate the DFC-Net, we also modify a real person dataset called EDN-10k upon [10], which contains 10K high-resolution images for four real subjects performing different poses. The experimental results on these two datasets demonstrate that our model can effectively synthesize realistic images and conduct pose transfer for both the animation characters and real persons.

In summary, our contributions are as follows:

–

We propose a novel method DFC-Net for human pose transfer with two disentangled feature consistency losses to make the information between the real images and synthesized images consistent.

–

We propose a novel data augmentation scheme that enforces augmented consistency constraints with an unpaired support dataset to further improve the generality of our model.

–

We collect an animation character dataset Mixamo-Pose as a new benchmark to enable the accurate evaluation of pose transfer between different people in the animation domain.

–

We conduct extensive experiments on datasets Mixamo-Pose and EDN-10k, on which the empirical results demonstrate the effectiveness of our method.

2 Related Work

Generative adversarial networks [15] and Diffusion models [20] have achieved tremendous success in image generation tasks, whose goal is to generate high-fidelity images based on other images or text prompts from a different domain. Pix2Pix [23] proposes a framework based on cGANs [39] with an encoder-decoder architecture [19]; CycleGAN [73] addressed this problem by using cycle-consistent GANs; DualGAN [64] and [21] are also unsupervised image-to-image translation methods trained on unpaired datasets. Similarly, [6, 22, 32] are also image-to-image translation techniques, but they try to generate a dataset of the target domain with labels for domain adaptation tasks. The above works can be exploited as a general approach in the human pose transfer task, while the precondition is that they have a specific image domain that can be converted to the synthesized image domain, e.g., using a pose estimator [9] to generate a paired skeleton image dataset. Based on the diffusion model, Diffustereo [45] proposes a diffusion kernel and stereo constraints for 3D human reconstruction from sparse cameras. MotionDiffuse [66] leverages the diffusion model on the text-driven motion generation task. In this work, we focus on the 2D pose-guided motion transfer task, which differs from the above 3D reconstruction and test-driven tasks. Different from the image-to-image translation methods, DFC-Net improved the quality of the synthesized image by adding consistency constraints in the feature space.

Recently, there have been a growing number of human pose transfer methods with specifically designed modules. One branch is the spatial transformation methods [13, 28, 50], aiming at building the deformation mapping of the keypoint correspondences in the human body. By leveraging the spatial transformation capability of CNN, [24] presented the spatial transformer networks (STN) that approximate the global affine transformation to warp the features. Following STN, several variant works [25, 29, 65] have been proposed to synthesize images with better performance. [59] introduced an external eye-tracking dataset and two cascaded attention modules for comprehensive pose segmentation. [60] incorporated three different inference processes to detect each part of the human body. [4] used image segmentation to decompose the problem into modular subtasks for each body part and then integrated all parts into the final result. [50] built deformable skip connections to move information and transfer textures for pose transfer. Monkey-Net [48] encoded pose information via dense flow fields generated from keypoints learned in a self-supervised fashion. First-Order Motion Model [49] decoupled appearance and pose and proposes to use learned keypoints and local affine transformations to generate image animation. [34] integrated the human pose transfer, appearance transfer, and novel view synthesis into one unified framework by using SMPL [36] to generate a human body mesh. The spatial transformation methods usually implicitly assume that the warping operation can cover the whole body. However, when the viewpoint changes, and occlusions occur, the above assumption can not hold, leading to pose ambiguity and performance dropping.

Another branch methods are pose-guided and aim at predicting new appearance contents in uncovered regions to handle the pose ambiguity problem. One of the earliest works, PG \(^{2}\) [37], presented a two-stage method using U-Net to synthesize the target person with arbitrary poses. [38] further decomposed the image into the foreground, background, and pose features to achieve more precise control of different information. [47] introduced a multi-stage GAN loss and synthesized each body part, respectively. [41] leveraged the DensePose [3] rather than the commonly used 2D key-points to perform accurate pose transfer. [10] learned a direct mapping from the skeleton images to synthesized images with corresponding poses based on the architecture of Pix2PixHD [58]. PATH [74] introduced cascaded attention transfer blocks (PATBs) to refine pose and appearance features simultaneously. Inspired by PATH, PMAN [11] proposed a progressive multi-attention framework with memory networks to improve image quality. However, some of these methods [41, 57, 68] focused on synthesizing results at the image level (i.e., adversarial and reconstruction losses), thus leading to appearance inconsistency when predicted local contents are not consistent with the surrounding contexts. Some works [46, 67] designed the light weighted networks to accelerate the training and inference process. Our method can also benefit from these light weighted networks to achieve high efficiency human pose transfer.

In contrast, our method learns to disentangle and reassemble the pose and appearance in the feature space. One similar work close to ours is C \(^{2}\) GAN [54] which consists of three generation cycles (i.e., one for image generation and two for keypoint generation). C \(^{2}\) GAN explored the cross-modal information in the image level at the cost of model complexity and training instability while DFC-Net only introduced two feature consistency losses into the full objective, which kept the model simple and effective. By disentangling the pose and appearance features, we can enforce the feature consistencies between the synthesized and real images and leverage the pose features from an unpaired dataset to improve performance.

3 Methodology

3.1 Overview

The training and inference process of the proposed model is shown in Figure 1. Given one image \(\boldsymbol {x}_{\mathrm{s}}\) of a source person and another image \(\boldsymbol {x}_{\mathrm{t}}\) of a target person, DFC-Net synthesizes an image \(\boldsymbol {x}_\mathrm{syn}\) , which reserves (a) the pose information, e.g., pose and location, of the source person in \(\boldsymbol {x}_{\mathrm{s}}\) , and (b) the static information, e.g., person appearance and environment background, from the target image \(\boldsymbol {x}_{\mathrm{t}}\) . For each image, DFC-Net attempts to disentangle the pose and static information into orthogonal features. Specifically, DFC-Net consists of the following core components: (1) a Pose Feature Encoder \(M(\cdot)\) , which extracts pose features \(M(\boldsymbol {x})\) from an image \(\boldsymbol {x}\) ; (2) a Static Feature Encoder \(S(\cdot)\) , which extracts static features \(S(\boldsymbol {x}^{\prime })\) from an image \(\boldsymbol {x}^{\prime }\) ; and (3) an Image Generator \(G(\cdot)\) , which synthesizes an image \(G(M(\boldsymbol {x}), S(\boldsymbol {x}^{\prime }))\) based on the encoded pose and static features \(M(\boldsymbol {x})\) and \(S(\boldsymbol {x}^{\prime })\) from images \(\boldsymbol {x}\) and \(\boldsymbol {x}^{\prime }\) separately. In the remainder of this section, we describe the model architecture and introduce the training procedure, followed by the model instantiations.

3.2 Pose Transfer Network Architecture

3.2.1 Pose Feature Encoder.

Our designed Pose Feature Encoder consists of a Pose Estimator network, a Keypoint Amplifier block, and a Pose Refiner network. Given a RGB image \(\boldsymbol {x}\in \mathbb {R}^{3\times H \times W}\) of height H and width W, the pre-trained Pose Estimator aims at extracting pose information \(P(\boldsymbol {x})\) from the image \(\boldsymbol {x}\) . Similar to [9], the extracted pose information contains the downsampled keypoint heatmaps \(\boldsymbol {h}\in \mathbb {R}^{18 \times \frac{H}{8} \times \frac{W}{8}}\) and the part affinity fields \(\boldsymbol {p}\in \mathbb {R}^{38 \times \frac{H}{8} \times \frac{W}{8}}\) . The keypoint heatmaps \(\boldsymbol {h}\) store the heatmaps of 18 body parts, and the part affinity fields \(\boldsymbol {p}\) store the location and orientation for heatmaps of body parts and background, which has 38 ( \(=(18 + 1) \times 2\) ) channels.

As the pre-trained pose estimator (OpenPose [8] in our implementation) is pre-trained on COCO keypoint challenge dataset [30], when applied on Mixamo-Pose and EDN-10k datasets with different distributions, more noise is made on keypoint heatmaps \(\boldsymbol {h}\) . To reduce the interference of noise, we apply a softmax function with a relatively small temperature T (e.g., 0.01) as the Keypoint Amplifier to denoise the extracted keypoint heatmaps by increasing the gap between large and small values in the heatmaps and obtain the amplified heatmaps \(\boldsymbol {h}^{\prime }\) by

\begin{align} \boldsymbol {h}^{\prime } = \mathrm{softmax}\left(\frac{1}{T}\cdot \boldsymbol {h}\right). \end{align}

(1)

As shown in Figure 2, by employing Keypoint Amplifier on the input heatmaps, the small probability, e.g., 0.2, will be squeezed to almost 0.0. On the contrary, the large probability, e.g., 0.8, will be squeezed to almost 1.0. Without the Keypoint Amplifier, the generator may still synthesize blurry limbs for the low probability areas and twist the generated person.

Finally, the Pose Refiner takes both the part affinity fields \(\boldsymbol {p}\) and the amplified keypoint heatmaps \(\boldsymbol {h}^{\prime }\) and produces the encoded pose feature vector \(M(\boldsymbol {x})\) . In this way, the pose information extracted from the Pose Estimator can be refined, and the influence caused by different limb ratios and/or camera angles and distances can be reduced.

3.2.2 Static Feature Encoder.

While the Pose Feature Encoder is not capable of extracting static information, the static information, including background, personal appearance, and so on, from another image \(\boldsymbol {x}^{\prime }\) , is captured automatically by another module with the help of the full objective function. Named as Static Feature Encoder, this module extracts only static features \(S(\boldsymbol {x}^{\prime })\) from \(\boldsymbol {x}^{\prime }\) .

3.2.3 Image Generator.

Given pose features \(M(\boldsymbol {x}_\mathrm{s})\) extracted from a source image \(\boldsymbol {x}_\mathrm{s}\) and static features \(S(\boldsymbol {x}_\mathrm{t})\) extracted from a target image \(\boldsymbol {x}_\mathrm{t}\) , the Image Generator outputs the synthesized image \(\boldsymbol {x}_{\mathrm{syn}}\) by

\begin{align} \boldsymbol {x}_{\mathrm{syn}} = G\left(M(\boldsymbol {x}_\mathrm{s}), S(\boldsymbol {x}_\mathrm{t})\right). \end{align}

(2)

It is noted that many existing methods (e.g., [10]) attempt to learn pose-to-image or pose-to-appearance mapping solely via its generator. In that case, the generator has to learn three different functionalities: (1) memorizing the state information of the target person, (2) extracting representative pose features, and (3) combining the static and pose information to synthesize the target person image with the desired pose. Even though the generator can memorize the state information \(\boldsymbol {x}_\mathrm{t}\) of the target person perfectly, once the desired pose \(\boldsymbol {x}_\mathrm{s}\) is very different from the poses in the training dataset (e.g., the distance from the camera, skeleton scale from different persons, and occlusions), it is too difficult for the generator to achieve the above second and third functionalities at the same time. The results of Pix2Pix [23] and Everybody Dance Now (EDN) [10] in Section 4 also validate their disadvantages. DFC-Net, instead, decomposes the above three functionalities into three network modules, including the pose feature encoder, static feature encoder, and image generator, and thus enables the reconstruction’s quality improvement.

3.3 Training DFC-Net

We train the pose transfer network in an adversarial learning way with disentangled feature consistency losses as well as other objectives. Basically, the model is trained with a set of images of the same person, possibly from one or several video clips. To further improve the generalization ability of DFC-Net, we propose to train DFC-Net with a support set and the augmented consistency losses. We show the ablation study results in Section 4.3.

3.3.1 Adversarial Training.

We employ an Image Discriminator (D) in an adversarial learning way to ensure the synthesized image \(\boldsymbol {x}_\mathrm{syn}\) borrows the pose and static information from the source and target images ( \(\boldsymbol {x}_\mathrm{s}\) and \(\boldsymbol {x}_\mathrm{t}\) ) separately. As both the source and target images during training contain the same person with the same appearance and background, they share almost the same static features, i.e., \(S(\boldsymbol {x}_\mathrm{t}) \simeq S(\boldsymbol {x}_\mathrm{s})\) . Therefore, the output of the model \(\boldsymbol {x}_{\mathrm{syn}}\) can also be treated as a reconstruction of the source image \(\boldsymbol {x}_\mathrm{s}\) , as the synthesized images contain the same pose features \(M(\boldsymbol {x}_\mathrm{s})\) as the source image. This inspires us to resort to the conditional generative adversarial network (cGAN) [23], where the Image Discriminator attempts to discern between the real sample \(\boldsymbol {x}_\mathrm{s}\) and the generated image \(\boldsymbol {x}_{\mathrm{syn}}\) , conditioned on the pose features \(M(\boldsymbol {x}_\mathrm{s})\) extracted from the source image. That is, the Image Discriminator attempts to fit \(D (\boldsymbol {x}_\mathrm{s}, M(\boldsymbol {x}_\mathrm{s}))=1\) and \(D (\boldsymbol {x}_\mathrm{syn}, M(\boldsymbol {x}_\mathrm{s}))) = 0\) . The adversarial loss is described as follows:

\begin{align} \mathcal {L}_{\mathrm{adv}} = - (\mathcal {L}_{\mathrm{adv}}^+ + \mathcal {L}_{\mathrm{adv}}^-), \end{align}

(3)

where

\begin{align} \mathcal {L}_{\mathrm{adv}}^+ &= \log D (\boldsymbol {x}_\mathrm{s}, M(\boldsymbol {x}_\mathrm{s})), \end{align}

(4)

\begin{align} \mathcal {L}_{\mathrm{adv}}^- &= \log \left(1 - D \left(\boldsymbol {x}_\mathrm{syn}, M(\boldsymbol {x}_\mathrm{s})\right)\right). \end{align}

(5)

We enhance the Image Discriminator with a multi-scale discriminator \(D = (D_1, D_2)\) [58] and include the discriminator feature matching loss \(\mathcal {L}_\mathrm{fm}\) in our objective. The feature matching loss is a weighted sum of feature losses from 5 different layers of the Image Discriminator, calculated by \(L_1\) distance between the corresponding features of \(\boldsymbol {x}_\mathrm{s}\) and \(\boldsymbol {x}_\mathrm{syn}\) .

In order to increase the training stability and improve the synthesized image quality, we also add the perceptual loss \(\mathcal {L}_\mathrm{per}\) [26] based on a pre-trained VGG network [51].

3.3.2 Disentangled Feature Consistency Losses.

The above adversarial training losses aim at penalizing discrepancy between the synthesized and source images directly in the raw image space. To improve the accuracy and robustness of the pose transfer results, we also introduce two disentangled feature consistency losses in terms of pose and static features to ensure the synthesized person looks like the target person and behaves as the source person separately. The pose consistency loss \(\mathcal {L}_\mathrm{mc}\) measures the differences between the synthesized and source images in the pose feature space, and the static consistency loss \(\mathcal {L}_\mathrm{sc}\) measures the differences between the synthesized and target images in the static feature space. They are both \(L_1\) distances between the outputs from the corresponding encoders, formally defined as

\begin{align} \mathcal {L}_\mathrm{mc}& = {\left\Vert M(\boldsymbol {x}_\mathrm{syn}) - M(\boldsymbol {x}_\mathrm{s}) \right\Vert }_1, \end{align}

(6)

\begin{align} \mathcal {L}_\mathrm{sc}& = {\left\Vert S(\boldsymbol {x}_\mathrm{syn}) - S(\boldsymbol {x}_\mathrm{t}) \right\Vert }_1. \end{align}

(7)

3.3.3 Augmented Consistency Loss.

Through disentangling the pose feature from the source images, we find that images with different persons can also be passed into the training process as the source images \(\boldsymbol {x}_\mathrm{s}\) to improve the generalization ability of our model. Hence, we introduce a novel data augmentation method that extends the training dataset with the images of different persons, referred to as the support set, providing many kinds of unseen poses. Note that the subjects in support set can be arbitrary and are different from the primary training dataset, so the ground-truth images with the target person performing the pose of the source person are not available at all. As a result, the corresponding losses \(\mathcal {L}_\mathrm{adv}^+, \mathcal {L}_\mathrm{per}\) , and \(\mathcal {L}_\mathrm{fm}\) for the support set are not applicable, and we optimize relevant objective terms, which are defined by

\begin{align} \mathcal {L}_\mathrm{sup}= \lambda _\mathrm{adv}\mathcal {L}_\mathrm{adv}^- + \lambda _\mathrm{mc}\mathcal {L}_\mathrm{mc}+ \lambda _\mathrm{sc}\mathcal {L}_\mathrm{sc}, \end{align}

(8)

where the weights \(\lambda _\mathrm{adv}, \lambda _\mathrm{mc}, \lambda _\mathrm{sc}\) are the weights for each loss.

3.3.4 Full Objective.

By bringing all the objective terms together, we train all components jointly except for the Pose Estimator to minimize the full objective \(\mathcal {L}_\mathrm{full}\) below.

\begin{align} \mathcal {L}_\mathrm{full}=\, & \lambda _\mathrm{adv}\mathcal {L}_\mathrm{adv}+ \lambda _\mathrm{fm}\mathcal {L}_\mathrm{fm}+ \lambda _\mathrm{per}\mathcal {L}_\mathrm{per}\\ \nonumber \nonumber & + \lambda _\mathrm{mc}\mathcal {L}_\mathrm{mc}+ \lambda _\mathrm{sc}\mathcal {L}_\mathrm{sc}+ \mathcal {L}_\mathrm{sup}, \end{align}

(9)

where \(\lambda _\mathrm{adv}, \lambda _\mathrm{fm}, \lambda _\mathrm{per}\) are set to 1, 10, 10 following Pix2Pix [23] and EDN [10], while \(\lambda _\mathrm{mc}, \lambda _\mathrm{sc}\) are set to 0.1, 0.01 by grid search. We set the \(\lambda _\mathrm{sc}\) to 0.01 comparing to the \(\lambda _\mathrm{mc}\) to balance the \(\mathcal {L}_\mathrm{mc}\) and \(\mathcal {L}_\mathrm{sc}\) .

3.4 Training and Inference Process

For each subject in the training dataset (i.e., Mixamo-Pose and EDN-10k in our experiments), we train one separate model following the same scheme of [10] (e.g., we trained four models for four subjects in EDN-10k dataset.) For the sake of comparison fairness, we also train all the baseline methods following the same scheme.

During the training stage, given a training dataset consisting of N images of one subject and a support set, for each training iteration, we randomly choose a pair of images as \(x_s\) and \(x_t\) from the training dataset and an image as \(x_s\) from the support set respectively, pass them into the DFC-Net and train it using the full objective in Equation (9).

During the inference stage, given a desired pose image \(x_s\) , we randomly choose an image \(x_t\) from the training dataset and synthesize the result. For EDN-10k dataset, the pose image \(x_s\) is chosen from the testing dataset with an unseen pose in the training process. Even though the pose image \(x_s\) and the target person image \(x_t\) contain the same person (i.e., the ground truth of the pose image \(x_s\) with another person is unavailable for real-world data), by passing through the pose feature encoder, the static information in the pose image \(x_s\) is discarded, and only the keypoints information are preserved. For Mixamo-Pose dataset, the pose image \(x_s\) is chosen from the testing dataset, including the different person from the target person (e.g., the target person image \(x_t\) is from Liam and the source person image \(x_s\) is from Remy). For both benchmark, DFC-Net has to extract the static features from the target person image \(x_t\) and combines them with the pose features of the pose image \(x_s\) to synthesize the final images where the poses are unseen during the training. Thus there is no information leakage.

3.5 Implementation Details

We employed the pre-trained VGG-19 [52] network part from [9] for the Pose Estimator, and adopted the similar approach in [57] to build our network, the detailed designs are as follows:

–

Pose Refiner: It is composed of a convolutional block, a channel-wise upsampling module, and five residual blocks [17]. Firstly, the convolutional block consists of a reflection padding layer, a \(7 \times 7\) convolutional layer, a batch normalization layer, and ReLU. The channel-wise upsampling module, which increases the number of channels from 64 to 512, contains three convolutional blocks. Each block contains a \(3 \times 3\) convolutional layer, a batch normalization layer, and ReLU. Each of the five residual blocks consists of two small convolutional blocks, and each block has a reflection padding layer, a \(3 \times 3\) convolutional layer, and a batch normalization layer. The first small convolutional block also has a ReLU at the end.

–

Static Feature Encoder: It firstly has the same convolutional block as in the Pose Refiner. Then it contains three convolutional downsampling blocks, and each block consists of a \(3 \times 3\) convolutional layer, a batch normalization layer, and ReLU. There are also five residual blocks following the downsampling blocks as the same as in the Pose Refiner.

–

Image Generator: It is composed of four residual blocks, an upsampling module, and a convolutional block. Each of the residual blocks is the same as in the Pose Refiner and Static Feature Encoder. The upsampling module consists of three transposed convolutional blocks, and each block is composed of a \(3 \times 3\) transposed convolutional layer, a batch normalization layer, and ReLU. The last convolutional block contains two reflection padding layers, two \(7 \times 7\) convolutional layers, and a tangent function.

–

Image Discriminator: It contains two discriminators at different scales, which are similar to [58]. Each discriminator is composed of five convolutional blocks. The first block has a \(4 \times 4\) convolutional layer and LeakyReLU. Each of the next three blocks has a \(4 \times 4\) convolutional layer, a batch normalization layer, and Leaky ReLU. The last block only has a \(4 \times 4\) convolutional layer.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

We built Mixamo-Sup as a support set for data augmentation to boost the generality of DFC-Net and organized two datasets, Mixamo-Pose and EDN-10k, to verify the effectiveness of the proposed DFC-Net for human pose transfer.

–

EDN-10K: We processed and tailored the dataset released by [10] to build the EDN-10k dataset. The original dataset consists of five long target videos, each lasting from 8 minutes to 17 minutes, and split into the training and test set. We chose the first four subjects since subject five only performed less complex dance poses. In each video of the original dataset, a different subject performed a series of different motions, and the camera was fixed to keep the background unchanged. We chose four subjects from the original dataset, uniformly sampled 10k frames as the training set and 1k frames as the test set for each subject, Since the original images have a large resolution of \(1024 \times 512\) and most areas are the fixed background, we cropped all frames to the middle \(512 \times 512\) square areas and resized them to \(256 \times 256\) .

–

Mixamo-Pose: We randomly chose 4 characters, Andromeda, Liam, Remy, and Stefani, with 30 different pose sequences from Mixamo. To render the 3D animations into 2D images, we loaded each character performing each pose sequence on a white background into Blender [5], placed two cameras in front of and behind the character, and took the images. We centered the characters in the images according to their keypoints and resized them to 256 \(\times\) 256. Mixamo-Pose were split into training and test sets. For each character, the training set contains 1,488 images with 15 poses, and the test set contains 1,185 images with 15 other poses.

–

Mixamo-Sup: For data augmentation, we built a support set by rendering 15,684 images of six new characters from Mixamo [2] with another 15 unseen poses in the same way as Mixamo-Pose. Since it is unnecessary to contain the same person as the target person image, DFC-Net leveraged the support set as the source person images \(x_s\) . When training on both EDN-10k and Mixamo-Pose, we use Mixamo-Sup as the support set. Note that Mixamo-Sup has a totally different distribution from EDN-10k but still gains a huge improvement shown in Section 4.3.

Note that the experiments on EDN-10k only include the pose transfer on the same person because the ground truth of different people carrying the same pose are unavailable. For Mixamo-Pose, since we can manipulate different characters to do the same action, the experiments include the pose transfer both on the different people, e.g., transfer the unseen pose of Liam to Andromeda, and the same person, e.g., transfer the unseen pose of Andromeda to herself.

4.1.2 Baseline Methods.

We compared our DFC-Net with the following competitive baselines:

–

Nearest Neighbors (NN): For each source person image \(\boldsymbol {x}_\mathrm{s}\) , we chose the image \(\boldsymbol {x}^{\prime }\) in the training set \(\mathcal {D}_{\mathrm{tr}}\) with the lowest mean square error (MSE) between the pose information \(P(\boldsymbol {x}_\mathrm{s})\) and \(P(\boldsymbol {x}^{\prime })\) as \(\boldsymbol {x}_\mathrm{syn}\) .

\begin{align} \boldsymbol {x}_\mathrm{syn}& = {\arg \min }_{\boldsymbol {x}^{\prime } \in \mathcal {D}_{\mathrm{tr}}}{\left\Vert P(\boldsymbol {x}_\mathrm{s}) - P(\boldsymbol {x}^{\prime }) \right\Vert }^2_2. \end{align}

(10)

The pose information was extracted by the same Pose Estimator as in our method.

–

Pose-guided Methods: We chose CycleGAN [73], Pix2Pix [23] and EDN [10] as the baselines. They all took the skeleton images as input instead of the original images, and we employed a pre-trained pose estimator [9] to extract keypoints, used OpenCV [7] to connect pairs of keypoints with different colors to generate the skeleton images. To ensure fair comparisons, the face GAN and face keypoint estimator in EDN were not adopted in our implementation, as they are independent components and can be seamlessly adopted by other learning-based baselines.

–

Spatial Transformation Methods: We selected Liquid Warping GAN (LWG) [34], Monkey-Net (MKN) [48], and First Order Motion Model (FOMM) [49]. LWG calculates the flow fields with additional 3D human models and integrates the human pose transfer, appearance transfer, and novel view synthesis into one unified framework. MKN and FOMM are both object-agnostic frameworks using learned keypoints to generate image animation in a self-supervised fashion.

4.1.3 Evaluation Metrics.

We evaluated the quality of the synthesized images with three commonly used metrics:

–

MSE: The mean squared error between the values of pixels of synthesized images and ground-truth images. Lower MSE values are better.

–

PSNR: The peak signal-to-noise ratio, which provides an empirical measure of the quality of synthesized images regarding ground-truth images. Higher PSNR values are better.

–

SSIM: Structural similarity [61], which is another perceptual metric that quantifies the quality of synthesized images given ground-truth images and focuses more on structural information (e.g., light). Higher SSIM values are better.

–

IS: Inception Score [44] is a metric for estimating the quality of the synthetic images based on the Inception-V3 model [53]. Higher IS values are better.

–

FID: Frechet Inception Distance [18] is also an Inception-V3-based metric to evaluate the synthetic images according to the statistics of the synthetic images. Lower FID values are better.

We calculated the average scores among all pairs of synthesized and ground-truth images on the test set. On Mixamo-Pose, for each character as the target person, we reported the average metrics of 4 different characters as the source person. While on EDN-10k, we reported metrics of the task on the same person for every subject.

4.2 Quantitative Evaluations

4.2.1 Results on EDN-10k.

Tables 1–5 depict results in terms of MSE, SSIM, PSNR, IS, and FID on the EDN-10k dataset. Table 6 provides comparisons on EDN-10k in terms of average results of the above five metrics over four subjects. The experimental results validated the advances of DFC-Net for real images:

Table 1.

Method	Subject1	Subject2	Subject3	Subject4
NN	54.9448	36.2267	55.2041	26.2531
CycleGAN [73]	64.1959	77.4336	70.3681	52.5171
Pix2Pix [23]	58.3633	43.0771	62.0203	24.1688
EDN [10]	56.3549	36.3887	55.9625	21.5724
LWG [34]	51.6246	43.2884	53.4031	21.4314
MKN [48]	48.1603	30.9902	47.6255	21.6634
FOMM [49]	46.2852	30.7603	51.1431	21.2709
Ours	45.2043	30.4782	48.5436	20.7248

Table 1. Comparisons on EDN-10k in Terms of MSE with the Best Results (Lowest Values) in Bold

Table 2.

Method	Subject1	Subject2	Subject3	Subject4
NN	0.6138	0.8253	0.7616	0.8437
CycleGAN [73]	0.5256	0.4911	0.5869	0.7821
Pix2Pix [23]	0.6238	0.8040	0.7585	0.8767
EDN [10]	0.6205	0.8445	0.8233	0.8939
LWG [34]	0.6394	0.8375	0.7434	0.8634
MKN [48]	0.7007	0.8503	0.8030	0.8904
FOMM [49]	0.6645	0.8445	0.7896	0.8649
Ours	0.7083	0.8670	0.8241	0.9083

Table 2. Comparisons on EDN-10k in Terms of SSIM with the Best Results (Highest Values) in Bold

Table 3.

Method	Subject1	Subject2	Subject3	Subject4
NN	30.7957	32.6439	31.3307	34.2481
CycleGAN [73]	30.0577	29.2421	29.7007	31.3398
Pix2Pix [23]	30.6020	32.0304	30.5843	34.6860
EDN [10]	30.7109	32.8639	30.9957	35.0369
LWG [34]	31.0085	31.7812	31.0503	35.0118
MKN [48]	31.3094	33.2530	30.6493	34.9108
FOMM [49]	31.5272	33.3145	31.3099	34.9612
Ours	31.5978	33.3509	31.4159	35.0718

Table 3. Comparisons on EDN-10k in Terms of PSNR with the Best Results (Highest Values) in Bold

Table 4.

Method	Subject1	Subject2	Subject3	Subject4
NN	3.1284	3.3052	3.1525	3.3271
CycleGAN [73]	2.9294	2.8902	2.9776	3.0343
Pix2Pix [23]	3.1903	3.2966	3.1864	3.5012
EDN [10]	3.1802	3.4328	3.4083	3.5316
LWG [34]	3.1774	3.4035	3.1365	3.4122
MKN [48]	3.3481	3.4680	3.3793	3.5238
FOMM [49]	3.2794	3.4075	3.3281	3.4019
Ours	3.3502	3.5227	3.4240	3.5682

Table 4. Comparisons on EDN-10k in Terms of IS with the Best Results (Highest Values) in Bold

Table 5.

Method	Subject1	Subject2	Subject3	Subject4
NN	24.8053	19.2783	22.4787	23.3219
CycleGAN [73]	37.8460	38.8926	35.3055	32.9549
Pix2Pix [23]	23.5316	21.3092	24.7829	17.5752
EDN [10]	23.8172	18.3454	19.0175	14.4926
LWG [34]	22.3348	18.1062	25.7232	19.7245
MKN [48]	19.4209	16.5735	19.5568	14.4617
FOMM [49]	20.3571	17.6391	21.3258	18.7283
Ours	18.3029	15.2877	17.8782	13.7508

Table 5. Comparisons on EDN-10k in Terms of FID with the Best Results (Lowest Values) in Bold

Table 6.

Method	MSE( \(\downarrow\) )	SSIM( \(\uparrow\) )	PSNR( \(\uparrow\) )	IS( \(\uparrow\) )	FID( \(\downarrow\) )
NN	43.1572	0.7611	32.2546	3.2283	22.4711
CycleGAN [73]	66.1287	0.5964	30.0851	2.9579	36.2498
Pix2Pix [23]	46.9074	0.7658	31.9757	3.2936	21.7997
EDN [10]	42.5696	0.7956	32.4019	3.3882	18.9182
LWG [34]	42.4368	0.7709	32.2129	3.2824	21.4496
MKN [48]	37.1098	0.8111	32.5306	3.4298	17.5032
FOMM [49]	37.3648	0.7908	32.7782	3.3542	19.5126
Ours	36.2377	0.8269	32.8591	3.4662	16.3049

Table 6. Comparisons on EDN-10k in Terms of the Average Results of the 5 Metrics Over 4 Subjects

–

Our method consistently outperformed all the baseline methods on all subjects. When synthesizing real person images, the most significant result is that on Subject1, our method achieved 45.2043 MSE while NN, CycleGAN, Pix2Pix, EDN, LWG, and MKN only got MSE values of 54.9448, 64.1959, 58.3633, 56.3549, 51.6246 and 48.1603 in Table 1, which indicates the images generated by our method have clear details. From Tables 4 and 5, DFC-Net also excelled other baselines for all four subjects according to the Inception Score (IS) and Frechet Inception Distance (FID). As shown in Table 6, our method also achieved the highest average SSIM of 0.8269 for all subjects, while no other methods except MKN got SSIM score greater than 0.8, which shows that our synthesized images are more realistic and suitable for the human visual system.

–

Secondly, we noticed CycleGAN has the worst results, which are 64.1959, 77.4336, and 70.3681 for MSE scores, and 0.5256, 0.4911, and 0.5869 for SSIM scores on Subject1, Subject2, and Subject3 in the Tables 1 and 2 respectively. We argue that CycleGAN is better at transferring the color or style for images from two domains rather than changing the geometry of the images, such as recovering the human appearance from the human skeleton because CycleGAN aims at learning a mapping from unpaired images directly. This property of CycleGAN is also supported by its inferior PSNR scores compared with other methods.

–

We could observe that NN can achieve low scores of MSE in Table 1. Since there are a lot of training images, it is easier for NN to find an image whose motion is very close to the desirable motion. Moreover, the fixed background also makes NN have higher scores of SSIM and PSNR in Tables 2 and 3, while other methods have to learn to generate an accurate background. But the images generated by NN usually do not perform the desired motions, and do not have any temporal coherence in motion when the input is a motion sequence since the results only depend on the training set.

–

Moreover, we observed that EDN, LWG, MKN, and FOMM also provided good results, especially on MSE metric, e.g., 42.5696, 42.4368, 37.1098, and 37.3648 average values for all subjects in Table 1, comparing with other baselines. Taking Subject2 as an example, LWG, MKN, and FOMM provided the SSIM of 0.8445, 0.8375, 0.8503, and 0.8445 in Table 2 which are higher than the results of NN, CycleGAN, and Pix2Pix. The higher SSIM values show that LWG, MKN, and FOMM can synthesize images closer to the ground truths.

4.2.2 Results on Mixamo-Pose.

Tables 7–11 respectively show the quantitative results of pose transfer in terms of MSE, SSIM, PSNR, IS, and FID on the Mixamo-Pose dataset. Table 12 provides comparisons on Mixamo-Pose in terms of average results of the above five metrics over four characters. We obtained the empirical results clearly demonstrated the effectiveness of DFC-Net on animation images:

Table 7.

Method	Andromeda	Liam	Remy	Stefani
NN	24.9047	28.4846	27.1801	24.8609
CycleGAN [73]	27.5520	29.9436	28.2237	22.7682
Pix2Pix [23]	23.9370	23.5610	24.0687	21.5841
EDN [10]	24.3244	23.1203	23.9229	35.0930
LWG [34]	24.2905	22.8587	22.9707	22.0910
MKN [48]	30.5934	39.3444	29.4817	24.1297
FOMM [49]	27.7809	29.0469	27.4474	25.3934
Ours	23.8539	21.7328	22.0763	21.2587

Table 7. Comparisons on Mixamo-Pose in Terms of MSE with the Best Results (Lowest Values) in Bold

Table 8.

Method	Andromeda	Liam	Remy	Stefani
NN	0.7357	0.7265	0.7487	0.7411
CycleGAN [73]	0.7205	0.7154	0.7377	0.7753
Pix2Pix [23]	0.7784	0.7955	0.7932	0.8069
EDN [10]	0.7817	0.7931	0.7926	0.8058
LWG [34]	0.7613	0.7912	0.7887	0.7858
MKN [48]	0.7076	0.6874	0.7531	0.7753
FOMM [49]	0.7165	0.7404	0.7538	0.74642
Ours	0.7726	0.8040	0.8057	0.8071

Table 8. Comparisons on Mixamo-Pose in Terms of SSIM with Best Results (Highest Values) in Bold

Table 9.

Method	Andromeda	Liam	Remy	Stefani
NN	34.4420	34.0575	34.3865	34.3977
CycleGAN [73]	33.7903	33.4230	33.6786	34.6130
Pix2Pix [23]	34.4049	34.5167	34.3875	34.8497
EDN [10]	34.3524	34.5882	34.4214	34.7247
LWG [34]	34.3426	34.6762	34.6183	34.7519
MKN [48]	33.6626	32.8841	33.7487	34.4779
FOMM [49]	33.7777	33.5912	33.8337	34.1919
Ours	34.4336	34.8798	34.7823	34.9303

Table 9. Comparisons on Mixamo-Pose in Terms of PSNR with the Best Results (Highest Values) in Bold

Table 10.

Method	Andromeda	Liam	Remy	Stefani
NN	3.3671	3.4907	3.3170	3.3815
CycleGAN [73]	2.9237	2.9105	3.0148	3.0681
Pix2Pix [23]	3.3892	3.5026	3.3356	3.5073
EDN [10]	3.4308	3.5418	3.4509	3.5624
LWG [34]	3.3704	3.4052	3.3177	3.4339
MKN [48]	3.3856	3.5697	3.4212	3.5105
FOMM [49]	3.3921	3.4893	3.4082	3.4721
Ours	3.4285	3.6297	3.5618	3.6075

Table 10. Comparisons on Mixamo-Pose in Terms of IS with the Best Results (Highest Values) in Bold

Table 11.

Method	Andromeda	Liam	Remy	Stefani
NN	21.9411	23.4728	22.5086	24.7382
CycleGAN [73]	23.1479	25.1871	21.0418	21.2344
Pix2Pix [23]	14.5051	12.7375	12.8751	11.1583
EDN [10]	14.6281	11.1481	11.6892	11.2089
LWG [34]	16.7303	11.4930	13.5728	14.3207
MKN [48]	21.4839	21.4015	17.3208	16.7219
FOMM [49]	19.6782	19.3755	21.2926	20.0382
Ours	14.3172	10.1062	11.2756	10.7603

Table 11. Comparisons on Mixamo-Pose in Terms of FID with the Best Results (Lowest Values) in Bold

Table 12.

Method	MSE( \(\downarrow\) )	SSIM( \(\uparrow\) )	PSNR( \(\uparrow\) )	IS( \(\uparrow\) )	FID( \(\downarrow\) )
NN	26.3576	0.7380	34.3209	3.3891	23.1652
CycleGAN [73]	27.1219	0.7372	33.8762	2.9793	22.6528
Pix2Pix [23]	23.2877	0.7935	34.5397	3.4337	12.8190
EDN [10]	26.6151	0.7933	34.5217	3.4965	12.1686
LWG [34]	23.0527	0.7817	34.5972	3.3818	14.0292
MKN [48]	30.8873	0.7308	33.6933	3.4718	19.2320
FOMM [49]	28.6544	0.7392	33.8486	3.4404	20.0961
Ours	22.2304	0.7973	34.7565	3.5569	11.6148

Table 12. Comparisons on Mixamo-Pose in Terms of the Average Results of the 5 Metrics Over 4 Characters

–

As shown in Tables 7–11, our DFC-Net outperformed all competing baselines regarding all five metrics on average again like the results on EDN-10k Notably, in terms of MSE, DFC-Net outperformed the second-best LWG by 0.8223 in Table 7. To be more concrete, the lowest MSEs of our DFC-Net indicated DFC-Net provided the most accurate motion transfer images with the shortest \(L_2\) distance between the ground-truth images. For instance, our method provides 22.0763 of MSE on Remy comparing to the 27.1801, 28.2237, 24.0687, and 23.9229 from NN, CycleGAN, Pix2Pix, and EDN in the Table 7. Similar results of IS and FID can also be observed in Tables 10 and 11. Furthermore, together with the highest scores of PSNR and SSIM, our DFC-Net could generate synthesized images with the most accurate motion transfer and the best image quality simultaneously.

–

Secondly, NN provided a relative good results on Andromeda but a higher MSE on Liam and Remy in the Table 7, since its performance depends on the training dataset in the extreme, and it can’t provide stable synthesized results. Moreover, in Tables 7 and 8, CycleGAN also can not perform well, with the results are 27.5520, 29.9436, and 28.2237 MSE, and 0.7205, 0.7154 and 0.7377 for SSIM scores on Andromeda, Liam, and Remy respectively. These results once again validated that it is difficult to learn a mapping from the skeleton images to human images directly using unpaired data.

–

Thirdly, from Tables 7–11, Pix2Pix, EDN, and LWG delivered such superior performance comparing with NN and CycleGAN. because they aim at learning a paired mapping from the skeleton images to human images directly. Besides that, the skeleton images as their input have accurate motion information without any noise and make the learning process easier. In contrast, even without converting source person images to skeleton images, our method DFC-Net fills the gap between the original source person images and the skeleton images to some extent by introducing the keypoint amplifier, two consistency losses, and a support dataset for training. Without any pre-process steps, and thus the source person images contain much noise and redundant information, our method can still achieve more improvements over Pix2Pix and EDN according to all five metrics.

–

Compared to the results on EDN-10k, MKN and FOMM dropped their performance when they transferred the pose between different people on Mixamo-Pose. It is difficult for them to extract keypoints features without a pre-trained pose estimator when the source person was not in the training dataset. For example, in Table 8, we can see that the SSIM score of MKN on Andromeda is 0.7076 compared with 0.7357, 0.7784, 0.7817, and 0.7726 from NN, Pix2Pix, EDN, and our method.

4.3 Ablation Study

To better understand the merits of designs of DFC-Net, we conducted detailed ablation studies on Subject1 from EDN-10k and Liam from Mixamo-Pose. The evaluation results are shown in Tables 13 and 14.

Table 13.

	KA	\(\mathcal {L}_\mathrm{sc}\)	\(\mathcal {L}_\mathrm{mc}\)	\(\mathcal {L}_\mathrm{sup}\)	MSE( \(\downarrow\) )	PSNR( \(\uparrow\) )	SSIM( \(\uparrow\) )
1					52.5749	30.9361	0.6595
2	✓				50.7810	31.0860	0.6676
3	✓	✓			49.3831	31.2043	0.6742
4	✓		✓		49.8352	31.1662	0.6710
5	✓	✓	✓		48.6431	31.2714	0.6796
6		✓	✓	✓	46.3699	31.4801	0.6937
7	✓	✓	✓	✓	45.2043	31.5978	0.7083

Table 13. Ablation Studies on Subject1 from EDN-10k

KA denotes the Keypoint Amplifier.

Table 14.

	KA	\(\mathcal {L}_\mathrm{sc}\)	\(\mathcal {L}_\mathrm{mc}\)	\(\mathcal {L}_\mathrm{sup}\)	MSE( \(\downarrow\) )	PSNR( \(\uparrow\) )	SSIM( \(\uparrow\) )
1					26.4595	33.9856	0.7605
2	✓				24.7312	34.2838	0.7726
3	✓	✓			23.7580	34.4666	0.7830
4	✓		✓		22.6160	34.6913	0.8011
5	✓	✓	✓		21.9509	34.8292	0.8032
6		✓	✓	✓	22.0240	34.8047	0.8031
7	✓	✓	✓	✓	21.7328	34.8798	0.8040

Table 14. Ablation Studies on Liam from Mixamo-Pose

KA denotes the Keypoint Amplifier.

4.3.1 Keypoint Amplifier.

In Tables 13 and 14, the baseline (the first row) is the results of our model without the keypoint amplifier, the consistency losses and the support dataset. This baseline provided the worst results of the three metrics, e.g., the MSE is 26.4595 in Table 14. The results in the second row versus those in the first row showed that the keypoint amplifier strengthened the performance of pose transfer, such as promoting the SSIM of the baseline from 0.6595 to 0.6676 in Table 13. Even with the consistency losses and support set (rows 6 and 7), it continuously reduced the interference of the noise on the keypoint heatmaps. These results validated that the keypoint amplifier can filter out the real keypoint locations and reduce the interference of the noise on the keypoint heatmaps.

4.3.2 Consistency Losses.

We explored the effects of consistency losses \(\mathcal {L}_\mathrm{sc}\) and \(\mathcal {L}_\mathrm{mc}\) on the task of pose transfer. Considering the different results between the second row and the third row, the static feature consistency loss \(\mathcal {L}_\mathrm{sc}\) boosted the pose transfer task and verified its validity. Moreover, the performance variations between the second row and the fourth row clearly evidenced the advantages of pose feature consistency loss \(\mathcal {L}_\mathrm{mc}\) . Together with these two feature consistency losses, the model achieved better results with MSE of 21.9509, PSNR of 34.8292, and SSIM of 0.8032, as shown in the fifth row of Table 14 The comparison among these cases showed that each consistency loss, whether the static one or the motion one, can enforce the consistency between the real image and synthesized images, thus improving the quality of the synthesized images.

It is noted that the static feature consistency was more useful than the motion one on EDN-10k, while we observed the opposite performance on Mixamo-Pose. For example, in the third and fourth rows of Table 14, the static feature consistency loss achieved 0.0104 improvements for SSIM, while the motion feature consistency loss provided a 0.0285 improvement. We considered that since the backgrounds of images in Mixamo-Pose were simply white (i.e., easy to learn), the static feature consistency could only boost the performance for personal appearance on Mixamo-Pose. On the contrary, the backgrounds were more complex in EDN-10k. Thus the static feature consistency was more effective on EDN-10k than on Mixamo-Pose.

4.3.3 Support Set with Augmented Consistency Loss.

We added the support set and utilized the augmented consistency loss \(\mathcal {L}_\mathrm{sup}\) during the training. The comparisons among the final row and other rows notably indicated that the support set could improve the generalization ability and the robustness of our model, especially when the poses of the source person are closer to the poses in the support set. The support set provided more negative examples besides the original training set and could help the discriminator to form an accurate boundary, which could further strengthen the performance of the generator. We also noticed that the support set was more useful on EDN-10k, because EDN-10k was sampled from the video clips of subjects, and it did not contain much motion variance compared with Mixamo-Pose. Moreover, the limb ratios and the distances between the person and camera in the support set were also distinct from EDN-10k, which provided more valuable negative samples for EDN-10k.

4.3.4 Different Pose Estimator Backbone.

To further verify the contribution of different pre-trained pose estimator backbones, we also conducted extensive experiments on Subject1 from EDN-10k and Liam from Mixamo-Pose with ResNet architectures including ResNet-18, ResNet-50, ResNet-101, and ResNet-152. For ResNet-18, we did not find any pre-trained model online, and thus we re-implemented [9] by replacing the VGG-19 with the ResNet-18 following the same training process as [9]. For ResNet-50, ResNet-101, and ResNet-152, we directly use the pre-trained models from [63]. The results are shown in Tables 15 and 16. We can observe that when we use a larger pose estimator backbone, we usually get better performance according to all five metrics. For instance, on Subject1 from EDN-10k, the ResNet-152 achieved the best SSIM result of 0.7261 compared to other backbones. It is because a better pre-trained pose estimator backbone can provide more representative pose information and thus generate high-fidelity images. It suggests that our method still has the potential for improvement by combining more advanced pose estimation technologies. In this work, we mainly focus on how to add consistency in the feature space instead of directly using a better pose estimator backbone.

Table 15.

Network	MSE( \(\downarrow\) )	PSNR( \(\uparrow\) )	SSIM( \(\uparrow\) )	IS( \(\uparrow\) )	FID( \(\downarrow\) )
ResNet-18	49.3291	30.5388	0.6802	3.2811	20.7048
ResNet-50 [63]	44.3875	31.7714	0.7039	3.3255	18.1374
ResNet-101 [63]	42.8027	31.9208	0.7255	3.4039	16.8275
ResNet-152 [63]	41.5562	32.3248	0.7261	3.4415	16.4553
VGG-19 [9]	45.2043	31.5978	0.7083	3.3502	18.3029

Table 15. Ablation Studies of Different Pose Estimator Backbones on Subject1 from EDN-10k

Table 16.

Network	MSE( \(\downarrow\) )	PSNR( \(\uparrow\) )	SSIM( \(\uparrow\) )	IS( \(\uparrow\) )	FID( \(\downarrow\) )
ResNet-18	23.3403	34.3123	0.8057	3.5564	10.2478
ResNet-50 [63]	21.3634	35.2125	0.8073	3.6187	9.9064
ResNet-101 [63]	20.8277	35.8716	0.8342	3.6571	9.5731
ResNet-152 [63]	20.5169	35.8920	0.8188	3.6933	9.4565
VGG-19 [9]	21.7328	34.8798	0.8040	3.6297	10.1062

Table 16. Ablation Studies of Different Pose Estimator Backbones on Liam from Mixamo-Pose

4.4 Qualitative Results

We further highlighted the superiority of our proposed approach by showing and contrasting the visualizations of synthesized results of all models on EDN-10k and Mixamo-Pose, respectively. In addition, we presented the visualizations of the ablation comparisons for Subject1 on EDN-10k to indicate the effectiveness of components of DFC-Net.

4.4.1 Visual Comparisons on EDN-10k.

We also provided visualization results on EDN-10k in Figure 3 and cropped the results in Figure 4 with more details, and DFC-Net synthesized real person image results with better qualities, which clearly indicated the effectiveness of our method as well. For instance:

Fig. 3.

Fig. 4.

–

As shown in the second row of Figure 3, the difference between the source pose and the target pose is very large, and this posture involves changes in almost all parts of the human body. In that case, we can observe that DFC-Net can generate the correct human pose for the Subject2, when other methods like LWG, MKN, and FOMM can only synthesize distorted poses. Another example is in Figure 4, we observed that the synthesized images of our method contained more details, including the clear face, wrinkles on clothes, and so on. The generated pose transfer image of Subject1 from DFC-Net has a more clear face structure, and the face orientation of the characters is also consistent with the target image, while MKN and FOMM failed to synthesize correct face orientation and hand poses.

–

Sometimes NN could generate images that are closer to the ground truths, e.g., the image of Subject3 at the third row of Figure 3, since the number of images in the training set increases from 1,488 in Mixamo-Pose to 10,000 in EDN-10k. But people could still easily determine that the poses of the generated images are different from the ground-truths.

–

The results produced by Pix2Pix and EDN were more blurry, and there was an aliasing effect on the edges of the subject in the Pix2Pix result. In addition, EDN showed better image results than Pix2Pix did. In the third row of Figure 3, we noticed that the image of Subject3, synthesized by Pix2Pix, had much noise in the subject’s hair. On the contrary, the result of EDN was smoother and more realistic, but they lacked many details compared with DFC-Net, especially the face part was still twisted.

4.4.2 Visual Comparisons on Mixamo-Pose.

As depicted in Figure 5, we employed multiple target persons and source persons with various poses that were different from those in the training set of the Mixamo-Pose. The first column is the target person image, and the second one is the source person image. Note that only our method DFC-Net takes the target person and source person images at the same time. CycleGAN, Pix2Pix, and EDN only take the corresponding skeleton image of the source person image as input and synthesize results. The last column is real images of the target person making the desired motion which are ground truth. We illustrate details as follows:

Fig. 5.

–

Firstly, compared to the aforementioned baselines, DFC-Net offered better transfer results and alleviated their drawbacks on synthesizing body parts, clothes, and faces, just to name a few. For instance, only DFC-Net generated the left hand of Liam at the first and second rows, while there was only the arm in the results of other baselines. When the target person and source person are not the same person, e.g., in rows 2–8, DFC-Net can still disentangle the pose information of the source person from the static information successfully and then synthesize high-quality results. In contrast, MKN and FOMM are unable to strip out the pose information fully, and thus the generated results are strange, e.g., the person in the result of FOMM is facing away in the fourth row.

–

Meanwhile, though the Pix2Pix and EDN could synthesize the target persons with the desired poses, some body parts (e.g., arms, hands) generated by Pix2Pix were severely corrupted. For example, images produced by EDN had stale colors and lots of noisy pixels in terms of clothes and faces, and it failed to capture the details of shoes in the 3rd row since the color of the shoes is white and is easily confused with the background.

–

Moreover, even though the image results of NN are more clear and not blurry, the NN method always made incorrect pose transfer since it can solely synthesize poses that existed in the training set. Besides that, NN can not synthesize images with temporal coherence, while the input is usually a sequence of desirable motions rather than a single image.

–

The CycleGAN could not provide similar poses compared with the ground truths. We can observe that almost all the poses were distorted, and the qualities of the corresponding images were constantly degraded.

4.4.3 Visual Ablation Studies on Subject1.

We presented the visualization of the ablation study results on Subject1 in Figure 6. We observed that each component has different degrees of improvement to the results, including clear head, hand gestures, and so on. Specifically:

Fig. 6.

–

The results produced by setting 1, which does not have the Keypoint Amplifier, have more noise than other settings, especially near the head area. Compared to the baseline setting 1, setting 2 with Keypoint Amplifier significantly improved the generated image qualities. For instance, the results of setting 1 in the first and second rows failed to synthesize complete head and hands, while the corresponding results of setting 2 fixed these errors.

–

As shown in the column of setting 7, DFC-Net with full components synthesized images with more accurate details and better qualities. In the second row of the results produced by setting 5, we can see the generated person lacked his arm, while the results of setting 7 with the augmented consistency loss from the support set can complete the arm part.

5 Conclusion

In this article, we proposed DFC-Net, a novel network with disentangled feature consistencies for human pose transfer. We introduce two disentangled feature consistency losses to enforce the pose and static information to be consistent between the synthesized and real images. Besides, we leverage the keypoint amplifier to denoise the keypoint heatmaps and make it easier to extract the pose features. Moreover, we show that the support set Mixamo-Sup containing different subjects with unseen poses can boost the pose transfer performance and enhance the robustness of the model. To enable the accurate evaluation of pose transfer between different people, we collect an animation character dataset Mixamo-Pose. Results on both animation and real image datasets, Mixamo-Pose and EDN-10k, consistently demonstrated the effectiveness of the proposed model.

References

[1]

Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics(2019).

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Overview

3.2 Pose Transfer Network Architecture

3.2.1 Pose Feature Encoder.

3.2.2 Static Feature Encoder.

3.2.3 Image Generator.

3.3 Training DFC-Net

3.3.1 Adversarial Training.

3.3.2 Disentangled Feature Consistency Losses.

3.3.3 Augmented Consistency Loss.

3.3.4 Full Objective.

3.4 Training and Inference Process

3.5 Implementation Details

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

4.1.2 Baseline Methods.

4.1.3 Evaluation Metrics.

4.2 Quantitative Evaluations

4.2.1 Results on EDN-10k.

4.2.2 Results on Mixamo-Pose.

4.3 Ablation Study

4.3.1 Keypoint Amplifier.

4.3.2 Consistency Losses.

4.3.3 Support Set with Augmented Consistency Loss.

4.3.4 Different Pose Estimator Backbone.

4.4 Qualitative Results

4.4.1 Visual Comparisons on EDN-10k.

4.4.2 Visual Comparisons on Mixamo-Pose.

4.4.3 Visual Ablation Studies on Subject1.

5 Conclusion

References

Cited By

Index Terms

Recommendations

Interactive Pose Attention Network for Human Pose Transfer

Few-shot anime pose transfer

Generative knowledge transfer for ship detection in SAR images

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations