Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Human Pose Transfer with Augmented Disentangled Feature Consistency

Published: 19 December 2023 Publication History

Abstract

Deep generative models have made great progress in synthesizing images with arbitrary human poses and transferring the poses of one person to others. Though many different methods have been proposed to generate images with high visual fidelity, the main challenge remains and comes from two fundamental issues: pose ambiguity and appearance inconsistency. To alleviate the current limitations and improve the quality of the synthesized images, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. Given a pair of images containing the source and target person, DFC-Net extracts pose and static information from the source and target respectively, then synthesizes an image of the target person with the desired pose from the source. Moreover, DFC-Net leverages disentangled feature consistency losses in the adversarial training to strengthen the transfer coherence and integrates a keypoint amplifier to enhance the pose feature extraction. With the help of the disentangled feature consistency losses, we further propose a novel data augmentation scheme that introduces unpaired support data with the augmented consistency constraints to improve the generality and robustness of DFC-Net. Extensive experimental results on Mixamo-Pose and EDN-10k have demonstrated DFC-Net achieves state-of-the-art performance on pose transfer.

1 Introduction

Human pose transfer has become increasingly compelling recently since it can be applied to real-world applications such as movies’ special effects [56], entertainment systems [62], reenactment [31], and so forth [12, 40]. At the same time, it is also closely related to many computer vision tasks like human-object interaction recognition [42, 71, 72], person re-identification [33, 55], human pose segmentation [59, 70] and human parsing [27, 60], and all these methods can be beneficial to each other. Given some images of a target person and a source person image with the desired pose (e.g., judo, dance), the goal of the human pose transfer task is to synthesize a realistic image of the target person with the desired pose of the source person.
With the power of deep learning, especially the generative adversarial networks (GANs) [15], pioneering works have raised impressive solutions to address the human image generation [34, 37, 41] by efficiently leveraging the image-to-image translation schemes and have achieved significant progress. Intuitively, early routine coarsely conducts human pose transfer through general image-to-image translation methods such as Pix2Pix [23] and CycleGAN [73], which attempt to translate the extracted skeleton image of the source person to the image of target person with the desired poses.
Subsequent approaches [28, 37, 38] adopt specifically designed modules for human pose transfer. Specifically, the U-net architecture with skip connections in [14] is employed to keep the low-level features. To mitigate the pose misalignment between the source and target persons, [50] uses part-wise affine transformation with a modified feature fusion mechanism to warp the appearance features onto the target pose. Later, extensive works have been presented to strengthen the modeling ability of body deformation and feature transfer with different methods, including 3D surface models [16, 28, 41], local attention [43, 74], and optical flow [57]. [27] and [60] propose a rectification strategy in a self-learning way and hierarchical information framework, respectively, for human parsing, which benefits the downstream pose transfer task. However, the warping methods commonly struggle with pose ambiguity when the viewpoint changes, occlusions occur, or even transferring a complicated pose in many situations. To address the pose ambiguity, a series of works [34, 57] use predictive branches to illuminate and replenish new contents for invisible regions. When the hallucinated contents have a different context style than the local-warped ones, generated images will have a low visual fidelity due to appearance inconsistency. One of the main reasons for pose ambiguity and appearance inconsistency is that the commonly used reconstruction loss and the adversarial generative loss only constrain the synthesized image generation at the image level.
Towards alleviating the mentioned limitations, it is important to disentangle the pose and appearance information, and exploit the disentangled pose and appearance feature consistencies between the synthesized and real images, i.e., the synthesized target image should have a similar high-level appearance feature to the real target person as well as a similar high-level pose feature to the real source person. The disentangled pose and appearance feature consistencies can constrain the training at the feature level and lead to a more consistent and realistic synthesized result. In CDMS [70], a multi-mutual consistency learning strategy is proposed for the human pose segmentation task, showing the importance of feature consistency for distinguishing the human pose.
In this article, we propose a pose transfer network with augmented Disentangled Feature Consistency (DFC-Net) to facilitate human pose transfer. DFC-Net contains a pose feature encoder and a static feature encoder to extract pose and appearance features from the source and target person, respectively. In the pose feature encoder, we integrated a pre-trained pose estimator such as OpenPose [8] to extract the keypoint heatmaps. Notice that the pose estimator is pre-trained on COCO keypoint challenge dataset [30], which is not any dataset deployed in our experiments. As shown in Figure 2, though the pre-trained pose estimator can predict pose heatmaps for unseen subjects in our dataset, it cannot generalize well and the heatmaps have much noise, which hinders subsequent pose transfer. Further, in order to remedy the distortion of the extracted keypoints caused by the distribution shift from the pose estimator, we introduce a keypoint amplifier to eliminate the noise in keypoint heatmaps. An image generator synthesizes a realistic image of the target person conditioned on the disentangled pose and appearance features. The feature encoders and image generator empower DFC-Net to enable us to present novel feature-level pose and appearance consistency losses [73]. These losses reinforce the consistency of pose and appearance information in the feature space and simultaneously maintain visual fidelity. Additionally, to further improve the robustness and generality of DFC-Net, by disentangling the pose information from different source persons, we present a novel data augmentation scheme that builds an extra unpaired support dataset as the source images, which provides different persons with unseen poses in the training set and augmented consistency constraints.
Fig. 1.
Fig. 1. Upper left: DFC-Netsynthesizes an image of the target person performing the pose of the source person. Upper right: the Pose Feature Encoder \(M(\cdot)\) includes three components: pre-trained Pose Estimator, Keypoint Amplifier, and Pose Refiner. Bottom: the overview of training process of DFC-Net. Note that the images \(x_s\) surrounded by orange dotted boxes is from the support set, and the augmented consistency losses \(\mathcal {L}_\mathrm{sup}\) are sum of \(\mathcal {L}_\mathrm{adv}^-, \mathcal {L}_\mathrm{mc}\) , and \(\mathcal {L}_\mathrm{sc}\) surrounded by orange dotted boxes.
Fig. 2.
Fig. 2. Comparison of the amplified heatmaps of subject 1 in EDN-10k, which are generated from the Keypoint Amplifier with different temperature T. (a) the input image of subject 1. (b) the generated heatmaps with temperature \(T=1\) . (c) the generated heatmaps with temperature \(T=0.1\) . (d) the generated heatmaps with temperature \(T=0.01\) . We can observe that the heatmap with temperature \(T=0.01\) is the most discriminative one, which is beneficial to later pose synthesis.
We also notice that the commonly used real-person datasets and benchmarks [35, 69] usually do not have the image of the target person with the desired pose from another source person, which is the ground truth. It is common practice to use a target person image directly from the testing dataset to provide the pose information during the evaluation process. Thus this practice raises the risk of leaking information and is also inconsistent with the usage in the real-world (i.e., the pose information is from another source person). In order to be consistent with the real-world application and to better evaluate the proposed method, inspired by [1], we collect an animation character image dataset named Mixamo-Pose from Adobe Mixamo [2], a 3D animation library, to accurately generate different characters performing identical poses as a benchmark to assess the human pose transfer between different people. Mixamo-Posecontains four different animation characters performing 15 kinds of poses. To further evaluate the DFC-Net, we also modify a real person dataset called EDN-10k upon [10], which contains 10K high-resolution images for four real subjects performing different poses. The experimental results on these two datasets demonstrate that our model can effectively synthesize realistic images and conduct pose transfer for both the animation characters and real persons.
In summary, our contributions are as follows:
We propose a novel method DFC-Net for human pose transfer with two disentangled feature consistency losses to make the information between the real images and synthesized images consistent.
We propose a novel data augmentation scheme that enforces augmented consistency constraints with an unpaired support dataset to further improve the generality of our model.
We collect an animation character dataset Mixamo-Pose as a new benchmark to enable the accurate evaluation of pose transfer between different people in the animation domain.
We conduct extensive experiments on datasets Mixamo-Pose and EDN-10k, on which the empirical results demonstrate the effectiveness of our method.

2 Related Work

Generative adversarial networks [15] and Diffusion models [20] have achieved tremendous success in image generation tasks, whose goal is to generate high-fidelity images based on other images or text prompts from a different domain. Pix2Pix [23] proposes a framework based on cGANs [39] with an encoder-decoder architecture [19]; CycleGAN [73] addressed this problem by using cycle-consistent GANs; DualGAN [64] and [21] are also unsupervised image-to-image translation methods trained on unpaired datasets. Similarly, [6, 22, 32] are also image-to-image translation techniques, but they try to generate a dataset of the target domain with labels for domain adaptation tasks. The above works can be exploited as a general approach in the human pose transfer task, while the precondition is that they have a specific image domain that can be converted to the synthesized image domain, e.g., using a pose estimator [9] to generate a paired skeleton image dataset. Based on the diffusion model, Diffustereo [45] proposes a diffusion kernel and stereo constraints for 3D human reconstruction from sparse cameras. MotionDiffuse [66] leverages the diffusion model on the text-driven motion generation task. In this work, we focus on the 2D pose-guided motion transfer task, which differs from the above 3D reconstruction and test-driven tasks. Different from the image-to-image translation methods, DFC-Net improved the quality of the synthesized image by adding consistency constraints in the feature space.
Recently, there have been a growing number of human pose transfer methods with specifically designed modules. One branch is the spatial transformation methods [13, 28, 50], aiming at building the deformation mapping of the keypoint correspondences in the human body. By leveraging the spatial transformation capability of CNN, [24] presented the spatial transformer networks (STN) that approximate the global affine transformation to warp the features. Following STN, several variant works [25, 29, 65] have been proposed to synthesize images with better performance. [59] introduced an external eye-tracking dataset and two cascaded attention modules for comprehensive pose segmentation. [60] incorporated three different inference processes to detect each part of the human body. [4] used image segmentation to decompose the problem into modular subtasks for each body part and then integrated all parts into the final result. [50] built deformable skip connections to move information and transfer textures for pose transfer. Monkey-Net [48] encoded pose information via dense flow fields generated from keypoints learned in a self-supervised fashion. First-Order Motion Model [49] decoupled appearance and pose and proposes to use learned keypoints and local affine transformations to generate image animation. [34] integrated the human pose transfer, appearance transfer, and novel view synthesis into one unified framework by using SMPL [36] to generate a human body mesh. The spatial transformation methods usually implicitly assume that the warping operation can cover the whole body. However, when the viewpoint changes, and occlusions occur, the above assumption can not hold, leading to pose ambiguity and performance dropping.
Another branch methods are pose-guided and aim at predicting new appearance contents in uncovered regions to handle the pose ambiguity problem. One of the earliest works, PG \(^{2}\) [37], presented a two-stage method using U-Net to synthesize the target person with arbitrary poses. [38] further decomposed the image into the foreground, background, and pose features to achieve more precise control of different information. [47] introduced a multi-stage GAN loss and synthesized each body part, respectively. [41] leveraged the DensePose [3] rather than the commonly used 2D key-points to perform accurate pose transfer. [10] learned a direct mapping from the skeleton images to synthesized images with corresponding poses based on the architecture of Pix2PixHD [58]. PATH [74] introduced cascaded attention transfer blocks (PATBs) to refine pose and appearance features simultaneously. Inspired by PATH, PMAN [11] proposed a progressive multi-attention framework with memory networks to improve image quality. However, some of these methods [41, 57, 68] focused on synthesizing results at the image level (i.e., adversarial and reconstruction losses), thus leading to appearance inconsistency when predicted local contents are not consistent with the surrounding contexts. Some works [46, 67] designed the light weighted networks to accelerate the training and inference process. Our method can also benefit from these light weighted networks to achieve high efficiency human pose transfer.
In contrast, our method learns to disentangle and reassemble the pose and appearance in the feature space. One similar work close to ours is C \(^{2}\) GAN [54] which consists of three generation cycles (i.e., one for image generation and two for keypoint generation). C \(^{2}\) GAN explored the cross-modal information in the image level at the cost of model complexity and training instability while DFC-Net only introduced two feature consistency losses into the full objective, which kept the model simple and effective. By disentangling the pose and appearance features, we can enforce the feature consistencies between the synthesized and real images and leverage the pose features from an unpaired dataset to improve performance.

3 Methodology

3.1 Overview

The training and inference process of the proposed model is shown in Figure 1. Given one image \(\boldsymbol {x}_{\mathrm{s}}\) of a source person and another image \(\boldsymbol {x}_{\mathrm{t}}\) of a target person, DFC-Net synthesizes an image \(\boldsymbol {x}_\mathrm{syn}\) , which reserves (a) the pose information, e.g., pose and location, of the source person in \(\boldsymbol {x}_{\mathrm{s}}\) , and (b) the static information, e.g., person appearance and environment background, from the target image \(\boldsymbol {x}_{\mathrm{t}}\) . For each image, DFC-Net attempts to disentangle the pose and static information into orthogonal features. Specifically, DFC-Net consists of the following core components: (1) a Pose Feature Encoder \(M(\cdot)\) , which extracts pose features \(M(\boldsymbol {x})\) from an image \(\boldsymbol {x}\) ; (2) a Static Feature Encoder \(S(\cdot)\) , which extracts static features \(S(\boldsymbol {x}^{\prime })\) from an image \(\boldsymbol {x}^{\prime }\) ; and (3) an Image Generator \(G(\cdot)\) , which synthesizes an image \(G(M(\boldsymbol {x}), S(\boldsymbol {x}^{\prime }))\) based on the encoded pose and static features \(M(\boldsymbol {x})\) and \(S(\boldsymbol {x}^{\prime })\) from images \(\boldsymbol {x}\) and \(\boldsymbol {x}^{\prime }\) separately. In the remainder of this section, we describe the model architecture and introduce the training procedure, followed by the model instantiations.

3.2 Pose Transfer Network Architecture

3.2.1 Pose Feature Encoder.

Our designed Pose Feature Encoder consists of a Pose Estimator network, a Keypoint Amplifier block, and a Pose Refiner network. Given a RGB image \(\boldsymbol {x}\in \mathbb {R}^{3\times H \times W}\) of height H and width W, the pre-trained Pose Estimator aims at extracting pose information \(P(\boldsymbol {x})\) from the image \(\boldsymbol {x}\) . Similar to [9], the extracted pose information contains the downsampled keypoint heatmaps \(\boldsymbol {h}\in \mathbb {R}^{18 \times \frac{H}{8} \times \frac{W}{8}}\) and the part affinity fields \(\boldsymbol {p}\in \mathbb {R}^{38 \times \frac{H}{8} \times \frac{W}{8}}\) . The keypoint heatmaps \(\boldsymbol {h}\) store the heatmaps of 18 body parts, and the part affinity fields \(\boldsymbol {p}\) store the location and orientation for heatmaps of body parts and background, which has 38 ( \(=(18 + 1) \times 2\) ) channels.
As the pre-trained pose estimator (OpenPose [8] in our implementation) is pre-trained on COCO keypoint challenge dataset [30], when applied on Mixamo-Pose and EDN-10k datasets with different distributions, more noise is made on keypoint heatmaps \(\boldsymbol {h}\) . To reduce the interference of noise, we apply a softmax function with a relatively small temperature T (e.g., 0.01) as the Keypoint Amplifier to denoise the extracted keypoint heatmaps by increasing the gap between large and small values in the heatmaps and obtain the amplified heatmaps \(\boldsymbol {h}^{\prime }\) by
\begin{align} \boldsymbol {h}^{\prime } = \mathrm{softmax}\left(\frac{1}{T}\cdot \boldsymbol {h}\right). \end{align}
(1)
As shown in Figure 2, by employing Keypoint Amplifier on the input heatmaps, the small probability, e.g., 0.2, will be squeezed to almost 0.0. On the contrary, the large probability, e.g., 0.8, will be squeezed to almost 1.0. Without the Keypoint Amplifier, the generator may still synthesize blurry limbs for the low probability areas and twist the generated person.
Finally, the Pose Refiner takes both the part affinity fields \(\boldsymbol {p}\) and the amplified keypoint heatmaps \(\boldsymbol {h}^{\prime }\) and produces the encoded pose feature vector \(M(\boldsymbol {x})\) . In this way, the pose information extracted from the Pose Estimator can be refined, and the influence caused by different limb ratios and/or camera angles and distances can be reduced.

3.2.2 Static Feature Encoder.

While the Pose Feature Encoder is not capable of extracting static information, the static information, including background, personal appearance, and so on, from another image \(\boldsymbol {x}^{\prime }\) , is captured automatically by another module with the help of the full objective function. Named as Static Feature Encoder, this module extracts only static features \(S(\boldsymbol {x}^{\prime })\) from \(\boldsymbol {x}^{\prime }\) .

3.2.3 Image Generator.

Given pose features \(M(\boldsymbol {x}_\mathrm{s})\) extracted from a source image \(\boldsymbol {x}_\mathrm{s}\) and static features \(S(\boldsymbol {x}_\mathrm{t})\) extracted from a target image \(\boldsymbol {x}_\mathrm{t}\) , the Image Generator outputs the synthesized image \(\boldsymbol {x}_{\mathrm{syn}}\) by
\begin{align} \boldsymbol {x}_{\mathrm{syn}} = G\left(M(\boldsymbol {x}_\mathrm{s}), S(\boldsymbol {x}_\mathrm{t})\right). \end{align}
(2)
It is noted that many existing methods (e.g., [10]) attempt to learn pose-to-image or pose-to-appearance mapping solely via its generator. In that case, the generator has to learn three different functionalities: (1) memorizing the state information of the target person, (2) extracting representative pose features, and (3) combining the static and pose information to synthesize the target person image with the desired pose. Even though the generator can memorize the state information \(\boldsymbol {x}_\mathrm{t}\) of the target person perfectly, once the desired pose \(\boldsymbol {x}_\mathrm{s}\) is very different from the poses in the training dataset (e.g., the distance from the camera, skeleton scale from different persons, and occlusions), it is too difficult for the generator to achieve the above second and third functionalities at the same time. The results of Pix2Pix [23] and Everybody Dance Now (EDN) [10] in Section 4 also validate their disadvantages. DFC-Net, instead, decomposes the above three functionalities into three network modules, including the pose feature encoder, static feature encoder, and image generator, and thus enables the reconstruction’s quality improvement.

3.3 Training DFC-Net

We train the pose transfer network in an adversarial learning way with disentangled feature consistency losses as well as other objectives. Basically, the model is trained with a set of images of the same person, possibly from one or several video clips. To further improve the generalization ability of DFC-Net, we propose to train DFC-Net with a support set and the augmented consistency losses. We show the ablation study results in Section 4.3.

3.3.1 Adversarial Training.

We employ an Image Discriminator (D) in an adversarial learning way to ensure the synthesized image \(\boldsymbol {x}_\mathrm{syn}\) borrows the pose and static information from the source and target images ( \(\boldsymbol {x}_\mathrm{s}\) and \(\boldsymbol {x}_\mathrm{t}\) ) separately. As both the source and target images during training contain the same person with the same appearance and background, they share almost the same static features, i.e., \(S(\boldsymbol {x}_\mathrm{t}) \simeq S(\boldsymbol {x}_\mathrm{s})\) . Therefore, the output of the model \(\boldsymbol {x}_{\mathrm{syn}}\) can also be treated as a reconstruction of the source image \(\boldsymbol {x}_\mathrm{s}\) , as the synthesized images contain the same pose features \(M(\boldsymbol {x}_\mathrm{s})\) as the source image. This inspires us to resort to the conditional generative adversarial network (cGAN) [23], where the Image Discriminator attempts to discern between the real sample \(\boldsymbol {x}_\mathrm{s}\) and the generated image \(\boldsymbol {x}_{\mathrm{syn}}\) , conditioned on the pose features \(M(\boldsymbol {x}_\mathrm{s})\) extracted from the source image. That is, the Image Discriminator attempts to fit \(D (\boldsymbol {x}_\mathrm{s}, M(\boldsymbol {x}_\mathrm{s}))=1\) and \(D (\boldsymbol {x}_\mathrm{syn}, M(\boldsymbol {x}_\mathrm{s}))) = 0\) . The adversarial loss is described as follows:
\begin{align} \mathcal {L}_{\mathrm{adv}} = - (\mathcal {L}_{\mathrm{adv}}^+ + \mathcal {L}_{\mathrm{adv}}^-), \end{align}
(3)
where
\begin{align} \mathcal {L}_{\mathrm{adv}}^+ &= \log D (\boldsymbol {x}_\mathrm{s}, M(\boldsymbol {x}_\mathrm{s})), \end{align}
(4)
\begin{align} \mathcal {L}_{\mathrm{adv}}^- &= \log \left(1 - D \left(\boldsymbol {x}_\mathrm{syn}, M(\boldsymbol {x}_\mathrm{s})\right)\right). \end{align}
(5)
We enhance the Image Discriminator with a multi-scale discriminator \(D = (D_1, D_2)\) [58] and include the discriminator feature matching loss \(\mathcal {L}_\mathrm{fm}\) in our objective. The feature matching loss is a weighted sum of feature losses from 5 different layers of the Image Discriminator, calculated by \(L_1\) distance between the corresponding features of \(\boldsymbol {x}_\mathrm{s}\) and \(\boldsymbol {x}_\mathrm{syn}\) .
In order to increase the training stability and improve the synthesized image quality, we also add the perceptual loss \(\mathcal {L}_\mathrm{per}\) [26] based on a pre-trained VGG network [51].

3.3.2 Disentangled Feature Consistency Losses.

The above adversarial training losses aim at penalizing discrepancy between the synthesized and source images directly in the raw image space. To improve the accuracy and robustness of the pose transfer results, we also introduce two disentangled feature consistency losses in terms of pose and static features to ensure the synthesized person looks like the target person and behaves as the source person separately. The pose consistency loss \(\mathcal {L}_\mathrm{mc}\) measures the differences between the synthesized and source images in the pose feature space, and the static consistency loss \(\mathcal {L}_\mathrm{sc}\) measures the differences between the synthesized and target images in the static feature space. They are both \(L_1\) distances between the outputs from the corresponding encoders, formally defined as
\begin{align} \mathcal {L}_\mathrm{mc}& = {\left\Vert M(\boldsymbol {x}_\mathrm{syn}) - M(\boldsymbol {x}_\mathrm{s}) \right\Vert }_1, \end{align}
(6)
\begin{align} \mathcal {L}_\mathrm{sc}& = {\left\Vert S(\boldsymbol {x}_\mathrm{syn}) - S(\boldsymbol {x}_\mathrm{t}) \right\Vert }_1. \end{align}
(7)

3.3.3 Augmented Consistency Loss.

Through disentangling the pose feature from the source images, we find that images with different persons can also be passed into the training process as the source images \(\boldsymbol {x}_\mathrm{s}\) to improve the generalization ability of our model. Hence, we introduce a novel data augmentation method that extends the training dataset with the images of different persons, referred to as the support set, providing many kinds of unseen poses. Note that the subjects in support set can be arbitrary and are different from the primary training dataset, so the ground-truth images with the target person performing the pose of the source person are not available at all. As a result, the corresponding losses \(\mathcal {L}_\mathrm{adv}^+, \mathcal {L}_\mathrm{per}\) , and \(\mathcal {L}_\mathrm{fm}\) for the support set are not applicable, and we optimize relevant objective terms, which are defined by
\begin{align} \mathcal {L}_\mathrm{sup}= \lambda _\mathrm{adv}\mathcal {L}_\mathrm{adv}^- + \lambda _\mathrm{mc}\mathcal {L}_\mathrm{mc}+ \lambda _\mathrm{sc}\mathcal {L}_\mathrm{sc}, \end{align}
(8)
where the weights \(\lambda _\mathrm{adv}, \lambda _\mathrm{mc}, \lambda _\mathrm{sc}\) are the weights for each loss.

3.3.4 Full Objective.

By bringing all the objective terms together, we train all components jointly except for the Pose Estimator to minimize the full objective \(\mathcal {L}_\mathrm{full}\) below.
\begin{align} \mathcal {L}_\mathrm{full}=\, & \lambda _\mathrm{adv}\mathcal {L}_\mathrm{adv}+ \lambda _\mathrm{fm}\mathcal {L}_\mathrm{fm}+ \lambda _\mathrm{per}\mathcal {L}_\mathrm{per}\\ \nonumber \nonumber & + \lambda _\mathrm{mc}\mathcal {L}_\mathrm{mc}+ \lambda _\mathrm{sc}\mathcal {L}_\mathrm{sc}+ \mathcal {L}_\mathrm{sup}, \end{align}
(9)
where \(\lambda _\mathrm{adv}, \lambda _\mathrm{fm}, \lambda _\mathrm{per}\) are set to 1, 10, 10 following Pix2Pix [23] and EDN [10], while \(\lambda _\mathrm{mc}, \lambda _\mathrm{sc}\) are set to 0.1, 0.01 by grid search. We set the \(\lambda _\mathrm{sc}\) to 0.01 comparing to the \(\lambda _\mathrm{mc}\) to balance the \(\mathcal {L}_\mathrm{mc}\) and \(\mathcal {L}_\mathrm{sc}\) .

3.4 Training and Inference Process

For each subject in the training dataset (i.e., Mixamo-Pose and EDN-10k in our experiments), we train one separate model following the same scheme of [10] (e.g., we trained four models for four subjects in EDN-10k dataset.) For the sake of comparison fairness, we also train all the baseline methods following the same scheme.
During the training stage, given a training dataset consisting of N images of one subject and a support set, for each training iteration, we randomly choose a pair of images as \(x_s\) and \(x_t\) from the training dataset and an image as \(x_s\) from the support set respectively, pass them into the DFC-Net and train it using the full objective in Equation (9).
During the inference stage, given a desired pose image \(x_s\) , we randomly choose an image \(x_t\) from the training dataset and synthesize the result. For EDN-10k dataset, the pose image \(x_s\) is chosen from the testing dataset with an unseen pose in the training process. Even though the pose image \(x_s\) and the target person image \(x_t\) contain the same person (i.e., the ground truth of the pose image \(x_s\) with another person is unavailable for real-world data), by passing through the pose feature encoder, the static information in the pose image \(x_s\) is discarded, and only the keypoints information are preserved. For Mixamo-Pose dataset, the pose image \(x_s\) is chosen from the testing dataset, including the different person from the target person (e.g., the target person image \(x_t\) is from Liam and the source person image \(x_s\) is from Remy). For both benchmark, DFC-Net has to extract the static features from the target person image \(x_t\) and combines them with the pose features of the pose image \(x_s\) to synthesize the final images where the poses are unseen during the training. Thus there is no information leakage.

3.5 Implementation Details

We employed the pre-trained VGG-19 [52] network part from [9] for the Pose Estimator, and adopted the similar approach in [57] to build our network, the detailed designs are as follows:
Pose Refiner: It is composed of a convolutional block, a channel-wise upsampling module, and five residual blocks [17]. Firstly, the convolutional block consists of a reflection padding layer, a \(7 \times 7\) convolutional layer, a batch normalization layer, and ReLU. The channel-wise upsampling module, which increases the number of channels from 64 to 512, contains three convolutional blocks. Each block contains a \(3 \times 3\) convolutional layer, a batch normalization layer, and ReLU. Each of the five residual blocks consists of two small convolutional blocks, and each block has a reflection padding layer, a \(3 \times 3\) convolutional layer, and a batch normalization layer. The first small convolutional block also has a ReLU at the end.
Static Feature Encoder: It firstly has the same convolutional block as in the Pose Refiner. Then it contains three convolutional downsampling blocks, and each block consists of a \(3 \times 3\) convolutional layer, a batch normalization layer, and ReLU. There are also five residual blocks following the downsampling blocks as the same as in the Pose Refiner.
Image Generator: It is composed of four residual blocks, an upsampling module, and a convolutional block. Each of the residual blocks is the same as in the Pose Refiner and Static Feature Encoder. The upsampling module consists of three transposed convolutional blocks, and each block is composed of a \(3 \times 3\) transposed convolutional layer, a batch normalization layer, and ReLU. The last convolutional block contains two reflection padding layers, two \(7 \times 7\) convolutional layers, and a tangent function.
Image Discriminator: It contains two discriminators at different scales, which are similar to [58]. Each discriminator is composed of five convolutional blocks. The first block has a \(4 \times 4\) convolutional layer and LeakyReLU. Each of the next three blocks has a \(4 \times 4\) convolutional layer, a batch normalization layer, and Leaky ReLU. The last block only has a \(4 \times 4\) convolutional layer.

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

We built Mixamo-Sup as a support set for data augmentation to boost the generality of DFC-Net and organized two datasets, Mixamo-Pose and EDN-10k, to verify the effectiveness of the proposed DFC-Net for human pose transfer.
EDN-10K: We processed and tailored the dataset released by [10] to build the EDN-10k dataset. The original dataset consists of five long target videos, each lasting from 8 minutes to 17 minutes, and split into the training and test set. We chose the first four subjects since subject five only performed less complex dance poses. In each video of the original dataset, a different subject performed a series of different motions, and the camera was fixed to keep the background unchanged. We chose four subjects from the original dataset, uniformly sampled 10k frames as the training set and 1k frames as the test set for each subject, Since the original images have a large resolution of \(1024 \times 512\) and most areas are the fixed background, we cropped all frames to the middle \(512 \times 512\) square areas and resized them to \(256 \times 256\) .
Mixamo-Pose: We randomly chose 4 characters, Andromeda, Liam, Remy, and Stefani, with 30 different pose sequences from Mixamo. To render the 3D animations into 2D images, we loaded each character performing each pose sequence on a white background into Blender [5], placed two cameras in front of and behind the character, and took the images. We centered the characters in the images according to their keypoints and resized them to 256 \(\times\) 256. Mixamo-Pose were split into training and test sets. For each character, the training set contains 1,488 images with 15 poses, and the test set contains 1,185 images with 15 other poses.
Mixamo-Sup: For data augmentation, we built a support set by rendering 15,684 images of six new characters from Mixamo [2] with another 15 unseen poses in the same way as Mixamo-Pose. Since it is unnecessary to contain the same person as the target person image, DFC-Net leveraged the support set as the source person images \(x_s\) . When training on both EDN-10k and Mixamo-Pose, we use Mixamo-Sup as the support set. Note that Mixamo-Sup has a totally different distribution from EDN-10k but still gains a huge improvement shown in Section 4.3.
Note that the experiments on EDN-10k only include the pose transfer on the same person because the ground truth of different people carrying the same pose are unavailable. For Mixamo-Pose, since we can manipulate different characters to do the same action, the experiments include the pose transfer both on the different people, e.g., transfer the unseen pose of Liam to Andromeda, and the same person, e.g., transfer the unseen pose of Andromeda to herself.

4.1.2 Baseline Methods.

We compared our DFC-Net with the following competitive baselines:
Nearest Neighbors (NN): For each source person image \(\boldsymbol {x}_\mathrm{s}\) , we chose the image \(\boldsymbol {x}^{\prime }\) in the training set \(\mathcal {D}_{\mathrm{tr}}\) with the lowest mean square error (MSE) between the pose information \(P(\boldsymbol {x}_\mathrm{s})\) and \(P(\boldsymbol {x}^{\prime })\) as \(\boldsymbol {x}_\mathrm{syn}\) .
\begin{align} \boldsymbol {x}_\mathrm{syn}& = {\arg \min }_{\boldsymbol {x}^{\prime } \in \mathcal {D}_{\mathrm{tr}}}{\left\Vert P(\boldsymbol {x}_\mathrm{s}) - P(\boldsymbol {x}^{\prime }) \right\Vert }^2_2. \end{align}
(10)
The pose information was extracted by the same Pose Estimator as in our method.
Pose-guided Methods: We chose CycleGAN [73], Pix2Pix [23] and EDN [10] as the baselines. They all took the skeleton images as input instead of the original images, and we employed a pre-trained pose estimator [9] to extract keypoints, used OpenCV [7] to connect pairs of keypoints with different colors to generate the skeleton images. To ensure fair comparisons, the face GAN and face keypoint estimator in EDN were not adopted in our implementation, as they are independent components and can be seamlessly adopted by other learning-based baselines.
Spatial Transformation Methods: We selected Liquid Warping GAN (LWG) [34], Monkey-Net (MKN) [48], and First Order Motion Model (FOMM) [49]. LWG calculates the flow fields with additional 3D human models and integrates the human pose transfer, appearance transfer, and novel view synthesis into one unified framework. MKN and FOMM are both object-agnostic frameworks using learned keypoints to generate image animation in a self-supervised fashion.

4.1.3 Evaluation Metrics.

We evaluated the quality of the synthesized images with three commonly used metrics:
MSE: The mean squared error between the values of pixels of synthesized images and ground-truth images. Lower MSE values are better.
PSNR: The peak signal-to-noise ratio, which provides an empirical measure of the quality of synthesized images regarding ground-truth images. Higher PSNR values are better.
SSIM: Structural similarity [61], which is another perceptual metric that quantifies the quality of synthesized images given ground-truth images and focuses more on structural information (e.g., light). Higher SSIM values are better.
IS: Inception Score [44] is a metric for estimating the quality of the synthetic images based on the Inception-V3 model [53]. Higher IS values are better.
FID: Frechet Inception Distance [18] is also an Inception-V3-based metric to evaluate the synthetic images according to the statistics of the synthetic images. Lower FID values are better.
We calculated the average scores among all pairs of synthesized and ground-truth images on the test set. On Mixamo-Pose, for each character as the target person, we reported the average metrics of 4 different characters as the source person. While on EDN-10k, we reported metrics of the task on the same person for every subject.

4.2 Quantitative Evaluations

4.2.1 Results on EDN-10k.

Tables 15 depict results in terms of MSE, SSIM, PSNR, IS, and FID on the EDN-10k dataset. Table 6 provides comparisons on EDN-10k in terms of average results of the above five metrics over four subjects. The experimental results validated the advances of DFC-Net for real images:
Table 1.
MethodSubject1Subject2Subject3Subject4
NN54.944836.226755.204126.2531
CycleGAN [73]64.195977.433670.368152.5171
Pix2Pix [23]58.363343.077162.020324.1688
EDN [10]56.354936.388755.962521.5724
LWG [34]51.624643.288453.403121.4314
MKN [48]48.160330.990247.625521.6634
FOMM [49]46.285230.760351.143121.2709
Ours45.204330.478248.543620.7248
Table 1. Comparisons on EDN-10k in Terms of MSE with the Best Results (Lowest Values) in Bold
Table 2.
MethodSubject1Subject2Subject3Subject4
NN0.61380.82530.76160.8437
CycleGAN [73]0.52560.49110.58690.7821
Pix2Pix [23]0.62380.80400.75850.8767
EDN [10]0.62050.84450.82330.8939
LWG [34]0.63940.83750.74340.8634
MKN [48]0.70070.85030.80300.8904
FOMM [49]0.66450.84450.78960.8649
Ours0.70830.86700.82410.9083
Table 2. Comparisons on EDN-10k in Terms of SSIM with the Best Results (Highest Values) in Bold
Table 3.
MethodSubject1Subject2Subject3Subject4
NN30.795732.643931.330734.2481
CycleGAN [73]30.057729.242129.700731.3398
Pix2Pix [23]30.602032.030430.584334.6860
EDN [10]30.710932.863930.995735.0369
LWG [34]31.008531.781231.050335.0118
MKN [48]31.309433.253030.649334.9108
FOMM [49]31.527233.314531.309934.9612
Ours31.597833.350931.415935.0718
Table 3. Comparisons on EDN-10k in Terms of PSNR with the Best Results (Highest Values) in Bold
Table 4.
MethodSubject1Subject2Subject3Subject4
NN3.12843.30523.15253.3271
CycleGAN [73]2.92942.89022.97763.0343
Pix2Pix [23]3.19033.29663.18643.5012
EDN [10]3.18023.43283.40833.5316
LWG [34]3.17743.40353.13653.4122
MKN [48]3.34813.46803.37933.5238
FOMM [49]3.27943.40753.32813.4019
Ours3.35023.52273.42403.5682
Table 4. Comparisons on EDN-10k in Terms of IS with the Best Results (Highest Values) in Bold
Table 5.
MethodSubject1Subject2Subject3Subject4
NN24.805319.278322.478723.3219
CycleGAN [73]37.846038.892635.305532.9549
Pix2Pix [23]23.531621.309224.782917.5752
EDN [10]23.817218.345419.017514.4926
LWG [34]22.334818.106225.723219.7245
MKN [48]19.420916.573519.556814.4617
FOMM [49]20.357117.639121.325818.7283
Ours18.302915.287717.878213.7508
Table 5. Comparisons on EDN-10k in Terms of FID with the Best Results (Lowest Values) in Bold
Table 6.
MethodMSE( \(\downarrow\) )SSIM( \(\uparrow\) )PSNR( \(\uparrow\) )IS( \(\uparrow\) )FID( \(\downarrow\) )
NN43.15720.761132.25463.228322.4711
CycleGAN [73]66.12870.596430.08512.957936.2498
Pix2Pix [23]46.90740.765831.97573.293621.7997
EDN [10]42.56960.795632.40193.388218.9182
LWG [34]42.43680.770932.21293.282421.4496
MKN [48]37.10980.811132.53063.429817.5032
FOMM [49]37.36480.790832.77823.354219.5126
Ours36.23770.826932.85913.466216.3049
Table 6. Comparisons on EDN-10k in Terms of the Average Results of the 5 Metrics Over 4 Subjects
Our method consistently outperformed all the baseline methods on all subjects. When synthesizing real person images, the most significant result is that on Subject1, our method achieved 45.2043 MSE while NN, CycleGAN, Pix2Pix, EDN, LWG, and MKN only got MSE values of 54.9448, 64.1959, 58.3633, 56.3549, 51.6246 and 48.1603 in Table 1, which indicates the images generated by our method have clear details. From Tables 4 and 5, DFC-Net also excelled other baselines for all four subjects according to the Inception Score (IS) and Frechet Inception Distance (FID). As shown in Table 6, our method also achieved the highest average SSIM of 0.8269 for all subjects, while no other methods except MKN got SSIM score greater than 0.8, which shows that our synthesized images are more realistic and suitable for the human visual system.
Secondly, we noticed CycleGAN has the worst results, which are 64.1959, 77.4336, and 70.3681 for MSE scores, and 0.5256, 0.4911, and 0.5869 for SSIM scores on Subject1, Subject2, and Subject3 in the Tables 1 and 2 respectively. We argue that CycleGAN is better at transferring the color or style for images from two domains rather than changing the geometry of the images, such as recovering the human appearance from the human skeleton because CycleGAN aims at learning a mapping from unpaired images directly. This property of CycleGAN is also supported by its inferior PSNR scores compared with other methods.
We could observe that NN can achieve low scores of MSE in Table 1. Since there are a lot of training images, it is easier for NN to find an image whose motion is very close to the desirable motion. Moreover, the fixed background also makes NN have higher scores of SSIM and PSNR in Tables 2 and 3, while other methods have to learn to generate an accurate background. But the images generated by NN usually do not perform the desired motions, and do not have any temporal coherence in motion when the input is a motion sequence since the results only depend on the training set.
Moreover, we observed that EDN, LWG, MKN, and FOMM also provided good results, especially on MSE metric, e.g., 42.5696, 42.4368, 37.1098, and 37.3648 average values for all subjects in Table 1, comparing with other baselines. Taking Subject2 as an example, LWG, MKN, and FOMM provided the SSIM of 0.8445, 0.8375, 0.8503, and 0.8445 in Table 2 which are higher than the results of NN, CycleGAN, and Pix2Pix. The higher SSIM values show that LWG, MKN, and FOMM can synthesize images closer to the ground truths.

4.2.2 Results on Mixamo-Pose.

Tables 711 respectively show the quantitative results of pose transfer in terms of MSE, SSIM, PSNR, IS, and FID on the Mixamo-Pose dataset. Table 12 provides comparisons on Mixamo-Pose in terms of average results of the above five metrics over four characters. We obtained the empirical results clearly demonstrated the effectiveness of DFC-Net on animation images:
Table 7.
MethodAndromedaLiamRemyStefani
NN24.904728.484627.180124.8609
CycleGAN [73]27.552029.943628.223722.7682
Pix2Pix [23]23.937023.561024.068721.5841
EDN [10]24.324423.120323.922935.0930
LWG [34]24.290522.858722.970722.0910
MKN [48]30.593439.344429.481724.1297
FOMM [49]27.780929.046927.447425.3934
Ours23.853921.732822.076321.2587
Table 7. Comparisons on Mixamo-Pose in Terms of MSE with the Best Results (Lowest Values) in Bold
Table 8.
MethodAndromedaLiamRemyStefani
NN0.73570.72650.74870.7411
CycleGAN [73]0.72050.71540.73770.7753
Pix2Pix [23]0.77840.79550.79320.8069
EDN [10]0.78170.79310.79260.8058
LWG [34]0.76130.79120.78870.7858
MKN [48]0.70760.68740.75310.7753
FOMM [49]0.71650.74040.75380.74642
Ours0.77260.80400.80570.8071
Table 8. Comparisons on Mixamo-Pose in Terms of SSIM with Best Results (Highest Values) in Bold
Table 9.
MethodAndromedaLiamRemyStefani
NN34.442034.057534.386534.3977
CycleGAN [73]33.790333.423033.678634.6130
Pix2Pix [23]34.404934.516734.387534.8497
EDN [10]34.352434.588234.421434.7247
LWG [34]34.342634.676234.618334.7519
MKN [48]33.662632.884133.748734.4779
FOMM [49]33.777733.591233.833734.1919
Ours34.433634.879834.782334.9303
Table 9. Comparisons on Mixamo-Pose in Terms of PSNR with the Best Results (Highest Values) in Bold
Table 10.
MethodAndromedaLiamRemyStefani
NN3.36713.49073.31703.3815
CycleGAN [73]2.92372.91053.01483.0681
Pix2Pix [23]3.38923.50263.33563.5073
EDN [10]3.43083.54183.45093.5624
LWG [34]3.37043.40523.31773.4339
MKN [48]3.38563.56973.42123.5105
FOMM [49]3.39213.48933.40823.4721
Ours3.42853.62973.56183.6075
Table 10. Comparisons on Mixamo-Pose in Terms of IS with the Best Results (Highest Values) in Bold
Table 11.
MethodAndromedaLiamRemyStefani
NN21.941123.472822.508624.7382
CycleGAN [73]23.147925.187121.041821.2344
Pix2Pix [23]14.505112.737512.875111.1583
EDN [10]14.628111.148111.689211.2089
LWG [34]16.730311.493013.572814.3207
MKN [48]21.483921.401517.320816.7219
FOMM [49]19.678219.375521.292620.0382
Ours14.317210.106211.275610.7603
Table 11. Comparisons on Mixamo-Pose in Terms of FID with the Best Results (Lowest Values) in Bold
Table 12.
MethodMSE( \(\downarrow\) )SSIM( \(\uparrow\) )PSNR( \(\uparrow\) )IS( \(\uparrow\) )FID( \(\downarrow\) )
NN26.35760.738034.32093.389123.1652
CycleGAN [73]27.12190.737233.87622.979322.6528
Pix2Pix [23]23.28770.793534.53973.433712.8190
EDN [10]26.61510.793334.52173.496512.1686
LWG [34]23.05270.781734.59723.381814.0292
MKN [48]30.88730.730833.69333.471819.2320
FOMM [49]28.65440.739233.84863.440420.0961
Ours22.23040.797334.75653.556911.6148
Table 12. Comparisons on Mixamo-Pose in Terms of the Average Results of the 5 Metrics Over 4 Characters
As shown in Tables 711, our DFC-Net outperformed all competing baselines regarding all five metrics on average again like the results on EDN-10k Notably, in terms of MSE, DFC-Net outperformed the second-best LWG by 0.8223 in Table 7. To be more concrete, the lowest MSEs of our DFC-Net indicated DFC-Net provided the most accurate motion transfer images with the shortest \(L_2\) distance between the ground-truth images. For instance, our method provides 22.0763 of MSE on Remy comparing to the 27.1801, 28.2237, 24.0687, and 23.9229 from NN, CycleGAN, Pix2Pix, and EDN in the Table 7. Similar results of IS and FID can also be observed in Tables 10 and 11. Furthermore, together with the highest scores of PSNR and SSIM, our DFC-Net could generate synthesized images with the most accurate motion transfer and the best image quality simultaneously.
Secondly, NN provided a relative good results on Andromeda but a higher MSE on Liam and Remy in the Table 7, since its performance depends on the training dataset in the extreme, and it can’t provide stable synthesized results. Moreover, in Tables 7 and 8, CycleGAN also can not perform well, with the results are 27.5520, 29.9436, and 28.2237 MSE, and 0.7205, 0.7154 and 0.7377 for SSIM scores on Andromeda, Liam, and Remy respectively. These results once again validated that it is difficult to learn a mapping from the skeleton images to human images directly using unpaired data.
Thirdly, from Tables 711, Pix2Pix, EDN, and LWG delivered such superior performance comparing with NN and CycleGAN. because they aim at learning a paired mapping from the skeleton images to human images directly. Besides that, the skeleton images as their input have accurate motion information without any noise and make the learning process easier. In contrast, even without converting source person images to skeleton images, our method DFC-Net fills the gap between the original source person images and the skeleton images to some extent by introducing the keypoint amplifier, two consistency losses, and a support dataset for training. Without any pre-process steps, and thus the source person images contain much noise and redundant information, our method can still achieve more improvements over Pix2Pix and EDN according to all five metrics.
Compared to the results on EDN-10k, MKN and FOMM dropped their performance when they transferred the pose between different people on Mixamo-Pose. It is difficult for them to extract keypoints features without a pre-trained pose estimator when the source person was not in the training dataset. For example, in Table 8, we can see that the SSIM score of MKN on Andromeda is 0.7076 compared with 0.7357, 0.7784, 0.7817, and 0.7726 from NN, Pix2Pix, EDN, and our method.

4.3 Ablation Study

To better understand the merits of designs of DFC-Net, we conducted detailed ablation studies on Subject1 from EDN-10k and Liam from Mixamo-Pose. The evaluation results are shown in Tables 13 and 14.
Table 13.
 KA \(\mathcal {L}_\mathrm{sc}\) \(\mathcal {L}_\mathrm{mc}\) \(\mathcal {L}_\mathrm{sup}\) MSE( \(\downarrow\) )PSNR( \(\uparrow\) )SSIM( \(\uparrow\) )
1    52.574930.93610.6595
2   50.781031.08600.6676
3  49.383131.20430.6742
4  49.835231.16620.6710
5 48.643131.27140.6796
6 46.369931.48010.6937
745.204331.59780.7083
Table 13. Ablation Studies on Subject1 from EDN-10k
KA denotes the Keypoint Amplifier.
Table 14.
 KA \(\mathcal {L}_\mathrm{sc}\) \(\mathcal {L}_\mathrm{mc}\) \(\mathcal {L}_\mathrm{sup}\) MSE( \(\downarrow\) )PSNR( \(\uparrow\) )SSIM( \(\uparrow\) )
1    26.459533.98560.7605
2   24.731234.28380.7726
3  23.758034.46660.7830
4  22.616034.69130.8011
5 21.950934.82920.8032
6 22.024034.80470.8031
721.732834.87980.8040
Table 14. Ablation Studies on Liam from Mixamo-Pose
KA denotes the Keypoint Amplifier.

4.3.1 Keypoint Amplifier.

In Tables 13 and 14, the baseline (the first row) is the results of our model without the keypoint amplifier, the consistency losses and the support dataset. This baseline provided the worst results of the three metrics, e.g., the MSE is 26.4595 in Table 14. The results in the second row versus those in the first row showed that the keypoint amplifier strengthened the performance of pose transfer, such as promoting the SSIM of the baseline from 0.6595 to 0.6676 in Table 13. Even with the consistency losses and support set (rows 6 and 7), it continuously reduced the interference of the noise on the keypoint heatmaps. These results validated that the keypoint amplifier can filter out the real keypoint locations and reduce the interference of the noise on the keypoint heatmaps.

4.3.2 Consistency Losses.

We explored the effects of consistency losses \(\mathcal {L}_\mathrm{sc}\) and \(\mathcal {L}_\mathrm{mc}\) on the task of pose transfer. Considering the different results between the second row and the third row, the static feature consistency loss \(\mathcal {L}_\mathrm{sc}\) boosted the pose transfer task and verified its validity. Moreover, the performance variations between the second row and the fourth row clearly evidenced the advantages of pose feature consistency loss \(\mathcal {L}_\mathrm{mc}\) . Together with these two feature consistency losses, the model achieved better results with MSE of 21.9509, PSNR of 34.8292, and SSIM of 0.8032, as shown in the fifth row of Table 14 The comparison among these cases showed that each consistency loss, whether the static one or the motion one, can enforce the consistency between the real image and synthesized images, thus improving the quality of the synthesized images.
It is noted that the static feature consistency was more useful than the motion one on EDN-10k, while we observed the opposite performance on Mixamo-Pose. For example, in the third and fourth rows of Table 14, the static feature consistency loss achieved 0.0104 improvements for SSIM, while the motion feature consistency loss provided a 0.0285 improvement. We considered that since the backgrounds of images in Mixamo-Pose were simply white (i.e., easy to learn), the static feature consistency could only boost the performance for personal appearance on Mixamo-Pose. On the contrary, the backgrounds were more complex in EDN-10k. Thus the static feature consistency was more effective on EDN-10k than on Mixamo-Pose.

4.3.3 Support Set with Augmented Consistency Loss.

We added the support set and utilized the augmented consistency loss \(\mathcal {L}_\mathrm{sup}\) during the training. The comparisons among the final row and other rows notably indicated that the support set could improve the generalization ability and the robustness of our model, especially when the poses of the source person are closer to the poses in the support set. The support set provided more negative examples besides the original training set and could help the discriminator to form an accurate boundary, which could further strengthen the performance of the generator. We also noticed that the support set was more useful on EDN-10k, because EDN-10k was sampled from the video clips of subjects, and it did not contain much motion variance compared with Mixamo-Pose. Moreover, the limb ratios and the distances between the person and camera in the support set were also distinct from EDN-10k, which provided more valuable negative samples for EDN-10k.

4.3.4 Different Pose Estimator Backbone.

To further verify the contribution of different pre-trained pose estimator backbones, we also conducted extensive experiments on Subject1 from EDN-10k and Liam from Mixamo-Pose with ResNet architectures including ResNet-18, ResNet-50, ResNet-101, and ResNet-152. For ResNet-18, we did not find any pre-trained model online, and thus we re-implemented [9] by replacing the VGG-19 with the ResNet-18 following the same training process as [9]. For ResNet-50, ResNet-101, and ResNet-152, we directly use the pre-trained models from [63]. The results are shown in Tables 15 and 16. We can observe that when we use a larger pose estimator backbone, we usually get better performance according to all five metrics. For instance, on Subject1 from EDN-10k, the ResNet-152 achieved the best SSIM result of 0.7261 compared to other backbones. It is because a better pre-trained pose estimator backbone can provide more representative pose information and thus generate high-fidelity images. It suggests that our method still has the potential for improvement by combining more advanced pose estimation technologies. In this work, we mainly focus on how to add consistency in the feature space instead of directly using a better pose estimator backbone.
Table 15.
NetworkMSE( \(\downarrow\) )PSNR( \(\uparrow\) )SSIM( \(\uparrow\) )IS( \(\uparrow\) )FID( \(\downarrow\) )
ResNet-1849.329130.53880.68023.281120.7048
ResNet-50 [63]44.387531.77140.70393.325518.1374
ResNet-101 [63]42.802731.92080.72553.403916.8275
ResNet-152 [63]41.556232.32480.72613.441516.4553
VGG-19 [9]45.204331.59780.70833.350218.3029
Table 15. Ablation Studies of Different Pose Estimator Backbones on Subject1 from EDN-10k
Table 16.
NetworkMSE( \(\downarrow\) )PSNR( \(\uparrow\) )SSIM( \(\uparrow\) )IS( \(\uparrow\) )FID( \(\downarrow\) )
ResNet-1823.340334.31230.80573.556410.2478
ResNet-50 [63]21.363435.21250.80733.61879.9064
ResNet-101 [63]20.827735.87160.83423.65719.5731
ResNet-152 [63]20.516935.89200.81883.69339.4565
VGG-19 [9]21.732834.87980.80403.629710.1062
Table 16. Ablation Studies of Different Pose Estimator Backbones on Liam from Mixamo-Pose

4.4 Qualitative Results

We further highlighted the superiority of our proposed approach by showing and contrasting the visualizations of synthesized results of all models on EDN-10k and Mixamo-Pose, respectively. In addition, we presented the visualizations of the ablation comparisons for Subject1 on EDN-10k to indicate the effectiveness of components of DFC-Net.

4.4.1 Visual Comparisons on EDN-10k.

We also provided visualization results on EDN-10k in Figure 3 and cropped the results in Figure 4 with more details, and DFC-Net synthesized real person image results with better qualities, which clearly indicated the effectiveness of our method as well. For instance:
Fig. 3.
Fig. 3. Visualizations of pose transfer with different characters and various poses on EDN-10k. The columns from left to right show target persons, source persons, results of different methods, and ground truth, respectively.
Fig. 4.
Fig. 4. Visualizations of Subject1 from EDN-10k with more details. The image in the first left column is the source image and ground truth. We did not show the target image (i.e, the same person with different pose) due to space limitation and cropped the results to the areas surrounded by the red dash lines.
As shown in the second row of Figure 3, the difference between the source pose and the target pose is very large, and this posture involves changes in almost all parts of the human body. In that case, we can observe that DFC-Net can generate the correct human pose for the Subject2, when other methods like LWG, MKN, and FOMM can only synthesize distorted poses. Another example is in Figure 4, we observed that the synthesized images of our method contained more details, including the clear face, wrinkles on clothes, and so on. The generated pose transfer image of Subject1 from DFC-Net has a more clear face structure, and the face orientation of the characters is also consistent with the target image, while MKN and FOMM failed to synthesize correct face orientation and hand poses.
Sometimes NN could generate images that are closer to the ground truths, e.g., the image of Subject3 at the third row of Figure 3, since the number of images in the training set increases from 1,488 in Mixamo-Pose to 10,000 in EDN-10k. But people could still easily determine that the poses of the generated images are different from the ground-truths.
The results produced by Pix2Pix and EDN were more blurry, and there was an aliasing effect on the edges of the subject in the Pix2Pix result. In addition, EDN showed better image results than Pix2Pix did. In the third row of Figure 3, we noticed that the image of Subject3, synthesized by Pix2Pix, had much noise in the subject’s hair. On the contrary, the result of EDN was smoother and more realistic, but they lacked many details compared with DFC-Net, especially the face part was still twisted.

4.4.2 Visual Comparisons on Mixamo-Pose.

As depicted in Figure 5, we employed multiple target persons and source persons with various poses that were different from those in the training set of the Mixamo-Pose. The first column is the target person image, and the second one is the source person image. Note that only our method DFC-Net takes the target person and source person images at the same time. CycleGAN, Pix2Pix, and EDN only take the corresponding skeleton image of the source person image as input and synthesize results. The last column is real images of the target person making the desired motion which are ground truth. We illustrate details as follows:
Fig. 5.
Fig. 5. Visualizations of motion transfer with different characters and various motions on Mixamo-Pose. The columns from left to right show target persons, source persons, results of different methods, and ground truth, respectively.
Firstly, compared to the aforementioned baselines, DFC-Net offered better transfer results and alleviated their drawbacks on synthesizing body parts, clothes, and faces, just to name a few. For instance, only DFC-Net generated the left hand of Liam at the first and second rows, while there was only the arm in the results of other baselines. When the target person and source person are not the same person, e.g., in rows 2–8, DFC-Net can still disentangle the pose information of the source person from the static information successfully and then synthesize high-quality results. In contrast, MKN and FOMM are unable to strip out the pose information fully, and thus the generated results are strange, e.g., the person in the result of FOMM is facing away in the fourth row.
Meanwhile, though the Pix2Pix and EDN could synthesize the target persons with the desired poses, some body parts (e.g., arms, hands) generated by Pix2Pix were severely corrupted. For example, images produced by EDN had stale colors and lots of noisy pixels in terms of clothes and faces, and it failed to capture the details of shoes in the 3rd row since the color of the shoes is white and is easily confused with the background.
Moreover, even though the image results of NN are more clear and not blurry, the NN method always made incorrect pose transfer since it can solely synthesize poses that existed in the training set. Besides that, NN can not synthesize images with temporal coherence, while the input is usually a sequence of desirable motions rather than a single image.
The CycleGAN could not provide similar poses compared with the ground truths. We can observe that almost all the poses were distorted, and the qualities of the corresponding images were constantly degraded.

4.4.3 Visual Ablation Studies on Subject1.

We presented the visualization of the ablation study results on Subject1 in Figure 6. We observed that each component has different degrees of improvement to the results, including clear head, hand gestures, and so on. Specifically:
Fig. 6.
Fig. 6. Visualizations of Subject1 from EDN-10k. The columns from left to right show target persons, source persons, results of 6 different settings in ablation studies, and ground truth, respectively. The indices of each column are the same as in the Table 14.
The results produced by setting 1, which does not have the Keypoint Amplifier, have more noise than other settings, especially near the head area. Compared to the baseline setting 1, setting 2 with Keypoint Amplifier significantly improved the generated image qualities. For instance, the results of setting 1 in the first and second rows failed to synthesize complete head and hands, while the corresponding results of setting 2 fixed these errors.
As shown in the column of setting 7, DFC-Net with full components synthesized images with more accurate details and better qualities. In the second row of the results produced by setting 5, we can see the generated person lacked his arm, while the results of setting 7 with the augmented consistency loss from the support set can complete the arm part.

5 Conclusion

In this article, we proposed DFC-Net, a novel network with disentangled feature consistencies for human pose transfer. We introduce two disentangled feature consistency losses to enforce the pose and static information to be consistent between the synthesized and real images. Besides, we leverage the keypoint amplifier to denoise the keypoint heatmaps and make it easier to extract the pose features. Moreover, we show that the support set Mixamo-Sup containing different subjects with unseen poses can boost the pose transfer performance and enhance the robustness of the model. To enable the accurate evaluation of pose transfer between different people, we collect an animation character dataset Mixamo-Pose. Results on both animation and real image datasets, Mixamo-Pose and EDN-10k, consistently demonstrated the effectiveness of the proposed model.

References

[1]
Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Learning character-agnostic motion for motion retargeting in 2D. ACM Transactions on Graphics(2019).
[2]
Adobe Systems Inc.2018. Retrieved December 27, 2018 from https://www.mixamo.com. Accessed: 2018-12-27.
[3]
Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. 2018. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Blender Online Community. 2018. Blender - a 3D Modelling and Rendering Package. Blender Foundation, Stichting Blender Foundation, Amsterdam. Retrieved from http://www.blender.org
[6]
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[7]
G. Bradski. 2000. The OpenCV library. Dr. Dobb’s Journal of Software Tools (2000).
[8]
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[9]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[10]
Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2019. Everybody dance now. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[11]
Baoyu Chen, Yi Zhang, Hongchen Tan, Baocai Yin, and Xiuping Liu. 2021. PMAN: Progressive multi-attention network for human pose transfer. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[12]
Changxing Ding and Dacheng Tao. 2016. A comprehensive survey on pose-invariant face recognition. ACM Transactions on Intelligent Systems and Technology (2016).
[13]
Haoye Dong, Xiaodan Liang, Ke Gong, Hanjiang Lai, Jia Zhu, and Jian Yin. 2018. Soft-gated warping-gan for pose-guided person image synthesis. In Proceedings of the Advances in Neural Information Processing Systems.
[14]
Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[15]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems.
[16]
Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor Lempitsky. 2019. Coordinate-based texture inpainting for pose-guided human image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[18]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (2017).
[19]
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science (2006).
[20]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. NIPS 33 (2020), 6840–6851.
[21]
Yedid Hoshen and Lior Wolf. 2018. Identifying analogies across domains. In Proceedings of the International Conference on Learning Representations.
[22]
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision.
[23]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[24]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems.
[25]
Wei Jiang, Weiwei Sun, Andrea Tagliasacchi, Eduard Trulls, and Kwang Moo Yi. 2019. Linearized multi-sampling for differentiable image transformation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[26]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Computer VisionECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14.
[27]
Tao Li, Zhiyuan Liang, Sanyuan Zhao, Jiahao Gong, and Jianbing Shen. 2020. Self-learning with rectification strategy for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[28]
Yining Li, Chen Huang, and Chen Change Loy. 2019. Dense intrinsic appearance flow for human pose transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[29]
Chen-Hsuan Lin and Simon Lucey. 2017. Inverse compositional spatial transformer networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[30]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer VisionECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
[31]
Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo Kim, Florian Bernard, Marc Habermann, Wenping Wang, and Christian Theobalt. 2019. Neural rendering and reenactment of human actor videos. ACM Transactions on Graphics (2019).
[32]
Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. In Proceedings of the Advances in Neural Information Processing Systems.
[33]
Wenhe Liu, Xiaojun Chang, Ling Chen, Dinh Phung, Xiaoqin Zhang, Yi Yang, and Alexander G. Hauptmann. 2020. Pair-based uncertainty and diversity promoting early active learning for person re-identification. ACM Transactions on Intelligent Systems and Technology (2020).
[34]
Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. 2019. Liquid warping GAN: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[35]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[36]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (2015).
[37]
Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose guided person image generation. In Proceedings of the Advances in Neural Information Processing Systems.
[38]
Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[39]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from https://arxiv.org/abs/1411.1784
[40]
Thomas B. Moeslund, Adrian Hilton, and Volker Krüger. 2006. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding (2006).
[41]
Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. 2018. Dense pose transfer. In Proceedings of the European Conference on Computer Vision.
[42]
Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision.
[43]
Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H. Li, and Ge Li. 2020. Deep image spatial transformation for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[44]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. NIPS (2016).
[45]
Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, and Yebin Liu. 2022. Diffustereo: High quality human reconstruction via diffusion-based stereo using sparse cameras. In Proceedings of the European Conference on Computer Vision. Springer.
[46]
Jianbing Shen, Yuanpei Liu, Xingping Dong, Xiankai Lu, Fahad Shahbaz Khan, and Steven Hoi. 2021. Distilled siamese networks for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[47]
Chenyang Si, Wei Wang, Liang Wang, and Tieniu Tan. 2018. Multistage adversarial losses for pose-based human image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[48]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[49]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. In Proceedings of the Advances in Neural Information Processing Systems.
[50]
Aliaksandr Siarohin, Enver Sangineto, Stéphane Lathuilière, and Nicu Sebe. 2018. Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[51]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556
[52]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556
[53]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[54]
Hao Tang, Dan Xu, Gaowen Liu, Wei Wang, Nicu Sebe, and Yan Yan. 2019. Cycle in cycle generative adversarial networks for keypoint-guided image generation. In Proceedings of the 27th ACM International Conference on Multimedia.
[55]
Jiajie Tian, Qihao Tang, Rui Li, Zhu Teng, Baopeng Zhang, and Jianping Fan. 2021. A camera identity-guided distribution consistency method for unsupervised multi-target domain person re-identification. ACM Transactions on Intelligent Systems and Technology 12, 4 (2021), 1–18.
[56]
Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised learning of motion capture. In Proceedings of the Advances in Neural Information Processing Systems.
[57]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. In NIPS.
[58]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[59]
Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven C. H. Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[60]
Wenguan Wang, Tianfei Zhou, Siyuan Qi, Jianbing Shen, and Song-Chun Zhu. 2021. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[61]
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, Eero P. Simoncelli, et al. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing (2004).
[62]
Shihong Xia, Lin Gao, Yu-Kun Lai, Ming-Ze Yuan, and Jinxiang Chai. 2017. A survey on human performance capture and animation. Journal of Computer Science and Technology (2017).
[63]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision.
[64]
Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision.
[65]
Haoyang Zhang and Xuming He. 2017. Deep free-form deformation network for object-mask registration. In Proceedings of the IEEE International Conference on Computer Vision.
[66]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv:2208.15001. Retrieved from https://arxiv.org/abs/2208.15001
[67]
Zongji Zhao, Sanyuan Zhao, and Jianbing Shen. 2021. Real-time and light-weighted unsupervised video object segmentation network. Pattern Recognition (2021).
[68]
Haitian Zheng, Lele Chen, Chenliang Xu, and Jiebo Luo. 2019. Unsupervised pose flow learning for pose guided synthesis. arXiv:1909.13819. Retrieved from https://arxiv.org/abs/1909.13819
[69]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Ccomputer Vision.
[70]
Tao Zhou, Huazhu Fu, Chen Gong, Ling Shao, Fatih Porikli, Haibin Ling, and Jianbing Shen. 2022. Consistency and diversity induced human motion segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[71]
Tianfei Zhou, Siyuan Qi, Wenguan Wang, Jianbing Shen, and Song-Chun Zhu. 2021. Cascaded parsing of human-object interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[72]
Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, and Jianbing Shen. 2020. Cascaded human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[73]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision.
[74]
Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. 2019. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Cited By

View all
  • (2023)CTrGAN: Cycle Transformers GAN for Gait Transfer2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00045(371-381)Online publication date: Jan-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 1
February 2024
533 pages
EISSN:2157-6912
DOI:10.1145/3613503
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2023
Online AM: 13 October 2023
Accepted: 09 September 2023
Revised: 09 August 2023
Received: 11 August 2022
Published in TIST Volume 15, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Human pose transfer
  2. generative adversarial network
  3. image generation
  4. computer vision

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)927
  • Downloads (Last 6 weeks)103
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)CTrGAN: Cycle Transformers GAN for Gait Transfer2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00045(371-381)Online publication date: Jan-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media