3.1 Overview and Notations
For the 2D image-based VTON task, the target in-shop clothing item is different from the source clothing item that is worn by a given person. Specifically, given a person image
\(I \ {\in } \ {\mathbb {R}^{3 \times h \times w}}\) and one in-shop clothing image
\(c \ {\in } \ {\mathbb {R}^{3 \times h \times w}}\) . Our goal is to generate the image
\(I_{o} \ {\in } \ {\mathbb {R}^{3 \times h \times w}}\) where a person
I wears the cloth
c. Hence, what we need to do first is to reduce the side effects of the source clothes, like color, texture, or shape. Meanwhile, it’s also necessary to preserve the information about the given person as much as possible, including the person’s face, hair, body shape, and pose. To this end, we adopt the same pipeline as [
22] for the person representation
p from
I. It contains three components, the 18-channel feature maps for the human pose, the 1-channel feature map for the body shape, and the 3-channel RGB image. Note that the RGB image contains only the reserved regions of a person (i.e., face, hair, and lower parts of the person) for maintaining the identity of this person.
The basic structure of the proposed CIT is in a two-stage (i.e., see the geometric matching stage and the try-on stage in Figure
1) pipeline, which is also adopted by CP-VTON [
55] and CP-VTON+ [
37]. In particular, the former takes as input the cloth-agnostic person representation
p and an in-shop clothing item
c to produce a warped cloth
\(\hat{c}\) and a warped mask
\(\hat{c_{m}}\) based on the given person’s pose and shape. The latter uses the warped cloth
\(\hat{c}\) , the corresponding warped mask
\(\hat{c_{m}}\) together with the person representation
p to generate the final person image with the worn in-shop cloth. In the first geometric matching stage, we propose a CIT matching block (Block-I, see the upper part in Figure
2 for the details), which takes the person feature
\(X_{p}\) and the in-shop cloth feature
\(X_{c}\) as inputs. Then
\(X_{p}\) and
\(X_{c}\) are generated by two similar feature extractors from
p and
c, respectively (see the first geometric matching stage in Figure
1). After that, it produces a correlation feature
\(X_{out-I}\) followed by a down-sample layer for regressing the parameter
\(\theta\) . Note that
\(\theta\) is used for warping the original in-shop clothing
c to the target on-body style
\(\hat{c}\) via an interpolation method named the
thin-plate spline (
TPS) warping module [
43]. Specifically, given two images with some corresponding control points in different positions, these control points can be well aligned from one (i.e., in the in-shop clothing item) to another (i.e., in the corresponding human body region) with the thin plate spline interpolation operation in a geometry estimation manner (i.e., local descriptor extraction, descriptor matching, transformation-related parameter estimation) [
22].
In addition, the TPS operation we adopted in this article is the same as the one used in CP-VTON [
55] from [
43]. It first utilizes its differentiable modules to conduct a transformation by mimicking the geometry estimation procedure in a learnable manner from
c to the
\(\hat{c}\) . Meanwhile, the corresponding mask
\(\hat{c_{m}}\) of
\(\hat{c}\) is also produced based on
\(\theta\) via TPS warping operation. In the second stage, we utilize the warped cloth
\(\hat{c}\) and the warped mask
\(\hat{c_{m}}\) together with the person representation
p as inputs of the CIT reasoning block (Block-II, see the bottom part in Figure
2 for the details). And the output
\(X_{out-II}\) of the CIT reasoning block is used to guide the final mask composition for generating more realistic try-on results.
3.2 Interactive Transformer
Having leveraged the self-attention mechanism, Transformer is capable of modeling long-range dependencies. Given this inherent ability of the Transformer, we propose the innovative Interactive-Transformer for exploring the correlation between the person and the clothing item in the VTON task. There are two types of Interactive Transformers in the proposed CIT. The first version, i.e., Interactive Transformer I, is employed in the first geometric matching stage. The second version, i.e., Interactive Transformer II, is utilized in the second try-on stage. They are based on the basic Transformer encoders and the cross-modal Transformer encoders, and their detailed description is depicted in Figure
2.
Regarding a standard Transformer encoder, a positional embedding is initially added to the input feature as elucidated in [
54]. This is beneficial for keeping the initial spatial relation of the input. After the position embedding, the input feature will be projected into queries
\(Q_{m}\) , keys
\(K_{m}\) , and values
\(V_{m}\) by a linear layer. Subsequently, the output of the attention layer
\(A_{m}\) is derived as
where
d is the dimension of the query, key.
The aforementioned self-attention mechanism is usually employed for only one type of input data. However, in the two-stage VTON task, to accurately capture a more precise match between the information of the person and the cloth information, there are several pairs of correlations we can’t overlook. Notably, in the geometric matching stage, we need to consider the correlation between the cloth-agnostic person representation p and the in-shop clothing item c since such a correlation is indispensable for producing a reasonable warped cloth \(\hat{c}\) . In the second try-on stage, there are three types of inputs i.e., p, \(\hat{c}\) , as well as \(\hat{c_{m}}\) . To proficiently model the long-range connection of each two of them (i.e., p and \(\hat{c}\) , p and \(\hat{c_{m}}\) , as well as \(\hat{c}\) and \(\hat{c_{m}}\) ) is also a crucial issue since a well-captured correlation usually yields a good match between the person’s body and the in-shop cloth.
Based on such an observation, instead of using only the self-attention layer in a Transformer encoder for processing a single-modal input, we propose a cross-modal Transformer encoder based on a cross-attention mechanism. Note that we treat each kind of input as a certain single-modal input since each of them provides a specific input. For example,
p is for person identity,
c and
\(\hat{c}\) correspond to the texture, and
\(\hat{c_{m}}\) is related to the shape information. And the cross-attention is computed as follows:
where we adopt the first input (i.e., person representation
p) as query
\(Q_{m1}\) , and the second input (i.e., the in-shop clothing item
c) as the keys
\(K_{m2}\) and values
\(V_{m2}\) . Based on such a cross-interactive manner, each kind of input keeps updating its sequence via the external information from the multi-head cross-attention module. As a result, one modality will be transformed into a different set of key/value pairs to interact with another modality.
Interactive Transformer I is shown in the red dash box of the upper area of Figure
2. It consists of two regular Transformer encoders (depicted in gray) and two cross-modal Transformer encoders (depicted in light blue) that are directly applied to feature maps. We use
\(selfTrans(\cdot)\) and
\(crossTrans(\cdot)\) to indicate the operators of these two kinds of Transformer encoders. Given two input features
\(F_{p}\) and
\(F_{c}\) with dimension
\((C, B, S)\) . Note that the dimension of
\(F_{p}\) and
\(F_{c}\) are reshaped from input features
\(X_{p}\) and
\(X_{c}\) with original dimension
\((B, C, H, W)\) . Here
B,
C,
H,
W denote the batch size, the number of channels, the height, and the width of the input features
\(X_{p}\) and
\(X_{c}\) ,
\(S = H \times W\) denotes the spatial dimension. Then each of them will go through its corresponding
N-layer regular Transformer encoder first. After that we get the processed features
\(F_{p}^{^{\prime }}\) and
\(F_{c}^{^{\prime }}\) as follows:
Then the cross-modal Transformer encoder is used for modeling the cross-modal long-range correlations between
\(F_{p}^{^{\prime }}\) and
\(F_{c}^{^{\prime }}\) :
here
\(crossTrans(X_{p}^{^{\prime }}, X_{c}^{^{\prime }})\) indicates that we utilize
\(X_{c}^{^{\prime }}\) as the keys and values while we use
\(X_{p}^{^{\prime }}\) as the queries. On the other hand,
\(crossTrans(X_{c}^{^{\prime }}, X_{p}^{^{\prime }})\) indicates that the keys and values are coming from
\(X_{p}^{^{\prime }}\) and queries come from
\(X_{c}^{^{\prime }}\) . After concatenating the outputs from the two cross-modal Transformer encoders, we get the output
\(X_{cross}^{1}\) of the Interactive Transformer I. It can strengthen the correlation matching ability.
Interactive Transformer II is shown in the red dash box at the bottom area of Figure
2. Similar to the Interactive-Transformer I, the Interactive Transformer II is also constructed by combining regular Transformer encoders and cross-modal Transformer encoders.
The Interactive Transformer II is designed mainly for exploring the correlations between every two inputs among three total inputs (i.e., p, \(\hat{c}\) , and \(\hat{c_{m}}\) ). In particular, we adopt 3 regular Transformer encoders and 6 cross-modal Transformer encoders for constructing the Interactive Transformer II. Note that for better illustration, we depict \(X_{p}\) , \(X_{\hat{c}}\) , as well as \(X_{\hat{c_m}}\) and their corresponding information flows in yellow, green, and blue, respectively.
Within the Interactive Transformer II, there are three input features i.e.,
\(X_p\) ,
\(X_{\hat{c}}\) , and
\(X_{\hat{c_{m}}}\) . Each of them works as the Query element within its own branch while working as the Key and Value elements in the other two branches. Specifically, we take the feature
\(X_p\) (depicted in yellow in Figure
2) that comes from person representation as the detailed introduction. Once we get
\(X_p\) after the 1D convolutional layer that is out of the red dash box. There are two pathways for
\(X_p\) to pass through. The first one is to directly let it meet two cross-modal Transformer encoders (i.e., the green-border cross-modal Transformer encoder between
\(X_{\hat{c}}^{^{\prime }}\) and
\(X_p\) , as well as the blue-border cross-modal Transformer encoder between
\(X_{\hat{c_{m}}}^{^{\prime }}\) and
\(X_p\) ). Another one is to let
\(X_p\) pass through a regular Transformer encoder for producing the updated feature
\(X_{p}^{^{\prime }}\) . Note that here
\((X_{\hat{c}}^{^{\prime }} {\rightarrow } X_{p})\) within the green-border cross-modal Transformer encoder means we utilize
\(X_{p}\) as Query and
\(X_{\hat{c}}^{^{\prime }}\) as Key and Value, while
\((X_{\hat{c_m}}^{^{\prime }} {\rightarrow } X_{p})\) within the blue-border cross-modal Transformer encoder indicates we use
\(X_{p}\) as Query and
\(X_{\hat{c_m}}^{^{\prime }}\) as Key and Value.
\(X_{\hat{c}}^{^{\prime }}\) and
\(X_{\hat{c_m}}^{^{\prime }}\) are the updated features from
\(X_{\hat{c}}\) and
\(X_{\hat{c_m}}\) after their corresponding regular Transformer encoders. We formulate such procedures of the first yellow branch as follows:
Similarly, we also get the output of the middle green branch
\(X_{\hat{c}}^{cross}\) and the output of the bottom blue branch
\(X_{\hat{c_m}}^{cross}\) . Finally, the overall output of the Interactive Transformer II is
3.3 CIT Matching Block
Based on our Interactive Transformer I, we propose the CIT Matching block (Block-I) to boost the performance of the TPS transformation by strengthening the long-range correlation between
\(X_{p} \ {\in } \ R^{B{\times }C{\times }H{\times }W}\) and
\(X_{c} \ {\in } \ R^{B{\times }C{\times }H{\times }W}\) . Here
B,
C,
H, and
W indicate batch size, channel number, and the height and width of a given feature. To utilize the Transformer encoder for modeling long-range dependencies, we first adjust the dimensions of
\(X_p\) and
\(X_c\) from
\((B, C, H, W)\) to
\((B, C, S)\) , forming the
\(X_p^{^{\prime }}\) and
\(X_c^{^{\prime }}\) . Note that
\(S=H \times W\) . Besides, a 1D convolutional layer is also adopted to ensure that each element of each input sequence can get sufficient awareness of its neighborhood elements. When we get
\(F_p\) and
\(F_c\) after the convolutional layers, the proposed Interactive Transformer I is applied to
\(F_p\) and
\(F_c\) for capturing the long-range correlation between the person-related and the in-shop cloth-related features. As a result, we get the result (i.e.,
\(X_{cross}^{1}\) .) of the proposed CIT matching block. These procedures are depicted in Figure
2 with detailed annotations.
Instead of directly adding this long-range relation to features
\(X_{p}\) or
\(X_{c}\) , we strengthen each of them by a global strengthened attention
\(X_{att}\) operation as follows:
Here
\(\times\) means an element-wise multiplication,
\((.)\) indicates that both features
\(X_{p}\) and
\(X_{c}\) follow the same form. Note that
\(X_{att}\) is produced from
\(X_{cross}^{1}\) by a linear projection and a sigmoid activation. Based on this operation, the element position relation of each input will be activated by the sigmoid activation function. In particular, when it is applied to the input feature as attention, both the position information of each element within each input and the correlation between two inputs can be kept in a balanced manner. Then a matrix multiplication between
\(X_{p}^{global}\) and
\(X_{c}^{global}\) is conducted. The output
\(X_{out-I}\) of the proposed CIT matching block is finally obtained after a reshape operation, which represents the improved correlation between the person and clothing features. These procedures can be defined as follows:
Here
\(X_{p}^{global}\) and
\(X_{c}^{global}\) have the same dimension
\((B, C, S)\) , and the output
\(X_{out-I}\) is in dimension
\((B, S, H, W)\) .
3.4 CIT Reasoning Block
Previous works (i.e., CP-VTON [
55] and CP-VTON+ [
37]) first concatenate the person information
p, the warped cloth information
\(\hat{c}\) , and the warped clothing mask
\(\hat{c_m}\) together. Then the concatenated input is directly sent to one UNet model as a single input to generate a composition mask
\(M_{o}\) as well as a rendered person image
\(I_{R}\) . However, such a rough concatenate operation may lead to coarse information matching, and consequently, it would be difficult to achieve a well-matched final try-on result.
To this end, we propose the CIT Reasoning block (Block-II) depicted in Figure
2, aiming at modeling such more complicated correlations among
p,
\(\hat{c}\) , and
\(\hat{c_{m}}\) . Firstly, we adopt the patch embedding operation [
13] to all these three inputs. Then each of them goes through a 1D convolutional layer to ensure the relation modeling of each element with its neighbor elements. After that, we get
\(X_p\) ,
\(X_{\hat{c}}\) , and
\(X_{\hat{c_{m}}}\) . To well capture the complicated long-range correlations among these features, we apply the proposed Interactive Transformer II to
\(X_p\) ,
\(X_{\hat{c}}\) , as well as
\(X_{\hat{c_{m}}}\) . Then the output
\(X_{out-II}\) of Interactive Transformer II is utilized to guide the final mask composition for a better generation as follows:
here
sigmod indicates the Sigmod activation function.
3.5 Optimization Objectives
The first stage of CIT is trained with sample triplets
\((p, c, c_m)\) , while the second stage is trained with
\((p, \hat{c}, \hat{c_m})\) . In addition, in the first matching stage, we adopt the same optimization objectives as CP-VTON+ [
37]:
where
\(\mathcal {L}_{1}\) indicates the pixel-wise L1 loss between the warped result
\(\hat{c}\) and the ground truth
\(c_{t}\) .
\(\mathcal {L}_{reg}\) indicates the grid regularization loss, and it can be formalized as follows:
where
\(G_x\) and
\(G_y\) indicate the grid coordinates of the generated images along the
x and the
y directions.
In the second stage, the optimization objective is as follows:
The first item aims at minimizing the discrepancy between the output
\(I_{o}\) and the ground truth
\(I_{GT}\) . The second item, the VGG perceptual loss [
27], is a widely used loss item in image generation tasks. It is an alternative to pixel-wise losses and it attempts to be closer to the perceptual similarity of human beings. The VGG loss is based on the ReLU activation layers of the pre-trained 19-layer VGG network. It can be expressed as follows:
where
\(W_{i,j}\) and
\(H_{i,j}\) describe the dimensions of the respective feature maps within the VGG network.
\(\phi _{i,j}\) indicates the feature map obtained by the
jth convolution before the
ith max pooling layer within the VGG19 network. The third item is used to encourage the composition mask
\(M_{o}\) to select the most suitable warped clothing mask
\(c_{tm}\) as much as possible.