Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2404.04474v1 [cs.CV] 06 Apr 2024

RoNet: Rotation-oriented Continuous Image Translation

Yi Li*, Xin Xie, Lina Lei, Haiyan Fu, Yanqing Guo Corresponding author:                      liyi@dlut.edu.cn Other authors:                      shelsin@mail.dlut.edu.cn                      leilina@mail.dlut.edu.cn                      fuhy@dlut.edu.cn,                      guoyq@dlut.edu.cn
Abstract

The generation of smooth and continuous images between domains has recently drawn much attention in image-to-image (I2I) translation. Linear relationship acts as the basic assumption in most existing approaches, while applied to different aspects including features, models or labels. However, the linear assumption is hard to conform with the element dimension increases and suffers from the limit that having to obtain both ends of the line. In this paper, we propose a novel rotation-oriented solution and model the continuous generation with an in-plane rotation over the style representation of an image, achieving a network named RoNet. A rotation module is implanted in the generation network to automatically learn the proper plane while disentangling the content and the style of an image. To encourage realistic texture, we also design a patch-based semantic style loss that learns the different styles of the similar object in different domains. We conduct experiments on forest scenes (where the complex texture makes the generation very challenging), faces, streetscapes and the iphone2dslr task. The results validate the superiority of our method in terms of visual quality and continuity.

Index Terms:
Image-to-image translation (I2I), continuous generation, style representation.
Refer to caption
Figure 1: The turning wheel of four seasons generated by RoNet with the single input (on the right labeled with the red dot).
Refer to caption
Figure 2: The high definition results of RoNet. Images in one row are generated with a single source image by setting different rotation angles θ𝜃\thetaitalic_θ. More results are presented in Sec.IV.

I Introduction

Image-to-image (I2I) translation [1], also known as image translation, learns to map an image from the source domain to an image in the target domain. In the process, an image is roughly decomposed (explicitly or implicitly) into two components: the content (which is domain-invariant) is expected to remain the same after translation, while the style (which is domain-variant) refers to the changes during translation. Depending on the concrete tasks, I2I translation can benefit a wide range of applications involving portrait animation [2], photo enhancement [3, 4, 5], painting style transfer [6, 7, 8], domain adaptation [9, 10, 11] and face synthesis [12, 13]. Despite the impressive progress in the past years, it remains challenging to obtain smooth and continuous translation results.

Recently, some approaches have explored to utilize linear interpolation to realize continuous translation. The linear assumption may be imposed on the learned features [14, 15], the trained models [16] or the domain labels [17, 18]. The linear manifold assumption is intuitive but has limitations in at least three aspects. (1) Elements (feature/model/label) of the source domain and the target domain are both necessary to realize the linear interpolation. Take the model interpolation in [16] as an example, one has to train multiple models to deal with the multi-domain translation. (2) The elements (feature/model/label) tends to have high dimensions. With the dimension increases, the relationship between two domains becomes harder to conform to the linear assumption, especially when there is a wide gap between two domains. Even for labels, there are complex scenarios like cyclic translations, e.g., the turning wheel of seasons as presented in Fig.1, generated by RoNet. (3) Suppose there are the source representation Ssrcsubscript𝑆𝑠𝑟𝑐S_{src}italic_S start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and the target representations Stgt1subscript𝑆𝑡𝑔𝑡1S_{tgt1}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t 1 end_POSTSUBSCRIPT and Stgt2subscript𝑆𝑡𝑔𝑡2S_{tgt2}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t 2 end_POSTSUBSCRIPT from two different domains, the fused representation Ssrctgt2inpsubscriptsuperscript𝑆𝑖𝑛𝑝𝑠𝑟𝑐𝑡𝑔𝑡2S^{inp}_{src\rightarrow tgt2}italic_S start_POSTSUPERSCRIPT italic_i italic_n italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_r italic_c → italic_t italic_g italic_t 2 end_POSTSUBSCRIPT is usually defined as αSsrc+(1α)Stgt2𝛼subscript𝑆𝑠𝑟𝑐1𝛼subscript𝑆𝑡𝑔𝑡2\alpha S_{src}+(1-\alpha)S_{tgt2}italic_α italic_S start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t 2 end_POSTSUBSCRIPT. As illustrated in Fig.3 (a), the typical intermediate representation interpolation will sacrifice the expression ability of the fused element in the multi-domain image translation. For example, we cannot obtain the image with autumn style by the representation interpolation from the spring domain to winter domain.

Based on the analysis, this paper proposes to realize continuous translation with an in-plane rotation of the style representation. Different from the intuitive linear interpolation, we assume the domains to distribute on a circle in a super-plane and learn the super-plane automatically. We accordingly propose RoNet that built on a generation network that explicitly disentangles the content and the style of an image. The method possesses the following advantages. It is capable of multi-domain generation with a single input image because the domain relationship is embedded in the rotation angle. Although provided with the cyclic manifold, the rotation plane is learned automatically in an end-to-end manner, making the network be appropriate for not only periodic translation like season shifting but also general translation tasks like realfacecomicportrait𝑟𝑒𝑎𝑙𝑓𝑎𝑐𝑒𝑐𝑜𝑚𝑖𝑐𝑝𝑜𝑟𝑡𝑟𝑎𝑖𝑡real\;face\to comic\;portraititalic_r italic_e italic_a italic_l italic_f italic_a italic_c italic_e → italic_c italic_o italic_m italic_i italic_c italic_p italic_o italic_r italic_t italic_r italic_a italic_i italic_t and iphonedslr𝑖𝑝𝑜𝑛𝑒𝑑𝑠𝑙𝑟iphone\to dslritalic_i italic_p italic_h italic_o italic_n italic_e → italic_d italic_s italic_l italic_r. In the rotation model, the domain style is captured by the vector direction. When translating from one domain to another, we merely modify the vector direction to the target domain while keeping the magnitude unchanged as presented in Fig.3 (b), reserving the expression ability after translation. More specifically, we can generate the continuous images with the style of four seasons by the representation rotation.

We conduct experiments on various translation tasks including season shifting in forests, realfacecomicportrait𝑟𝑒𝑎𝑙𝑓𝑎𝑐𝑒𝑐𝑜𝑚𝑖𝑐𝑝𝑜𝑟𝑡𝑟𝑎𝑖𝑡real\;face\to comic\;portraititalic_r italic_e italic_a italic_l italic_f italic_a italic_c italic_e → italic_c italic_o italic_m italic_i italic_c italic_p italic_o italic_r italic_t italic_r italic_a italic_i italic_t, solar day shifting of streetscapes and iphonedslr𝑖𝑝𝑜𝑛𝑒𝑑𝑠𝑙𝑟iphone\to dslritalic_i italic_p italic_h italic_o italic_n italic_e → italic_d italic_s italic_l italic_r. Among them, the translation of the forest scene is quite challenging due to the extremely complex texture of trees. The results of existing methods deteriorate a lot in this task, shown in Sec.IV. To this end, we design a patch-based semantic style loss by focusing effectively on the style nuances in the matched patches across domains. Compared with other approaches, RoNet produces the most realistic results on various tasks and achieves the superior quantitative performance.

Our main contributions are summarized as follows.

  • To achieve continuous I2I translation, we propose a novel rotation-oriented mechanism which embeds the style representation into a plane and utilizes the rotated representation to guide the generation. RoNet is accordingly implemented to learn the rotation plane automatically while disentangling the content and the style of an image simultaneously.

  • To produce realistic visual effects on challenging textures like trees in forests, we design a patch-based semantic style loss. It first matches the patches from different domains and then learns the style difference with high pertinency.

  • Experiments on various translation scenarios are conducted, including season shifting in forests, realfacecomicportrait𝑟𝑒𝑎𝑙𝑓𝑎𝑐𝑒𝑐𝑜𝑚𝑖𝑐𝑝𝑜𝑟𝑡𝑟𝑎𝑖𝑡real\;face\to comic\;portraititalic_r italic_e italic_a italic_l italic_f italic_a italic_c italic_e → italic_c italic_o italic_m italic_i italic_c italic_p italic_o italic_r italic_t italic_r italic_a italic_i italic_t, solar day shifting of streetscapes and iphonedslr𝑖𝑝𝑜𝑛𝑒𝑑𝑠𝑙𝑟iphone\to dslritalic_i italic_p italic_h italic_o italic_n italic_e → italic_d italic_s italic_l italic_r. With the guidance of the rotation, RoNet successfully generates realistic as well as continuous translation results with a single input image.

II Related Work

As soon as the seminal work of Image-to-Image translation [1] was proposed, it showed the excellent performance. Building on that basis, several methods [19, 20, 21, 22, 23, 24, 25, 26, 27] are designed to further achieve more surprising effect in three main ways as follows.

II-A Disentangled Representations

Disentanglement is defined as the act of releasing from a snarled or tangled condition, which is a common tool to extract the high-dimensional features in latent space. Actually, many unsupervised methods [6, 4, 28, 29, 30] utilize disentanglement to capture the content and style feature via the encoder, achieving excellent image-to-image transltaion. In addition, multi-domain image translation [19, 20] is accomplished by building relative transformations between style representations in different domains or inputing the style encoder extra information about domain label in semi-supervised training ways. With the guidance of style features, image synthesis [21, 22, 23] is allowed to be controllable. With the development of neural networks, generated images become more and more vivid enough to fool the discriminator in GANs [31] or even some industry experts. However, it is hard for most existing methods to make ideal disentanglement, might causing semantic flipping [27]. Recent works [32, 33] exploit disentanglement for few-shot generalization capabilities. Besides, the technology of disentanglement is frequently used in image editing. There always are latent directional representations which control the changes of one attribute in the image, possessing some latent semantic interpretation. Prior works [34, 35, 36] explore the disentanglement of latent dimensions for image editing by training a linear classifier, discovering the meaningful directions for semantic editing and making remarkable results. Voynov and Babenko [37] propose to learn a candidate matrix and a classifier such that the semantic directions in the matrix can be properly recognized by the classifier. Unsupervised methods GANSpace [38] and SeFa [39] perform PCA on the sampled data and the weights of generative model respectively to seperate primary semantic directions in the latent space, generating realistic images with target attribute directions. Similarly, ArtIns [40] make advantages of FastICA [41, 42] algorithm to obtain each independent style component vector in style feature space, enabling artwork editing.

Refer to caption
Figure 3: Visualized difference between vector rotation and interpolation.

II-B Style Transfer

The ultimate goal of style transfer is to generate plausible artworks, preserving the content of the photograph and owning the style of the painting simultaneously. Gatys et al. [43] were the seminal work to achieve stylization. During the iterative optimization process, the content and style features are fused by calculating the Gram matrix for loss constraints. Similarly, some works [44, 45, 46] iteratively flexibly combine content and style of arbitrary images, which are time-consuming. For resource-saving and faster stylization, later works [47, 48, 49, 50, 51, 52] turn to the convolutional neural networks (CNNs) and utilize a feed-forward pass to improve the efficiency of stylization. Moreover, migrating multiple styles into one content image [53] is completed by conditional instance normalization, generating excellent stylized results and breaking the limitation of learning one specific style. Recently, arbitrary style transfer methods are paid more attention to facilitate efficient applications. WCT [54] is proposed to achieve universal style transfer with two transformation steps including whitening and coloring. Huang et al. [55] propose AdaIN, normalizing the mean and variance of each feature map separately, to adaptively combine the content and style. Due to the convenience, a large number of image generation tasks [56, 19, 57] adopt the AdaIN as the first choice to fuse the content and style representations. Based on CNNs, Jing et al. [58] extends the arbitrary style transfer task by introducing dynamic instance normalization. Avatar-Net [59] utilizes a U-net [60] to semantically align the content and the style features. Linear [61] learns a linear transformation according to the content and style features. With the development of attention mechanism [62, 63], existing methods [24, 25, 26] utilize the encoder-transfer-decoder architecture to generate more high-quality artworks. As for photorealistic image stylization, it can be considered as a special kind of image-to-image translation. Recent works [64, 7] design the deep-learning approach to faithfully transfer the reference style of natural scenes.

Refer to caption
Figure 4: Schematic of rotating a style vector from S1normal-→subscript𝑆1\vec{S_{1}}over→ start_ARG italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG to S2normal-→subscript𝑆2\vec{S_{2}}over→ start_ARG italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. First, S1subscript𝑆1\vec{S_{1}}over→ start_ARG italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG is mapped onto the rotation plane and obtain P1+R=S1subscript𝑃1𝑅subscript𝑆1\vec{P_{1}}+\vec{R}=\vec{S_{1}}over→ start_ARG italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + over→ start_ARG italic_R end_ARG = over→ start_ARG italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. Second, rotate P1subscript𝑃1\vec{P_{1}}over→ start_ARG italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG to P2subscript𝑃2\vec{P_{2}}over→ start_ARG italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG in the rotation plane. Finally, S2=P2+Rsubscript𝑆2subscript𝑃2𝑅\vec{S_{2}}=\vec{P_{2}}+\vec{R}over→ start_ARG italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = over→ start_ARG italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + over→ start_ARG italic_R end_ARG.

II-C Continuous Image Translation

For continuous image variation, feature interpolation [14, 15, 65] is a common practice to accomplish this task. DRIT [6] and MUNIT [4] perform continuous interpolation between two style features, while generated images belong to the same domain. StarGAN v2 [19] and SMIT [66] mix disentangled style representations, resulting in impressive continuous i2i translation. Besides, continuity can be achieved by model parameter interpolation between two domains [16]. Then several methods [17, 18, 67, 68] generate a continuous sequence of images between two domains by utilizing intermediate domain labels. GANimation [69] adopts a conditional GAN framework [1], enabling the continuous generation of examples by inputing the continuous rather than discrete labels at inference time. Relgan [70] introduces the loss interpolation for middle states. In addition, there is rich latent information contained in the underlying dimensions. Chen et al. [5] have proposed a framework for unpaired i2i translation, generating natural and gradually changing intermediate results by latent space interpolation. CoMoGAN [71] relies on naive physics-inspired models to guide the training, learning continuous translations in latent space. However, it is complicated to obtain related physics function for model guidance [71] in different domains. Still, linear interpolation [72, 73] is not always valid (e.g. spring to winter include summer and autumn, day to night include dusk).

Refer to caption
Figure 5: Overview of RoNet. The source image Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT is disentangled into the content representation and the style representation by Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Under the alternant training of style vector, the style representation of Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT is rotated from the source domain to the target domain with the guide of Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT. To further encourage realistic texture, we design a patch-based semantic style loss.

III Method

Let {i}i=1Nsuperscriptsubscriptsuperscript𝑖𝑖1𝑁\{\mathcal{I}^{i}\}_{i=1}^{N}\in\mathcal{I}{ caligraphic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_I be the image sets of N𝑁Nitalic_N different domains, and {yi}i=1N𝒴superscriptsubscriptsubscript𝑦𝑖𝑖1𝑁𝒴\{y_{i}\}_{i=1}^{N}\in\mathcal{Y}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_Y be their corresponding domain labels. Given an image 𝐈𝐈\mathbf{I}\in\mathcal{I}bold_I ∈ caligraphic_I, the goal of RoNet is to generate continuous results across domains that accord with the cyclic manifold, under the guidance of style vector rotation. Since the style indicates the changes during translation, we employ the style representation to play the rotation role. Note that the plane is not manually appointed but learned along with the whole network, and the details are illustrated in the following.

III-A How to Rotate?

It is challenging to imagine the rotation of high-dimensional vectors for its spatial complexity. From Euler’s rotation theorem we know that any rotation can be expressed as a single rotation with respect to some axes [74]. In order to further discuss the rotation process, we define the n𝑛nitalic_n-dimensional source vector as S1Rnsubscript𝑆1superscript𝑅𝑛S_{1}\in R^{n}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and let θ𝜃\thetaitalic_θ be the angle to rotate. We first project S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT onto the 2222-dimensional rotation plane W𝑊Witalic_W which is the span of two orthogonal unit vectors m,nRn𝑚𝑛superscript𝑅𝑛m,n\in R^{n}italic_m , italic_n ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT:

P1=(S1m)m+(S1n)nsubscript𝑃1subscript𝑆1𝑚𝑚subscript𝑆1𝑛𝑛P_{1}=(S_{1}\cdot m)m+(S_{1}\cdot n)nitalic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_m ) italic_m + ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_n ) italic_n (1)

where P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the projection of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in W𝑊Witalic_W. Then we can obtain the rest of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

R=S1P1𝑅subscript𝑆1subscript𝑃1R=S_{1}-P_{1}italic_R = italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (2)

where R𝑅Ritalic_R is the component of S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that is orthogonal to plane W𝑊Witalic_W. Hence, it is unchanged by the rotation in W𝑊Witalic_W. Next step, we shall achieve the high-dimensional vector rotation by rotating P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in W𝑊Witalic_W as Equation 3, and then mapping the result P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT back to plane in which S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT lies, which is described as Equation 4.

P2subscript𝑃2\displaystyle P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =RotW,θ(P1)absent𝑅𝑜subscript𝑡𝑊𝜃subscript𝑃1\displaystyle=Rot_{W,\theta}(P_{1})= italic_R italic_o italic_t start_POSTSUBSCRIPT italic_W , italic_θ end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (3)
=[(S1m)cosθ(S1n)sinθ]mabsentdelimited-[]subscript𝑆1𝑚𝑐𝑜𝑠𝜃subscript𝑆1𝑛𝑠𝑖𝑛𝜃𝑚\displaystyle=[(S_{1}\cdot m)cos\theta-(S_{1}\cdot n)sin\theta]m= [ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_m ) italic_c italic_o italic_s italic_θ - ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_n ) italic_s italic_i italic_n italic_θ ] italic_m
+[(S1n)cosθ+(S1m)sinθ]ndelimited-[]subscript𝑆1𝑛𝑐𝑜𝑠𝜃subscript𝑆1𝑚𝑠𝑖𝑛𝜃𝑛\displaystyle+[(S_{1}\cdot n)cos\theta+(S_{1}\cdot m)sin\theta]n+ [ ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_n ) italic_c italic_o italic_s italic_θ + ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_m ) italic_s italic_i italic_n italic_θ ] italic_n
S2=subscript𝑆2absent\displaystyle S_{2}=italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = P2+Rsubscript𝑃2𝑅\displaystyle P_{2}+Ritalic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_R (4)
=S1+[mn]absentsubscript𝑆1matrix𝑚𝑛\displaystyle=S_{1}+\begin{bmatrix}m&n\end{bmatrix}= italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + [ start_ARG start_ROW start_CELL italic_m end_CELL start_CELL italic_n end_CELL end_ROW end_ARG ] [(cosθ1)sinθsinθ(cosθ1)][S1mS1n]matrix𝑐𝑜𝑠𝜃1𝑠𝑖𝑛𝜃𝑠𝑖𝑛𝜃𝑐𝑜𝑠𝜃1matrixsubscript𝑆1𝑚subscript𝑆1𝑛\displaystyle\begin{bmatrix}(cos\theta-1)&-sin\theta\\ sin\theta&(cos\theta-1)\end{bmatrix}\begin{bmatrix}S_{1}\cdot m\\ S_{1}\cdot n\end{bmatrix}[ start_ARG start_ROW start_CELL ( italic_c italic_o italic_s italic_θ - 1 ) end_CELL start_CELL - italic_s italic_i italic_n italic_θ end_CELL end_ROW start_ROW start_CELL italic_s italic_i italic_n italic_θ end_CELL start_CELL ( italic_c italic_o italic_s italic_θ - 1 ) end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_m end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_n end_CELL end_ROW end_ARG ]

The whole rotation process is shown in the Figure 4. In other words, the vector S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT under rotation by θ𝜃\thetaitalic_θ is equal to the projection P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of vector S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under rotation by θ𝜃\thetaitalic_θ in W𝑊Witalic_W plus (S1P1)subscript𝑆1𝑃1(S_{1}-P{1})( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_P 1 ). In this work, our target is to discover the rotation plane, allowing style vectors after rotation to be cyclic. So we set two learnable vectors (μ,ν)Rn𝜇𝜈superscript𝑅𝑛(\mu,\nu)\in R^{n}( italic_μ , italic_ν ) ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to model the rotation plane. For specifying the orthogonal unit assumptions, we adopt the Schmidt orthogonalization to obtain the rotation plane:

W=(m,n)=GramSchmidt(μ,ν).𝑊𝑚𝑛𝐺𝑟𝑎𝑚𝑆𝑐𝑚𝑖𝑑𝑡𝜇𝜈W=(m,n)=GramSchmidt(\mu,\nu).italic_W = ( italic_m , italic_n ) = italic_G italic_r italic_a italic_m italic_S italic_c italic_h italic_m italic_i italic_d italic_t ( italic_μ , italic_ν ) . (5)

III-B RoNet

The overview of the proposed RoNet is presented in Figure 5. There are four essential subnets in RoNet which are the content encoder Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the style encoder Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the rotation module and the Generator G𝐺Gitalic_G. Considering it is time- and labour-consuming to obtain the continuous training data, we employ the images in distinct key domains for learning, making the method weakly supervised. In each round of training, we feed the network with a pair of images (Isrc,Itgt)subscript𝐼𝑠𝑟𝑐subscript𝐼𝑡𝑔𝑡(I_{src},I_{tgt})( italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT ) that are from the source domain and the target domain respectively. For instance, the source image Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT in Figure 5 is from the Summer domain, while the target image Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT is from the Autumn domain.

Given an image, we use the content encoder Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to extract the domain-invariant representation C𝐶Citalic_C, and use the style encoder Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for domain-variant representation S𝑆Sitalic_S. In a typical exemplar-based I2I translation approach, Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT provides the content for the results which should be kept the same, and Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT specifies the target style of the generated image, i.e., Imix=G(Csrc,Stgt)=G(Ec(Isrc),Es(Itgt))subscript𝐼𝑚𝑖𝑥𝐺subscript𝐶𝑠𝑟𝑐subscript𝑆𝑡𝑔𝑡𝐺subscript𝐸𝑐subscript𝐼𝑠𝑟𝑐subscript𝐸𝑠subscript𝐼𝑡𝑔𝑡I_{mix}=G(C_{src},S_{tgt})=G(E_{c}(I_{src}),E_{s}(I_{tgt}))italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT = italic_G ( italic_C start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT ) = italic_G ( italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) , italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT ) ). However, as has been introduced above, RoNet is in charge of both information disentanglement and learning the proper plane for continuously rotation. Hence we implant a rotation module in the network to find the plane in Equation 5. Concretely, we first rotate the style vector of the source image Ssrcsubscript𝑆𝑠𝑟𝑐S_{src}italic_S start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT to the target domain, and then use the rotated vector Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT to generate the image in the target domain Imix=G(Csrc,Srot)subscript𝐼𝑚𝑖𝑥𝐺subscript𝐶𝑠𝑟𝑐subscript𝑆𝑟𝑜𝑡I_{mix}=G(C_{src},S_{rot})italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT = italic_G ( italic_C start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT ). By alternate training and imposing constraints between Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT and Stgtsubscript𝑆𝑡𝑔𝑡S_{tgt}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT, we learn the rotation plane along with the disentanglement in an end-to-end manner.

In order to extract the content representation and prevent semantic flipping, we adopt normalized content features and segmentation pseudo-labels [27] of Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT for better visual quality during generation. Different from existing point-to-point translation that usually uses face or animal images in experiments, we apply RoNet in multi-domain scene generation. Generation targets as faces hold clear structures, but a scene image tends to consist of stuff without fixed shapes, e.g., sky and grass. Nevertheless, the content representation is incapable of serving the sketch information for the stuff after rounds of downsampling. Thus we complement the generator with content features which can be easily obtained by VGG encoder [75]. On the other hand, the source and target domains always have a large semantic mismatch, suffering from source content corruption. Therefore hypervector is obtained by Vector Symbolic Architectures (VSA) [27] to constrain the semantic information of principal components between Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT.

III-C Loss Optimization

This section introduces how to build loss functions to optimize the model.

III-C1 Adversarial Objective

During the training, the generator takes the content features C𝐶Citalic_C and style features S𝑆Sitalic_S as inputs, learning to generate realistic images via an adversarial loss:

advsubscript𝑎𝑑𝑣\displaystyle\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT =𝔼[logDysrc(Isrc)]absent𝔼delimited-[]logsubscript𝐷subscript𝑦𝑠𝑟𝑐subscript𝐼𝑠𝑟𝑐\displaystyle=\mathbb{E}\left[\mathrm{log}D_{y_{src}}(I_{src})\right]= blackboard_E [ roman_log italic_D start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) ] (6)
+𝔼[log(1Dytgt(G(Csrc,Smix)))]𝔼delimited-[]log1subscript𝐷subscript𝑦𝑡𝑔𝑡𝐺subscript𝐶𝑠𝑟𝑐subscript𝑆𝑚𝑖𝑥\displaystyle+\mathbb{E}\left[\mathrm{log}(1-D_{y_{tgt}}(G(C_{src},S_{mix})))\right]+ blackboard_E [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_G ( italic_C start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ) ) ) ]

where Dy()subscript𝐷𝑦D_{y}(\cdot)italic_D start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( ⋅ ) means the output of discriminator D𝐷Ditalic_D corresponding to the domain y𝑦yitalic_y. Notably, style features Smixsubscript𝑆𝑚𝑖𝑥S_{mix}italic_S start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT are Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT or Stgtsubscript𝑆𝑡𝑔𝑡S_{tgt}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT. Concretely, we input Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT and Stgtsubscript𝑆𝑡𝑔𝑡S_{tgt}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT alternately to generator for better harmonization of style encoder Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and rotation plane W𝑊Witalic_W during training. In certain aspects, such training strategy helps us discover better cyclic manifold and save computation memory.

III-C2 Content Preservation

For better content representation, we use a pre-trained VGG network [75] to extract content feature maps for computing the content perceptual loss as follows:

con=𝔼[iclφi(Imix)φi(Isrc)2]subscript𝑐𝑜𝑛𝔼delimited-[]superscriptsubscript𝑖𝑐𝑙subscriptnormsubscript𝜑𝑖subscript𝐼𝑚𝑖𝑥subscript𝜑𝑖subscript𝐼𝑠𝑟𝑐2\displaystyle\mathcal{L}_{con}=\mathbb{E}\left[\sum_{i}^{cl}\left\|\varphi_{i}% (I_{mix})-\varphi_{i}(I_{src})\right\|_{2}\right]caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l end_POSTSUPERSCRIPT ∥ italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ) - italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (7)

where φi()subscript𝜑𝑖\varphi_{i}(\cdot)italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) denotes the features extracted from the i𝑖iitalic_i-th layer in a pre-trained VGG network [75] and 2\left\|\cdot\right\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes Mean Square Error (MSE). Besides, cl𝑐𝑙clitalic_c italic_l represents layers {conv4_1𝑐𝑜𝑛𝑣4_1conv4\_1italic_c italic_o italic_n italic_v 4 _ 1, conv5_1𝑐𝑜𝑛𝑣5_1conv5\_1italic_c italic_o italic_n italic_v 5 _ 1}.

Refer to caption
Figure 6: Patch-based semantic style loss. For more realistic texture learning, cosine similarity matrix is calculated for better patch matching according to the style features, encoded from uniform sampling patches of target image Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT and random sampling patches of generated image Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT.
Refer to caption
Figure 7: Continuous translation from Spring to Winter generated by RoNet.

III-C3 VSA-Based Semantic Consistency

Although adversarial and content losses provide generator with powerful constraints, unaligned semantic information between domains results in semantic flipping. Therefore we adapt a VSA-based loss [27] to preserve the principal objects of source domain.

VSAsubscript𝑉𝑆𝐴\displaystyle\mathcal{L}_{VSA}caligraphic_L start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT =𝔼[1dist(Vsrc,Vmix)]absent𝔼delimited-[]1𝑑𝑖𝑠𝑡subscript𝑉𝑠𝑟𝑐subscript𝑉𝑚𝑖𝑥\displaystyle=\mathbb{E}\left[1-dist(V_{src},V_{mix})\right]= blackboard_E [ 1 - italic_d italic_i italic_s italic_t ( italic_V start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ) ] (8)

where dist(,)𝑑𝑖𝑠𝑡dist(\cdot,\cdot)italic_d italic_i italic_s italic_t ( ⋅ , ⋅ ) is the cosine distance. Vsrcsubscript𝑉𝑠𝑟𝑐V_{src}italic_V start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and Vmixsubscript𝑉𝑚𝑖𝑥V_{mix}italic_V start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT are hypervectors to present the semantic features of Isrcsubscript𝐼𝑠𝑟𝑐I_{src}italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT respectively, obtained by projecting image features in locality sensitive hashing (LSH) [76, 27].

III-C4 Patch-based Semantic Matching

Facilitating style encoder Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to capture the realistic texture, we propose a patch-based semantic style loss. In detail, the target image Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT is uniformly divided into 16 patches while randomly sampling N𝑁Nitalic_N patches BmixNsubscriptsuperscript𝐵𝑁𝑚𝑖𝑥B^{N}_{mix}italic_B start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT from the generated image Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT. Then cosine similarity matrix is obtained by calculating the cosine distance of patch-based latent representations from two domains, which are extracted by a pre-trained VGG encoder [75]. According to semantic similarity, we can match N𝑁Nitalic_N patches BtgtNsubscriptsuperscript𝐵𝑁𝑡𝑔𝑡B^{N}_{tgt}italic_B start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT from Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT, processing analogical semantic as BmixNsubscriptsuperscript𝐵𝑁𝑚𝑖𝑥B^{N}_{mix}italic_B start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT. Thus the same objects from different domains are matched for more accurate style learning. For example, the style of trees in Imixsubscript𝐼𝑚𝑖𝑥I_{mix}italic_I start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT is learned from that in Itgtsubscript𝐼𝑡𝑔𝑡I_{tgt}italic_I start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT. In this work, we adapt a DINO-ViT model [77] (a Vision Transformer model that has been pre-trained in a self-supervised manner) to obtain the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token, which contains the semantic style information of the image. Therefore we can ensure style capture by minimizing the [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token distances as shown in Figure 6.

sty=𝔼[iNe[CLS]L(Bmixi)e[CLS]L(Btgti)2].subscript𝑠𝑡𝑦𝔼delimited-[]superscriptsubscript𝑖𝑁subscriptnormsubscriptsuperscript𝑒𝐿delimited-[]𝐶𝐿𝑆subscriptsuperscript𝐵𝑖𝑚𝑖𝑥subscriptsuperscript𝑒𝐿delimited-[]𝐶𝐿𝑆subscriptsuperscript𝐵𝑖𝑡𝑔𝑡2\displaystyle\mathcal{L}_{sty}=\mathbb{E}\left[\sum_{i}^{N}\left\|e^{L}_{[CLS]% }(B^{i}_{mix})-e^{L}_{[CLS]}(B^{i}_{tgt})\right\|_{2}\right].caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_e start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ( italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ) - italic_e start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ( italic_B start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] . (9)

where e[CLS]Lsubscriptsuperscript𝑒𝐿delimited-[]𝐶𝐿𝑆e^{L}_{[CLS]}italic_e start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT denotes the last layer [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token of DINO-ViT [77].

III-C5 Style Alignment

In order to make style features flow into the cyclic manifold as much as possible. We conduct a loss to restrict the Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT and Stgtsubscript𝑆𝑡𝑔𝑡S_{tgt}italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT, ensuring the style features can be rotated from the source domain to the target domain.

mse=𝔼[SrotStgt2]subscript𝑚𝑠𝑒𝔼delimited-[]subscriptnormsubscript𝑆𝑟𝑜𝑡subscript𝑆𝑡𝑔𝑡2\displaystyle\mathcal{L}_{mse}=\mathbb{E}\left[\left\|S_{rot}-S_{tgt}\right\|_% {2}\right]caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT = blackboard_E [ ∥ italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (10)

Notably, this loss term only work when the input of generator is Srotsubscript𝑆𝑟𝑜𝑡S_{rot}italic_S start_POSTSUBSCRIPT italic_r italic_o italic_t end_POSTSUBSCRIPT, which helps to make encoder and rotation plane adapt each other.

III-C6 Image Reconstruction

To further guarantee that the generator G𝐺Gitalic_G learns to preserve the original domain-invariant characteristics, we build a reconstruction loss:

recsubscript𝑟𝑒𝑐\displaystyle\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT =𝔼[G(Csrc,Ssrc)Isrc1].absent𝔼delimited-[]subscriptnorm𝐺subscript𝐶𝑠𝑟𝑐subscript𝑆𝑠𝑟𝑐subscript𝐼𝑠𝑟𝑐1\displaystyle=\mathbb{E}\left[||G(C_{src},S_{src})-I_{src}||_{1}\right].= blackboard_E [ | | italic_G ( italic_C start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] . (11)

Therefore, the network is trained by minimizing the loss function defined as:

\displaystyle\mathcal{L}caligraphic_L =λadvadv+λconcon+λVSAVSAabsentsubscript𝜆𝑎𝑑𝑣subscript𝑎𝑑𝑣subscript𝜆𝑐𝑜𝑛subscript𝑐𝑜𝑛subscript𝜆𝑉𝑆𝐴subscript𝑉𝑆𝐴\displaystyle=\lambda_{adv}\mathcal{L}_{adv}+\lambda_{con}\mathcal{L}_{con}+% \lambda_{VSA}\mathcal{L}_{VSA}= italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT (12)
+λrecrec+λmsemse+λstystysubscript𝜆𝑟𝑒𝑐subscript𝑟𝑒𝑐subscript𝜆𝑚𝑠𝑒subscript𝑚𝑠𝑒subscript𝜆𝑠𝑡𝑦subscript𝑠𝑡𝑦\displaystyle+\lambda_{rec}\mathcal{L}_{rec}+\lambda_{mse}\mathcal{L}_{mse}+% \lambda_{sty}\mathcal{L}_{sty}+ italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT

where λadvsubscript𝜆𝑎𝑑𝑣\lambda_{adv}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, λconsubscript𝜆𝑐𝑜𝑛\lambda_{con}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, λVSAsubscript𝜆𝑉𝑆𝐴\lambda_{VSA}italic_λ start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT, λrecsubscript𝜆𝑟𝑒𝑐\lambda_{rec}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT, λmsesubscript𝜆𝑚𝑠𝑒\lambda_{mse}italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT, λstysubscript𝜆𝑠𝑡𝑦\lambda_{sty}italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT are the hyper-parameters to balance each item.

Refer to caption
Figure 8: Comparison of continuous translation from Spring to Winter. The autumn column presents the reconstruction results of the input Spring image generated by different approaches. RoNet yields the best results with continuity across domains.

IV Experiments

IV-A Experimental Setting

IV-A1 Implementation Details

During training, we use the ADAM solver [78] with β1=0.0subscript𝛽10.0\beta_{1}=0.0italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.0 and β2=0.99subscript𝛽20.99\beta_{2}=0.99italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99, and a learning rate of 0.0001 for optimization. Besides, we empirically set the hyperparameters λadvsubscript𝜆𝑎𝑑𝑣\lambda_{adv}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, λconsubscript𝜆𝑐𝑜𝑛\lambda_{con}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, λVSAsubscript𝜆𝑉𝑆𝐴\lambda_{VSA}italic_λ start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT, λrecsubscript𝜆𝑟𝑒𝑐\lambda_{rec}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT, λmsesubscript𝜆𝑚𝑠𝑒\lambda_{mse}italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT and λstysubscript𝜆𝑠𝑡𝑦\lambda_{sty}italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT as 1, 1, 3, 50, 150 and 1 respectively, making the losses balanced. All the experiments are conducted under the environment of Python 3.7.3 and PyTorch 1.7.1 on an Ubuntu 18.04 system with one single 32G Tesla-V00 GPU. In the training stage, all images are randomly cropped into 512 ×\times× 512, while in the inference stage, any image size is supported.

IV-A2 Datasets

Our proposed framework is allowed to be applied in different scenes, all experiments are conducted on multiple datasets as follows.

  1. 1.

    Season Album: As for seasonal image translation, we collect 4000 training images and 1000 testing images for each season from flickr.com. The resolution of all seasonal images is greater than 1024, for keeping the details of seasonal characteristics such as the color of leaves, the texture of snow, etc.

  2. 2.

    Comic Face: We obtain real and comic face dataset from kaggle.com, which is used to finish real face stylization. There are 1500 real identities and 1500 anime faces for training, and 500 images for testing.

  3. 3.

    Waymo Open Dataset [79] : There is high-resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions. The scenes of dateset are selected from both suburban and urban area at different moments of the day. The dataset is currently released to making advancements in machine perception and self-driving technology. As for timeshift task, we split clear images into four domains {day, dusk, night, dawn} according to Waymo image labels, obtaining train / test sets of 27272 / 7682 images.

  4. 4.

    Iphone2dslr Flowers [3] : The dataset is used to map iPhone images with large depth of field to DSLR images with shallow depth of field. There are 1812 / 569 iPhone images and 3325 / 480 DSLR images for training / testing.

IV-A3 Baselines

We compared our methods with several state-of-the-art methods as follows, which can accomplish the continuous image-to-image translation in different interpolation ways.

  1. 1.

    StarGAN v2 [19] is a state-of-the-art multi-domain translation network. It can map an input image to multiple defined domains with a single model. We train the model with the public codes released by the authors and use its disentangled style code to enhance the continuous effects with linear interpolation.

  2. 2.

    DLOW [17] realizes continuous translation by generating a continuous sequence of intermediate labels between two domains. In other words, it is a method based on interpolated labels.

  3. 3.

    DRIT [6] is able to generate diverse results within a certain domain, but not suitable for multi-domain translation directly. Thus we train the model between every two domains and apply interpolation to the models to obtain the continuous results.

  4. 4.

    Fast Photo Style [7] is a style transfer method based on disentanglement learning and can generate continuous transfer results via linear interpolation.

  5. 5.

    VSAIT [27] is a paradigm for image-to-image translation by setting VSA-based constraints on adversarial learning, achieving continuous image generation by applying the interpolation to the models.

  6. 6.

    CoMoGAN [71] achieves cyclic continuous translation with the guidance of physics-inspired models, but is limited in scenarios without decent physical models, such as season transition.

  7. 7.

    DiffuseIT [80] is a score-based model to accomplish image translation by introducing a loss function to control the diffusion process, but there are instances of content and style leakage in the generated results.

IV-A4 Evaluation Metrics

We choose the following quantitative evaluation metrics to demonstrate the effectiveness of our proposed framework.

  1. 1.

    Learned Perceptual Image Patch Similarity (LPIPS) is based on the VGG [75] and AlexNet [81] network architectures, evaluating of the distance between image patches. Higher means further different, while lower means more similar.

  2. 2.

    Fréchet Inception Distance (FID) [82] is a metric that compares the distribution of generated images with the distribution of a set of real images, by calculating the distance between feature vectors of real and generated images. The FID is the current standard metric for assessing the quality of generative models.

  3. 3.

    Kernel Inception Distance (KID) [83] is able to calculate the squared Maximum Mean Discrepancy (MMD) between the Inception representations of the real and generated images, via a polynomial kernel. Similar to the FID [82], lower values indicate closer distances between the distributions of generated and real data.

Refer to caption
Figure 9: Visual results of ablation studies. The Asimilar-to\simF models are capable of completing the seasonal cycle without the need for reference style images, whereas the G model requires reference style images due to the absence of the rotation module.
Refer to caption
Figure 10: Interpolation Effectiveness. Given the spring image as the source and the winter as the target style, we should obtain the summer and autumn results when doing the interpolation from spring to winter.
Refer to caption
Figure 11: high-resolution images generated by RoNet.
Refer to caption
Figure 12: Comparison of continuous translation from Dawn to Night. The dusk column presents the reconstruction results of the input image generated by different approaches.
Refer to caption
Figure 13: Continuous translation from Dawn to Night.

IV-B Season Shifting

IV-B1 Continuous Translation

Fig.7 presents the continuous translation results with high definition. Please zoom in for more realistic details. In nature, most plants turn into dark yellow when autumn comes. RoNet captures the smooth variation successfully with the help of the in-plane rotation. For typical forest scene as the first row, the complex texture is quite challenging for the net to learn, easily causing artifacts or blur. It is observed that RoNet preserves the texture well because there is the patch-based semantic style loss. In the penultimate row, the plants changes with seasons, while the river keeps similar appearance across seasons, which is agree with natural order. Furthermore, RoNet has the ability of cyclic continuous translation with a single input. We present raw results in Fig.1 which are also the materials for the corresponding time-lapse demo in the supplementary material.

IV-B2 Comparison

We compare RoNet with leading approaches in closely related fields, including multi-domain translation, continuous translation based on linear interpolation, and translation based on disentangled representations (StarGAN v2 [19], DLOW [17], DRIT [6], Fast Photo Style [7], VSAIT [27], DiffuseIT [80]). Note that all the approaches are trained and tested under the same protocol. The comparison is exhibited in Fig.8. For StarGAN v2 [19], we provide the style representations of different target domains to implement the multi-domain translation. Thus the 2nd, 5th, 8th and last column of StarGAN v2 [19] is guided by certain domain styles, while the rest are the corresponding interpolation results over the style vectors. For the other four rows, models are trained between Spring and Winter as they are essentially point-to-point approaches, but applied with different interpolation schemes: DLOW [17] with label interpolation, Fast Photo Style [7] and DiffuseIT [80] with representation interpolation, DNI-DRIT [6] and DNI-VSAIT [27] with model interpolation. It is observed that RoNet achieves the most appealing results in terms of both visual quality and continuity.

TABLE I: Metrics comparison of RoNet and existing approaches.
Method FID \downarrow LPIPS \downarrow KID ×103\times 10^{-3}\downarrow× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ↓
Spr Sum Aut Win mean Spr Sum Aut Win mean Spr Sum Aut Win mean
StarGAN v2 [19] 45.8 74.2 81.8 66.3 67.0 0.675 0.685 0.748 0.738 0.712 34.1 39.2 43.9 35.2 38.1
DLOW [17] 75.3 60.0 88.2 78.7 75.5 0.487 0.489 0.524 0.530 0.508 43.7 23.5 46.6 38.4 38.1
DRIT [6] 58.3 47.2 52.8 57.9 54.1 0.306 0.306 0.363 0.457 0.357 35.8 23.4 27.1 30.2 29.1
Fast Photo Style [7] 102.4 80.1 82.1 76.2 85.2 0.421 0.421 0.510 0.440 0.448 73.4 54.9 51.1 50.8 57.6
VASIT [27] 59.7 52.8 60.4 69.1 60.5 0.261 0.240 0.308 0.235 0.261 37.4 30.0 34.4 41.2 35.7
DiffuseIT [80] 60.8 49.1 55.6 49.3 53.7 0.512 0.517 0.577 0.538 0.536 41.1 24.6 30.0 22.5 29.5
RoNet 55.7 43.3 57.1 55.7 52.9 0.265 0.238 0.256 0.176 0.234 36.8 20.5 27.8 26.2 27.8
TABLE II: Ablation studies.
Method FID \downarrow LPIPS \downarrow KID ×103\times 10^{-3}\downarrow× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ↓
Spr Sum Aut Win mean Spr Sum Aut Win mean Spr Sum Aut Win mean
A: Baseline advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT 102.1 66.0 76.6 72.2 79.2 0.394 0.378 0.464 0.378 0.404 77.2 38.0 40.2 39.5 48.7
B: + stysubscript𝑠𝑡𝑦\mathcal{L}_{sty}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT 65.3 66.3 72.1 74.1 69.5 0.461 0.413 0.523 0.553 0.488 47.9 52.3 39.7 39.4 44.8
C: + consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT 58.7 53.6 59.1 63.9 58.8 0.254 0.248 0.309 0.366 0.294 34.9 29.1 28.4 37.3 32.5
D: + recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT 57.1 52.9 57.9 71.0 59.7 0.248 0.218 0.233 0.350 0.262 35.7 29.2 29.7 40.5 33.8
E: + msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT 56.3 46.3 57.7 58.6 54.7 0.290 0.238 0.299 0.340 0.292 35.7 23.4 31.5 28.3 29.7
F: + VSAsubscript𝑉𝑆𝐴\mathcal{L}_{VSA}caligraphic_L start_POSTSUBSCRIPT italic_V italic_S italic_A end_POSTSUBSCRIPT 55.7 43.3 57.1 55.7 52.9 0.265 0.238 0.256 0.176 0.234 36.8 20.5 27.8 26.2 27.8
G: w/o𝑤𝑜w/oitalic_w / italic_o Rotation 59.5 54.4 65.0 67.3 61.5 0.228 0.228 0.261 0.267 0.246 41.2 31.1 39.7 35.6 36.9
TABLE III: Metrics comparison of RoNet and existing approaches on the timeshift task.
Method FID \downarrow LPIPS \downarrow KID ×103\times 10^{-3}\downarrow× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ↓
Dawn Day Dusk Night mean Dawn Day Dusk Night mean Dawn Day Dusk Night mean
StarGAN v2 [19] 215.9 120.5 163.8 161.7 165.5 0.573 0.546 0.515 0.660 0.573 156.3 142.0 139.9 184.2 155.6
DLOW [17] 184.9 141.2 153.1 134.4 153.4 0.352 0.354 0.340 0.458 0.376 151.3 144.0 143.4 109.8 137.1
DRIT [6] 196.6 156.2 208.2 159.9 180.2 0.407 0.393 0.371 0.627 0.450 153.3 144.1 222.1 118.5 159.5
Fast Photo Style [7] 159.1 162.0 190.0 112.6 155.9 0.317 0.323 0.300 0.523 0.366 114.8 143.2 189.9 94.9 135.7
VASIT [27] 114.2 93.2 104.6 100.6 103.4 0.153 0.174 0.174 0.438 0.235 64.0 57.2 77.4 54.7 63.3
CoMoGAN [71] 141.2 103.9 141.0 117.9 126.0 0.284 0.283 0.308 0.338 0.303 84.0 67.3 128.1 100.7 95.0
DiffuseIT [80] 108.3 76.7 103.1 109.5 99.4 0.422 0.424 0.391 0.619 0.464 55.6 55.1 80.6 77.2 67.1
RoNet 107.6 75.1 87.4 93.4 90.9 0.178 0.209 0.195 0.212 0.199 53.9 31.7 49.6 46.3 45.4

IV-B3 Quantitative Analysis

In order to evaluate the performance of our model objectively, we report the quantitative metrics of different approaches in Table I. Three metrics are employed to evaluate these approaches in various views. LPIPS [84] evaluates the structure difference between source images and generated images, FID [82] and KID [83] measures the distance between the distribution of the target images and the generated images. For all the three metrics, the smaller the better. Note that all contrastive experiments are conducted with test set in datasets. It is observed that RoNet achieves the lowest performance in all metrics, verifying its superiority over the others.

IV-B4 Ablation Study

To analyse the contribution of each loss function elaborately, we conduct the ablation studies and show the results in Tab.II. The baseline is trained merely by the adversarial loss, then we add each loss function gradually. If we first add the patch-based semantic style loss, the model performance declines much. The reason is that these two loss functions focus more on visual reality but are insufficient for model convergence. With the contribution of the content loss, the reconstruction loss and the MSE loss, the performance increases significantly. Finally, the full model F achieves the best results. In the paper, we report the importance of each loss function as presented in Tab.II. Here, visual effects of each loss function are shown in Fig.9. It is observed that the sky artifacts decrease and semantic style becomes more realistic. In addition, we remove the rotation module to test its impact on the model’s performance. The results are shown in Tab.II and Fig.9. The findings demonstrate that the rotation module significantly enhances the model’s ability to learn vibrant styles, even without reference style images.

IV-B5 Interpolation Effectiveness

We apply different interpolation schemes to the state-of-the-art methods: DLOW [17] with label interpolation, StarGAN v2 [19] and Fast Photo Style [7] with representation interpolation, DNI-DRIT [6] and DNI-VSAIT [27] with model interpolation. As shown in Fig.10, we set the spring image as input source and the winter image as target style. It is observed that StarGAN v2 [19], DiffuseIT [80] and Fast Photo Style [7] suffer from content leak, DNI-DRIT [6] and DNI-VSAIT [27] exhibit serious artifacts. However, linear interpolation is not always valid, these methods should generate intermediate results of the summer and autumn seasons. DLOW [17] and RoNet achieve the full seasonal circulation without the target image, but DLOW [17] obtains more style leak.

IV-B6 High Resolution

Due to the advantageous features of convolutional neural networks, RoNet is capable of producing a plethora of images with varying scales. As illustrated in Fig.11, the images generated by RoNet possess a resolution of 1024×1600102416001024\times 16001024 × 1600 pixels, which vividly exposes the intricate texture details of leaves.

IV-C Time Shifting

IV-C1 Difference

Recently, an unsupervised generation network named CoMoGAN [71] has been proposed to learn non-linear continuous translations. Despite the unsupervised training manner, CoMoGAN relies on physics-inspired models to guide the learning process. Taking the cyclic translation task of “day to any time” as an example, CoMoGAN first renders a daytime image with a color-based model to obtain the images at any time of a day, and then use the series of rendered data as supervision. It makes CoMoGAN stuck in the limitation of seeking the proper physical model. When confronting more challenging tasks such as season transition, CoMoGAN is out of work since it is hard to find a capable physical model for data rendering. However, our proposed framework is not constrained by physical guidance, applying to arbitrary scene domains, such as seasonal variation shown in Fig.7 and time shifting shown in Fig.13.

IV-C2 Comparison

We compare RoNet with several SOTA approaches (StarGAN v2 [19], DLOW [17], Fast Photo Style [7], DRIT [6], CoMoGAN [71], VSAIT [27], DiffuseIT [80]). From Fig.12 and Tab.III, it is observed that StarGAN v2 [19] and DiffuseIT [80] suffer from serious semantic distortion. Fast Photo Style [7], DLOW[17], DNI-VSAIT [27] and DNI-DRIT [6] obtain unrealistic results. Since the guidance of physics-inspired function, CoMoGAN [71] learns more color- and pixel-wise information. However, our model captures the style from data distribution, such as the street lights in the night and sunsets in the dusk.

IV-D Other Tasks

Experiments are also conducted on tasks of realfacecomicportrait𝑟𝑒𝑎𝑙𝑓𝑎𝑐𝑒𝑐𝑜𝑚𝑖𝑐𝑝𝑜𝑟𝑡𝑟𝑎𝑖𝑡real\;face\to comic\;portraititalic_r italic_e italic_a italic_l italic_f italic_a italic_c italic_e → italic_c italic_o italic_m italic_i italic_c italic_p italic_o italic_r italic_t italic_r italic_a italic_i italic_t and iphonedslr𝑖𝑝𝑜𝑛𝑒𝑑𝑠𝑙𝑟iphone\to dslritalic_i italic_p italic_h italic_o italic_n italic_e → italic_d italic_s italic_l italic_r, and the visual results are presented in Fig.14 and Fig.16, respectively. For real face to comic, we compared with other four approaches (StyTr [26], SANet [24], Linear [61], and CCPL [52]) who realize continuous translation with interpolation. From Fig.15, it is observed that RoNet yields the most appealing results with rotation and the changes of pupilla color and hairstyle are continuous. The results from iPhone to DSLR also verify the generalization ability of RoNet.

Refer to caption
Figure 14: Continuous translation from real face to comic.
Refer to caption
Figure 15: Detail comparison of image style transfer results from real face to comic.
Refer to caption
Figure 16: Continuous translation on the iPhone to DSLR task, using iphone2dslr dataset [3].
Refer to caption
Figure 17: The uneven circulation of four seasons generated by RoNet with the single input (red dot), where the spring and summer are longer than autumn and winter.
Refer to caption
Figure 18: Continuous translation of timeshift where the day and night account for larger part than dusk and dawn.

IV-E Discussion

IV-E1 Generalization

Our proposed method is a general framework that can be applied to most natural scenes with continuous variance, including seasonal circulation and time shifting. However, there are certain special circumstances in the world that need to be considered. For example, in normal situations, the length of day is longer than that of night, dawn and dusk, as shown in Fig.18. Additionally, in the southern hemisphere, the length of spring and summer is longer than that of autumn and winter, as shown in Fig.17. It is important to account for these special circumstances to ensure the effectiveness of our method in such situations.

IV-E2 Automation

learning imbalanced datasets with label distribution aware margin loss [85]

V Conclusion

Aiming at continuous I2I translation, this paper proposes to embed the disentangled style representation under an annular manifold constraint. And thus the continuous generation can be achieved by rotating the style representation arbitrarily in the proper plane. To this end, RoNet is designed by implanting a rotation module in the generation network and adding a new patch-based semantic style loss. Different from the typical linear interpolation, the rotation is capable of moving the style representation from one domain to another with a single input as well as keeping the magnitude of the representation. Experiments of various translation scenarios (involving seasons, faces, solar days and camera effects) are conducted. The qualitative and quantitative results demonstrate that our method not only generates the most promising results with plausible transition compared with the others, but also achieves better performance in metrics. In the future, we plan to study more complex manifolds in continuous translation for better generality.

References

  • [1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
  • [2] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: One-shot anatomically consistent facial animation,” International Journal of Computer Vision, vol. 128, no. 3, pp. 698–713, 2020.
  • [3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
  • [4] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 172–189.
  • [5] Y.-C. Chen, X. Xu, Z. Tian, and J. Jia, “Homomorphic latent space interpolation for unpaired image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2408–2416.
  • [6] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 35–51.
  • [7] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 453–468.
  • [8] H. Chen, L. Zhao, H. Zhang, Z. Wang, Z. Zuo, A. Li, W. Xing, and D. Lu, “Diverse image style transfer via invertible cross-space mapping,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 880–14 889.
  • [9] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1989–1998.
  • [10] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, “Image to image translation for domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4500–4509.
  • [11] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domain adaptation of semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6936–6945.
  • [12] Y. Li, H. Huang, J. Cao, R. He, and T. Tan, “Disentangled representation learning of makeup portraits in the wild,” International Journal of Computer Vision, vol. 128, no. 8, pp. 2166–2184, 2020.
  • [13] Y. Han, J. Yang, and Y. Fu, “Disentangled face attribute editing via instance-aware latent space search,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 715–721.
  • [14] T. Xiao, J. Hong, and J. Ma, “Dna-gan: Learning disentangled representations from multi-attribute images,” arXiv preprint arXiv:1711.05415, 2017.
  • [15] J. Zhang, Y. Huang, Y. Li, W. Zhao, and L. Zhang, “Multi-attribute transfer via disentangled representation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9195–9202.
  • [16] X. Wang, K. Yu, C. Dong, X. Tang, and C. C. Loy, “Deep network interpolation for continuous imagery effect transition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1692–1701.
  • [17] R. Gong, W. Li, Y. Chen, and L. V. Gool, “Dlow: Domain flow for adaptation and generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2477–2486.
  • [18] R. Gong, D. Dai, Y. Chen, W. Li, D. P. Paudel, and L. Van Gool, “Analogical image translation for fog generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1433–1441.
  • [19] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8188–8197.
  • [20] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
  • [21] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  • [22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  • [23] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021.
  • [24] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5880–5888.
  • [25] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6649–6658.
  • [26] Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr:image style transfer with transformers,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [27] J. Theiss, J. Leverett, D. Kim, and A. Prakash, “Unpaired image translation via vector symbolic architectures,” in European Conference on Computer Vision.   Springer, 2022, pp. 17–32.
  • [28] L. Jiang, C. Zhang, M. Huang, C. Liu, J. Shi, and C. C. Loy, “Tsit: A simple and versatile framework for image-to-image translation,” in European Conference on Computer Vision.   Springer, 2020, pp. 206–222.
  • [29] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European conference on computer vision.   Springer, 2020, pp. 319–345.
  • [30] T. Park, J.-Y. Zhu, O. Wang, J. Lu, E. Shechtman, A. Efros, and R. Zhang, “Swapping autoencoder for deep image manipulation,” Advances in Neural Information Processing Systems, vol. 33, pp. 7198–7211, 2020.
  • [31] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 3, pp. 2672–2680, 2014.
  • [32] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz, “Few-shot unsupervised image-to-image translation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 10 551–10 560.
  • [33] K. Saito, K. Saenko, and M.-Y. Liu, “Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16.   Springer, 2020, pp. 382–398.
  • [34] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9243–9252.
  • [35] G. Yang, N. Fei, M. Ding, G. Liu, Z. Lu, and T. Xiang, “L2m-gan: Learning to manipulate latent space semantics for facial attribute editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2951–2960.
  • [36] J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion for real image editing,” in European conference on computer vision.   Springer, 2020, pp. 592–608.
  • [37] A. Voynov and A. Babenko, “Unsupervised discovery of interpretable directions in the gan latent space,” in International conference on machine learning.   PMLR, 2020, pp. 9786–9796.
  • [38] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “Ganspace: Discovering interpretable gan controls,” Advances in Neural Information Processing Systems, vol. 33, pp. 9841–9850, 2020.
  • [39] Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in gans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1532–1540.
  • [40] X. Xie, Y. Li, H. Huang, H. Fu, W. Wang, and Y. Guo, “Artistic style discovery with independent components,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 870–19 879.
  • [41] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
  • [42] Z. Koldovsky, P. Tichavsky, and E. Oja, “Efficient variant of algorithm fastica for independent component analysis attaining the cramér-rao lower bound,” IEEE Transactions on neural networks, vol. 17, no. 5, pp. 1265–1277, 2006.
  • [43] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
  • [44] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman, “Preserving color in neural artistic style transfer,” arXiv preprint arXiv:1606.05897, 2016.
  • [45] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,” arXiv preprint arXiv:1701.01036, 2017.
  • [46] E. Risser, P. Wilmot, and C. Barnes, “Stable and controllable neural texture synthesis and style transfer using histogram losses,” arXiv preprint arXiv:1701.08893, 2017.
  • [47] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” pp. 694–711, 2016.
  • [48] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European conference on computer vision.   Springer, 2016, pp. 702–716.
  • [49] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: An explicit representation for neural image style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1897–1906.
  • [50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6924–6932.
  • [51] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song, “Stroke controllable fast style transfer with adaptive receptive fields,” in European conference on computer vision (ECCV), 2018, pp. 238–254.
  • [52] Z. Wu, Z. Zhu, J. Du, and X. Bai, “Ccpl: Contrastive coherence preserving loss for versatile style transfer,” in European Conference on Computer Vision.   Springer, 2022, pp. 189–206.
  • [53] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation for artistic style,” arXiv preprint arXiv:1610.07629, 2016.
  • [54] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Advances Neural Information Processing Systems (NeurIPS), vol. 30, pp. 386–396, 2017.
  • [55] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” pp. 1501–1510, 2017.
  • [56] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
  • [57] T. Lin, Z. Ma, F. Li, D. He, X. Li, E. Ding, N. Wang, J. Li, and X. Gao, “Drafting and revision: Laplacian pyramid network for fast high-quality artistic style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5141–5150.
  • [58] Y. Jing, X. Liu, Y. Ding, X. Wang, E. Ding, M. Song, and S. Wen, “Dynamic instance normalization for arbitrary style transfer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4369–4376.
  • [59] L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-net: Multi-scale zero-shot style transfer by feature decoration,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8242–8250.
  • [60] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [61] X. Li, S. Liu, J. Kautz, and M.-H. Yang, “Learning linear transformations for fast arbitrary style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [62] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Advances Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
  • [63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  • [64] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style transfer,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4990–4998.
  • [65] Q. Mao, H.-Y. Tseng, H.-Y. Lee, J.-B. Huang, S. Ma, and M.-H. Yang, “Continuous and diverse image-to-image translation via signed attribute vectors,” International Journal of Computer Vision, vol. 130, no. 2, pp. 517–549, 2022.
  • [66] A. Romero, P. Arbeláez, L. Van Gool, and R. Timofte, “Smit: Stochastic multi-label image-to-image translation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  • [67] K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain generalization via invariant feature representation,” in International conference on machine learning.   PMLR, 2013, pp. 10–18.
  • [68] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao, “Deep domain generalization via conditional invariant adversarial networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 624–639.
  • [69] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 818–833.
  • [70] P.-W. Wu, Y.-J. Lin, C.-H. Chang, E. Y. Chang, and S.-W. Liao, “Relgan: Multi-domain image-to-image translation via relative attributes,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5914–5922.
  • [71] F. Pizzati, P. Cerri, and R. de Charette, “Comogan: continuous model-guided image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 288–14 298.
  • [72] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representations,” in International conference on machine learning.   PMLR, 2013, pp. 552–560.
  • [73] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow, “Understanding and improving interpolation in autoencoders via an adversarial regularizer,” arXiv preprint arXiv:1807.07543, 2018.
  • [74] H. Teoh, “Formula for vector rotation in arbitrary planes in n,” 2005.
  • [75] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [76] P. Neubert, S. Schubert, and P. Protzel, “An introduction to hyperdimensional computing for robotics,” KI-Künstliche Intelligenz, vol. 33, no. 4, pp. 319–330, 2019.
  • [77] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
  • [78] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [79] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
  • [80] G. Kwon and J. C. Ye, “Diffusion-based image translation using disentangled style and content representation,” in The Eleventh International Conference on Learning Representations, 2023.
  • [81] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • [82] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [83] A. Borji, “Pros and cons of gan evaluation measures,” Computer Vision and Image Understanding, vol. 179, pp. 41–65, 2019.
  • [84] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  • [85] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, “Learning imbalanced datasets with label-distribution-aware margin loss,” Advances in neural information processing systems, vol. 32, 2019.