RoNet: Rotation-oriented Continuous Image Translation

Yi Li*, Xin Xie, Lina Lei, Haiyan Fu, Yanqing Guo Corresponding author: liyi@dlut.edu.cn Other authors: shelsin@mail.dlut.edu.cn leilina@mail.dlut.edu.cn fuhy@dlut.edu.cn, guoyq@dlut.edu.cn

Abstract

The generation of smooth and continuous images between domains has recently drawn much attention in image-to-image (I2I) translation. Linear relationship acts as the basic assumption in most existing approaches, while applied to different aspects including features, models or labels. However, the linear assumption is hard to conform with the element dimension increases and suffers from the limit that having to obtain both ends of the line. In this paper, we propose a novel rotation-oriented solution and model the continuous generation with an in-plane rotation over the style representation of an image, achieving a network named RoNet. A rotation module is implanted in the generation network to automatically learn the proper plane while disentangling the content and the style of an image. To encourage realistic texture, we also design a patch-based semantic style loss that learns the different styles of the similar object in different domains. We conduct experiments on forest scenes (where the complex texture makes the generation very challenging), faces, streetscapes and the iphone2dslr task. The results validate the superiority of our method in terms of visual quality and continuity.

Index Terms:

Image-to-image translation (I2I), continuous generation, style representation.

Refer to caption — Figure 1: The turning wheel of four seasons generated by RoNet with the single input (on the right labeled with the red dot).

I Introduction

Image-to-image (I2I) translation [1], also known as image translation, learns to map an image from the source domain to an image in the target domain. In the process, an image is roughly decomposed (explicitly or implicitly) into two components: the content (which is domain-invariant) is expected to remain the same after translation, while the style (which is domain-variant) refers to the changes during translation. Depending on the concrete tasks, I2I translation can benefit a wide range of applications involving portrait animation [2], photo enhancement [3, 4, 5], painting style transfer [6, 7, 8], domain adaptation [9, 10, 11] and face synthesis [12, 13]. Despite the impressive progress in the past years, it remains challenging to obtain smooth and continuous translation results.

Recently, some approaches have explored to utilize linear interpolation to realize continuous translation. The linear assumption may be imposed on the learned features [14, 15], the trained models [16] or the domain labels [17, 18]. The linear manifold assumption is intuitive but has limitations in at least three aspects. (1) Elements (feature/model/label) of the source domain and the target domain are both necessary to realize the linear interpolation. Take the model interpolation in [16] as an example, one has to train multiple models to deal with the multi-domain translation. (2) The elements (feature/model/label) tends to have high dimensions. With the dimension increases, the relationship between two domains becomes harder to conform to the linear assumption, especially when there is a wide gap between two domains. Even for labels, there are complex scenarios like cyclic translations, e.g., the turning wheel of seasons as presented in Fig.1, generated by RoNet. (3) Suppose there are the source representation $S_{src}$ and the target representations $S_{tgt1}$ and $S_{tgt2}$ from two different domains, the fused representation $S^{inp}_{src\rightarrow tgt2}$ is usually defined as $\alpha S_{src}+(1-\alpha)S_{tgt2}$ . As illustrated in Fig.3 (a), the typical intermediate representation interpolation will sacrifice the expression ability of the fused element in the multi-domain image translation. For example, we cannot obtain the image with autumn style by the representation interpolation from the spring domain to winter domain.

Based on the analysis, this paper proposes to realize continuous translation with an in-plane rotation of the style representation. Different from the intuitive linear interpolation, we assume the domains to distribute on a circle in a super-plane and learn the super-plane automatically. We accordingly propose RoNet that built on a generation network that explicitly disentangles the content and the style of an image. The method possesses the following advantages. It is capable of multi-domain generation with a single input image because the domain relationship is embedded in the rotation angle. Although provided with the cyclic manifold, the rotation plane is learned automatically in an end-to-end manner, making the network be appropriate for not only periodic translation like season shifting but also general translation tasks like $real\;face\to comic\;portrait$ and $iphone\to dslr$ . In the rotation model, the domain style is captured by the vector direction. When translating from one domain to another, we merely modify the vector direction to the target domain while keeping the magnitude unchanged as presented in Fig.3 (b), reserving the expression ability after translation. More specifically, we can generate the continuous images with the style of four seasons by the representation rotation.

We conduct experiments on various translation tasks including season shifting in forests, $real\;face\to comic\;portrait$ , solar day shifting of streetscapes and $iphone\to dslr$ . Among them, the translation of the forest scene is quite challenging due to the extremely complex texture of trees. The results of existing methods deteriorate a lot in this task, shown in Sec.IV. To this end, we design a patch-based semantic style loss by focusing effectively on the style nuances in the matched patches across domains. Compared with other approaches, RoNet produces the most realistic results on various tasks and achieves the superior quantitative performance.

Our main contributions are summarized as follows.

•

To achieve continuous I2I translation, we propose a novel rotation-oriented mechanism which embeds the style representation into a plane and utilizes the rotated representation to guide the generation. RoNet is accordingly implemented to learn the rotation plane automatically while disentangling the content and the style of an image simultaneously.
•

To produce realistic visual effects on challenging textures like trees in forests, we design a patch-based semantic style loss. It first matches the patches from different domains and then learns the style difference with high pertinency.
•

Experiments on various translation scenarios are conducted, including season shifting in forests, $real\;face\to comic\;portrait$ , solar day shifting of streetscapes and $iphone\to dslr$ . With the guidance of the rotation, RoNet successfully generates realistic as well as continuous translation results with a single input image.

II Related Work

As soon as the seminal work of Image-to-Image translation [1] was proposed, it showed the excellent performance. Building on that basis, several methods [19, 20, 21, 22, 23, 24, 25, 26, 27] are designed to further achieve more surprising effect in three main ways as follows.

II-A Disentangled Representations

Disentanglement is defined as the act of releasing from a snarled or tangled condition, which is a common tool to extract the high-dimensional features in latent space. Actually, many unsupervised methods [6, 4, 28, 29, 30] utilize disentanglement to capture the content and style feature via the encoder, achieving excellent image-to-image transltaion. In addition, multi-domain image translation [19, 20] is accomplished by building relative transformations between style representations in different domains or inputing the style encoder extra information about domain label in semi-supervised training ways. With the guidance of style features, image synthesis [21, 22, 23] is allowed to be controllable. With the development of neural networks, generated images become more and more vivid enough to fool the discriminator in GANs [31] or even some industry experts. However, it is hard for most existing methods to make ideal disentanglement, might causing semantic flipping [27]. Recent works [32, 33] exploit disentanglement for few-shot generalization capabilities. Besides, the technology of disentanglement is frequently used in image editing. There always are latent directional representations which control the changes of one attribute in the image, possessing some latent semantic interpretation. Prior works [34, 35, 36] explore the disentanglement of latent dimensions for image editing by training a linear classifier, discovering the meaningful directions for semantic editing and making remarkable results. Voynov and Babenko [37] propose to learn a candidate matrix and a classifier such that the semantic directions in the matrix can be properly recognized by the classifier. Unsupervised methods GANSpace [38] and SeFa [39] perform PCA on the sampled data and the weights of generative model respectively to seperate primary semantic directions in the latent space, generating realistic images with target attribute directions. Similarly, ArtIns [40] make advantages of FastICA [41, 42] algorithm to obtain each independent style component vector in style feature space, enabling artwork editing.

II-B Style Transfer

The ultimate goal of style transfer is to generate plausible artworks, preserving the content of the photograph and owning the style of the painting simultaneously. Gatys et al. [43] were the seminal work to achieve stylization. During the iterative optimization process, the content and style features are fused by calculating the Gram matrix for loss constraints. Similarly, some works [44, 45, 46] iteratively flexibly combine content and style of arbitrary images, which are time-consuming. For resource-saving and faster stylization, later works [47, 48, 49, 50, 51, 52] turn to the convolutional neural networks (CNNs) and utilize a feed-forward pass to improve the efficiency of stylization. Moreover, migrating multiple styles into one content image [53] is completed by conditional instance normalization, generating excellent stylized results and breaking the limitation of learning one specific style. Recently, arbitrary style transfer methods are paid more attention to facilitate efficient applications. WCT [54] is proposed to achieve universal style transfer with two transformation steps including whitening and coloring. Huang et al. [55] propose AdaIN, normalizing the mean and variance of each feature map separately, to adaptively combine the content and style. Due to the convenience, a large number of image generation tasks [56, 19, 57] adopt the AdaIN as the first choice to fuse the content and style representations. Based on CNNs, Jing et al. [58] extends the arbitrary style transfer task by introducing dynamic instance normalization. Avatar-Net [59] utilizes a U-net [60] to semantically align the content and the style features. Linear [61] learns a linear transformation according to the content and style features. With the development of attention mechanism [62, 63], existing methods [24, 25, 26] utilize the encoder-transfer-decoder architecture to generate more high-quality artworks. As for photorealistic image stylization, it can be considered as a special kind of image-to-image translation. Recent works [64, 7] design the deep-learning approach to faithfully transfer the reference style of natural scenes.

II-C Continuous Image Translation

For continuous image variation, feature interpolation [14, 15, 65] is a common practice to accomplish this task. DRIT [6] and MUNIT [4] perform continuous interpolation between two style features, while generated images belong to the same domain. StarGAN v2 [19] and SMIT [66] mix disentangled style representations, resulting in impressive continuous i2i translation. Besides, continuity can be achieved by model parameter interpolation between two domains [16]. Then several methods [17, 18, 67, 68] generate a continuous sequence of images between two domains by utilizing intermediate domain labels. GANimation [69] adopts a conditional GAN framework [1], enabling the continuous generation of examples by inputing the continuous rather than discrete labels at inference time. Relgan [70] introduces the loss interpolation for middle states. In addition, there is rich latent information contained in the underlying dimensions. Chen et al. [5] have proposed a framework for unpaired i2i translation, generating natural and gradually changing intermediate results by latent space interpolation. CoMoGAN [71] relies on naive physics-inspired models to guide the training, learning continuous translations in latent space. However, it is complicated to obtain related physics function for model guidance [71] in different domains. Still, linear interpolation [72, 73] is not always valid (e.g. spring to winter include summer and autumn, day to night include dusk).

III Method

Let $\{\mathcal{I}^{i}\}_{i=1}^{N}\in\mathcal{I}$ be the image sets of $N$ different domains, and $\{y_{i}\}_{i=1}^{N}\in\mathcal{Y}$ be their corresponding domain labels. Given an image $\mathbf{I}\in\mathcal{I}$ , the goal of RoNet is to generate continuous results across domains that accord with the cyclic manifold, under the guidance of style vector rotation. Since the style indicates the changes during translation, we employ the style representation to play the rotation role. Note that the plane is not manually appointed but learned along with the whole network, and the details are illustrated in the following.

III-A How to Rotate?

It is challenging to imagine the rotation of high-dimensional vectors for its spatial complexity. From Euler’s rotation theorem we know that any rotation can be expressed as a single rotation with respect to some axes [74]. In order to further discuss the rotation process, we define the $n$ -dimensional source vector as $S_{1}\in R^{n}$ and let $\theta$ be the angle to rotate. We first project $S_{1}$ onto the $2$ -dimensional rotation plane $W$ which is the span of two orthogonal unit vectors $m,n\in R^{n}$ :

P_{1}=(S_{1}\cdot m)m+(S_{1}\cdot n)n

(1)

where $P_{1}$ is the projection of $S_{1}$ in $W$ . Then we can obtain the rest of $S_{1}$ :

R=S_{1}-P_{1}

(2)

where $R$ is the component of $S_{1}$ that is orthogonal to plane $W$ . Hence, it is unchanged by the rotation in $W$ . Next step, we shall achieve the high-dimensional vector rotation by rotating $P_{1}$ in $W$ as Equation 3, and then mapping the result $P_{2}$ back to plane in which $S_{1}$ lies, which is described as Equation 4.

$\displaystyle P_{2}$	$\displaystyle=Rot_{W,\theta}(P_{1})$	(3)
	$\displaystyle=[(S_{1}\cdot m)cos\theta-(S_{1}\cdot n)sin\theta]m$
	$\displaystyle+[(S_{1}\cdot n)cos\theta+(S_{1}\cdot m)sin\theta]n$

	$\displaystyle S_{2}=$	$\displaystyle P_{2}+R$		(4)
	$\displaystyle=S_{1}+\begin{bmatrix}m&n\end{bmatrix}$	$\displaystyle\begin{bmatrix}(cos\theta-1)&-sin\theta\\ sin\theta&(cos\theta-1)\end{bmatrix}\begin{bmatrix}S_{1}\cdot m\\ S_{1}\cdot n\end{bmatrix}$		(4)

The whole rotation process is shown in the Figure 4. In other words, the vector $S_{2}$ under rotation by $\theta$ is equal to the projection $P_{1}$ of vector $S_{1}$ under rotation by $\theta$ in $W$ plus $(S_{1}-P{1})$ . In this work, our target is to discover the rotation plane, allowing style vectors after rotation to be cyclic. So we set two learnable vectors $(\mu,\nu)\in R^{n}$ to model the rotation plane. For specifying the orthogonal unit assumptions, we adopt the Schmidt orthogonalization to obtain the rotation plane:

W=(m,n)=GramSchmidt(\mu,\nu).

(5)

III-B RoNet

The overview of the proposed RoNet is presented in Figure 5. There are four essential subnets in RoNet which are the content encoder $E_{c}$ , the style encoder $E_{s}$ , the rotation module and the Generator $G$ . Considering it is time- and labour-consuming to obtain the continuous training data, we employ the images in distinct key domains for learning, making the method weakly supervised. In each round of training, we feed the network with a pair of images $(I_{src},I_{tgt})$ that are from the source domain and the target domain respectively. For instance, the source image $I_{src}$ in Figure 5 is from the Summer domain, while the target image $I_{tgt}$ is from the Autumn domain.

Given an image, we use the content encoder $E_{c}$ to extract the domain-invariant representation $C$ , and use the style encoder $E_{s}$ for domain-variant representation $S$ . In a typical exemplar-based I2I translation approach, $I_{src}$ provides the content for the results which should be kept the same, and $I_{tgt}$ specifies the target style of the generated image, i.e., $I_{mix}=G(C_{src},S_{tgt})=G(E_{c}(I_{src}),E_{s}(I_{tgt}))$ . However, as has been introduced above, RoNet is in charge of both information disentanglement and learning the proper plane for continuously rotation. Hence we implant a rotation module in the network to find the plane in Equation 5. Concretely, we first rotate the style vector of the source image $S_{src}$ to the target domain, and then use the rotated vector $S_{rot}$ to generate the image in the target domain $I_{mix}=G(C_{src},S_{rot})$ . By alternate training and imposing constraints between $S_{rot}$ and $S_{tgt}$ , we learn the rotation plane along with the disentanglement in an end-to-end manner.

In order to extract the content representation and prevent semantic flipping, we adopt normalized content features and segmentation pseudo-labels [27] of $I_{src}$ and $I_{mix}$ for better visual quality during generation. Different from existing point-to-point translation that usually uses face or animal images in experiments, we apply RoNet in multi-domain scene generation. Generation targets as faces hold clear structures, but a scene image tends to consist of stuff without fixed shapes, e.g., sky and grass. Nevertheless, the content representation is incapable of serving the sketch information for the stuff after rounds of downsampling. Thus we complement the generator with content features which can be easily obtained by VGG encoder [75]. On the other hand, the source and target domains always have a large semantic mismatch, suffering from source content corruption. Therefore hypervector is obtained by Vector Symbolic Architectures (VSA) [27] to constrain the semantic information of principal components between $I_{src}$ and $I_{mix}$ .

III-C Loss Optimization

This section introduces how to build loss functions to optimize the model.

III-C1 Adversarial Objective

During the training, the generator takes the content features $C$ and style features $S$ as inputs, learning to generate realistic images via an adversarial loss:

	$\displaystyle\mathcal{L}_{adv}$	$\displaystyle=\mathbb{E}\left[\mathrm{log}D_{y_{src}}(I_{src})\right]$		(6)
		$\displaystyle+\mathbb{E}\left[\mathrm{log}(1-D_{y_{tgt}}(G(C_{src},S_{mix})))\right]$		(6)

where $D_{y}(\cdot)$ means the output of discriminator $D$ corresponding to the domain $y$ . Notably, style features $S_{mix}$ are $S_{rot}$ or $S_{tgt}$ . Concretely, we input $S_{rot}$ and $S_{tgt}$ alternately to generator for better harmonization of style encoder $E_{s}$ and rotation plane $W$ during training. In certain aspects, such training strategy helps us discover better cyclic manifold and save computation memory.

III-C2 Content Preservation

For better content representation, we use a pre-trained VGG network [75] to extract content feature maps for computing the content perceptual loss as follows:

\displaystyle\mathcal{L}_{con}=\mathbb{E}\left[\sum_{i}^{cl}\left\|\varphi_{i}% (I_{mix})-\varphi_{i}(I_{src})\right\|_{2}\right]

(7)

where $\varphi_{i}(\cdot)$ denotes the features extracted from the $i$ -th layer in a pre-trained VGG network [75] and $\left\|\cdot\right\|_{2}$ denotes Mean Square Error (MSE). Besides, $cl$ represents layers { $conv4\_1$ , $conv5\_1$ }.

III-C3 VSA-Based Semantic Consistency

Although adversarial and content losses provide generator with powerful constraints, unaligned semantic information between domains results in semantic flipping. Therefore we adapt a VSA-based loss [27] to preserve the principal objects of source domain.

\displaystyle\mathcal{L}_{VSA}

\displaystyle=\mathbb{E}\left[1-dist(V_{src},V_{mix})\right]

(8)

where $dist(\cdot,\cdot)$ is the cosine distance. $V_{src}$ and $V_{mix}$ are hypervectors to present the semantic features of $I_{src}$ and $I_{mix}$ respectively, obtained by projecting image features in locality sensitive hashing (LSH) [76, 27].

III-C4 Patch-based Semantic Matching

Facilitating style encoder $E_{s}$ to capture the realistic texture, we propose a patch-based semantic style loss. In detail, the target image $I_{tgt}$ is uniformly divided into 16 patches while randomly sampling $N$ patches $B^{N}_{mix}$ from the generated image $I_{mix}$ . Then cosine similarity matrix is obtained by calculating the cosine distance of patch-based latent representations from two domains, which are extracted by a pre-trained VGG encoder [75]. According to semantic similarity, we can match $N$ patches $B^{N}_{tgt}$ from $I_{tgt}$ , processing analogical semantic as $B^{N}_{mix}$ . Thus the same objects from different domains are matched for more accurate style learning. For example, the style of trees in $I_{mix}$ is learned from that in $I_{tgt}$ . In this work, we adapt a DINO-ViT model [77] (a Vision Transformer model that has been pre-trained in a self-supervised manner) to obtain the $[CLS]$ token, which contains the semantic style information of the image. Therefore we can ensure style capture by minimizing the $[CLS]$ token distances as shown in Figure 6.

\displaystyle\mathcal{L}_{sty}=\mathbb{E}\left[\sum_{i}^{N}\left\|e^{L}_{[CLS]% }(B^{i}_{mix})-e^{L}_{[CLS]}(B^{i}_{tgt})\right\|_{2}\right].

(9)

where $e^{L}_{[CLS]}$ denotes the last layer $[CLS]$ token of DINO-ViT [77].

III-C5 Style Alignment

In order to make style features flow into the cyclic manifold as much as possible. We conduct a loss to restrict the $S_{rot}$ and $S_{tgt}$ , ensuring the style features can be rotated from the source domain to the target domain.

\displaystyle\mathcal{L}_{mse}=\mathbb{E}\left[\left\|S_{rot}-S_{tgt}\right\|_% {2}\right]

(10)

Notably, this loss term only work when the input of generator is $S_{rot}$ , which helps to make encoder and rotation plane adapt each other.

III-C6 Image Reconstruction

To further guarantee that the generator $G$ learns to preserve the original domain-invariant characteristics, we build a reconstruction loss:

\displaystyle\mathcal{L}_{rec}

\displaystyle=\mathbb{E}\left[||G(C_{src},S_{src})-I_{src}||_{1}\right].

(11)

Therefore, the network is trained by minimizing the loss function defined as:

	$\displaystyle\mathcal{L}$	$\displaystyle=\lambda_{adv}\mathcal{L}_{adv}+\lambda_{con}\mathcal{L}_{con}+% \lambda_{VSA}\mathcal{L}_{VSA}$		(12)
		$\displaystyle+\lambda_{rec}\mathcal{L}_{rec}+\lambda_{mse}\mathcal{L}_{mse}+% \lambda_{sty}\mathcal{L}_{sty}$		(12)

where $\lambda_{adv}$ , $\lambda_{con}$ , $\lambda_{VSA}$ , $\lambda_{rec}$ , $\lambda_{mse}$ , $\lambda_{sty}$ are the hyper-parameters to balance each item.

IV Experiments

IV-A Experimental Setting

IV-A1 Implementation Details

During training, we use the ADAM solver [78] with $\beta_{1}=0.0$ and $\beta_{2}=0.99$ , and a learning rate of 0.0001 for optimization. Besides, we empirically set the hyperparameters $\lambda_{adv}$ , $\lambda_{con}$ , $\lambda_{VSA}$ , $\lambda_{rec}$ , $\lambda_{mse}$ and $\lambda_{sty}$ as 1, 1, 3, 50, 150 and 1 respectively, making the losses balanced. All the experiments are conducted under the environment of Python 3.7.3 and PyTorch 1.7.1 on an Ubuntu 18.04 system with one single 32G Tesla-V00 GPU. In the training stage, all images are randomly cropped into 512 $\times$ 512, while in the inference stage, any image size is supported.

IV-A2 Datasets

Our proposed framework is allowed to be applied in different scenes, all experiments are conducted on multiple datasets as follows.

1.

Season Album: As for seasonal image translation, we collect 4000 training images and 1000 testing images for each season from flickr.com. The resolution of all seasonal images is greater than 1024, for keeping the details of seasonal characteristics such as the color of leaves, the texture of snow, etc.
2.

Comic Face: We obtain real and comic face dataset from kaggle.com, which is used to finish real face stylization. There are 1500 real identities and 1500 anime faces for training, and 500 images for testing.
3.

Waymo Open Dataset [79] : There is high-resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions. The scenes of dateset are selected from both suburban and urban area at different moments of the day. The dataset is currently released to making advancements in machine perception and self-driving technology. As for timeshift task, we split clear images into four domains {day, dusk, night, dawn} according to Waymo image labels, obtaining train / test sets of 27272 / 7682 images.
4.

Iphone2dslr Flowers [3] : The dataset is used to map iPhone images with large depth of field to DSLR images with shallow depth of field. There are 1812 / 569 iPhone images and 3325 / 480 DSLR images for training / testing.

IV-A3 Baselines

We compared our methods with several state-of-the-art methods as follows, which can accomplish the continuous image-to-image translation in different interpolation ways.

1.

StarGAN v2 [19] is a state-of-the-art multi-domain translation network. It can map an input image to multiple defined domains with a single model. We train the model with the public codes released by the authors and use its disentangled style code to enhance the continuous effects with linear interpolation.
2.

DLOW [17] realizes continuous translation by generating a continuous sequence of intermediate labels between two domains. In other words, it is a method based on interpolated labels.
3.

DRIT [6] is able to generate diverse results within a certain domain, but not suitable for multi-domain translation directly. Thus we train the model between every two domains and apply interpolation to the models to obtain the continuous results.
4.

Fast Photo Style [7] is a style transfer method based on disentanglement learning and can generate continuous transfer results via linear interpolation.
5.

VSAIT [27] is a paradigm for image-to-image translation by setting VSA-based constraints on adversarial learning, achieving continuous image generation by applying the interpolation to the models.
6.

CoMoGAN [71] achieves cyclic continuous translation with the guidance of physics-inspired models, but is limited in scenarios without decent physical models, such as season transition.
7.

DiffuseIT [80] is a score-based model to accomplish image translation by introducing a loss function to control the diffusion process, but there are instances of content and style leakage in the generated results.

IV-A4 Evaluation Metrics

We choose the following quantitative evaluation metrics to demonstrate the effectiveness of our proposed framework.

1.

Learned Perceptual Image Patch Similarity (LPIPS) is based on the VGG [75] and AlexNet [81] network architectures, evaluating of the distance between image patches. Higher means further different, while lower means more similar.
2.

Fréchet Inception Distance (FID) [82] is a metric that compares the distribution of generated images with the distribution of a set of real images, by calculating the distance between feature vectors of real and generated images. The FID is the current standard metric for assessing the quality of generative models.
3.

Kernel Inception Distance (KID) [83] is able to calculate the squared Maximum Mean Discrepancy (MMD) between the Inception representations of the real and generated images, via a polynomial kernel. Similar to the FID [82], lower values indicate closer distances between the distributions of generated and real data.

IV-B Season Shifting

IV-B1 Continuous Translation

Fig.7 presents the continuous translation results with high definition. Please zoom in for more realistic details. In nature, most plants turn into dark yellow when autumn comes. RoNet captures the smooth variation successfully with the help of the in-plane rotation. For typical forest scene as the first row, the complex texture is quite challenging for the net to learn, easily causing artifacts or blur. It is observed that RoNet preserves the texture well because there is the patch-based semantic style loss. In the penultimate row, the plants changes with seasons, while the river keeps similar appearance across seasons, which is agree with natural order. Furthermore, RoNet has the ability of cyclic continuous translation with a single input. We present raw results in Fig.1 which are also the materials for the corresponding time-lapse demo in the supplementary material.

IV-B2 Comparison

We compare RoNet with leading approaches in closely related fields, including multi-domain translation, continuous translation based on linear interpolation, and translation based on disentangled representations (StarGAN v2 [19], DLOW [17], DRIT [6], Fast Photo Style [7], VSAIT [27], DiffuseIT [80]). Note that all the approaches are trained and tested under the same protocol. The comparison is exhibited in Fig.8. For StarGAN v2 [19], we provide the style representations of different target domains to implement the multi-domain translation. Thus the 2nd, 5th, 8th and last column of StarGAN v2 [19] is guided by certain domain styles, while the rest are the corresponding interpolation results over the style vectors. For the other four rows, models are trained between Spring and Winter as they are essentially point-to-point approaches, but applied with different interpolation schemes: DLOW [17] with label interpolation, Fast Photo Style [7] and DiffuseIT [80] with representation interpolation, DNI-DRIT [6] and DNI-VSAIT [27] with model interpolation. It is observed that RoNet achieves the most appealing results in terms of both visual quality and continuity.

TABLE I: Metrics comparison of RoNet and existing approaches.

Method	FID $\downarrow$					LPIPS $\downarrow$					KID $\times 10^{-3}\downarrow$
Method	Spr	Sum	Aut	Win	mean	Spr	Sum	Aut	Win	mean	Spr	Sum	Aut	Win	mean
StarGAN v2 [19]	45.8	74.2	81.8	66.3	67.0	0.675	0.685	0.748	0.738	0.712	34.1	39.2	43.9	35.2	38.1
DLOW [17]	75.3	60.0	88.2	78.7	75.5	0.487	0.489	0.524	0.530	0.508	43.7	23.5	46.6	38.4	38.1
DRIT [6]	58.3	47.2	52.8	57.9	54.1	0.306	0.306	0.363	0.457	0.357	35.8	23.4	27.1	30.2	29.1
Fast Photo Style [7]	102.4	80.1	82.1	76.2	85.2	0.421	0.421	0.510	0.440	0.448	73.4	54.9	51.1	50.8	57.6
VASIT [27]	59.7	52.8	60.4	69.1	60.5	0.261	0.240	0.308	0.235	0.261	37.4	30.0	34.4	41.2	35.7
DiffuseIT [80]	60.8	49.1	55.6	49.3	53.7	0.512	0.517	0.577	0.538	0.536	41.1	24.6	30.0	22.5	29.5
RoNet	55.7	43.3	57.1	55.7	52.9	0.265	0.238	0.256	0.176	0.234	36.8	20.5	27.8	26.2	27.8

TABLE II: Ablation studies.

Method	FID $\downarrow$					LPIPS $\downarrow$					KID $\times 10^{-3}\downarrow$
Method	Spr	Sum	Aut	Win	mean	Spr	Sum	Aut	Win	mean	Spr	Sum	Aut	Win	mean
A: Baseline $\mathcal{L}_{adv}$	102.1	66.0	76.6	72.2	79.2	0.394	0.378	0.464	0.378	0.404	77.2	38.0	40.2	39.5	48.7
B: + $\mathcal{L}_{sty}$	65.3	66.3	72.1	74.1	69.5	0.461	0.413	0.523	0.553	0.488	47.9	52.3	39.7	39.4	44.8
C: + $\mathcal{L}_{con}$	58.7	53.6	59.1	63.9	58.8	0.254	0.248	0.309	0.366	0.294	34.9	29.1	28.4	37.3	32.5
D: + $\mathcal{L}_{rec}$	57.1	52.9	57.9	71.0	59.7	0.248	0.218	0.233	0.350	0.262	35.7	29.2	29.7	40.5	33.8
E: + $\mathcal{L}_{mse}$	56.3	46.3	57.7	58.6	54.7	0.290	0.238	0.299	0.340	0.292	35.7	23.4	31.5	28.3	29.7
F: + $\mathcal{L}_{VSA}$	55.7	43.3	57.1	55.7	52.9	0.265	0.238	0.256	0.176	0.234	36.8	20.5	27.8	26.2	27.8
G: $w/o$ Rotation	59.5	54.4	65.0	67.3	61.5	0.228	0.228	0.261	0.267	0.246	41.2	31.1	39.7	35.6	36.9

TABLE III: Metrics comparison of RoNet and existing approaches on the timeshift task.

Method	FID $\downarrow$					LPIPS $\downarrow$					KID $\times 10^{-3}\downarrow$
Method	Dawn	Day	Dusk	Night	mean	Dawn	Day	Dusk	Night	mean	Dawn	Day	Dusk	Night	mean
StarGAN v2 [19]	215.9	120.5	163.8	161.7	165.5	0.573	0.546	0.515	0.660	0.573	156.3	142.0	139.9	184.2	155.6
DLOW [17]	184.9	141.2	153.1	134.4	153.4	0.352	0.354	0.340	0.458	0.376	151.3	144.0	143.4	109.8	137.1
DRIT [6]	196.6	156.2	208.2	159.9	180.2	0.407	0.393	0.371	0.627	0.450	153.3	144.1	222.1	118.5	159.5
Fast Photo Style [7]	159.1	162.0	190.0	112.6	155.9	0.317	0.323	0.300	0.523	0.366	114.8	143.2	189.9	94.9	135.7
VASIT [27]	114.2	93.2	104.6	100.6	103.4	0.153	0.174	0.174	0.438	0.235	64.0	57.2	77.4	54.7	63.3
CoMoGAN [71]	141.2	103.9	141.0	117.9	126.0	0.284	0.283	0.308	0.338	0.303	84.0	67.3	128.1	100.7	95.0
DiffuseIT [80]	108.3	76.7	103.1	109.5	99.4	0.422	0.424	0.391	0.619	0.464	55.6	55.1	80.6	77.2	67.1
RoNet	107.6	75.1	87.4	93.4	90.9	0.178	0.209	0.195	0.212	0.199	53.9	31.7	49.6	46.3	45.4

IV-B3 Quantitative Analysis

In order to evaluate the performance of our model objectively, we report the quantitative metrics of different approaches in Table I. Three metrics are employed to evaluate these approaches in various views. LPIPS [84] evaluates the structure difference between source images and generated images, FID [82] and KID [83] measures the distance between the distribution of the target images and the generated images. For all the three metrics, the smaller the better. Note that all contrastive experiments are conducted with test set in datasets. It is observed that RoNet achieves the lowest performance in all metrics, verifying its superiority over the others.

IV-B4 Ablation Study

To analyse the contribution of each loss function elaborately, we conduct the ablation studies and show the results in Tab.II. The baseline is trained merely by the adversarial loss, then we add each loss function gradually. If we first add the patch-based semantic style loss, the model performance declines much. The reason is that these two loss functions focus more on visual reality but are insufficient for model convergence. With the contribution of the content loss, the reconstruction loss and the MSE loss, the performance increases significantly. Finally, the full model F achieves the best results. In the paper, we report the importance of each loss function as presented in Tab.II. Here, visual effects of each loss function are shown in Fig.9. It is observed that the sky artifacts decrease and semantic style becomes more realistic. In addition, we remove the rotation module to test its impact on the model’s performance. The results are shown in Tab.II and Fig.9. The findings demonstrate that the rotation module significantly enhances the model’s ability to learn vibrant styles, even without reference style images.

IV-B5 Interpolation Effectiveness

We apply different interpolation schemes to the state-of-the-art methods: DLOW [17] with label interpolation, StarGAN v2 [19] and Fast Photo Style [7] with representation interpolation, DNI-DRIT [6] and DNI-VSAIT [27] with model interpolation. As shown in Fig.10, we set the spring image as input source and the winter image as target style. It is observed that StarGAN v2 [19], DiffuseIT [80] and Fast Photo Style [7] suffer from content leak, DNI-DRIT [6] and DNI-VSAIT [27] exhibit serious artifacts. However, linear interpolation is not always valid, these methods should generate intermediate results of the summer and autumn seasons. DLOW [17] and RoNet achieve the full seasonal circulation without the target image, but DLOW [17] obtains more style leak.

IV-B6 High Resolution

Due to the advantageous features of convolutional neural networks, RoNet is capable of producing a plethora of images with varying scales. As illustrated in Fig.11, the images generated by RoNet possess a resolution of $1024\times 1600$ pixels, which vividly exposes the intricate texture details of leaves.

IV-C Time Shifting

IV-C1 Difference

Recently, an unsupervised generation network named CoMoGAN [71] has been proposed to learn non-linear continuous translations. Despite the unsupervised training manner, CoMoGAN relies on physics-inspired models to guide the learning process. Taking the cyclic translation task of “day to any time” as an example, CoMoGAN first renders a daytime image with a color-based model to obtain the images at any time of a day, and then use the series of rendered data as supervision. It makes CoMoGAN stuck in the limitation of seeking the proper physical model. When confronting more challenging tasks such as season transition, CoMoGAN is out of work since it is hard to find a capable physical model for data rendering. However, our proposed framework is not constrained by physical guidance, applying to arbitrary scene domains, such as seasonal variation shown in Fig.7 and time shifting shown in Fig.13.

IV-C2 Comparison

We compare RoNet with several SOTA approaches (StarGAN v2 [19], DLOW [17], Fast Photo Style [7], DRIT [6], CoMoGAN [71], VSAIT [27], DiffuseIT [80]). From Fig.12 and Tab.III, it is observed that StarGAN v2 [19] and DiffuseIT [80] suffer from serious semantic distortion. Fast Photo Style [7], DLOW[17], DNI-VSAIT [27] and DNI-DRIT [6] obtain unrealistic results. Since the guidance of physics-inspired function, CoMoGAN [71] learns more color- and pixel-wise information. However, our model captures the style from data distribution, such as the street lights in the night and sunsets in the dusk.

IV-D Other Tasks

Experiments are also conducted on tasks of $real\;face\to comic\;portrait$ and $iphone\to dslr$ , and the visual results are presented in Fig.14 and Fig.16, respectively. For real face to comic, we compared with other four approaches (StyTr [26], SANet [24], Linear [61], and CCPL [52]) who realize continuous translation with interpolation. From Fig.15, it is observed that RoNet yields the most appealing results with rotation and the changes of pupilla color and hairstyle are continuous. The results from iPhone to DSLR also verify the generalization ability of RoNet.

IV-E Discussion

IV-E1 Generalization

Our proposed method is a general framework that can be applied to most natural scenes with continuous variance, including seasonal circulation and time shifting. However, there are certain special circumstances in the world that need to be considered. For example, in normal situations, the length of day is longer than that of night, dawn and dusk, as shown in Fig.18. Additionally, in the southern hemisphere, the length of spring and summer is longer than that of autumn and winter, as shown in Fig.17. It is important to account for these special circumstances to ensure the effectiveness of our method in such situations.

IV-E2 Automation

learning imbalanced datasets with label distribution aware margin loss [85]

V Conclusion

Aiming at continuous I2I translation, this paper proposes to embed the disentangled style representation under an annular manifold constraint. And thus the continuous generation can be achieved by rotating the style representation arbitrarily in the proper plane. To this end, RoNet is designed by implanting a rotation module in the generation network and adding a new patch-based semantic style loss. Different from the typical linear interpolation, the rotation is capable of moving the style representation from one domain to another with a single input as well as keeping the magnitude of the representation. Experiments of various translation scenarios (involving seasons, faces, solar days and camera effects) are conducted. The qualitative and quantitative results demonstrate that our method not only generates the most promising results with plausible transition compared with the others, but also achieves better performance in metrics. In the future, we plan to study more complex manifolds in continuous translation for better generality.

References

[1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
[2] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: One-shot anatomically consistent facial animation,” International Journal of Computer Vision, vol. 128, no. 3, pp. 698–713, 2020.
[3] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
[4] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 172–189.
[5] Y.-C. Chen, X. Xu, Z. Tian, and J. Jia, “Homomorphic latent space interpolation for unpaired image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2408–2416.
[6] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 35–51.
[7] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 453–468.
[8] H. Chen, L. Zhao, H. Zhang, Z. Wang, Z. Zuo, A. Li, W. Xing, and D. Lu, “Diverse image style transfer via invertible cross-space mapping,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 880–14 889.
[9] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International Conference on Machine Learning. PMLR, 2018, pp. 1989–1998.
[10] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, “Image to image translation for domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4500–4509.
[11] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domain adaptation of semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6936–6945.
[12] Y. Li, H. Huang, J. Cao, R. He, and T. Tan, “Disentangled representation learning of makeup portraits in the wild,” International Journal of Computer Vision, vol. 128, no. 8, pp. 2166–2184, 2020.
[13] Y. Han, J. Yang, and Y. Fu, “Disentangled face attribute editing via instance-aware latent space search,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 715–721.
[14] T. Xiao, J. Hong, and J. Ma, “Dna-gan: Learning disentangled representations from multi-attribute images,” arXiv preprint arXiv:1711.05415, 2017.
[15] J. Zhang, Y. Huang, Y. Li, W. Zhao, and L. Zhang, “Multi-attribute transfer via disentangled representation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9195–9202.
[16] X. Wang, K. Yu, C. Dong, X. Tang, and C. C. Loy, “Deep network interpolation for continuous imagery effect transition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1692–1701.
[17] R. Gong, W. Li, Y. Chen, and L. V. Gool, “Dlow: Domain flow for adaptation and generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2477–2486.
[18] R. Gong, D. Dai, Y. Chen, W. Li, D. P. Paudel, and L. Van Gool, “Analogical image translation for fog generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1433–1441.
[19] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8188–8197.
[20] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
[21] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
[22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
[23] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021.
[24] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5880–5888.
[25] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in IEEE International Conference on Computer Vision (ICCV), 2021, pp. 6649–6658.
[26] Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr:image style transfer with transformers,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[27] J. Theiss, J. Leverett, D. Kim, and A. Prakash, “Unpaired image translation via vector symbolic architectures,” in European Conference on Computer Vision. Springer, 2022, pp. 17–32.
[28] L. Jiang, C. Zhang, M. Huang, C. Liu, J. Shi, and C. C. Loy, “Tsit: A simple and versatile framework for image-to-image translation,” in European Conference on Computer Vision. Springer, 2020, pp. 206–222.
[29] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for unpaired image-to-image translation,” in European conference on computer vision. Springer, 2020, pp. 319–345.
[30] T. Park, J.-Y. Zhu, O. Wang, J. Lu, E. Shechtman, A. Efros, and R. Zhang, “Swapping autoencoder for deep image manipulation,” Advances in Neural Information Processing Systems, vol. 33, pp. 7198–7211, 2020.
[31] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 3, pp. 2672–2680, 2014.
[32] M.-Y. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz, “Few-shot unsupervised image-to-image translation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 10 551–10 560.
[33] K. Saito, K. Saenko, and M.-Y. Liu, “Coco-funit: Few-shot unsupervised image translation with a content conditioned style encoder,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 2020, pp. 382–398.
[34] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9243–9252.
[35] G. Yang, N. Fei, M. Ding, G. Liu, Z. Lu, and T. Xiang, “L2m-gan: Learning to manipulate latent space semantics for facial attribute editing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2951–2960.
[36] J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion for real image editing,” in European conference on computer vision. Springer, 2020, pp. 592–608.
[37] A. Voynov and A. Babenko, “Unsupervised discovery of interpretable directions in the gan latent space,” in International conference on machine learning. PMLR, 2020, pp. 9786–9796.
[38] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris, “Ganspace: Discovering interpretable gan controls,” Advances in Neural Information Processing Systems, vol. 33, pp. 9841–9850, 2020.
[39] Y. Shen and B. Zhou, “Closed-form factorization of latent semantics in gans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1532–1540.
[40] X. Xie, Y. Li, H. Huang, H. Fu, W. Wang, and Y. Guo, “Artistic style discovery with independent components,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 870–19 879.
[41] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
[42] Z. Koldovsky, P. Tichavsky, and E. Oja, “Efficient variant of algorithm fastica for independent component analysis attaining the cramér-rao lower bound,” IEEE Transactions on neural networks, vol. 17, no. 5, pp. 1265–1277, 2006.
[43] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2414–2423.
[44] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman, “Preserving color in neural artistic style transfer,” arXiv preprint arXiv:1606.05897, 2016.
[45] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,” arXiv preprint arXiv:1701.01036, 2017.
[46] E. Risser, P. Wilmot, and C. Barnes, “Stable and controllable neural texture synthesis and style transfer using histogram losses,” arXiv preprint arXiv:1701.08893, 2017.
[47] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” pp. 694–711, 2016.
[48] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European conference on computer vision. Springer, 2016, pp. 702–716.
[49] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: An explicit representation for neural image style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1897–1906.
[50] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6924–6932.
[51] Y. Jing, Y. Liu, Y. Yang, Z. Feng, Y. Yu, D. Tao, and M. Song, “Stroke controllable fast style transfer with adaptive receptive fields,” in European conference on computer vision (ECCV), 2018, pp. 238–254.
[52] Z. Wu, Z. Zhu, J. Du, and X. Bai, “Ccpl: Contrastive coherence preserving loss for versatile style transfer,” in European Conference on Computer Vision. Springer, 2022, pp. 189–206.
[53] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation for artistic style,” arXiv preprint arXiv:1610.07629, 2016.
[54] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Advances Neural Information Processing Systems (NeurIPS), vol. 30, pp. 386–396, 2017.
[55] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” pp. 1501–1510, 2017.
[56] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4401–4410.
[57] T. Lin, Z. Ma, F. Li, D. He, X. Li, E. Ding, N. Wang, J. Li, and X. Gao, “Drafting and revision: Laplacian pyramid network for fast high-quality artistic style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 5141–5150.
[58] Y. Jing, X. Liu, Y. Ding, X. Wang, E. Ding, M. Song, and S. Wen, “Dynamic instance normalization for arbitrary style transfer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4369–4376.
[59] L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-net: Multi-scale zero-shot style transfer by feature decoration,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8242–8250.
[60] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[61] X. Li, S. Liu, J. Kautz, and M.-H. Yang, “Learning linear transformations for fast arbitrary style transfer,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[62] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Advances Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
[63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
[64] F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep photo style transfer,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4990–4998.
[65] Q. Mao, H.-Y. Tseng, H.-Y. Lee, J.-B. Huang, S. Ma, and M.-H. Yang, “Continuous and diverse image-to-image translation via signed attribute vectors,” International Journal of Computer Vision, vol. 130, no. 2, pp. 517–549, 2022.
[66] A. Romero, P. Arbeláez, L. Van Gool, and R. Timofte, “Smit: Stochastic multi-label image-to-image translation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[67] K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain generalization via invariant feature representation,” in International conference on machine learning. PMLR, 2013, pp. 10–18.
[68] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao, “Deep domain generalization via conditional invariant adversarial networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 624–639.
[69] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 818–833.
[70] P.-W. Wu, Y.-J. Lin, C.-H. Chang, E. Y. Chang, and S.-W. Liao, “Relgan: Multi-domain image-to-image translation via relative attributes,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5914–5922.
[71] F. Pizzati, P. Cerri, and R. de Charette, “Comogan: continuous model-guided image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 288–14 298.
[72] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representations,” in International conference on machine learning. PMLR, 2013, pp. 552–560.
[73] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow, “Understanding and improving interpolation in autoencoders via an adversarial regularizer,” arXiv preprint arXiv:1807.07543, 2018.
[74] H. Teoh, “Formula for vector rotation in arbitrary planes in n,” 2005.
[75] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[76] P. Neubert, S. Schubert, and P. Protzel, “An introduction to hyperdimensional computing for robotics,” KI-Künstliche Intelligenz, vol. 33, no. 4, pp. 319–330, 2019.
[77] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
[78] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[79] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
[80] G. Kwon and J. C. Ye, “Diffusion-based image translation using disentangled style and content representation,” in The Eleventh International Conference on Learning Representations, 2023.
[81] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[82] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
[83] A. Borji, “Pros and cons of gan evaluation measures,” Computer Vision and Image Understanding, vol. 179, pp. 41–65, 2019.
[84] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[85] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma, “Learning imbalanced datasets with label-distribution-aware margin loss,” Advances in neural information processing systems, vol. 32, 2019.