Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.16251v1 [cs.CV] 25 Dec 2023

MetaScript: Few-Shot Handwritten Chinese Content Generation via Generative Adversarial Networks
Project of AI3604 Computer Vision, 2023 Fall, SJTU

Jiazi Bu111The authors contributed equally to this project. The authors are arranged in alphabetical order by last name.
521030910395
   Qirui Li111The authors contributed equally to this project. The authors are arranged in alphabetical order by last name.
521030910397
   Kailing Wang111The authors contributed equally to this project. The authors are arranged in alphabetical order by last name.
521030910356
   Xiangyuan Xue111The authors contributed equally to this project. The authors are arranged in alphabetical order by last name.
521030910387
   Zhiyuan Zhang111The authors contributed equally to this project. The authors are arranged in alphabetical order by last name.
521030910377
Abstract

In this work, we propose MetaScript, a novel Chinese content generation system designed to address the diminishing presence of personal handwriting styles in the digital representation of Chinese characters. Our approach harnesses the power of few-shot learning to generate Chinese characters that not only retain the individual’s unique handwriting style but also maintain the efficiency of digital typing. Trained on a diverse dataset of handwritten styles, MetaScript is adept at producing high-quality stylistic imitations from minimal style references and standard fonts. Our work demonstrates a practical solution to the challenges of digital typography in preserving the personal touch in written communication, particularly in the context of Chinese script. Notably, our system has demonstrated superior performance in various evaluations, including recognition accuracy, inception score, and Frechet inception distance. At the same time, the training conditions of our model are easy to meet and facilitate generalization to real applications. Our code is available at https://github.com/xxyQwQ/metascript.

[Uncaptioned image]
Figure 1: The overall pipeline of this project. We design a system to generate handwritten Chinese contents within a few-shot setting. The system is composed of a generator and a composer. The generator is trained to generate handwritten Chinese characters given a structure template and some style references. The composer stitches the generated characters into a handwritten style content.

1 Introduction

Chinese characters, utilized continuously for over six millennia by more than a quarter of the world’s population, have been integral to education, employment, communication, and daily life in East Asia. The art of handwriting Chinese characters, transcends mere linguistic articulation, representing both the pinnacle of visual art and a medium for personal expression and cultivation [18]. There is an ancient Chinese saying, ”seeing the character is like seeing the face.” Throughout the process of personal growth, individuals develop distinct and characteristic handwriting styles. These styles can serve as symbols of one’s identity. Since humanity’s entry into the digital era, efficient yet characterless fixed fonts have supplanted handwritten text, eliminating the possibility of perceiving each other through the written word. This shift has engendered a sense of detachment from handwriting, significantly diminishing the personalization and warmth in textual communication.

To address this issue, we aim to devise a method that retains the individual’s handwriting style while also harnessing the efficiency afforded by typing. However, Chinese characters, numbering over 100,000 distinct ideograms with diverse glyph structures, lack standardized stroke units. Generating handwritten Chinese characters in a specific style is challenging through the naive structure-based approaches. Although there have been attempts to utilize the stroke decomposition of Chinese characters [15], combined with vision transformer techniques for morphological imitation, such methods require a substantial reference character set or a highly complex character decomposition tree to generate new styles. They require extensive linguistic processing, significant storage consumption, and complex search processes, rendering them impractical for everyday use.

Consequently, we introduce MetaScript, a novel approach designed for generating a large number of Chinese characters in a consistent style using few-shot learning. Our method is trained on a handwritten dataset encompassing a variety of styles and a substantial quantity of Chinese characters. It is capable of producing high-quality stylistic imitations of a vast array of text, utilizing only a few style references and standard fonts as structural information. This approach effectively bridges the gap between the personalized nuances of handwriting and the efficiency of digital text generation.

Our work has a multitude of applications. For instance, it can serve as a straightforward alternative to traditional font generation methods, facilitating the effortless creation of personalized fonts. Our approach is capable of producing unique glyph designs, which can be instrumental in artistic and visual media design. With its low computational demands and real-time inference capabilities, our work can be integrated with large language models to generate responses in personalized fonts, thereby enriching the user experience.

We summarize our contributions in three key aspects:

  • Innovative Few-Shot Learning Model: MetaScript employs a few-shot learning framework that enables the model to learn and replicate a specific handwriting style from a minimal set of examples. This significantly reduces the need for extensive training data, making the system more efficient and adaptable to individual styles.

  • Integration of Structural and Stylistic Elements: Our approach uniquely combines the structural integrity of standard Chinese fonts with the stylistic elements of individual handwriting using structure and style encoders. This integration ensures that the generated characters are not only stylistically consistent but also maintain the legibility and structural accuracy essential for Chinese script.

  • Scalability and Efficiency: The MetaScript system is designed to be scalable, capable of handling the generation of a vast number of characters without a proportional increase in computational resources or storage. This scalability is crucial given the extensive number of ideograms in the Chinese language and is a significant advancement over previous methods that required substantial storage and processing power.

2 Related Work

2.1 Generative Adversarial Networks

Generative Adversarial Networks (GANs) have been a hot research topic in the past 10 years [14]. For example, there are more than 22,900 papers related to GAN in 2023, that is: more than 2.5 papers per hour. GANs involve two parts: a generator that creates data and a discriminator that distinguishes between generated and real data. They are trained through a minimax optimization to reach a Nash equilibrium [35], where the generator effectively replicates the real data distribution. Thanks to the numerous works to enhance the objective function [32, 20], structure [22, 7, 11, 30], and transfer learning ability [32, 22, 45], of GANs, GANs now have a multitude of applications across various domains, such as Super-resolution [25, 41, 44, 6], Image synthesis and manipulation [8, 39], detection and video processing and NLP tasks [34, 43].

2.2 English Handwriting Generation

The generation of handwritten text represents a historically longstanding and classic task [33]. Early in 2007, Gangadhar et al. attempted to use Oscillatory neural networks to generate handwriting [10]. Some works used deep recurrent neural networks [24, 12, 2] to ensure enhanced consistency in the generated outcomes is imperative. Kanda et al. [21] proposes the use of reinforcement learning to evolve a rigorous future evaluation in traning. Other works imployed GANs to perform this task [1, 9, 31].

2.3 CJK Character Generation

Unlike some alphabetic languages that can generate extensive text using a limited number of templates, CJK (Chinese, Japanese, and Korean) languages are characterized by their abundant and structurally complex characters, precluding the possibility of generation through a minimal set of templates. The task of CJK (especially Chinese) character generation can be traced back to a period when computers had not yet become widespread in China [5]. Early methods decomposed characters into components or strokes [42, 46]. Later studies applied deep learning. Some of them used GANs or similar structure to transfer certain style onto stereotypes [27, 3, 4, 19]. Later advancement used Diffusion models to generate font sets [15, 13, 28].

3 Method

3.1 Overall Pipeline

Suppose the dataset 𝒟𝒟\mathcal{D}caligraphic_D contains n𝑛nitalic_n types of Chinese characters and m𝑚mitalic_m writers totally. Let xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT denote the character with the i𝑖iitalic_i-th type written by the j𝑗jitalic_j-th writer, i.e. the script, where i{1,2,,n}𝑖12𝑛i\in\{1,2,\dots,n\}italic_i ∈ { 1 , 2 , … , italic_n } and j{1,2,,m}𝑗12𝑚j\in\{1,2,\dots,m\}italic_j ∈ { 1 , 2 , … , italic_m }. Specifically, let xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT denote the character with the i𝑖iitalic_i-th type rendered from a standard font, i.e. the prototype. The proposed character generator G𝐺Gitalic_G is trained to follow the style of the references while keeping the structure of the templates. To be exact, given some references {xkj}k=1csuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐\{x_{k}^{j}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and a template xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, the generated result G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) should be similar to xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, which inherits the structure of the i𝑖iitalic_i-th type and the style of the j𝑗jitalic_j-th writer. We expect the character generator G𝐺Gitalic_G can generalize well to the unseen references.

We utilize the adversarial learning paradigm to train the character generator G𝐺Gitalic_G. Specifically, we introduce a multi-scale discriminator D𝐷Ditalic_D to distinguish the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) from the real character xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. The discriminator D𝐷Ditalic_D is also trained to predict the type and writer of the given character x𝑥xitalic_x. We expect that the discriminator D𝐷Ditalic_D can encourage the generator G𝐺Gitalic_G to learn how to generate plausible characters.

Based on the character generator G𝐺Gitalic_G, we can build a Chinese content generating system S𝑆Sitalic_S. Given some style references {xk}k=1csuperscriptsubscriptsubscript𝑥𝑘𝑘1𝑐\{x_{k}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and a content text {wi}i=1lsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑙\{w_{i}\}_{i=1}^{l}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, the system S𝑆Sitalic_S retrieves the corresponding templates {xwi0}i=1lsuperscriptsubscriptsuperscriptsubscript𝑥subscript𝑤𝑖0𝑖1𝑙\{x_{w_{i}}^{0}\}_{i=1}^{l}{ italic_x start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, generates the target characters {G({xk}k=1c,xwi0)}i=1lsuperscriptsubscript𝐺superscriptsubscriptsubscript𝑥𝑘𝑘1𝑐superscriptsubscript𝑥subscript𝑤𝑖0𝑖1𝑙\{G(\{x_{k}\}_{k=1}^{c},x_{w_{i}}^{0})\}_{i=1}^{l}{ italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and composes them into the complete result. Such pipeline is defined as few-shot handwritten Chinese character generation.

Refer to caption
Figure 2: The overview of the proposed character generator G𝐺Gitalic_G. The generator G𝐺Gitalic_G mainly contains three modules: a structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, a style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, and a denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. The structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT applies the U-Net architecture. The style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT applies the ResNet-18 architecture. The denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is composed of 7777 cascaded denormalization blocks.
Refer to caption
Figure 3: The detailed structure of the denormalization block and the denormalization layer. The denormalization layer is composed of a normalization step, a denormalization step and an attention mechanism. The denormalization block is composed of a skip connection and two identical layers, each of which contains a denormalization layer, an activation layer and a convolution layer.

3.2 Character Generator

Inspired by previous generative works [26, 22, 23], our proposed character generator G𝐺Gitalic_G mainly contains three modules: a structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, a style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, and a denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. The structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT extracts the structure information α1,α2,,α7subscript𝛼1subscript𝛼2subscript𝛼7\alpha_{1},\alpha_{2},\dots,\alpha_{7}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT from the template xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT extracts the style information β𝛽\betaitalic_β from the references {xkj}k=1csuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐\{x_{k}^{j}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Then the denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT combines the structure and style information to generate the target character xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. The overview of the proposed character generator is shown in Figure 2. The loss function will be described in Equation 11 and 12.

Structure Encoder. The structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT applies the U-Net [36] architecture, which includes 6666 down-sampling blocks and 6666 up-sampling blocks, extracting 7777 feature maps α1,α2,,α7subscript𝛼1subscript𝛼2subscript𝛼7\alpha_{1},\alpha_{2},\dots,\alpha_{7}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT with different scales from the template xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, as shown in the blue part of Figure 2.

Eα(xi0)={α1,α2,,α7}.subscript𝐸𝛼superscriptsubscript𝑥𝑖0subscript𝛼1subscript𝛼2subscript𝛼7E_{\alpha}(x_{i}^{0})=\{\alpha_{1},\alpha_{2},\dots,\alpha_{7}\}.italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = { italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT } . (1)

Each block is composes of a 4×4444\times 44 × 4 convolution layer with stride 2222, a normalization layer, and an activation layer. There are skip connections between feature maps with the same scale. The structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is trained in a self-supervised manner, which expects the structure of the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) to be the same with that of the template xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The loss function will be described in Equation 17.

Style Encoder. The style information should be as concise as a dense feature vector. Therefore, we apply the ResNet-18 [16] architecture in the style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT with input channels modified to c𝑐citalic_c and a linear layer added at the end. The style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT extracts a 512512512512-dimensional feature vector β𝛽\betaitalic_β from the references {xkj}k=1csuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐\{x_{k}^{j}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as the style information, as shown in the orange part of Figure 2.

Eβ({xkj}k=1c)=β.subscript𝐸𝛽superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐𝛽E_{\beta}(\{x_{k}^{j}\}_{k=1}^{c})=\beta.italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = italic_β . (2)

Similar to the structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, the style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT is trained in a self-supervised manner, which expects the style of the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) to be the same with that of the references {xkj}k=1csuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐\{x_{k}^{j}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. The loss function will be described in Equation 18.

Denormalization Decoder. Both the structure information α1,α2,,α7subscript𝛼1subscript𝛼2subscript𝛼7\alpha_{1},\alpha_{2},\dots,\alpha_{7}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT and the style information β𝛽\betaitalic_β will be fed into the denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, which is composed of 7777 cascaded denormalization blocks Dγ1,Dγ2,,Dγ7superscriptsubscript𝐷𝛾1superscriptsubscript𝐷𝛾2superscriptsubscript𝐷𝛾7D_{\gamma}^{1},D_{\gamma}^{2},\dots,D_{\gamma}^{7}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, as shown in the green part of Figure 2.

Dγ({αi}i=17,β)=G({xkj}k=1c,xi0).subscript𝐷𝛾superscriptsubscriptsubscript𝛼𝑖𝑖17𝛽𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0D_{\gamma}(\{\alpha_{i}\}_{i=1}^{7},\beta)=G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0}).italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( { italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT , italic_β ) = italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) . (3)

The output of the denormalization decoder Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT is the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), which will be directly supervised by the ground truth xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

The detailed structure of the denormalization block is shown in Figure 3. The denormalization block is composed of two identical layers, each of which contains a denormalization layer, an activation layer and a 3×3333\times 33 × 3 convolution layer. There is also a skip connection between the input and output of the denormalization block, which follows the classical residual learning strategy [16].

Dγi(αi,β,γi1)=γi,superscriptsubscript𝐷𝛾𝑖subscript𝛼𝑖𝛽subscript𝛾𝑖1subscript𝛾𝑖D_{\gamma}^{i}(\alpha_{i},\beta,\gamma_{i-1})=\gamma_{i},italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β , italic_γ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) = italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where i{1,2,,7}𝑖127i\in\{1,2,\dots,7\}italic_i ∈ { 1 , 2 , … , 7 }. Specifically, we state that γ0=βsubscript𝛾0𝛽\gamma_{0}=\betaitalic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_β and γ7=G({xkj}k=1c,xi0)subscript𝛾7𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\gamma_{7}=G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_γ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT = italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) for simplicity.

The detailed structure of the denormalization layer is also shown in Figure 3. The denormalization layer follows the design of Adaptive Instance Normalization (AdaIN) [22], which is composed of a normalization step and a denormalization step. Inspired by [26], we also introduce an attention mechanism to fuse different feature maps softly. To be exact, the input feature map γ𝛾\gammaitalic_γ is first normalized in the channel dimension.

γ¯=γμγσγ,¯𝛾𝛾subscript𝜇𝛾subscript𝜎𝛾\bar{\gamma}=\frac{\gamma-\mu_{\gamma}}{\sigma_{\gamma}},over¯ start_ARG italic_γ end_ARG = divide start_ARG italic_γ - italic_μ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG , (5)

where μγsubscript𝜇𝛾\mu_{\gamma}italic_μ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT and σγsubscript𝜎𝛾\sigma_{\gamma}italic_σ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT are the mean and standard deviation of the input feature map γ𝛾\gammaitalic_γ respectively.

Then the structure feature map α𝛼\alphaitalic_α will be fed into a 1×1111\times 11 × 1 convolution layer to predict the mean μαsubscript𝜇𝛼\mu_{\alpha}italic_μ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and standard deviation σαsubscript𝜎𝛼\sigma_{\alpha}italic_σ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT of the normalized feature map γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG for warping.

α^=σα×γ¯+μα,^𝛼subscript𝜎𝛼¯𝛾subscript𝜇𝛼\hat{\alpha}=\sigma_{\alpha}\times\bar{\gamma}+\mu_{\alpha},over^ start_ARG italic_α end_ARG = italic_σ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT × over¯ start_ARG italic_γ end_ARG + italic_μ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , (6)

where α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG is the predicted structure feature map.

The style feature vector β𝛽\betaitalic_β will be fed into a linear layer to predict the mean μβsubscript𝜇𝛽\mu_{\beta}italic_μ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and standard deviation σβsubscript𝜎𝛽\sigma_{\beta}italic_σ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT of the normalized feature map γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG for warping.

β^=σβ×γ¯+μβ,^𝛽subscript𝜎𝛽¯𝛾subscript𝜇𝛽\hat{\beta}=\sigma_{\beta}\times\bar{\gamma}+\mu_{\beta},over^ start_ARG italic_β end_ARG = italic_σ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT × over¯ start_ARG italic_γ end_ARG + italic_μ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT , (7)

where β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG is the predicted style feature map.

In addition, the normalized feature map γ¯¯𝛾\bar{\gamma}over¯ start_ARG italic_γ end_ARG will be fed into a 1×1111\times 11 × 1 convolution layer and a sigmoid activation layer to form the attention map η𝜂\etaitalic_η, which is used as a weighted mask to fuse the feature maps α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG and β^^𝛽\hat{\beta}over^ start_ARG italic_β end_ARG.

γ^=(1η)×α^+η×β^,^𝛾1𝜂^𝛼𝜂^𝛽\hat{\gamma}=(1-\eta)\times\hat{\alpha}+\eta\times\hat{\beta},over^ start_ARG italic_γ end_ARG = ( 1 - italic_η ) × over^ start_ARG italic_α end_ARG + italic_η × over^ start_ARG italic_β end_ARG , (8)

where γ^^𝛾\hat{\gamma}over^ start_ARG italic_γ end_ARG is the output feature map. Finally, the denormalization layer completes the entire process.

The key idea of the denormalization layer is to adaptively adjust the effective regions of the structure feature map and the style feature map, so that the generated character can inherit the structure of the template and the style of the references. Compared with the AdaIN [22], the denormalization layer can fuse the feature maps of arbitrary styles instead of pairwise exchanging, which shows better flexibility and diversity in character generation.

3.3 Multi-scale Discriminator

The discriminator block Disuperscript𝐷𝑖D^{i}italic_D start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT follows the traditional convolutional neuron network paradigm, which is composed of 5555 down-sampling blocks and 3333 classification heads, as shown in Figure 4. Each down-sampling block is composed of a 4×4444\times 44 × 4 convolution layer with stride 2222, a normalization layer, and an activation layer, which is exactly the same as that of the structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. Each classification head contains a single linear layer to predict the probability distribution from the extracted feature map.

Di(x)=(Dϕi(x),Dαi(x),Dβi(x))=(y^ϕi,y^αi,y^βi),superscript𝐷𝑖𝑥superscriptsubscript𝐷italic-ϕ𝑖𝑥superscriptsubscript𝐷𝛼𝑖𝑥superscriptsubscript𝐷𝛽𝑖𝑥superscriptsubscript^𝑦italic-ϕ𝑖superscriptsubscript^𝑦𝛼𝑖superscriptsubscript^𝑦𝛽𝑖D^{i}(x)=(D_{\phi}^{i}(x),D_{\alpha}^{i}(x),D_{\beta}^{i}(x))=(\hat{y}_{\phi}^% {i},\hat{y}_{\alpha}^{i},\hat{y}_{\beta}^{i}),italic_D start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) = ( italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) , italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) , italic_D start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_x ) ) = ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (9)

where Disuperscript𝐷𝑖D^{i}italic_D start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-th discriminator block, y^ϕisuperscriptsubscript^𝑦italic-ϕ𝑖\hat{y}_{\phi}^{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the predicted authenticity, y^αisuperscriptsubscript^𝑦𝛼𝑖\hat{y}_{\alpha}^{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the predicted type, and y^βisuperscriptsubscript^𝑦𝛽𝑖\hat{y}_{\beta}^{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the predicted writer. We expect that the discriminator block can extract useful features to distinguish the authenticity, type and writer all together. The fundamental idea is to force the generator to learn the correct structure and style features, instead of simply cheating the discriminator. The loss function will be described in Equation 13, 14, 15 and 16.

Refer to caption
Figure 4: The structure of the discriminator block. The discriminator block is composed of 5555 down-sampling blocks and 3333 classification heads as a typical convolutional neuron network.
Refer to caption
Figure 5: The overview of the multi-scale discriminator D𝐷Ditalic_D. The discriminator D𝐷Ditalic_D is composed of 2222 average pooling layers and 3333 discriminator blocks D1superscript𝐷1D^{1}italic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, D2superscript𝐷2D^{2}italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and D3superscript𝐷3D^{3}italic_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The input character x𝑥xitalic_x will be down-sampled by the average pooling layers to form 3333 different scales and fed into the corresponding discriminator blocks.

Inspired by [40], we apply a multi-scale discriminator D𝐷Ditalic_D to enhance the performance of the discriminator, as shown in Figure 5. The multi-scale discriminator D𝐷Ditalic_D is composed of 2222 average pooling layers and 3333 discriminator blocks D1superscript𝐷1D^{1}italic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, D2superscript𝐷2D^{2}italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and D3superscript𝐷3D^{3}italic_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The input character x𝑥xitalic_x will be down-sampled by the average pooling layers to form 3333 different scales, so the corresponding discriminator blocks can evaluate the input character x𝑥xitalic_x from different perspectives and perform better supervision. Previous works [40, 26] have shown that the multi-scale discriminator can effectively improve the quality of the generated results, especially in the high-resolution tasks.

D(x)={D1(x),D2(x),D3(x′′)},𝐷𝑥superscript𝐷1𝑥superscript𝐷2superscript𝑥superscript𝐷3superscript𝑥′′D(x)=\{D^{1}(x),D^{2}(x^{\prime}),D^{3}(x^{\prime\prime})\},italic_D ( italic_x ) = { italic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x ) , italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_D start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) } , (10)

where xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and x′′superscript𝑥′′x^{\prime\prime}italic_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT represent the input character x𝑥xitalic_x down-sampled once and twice respectively. The loss function will be described in Equation 11 and 12.

3.4 Training Objective

We utilize adversarial learning to train the character generator G𝐺Gitalic_G and the multi-scale discriminator D𝐷Ditalic_D and introduce 5555 kinds of loss functions: adversarial loss advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT, classification loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, structure loss strsubscript𝑠𝑡𝑟\mathcal{L}_{str}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT, style loss stysubscript𝑠𝑡𝑦\mathcal{L}_{sty}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT and reconstruction loss recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT. The overall loss function is a weighted sum over them. Formally, for the generator, the overall loss function is defined as

allGsuperscriptsubscript𝑎𝑙𝑙𝐺\displaystyle\mathcal{L}_{all}^{G}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT =λadvGadvG+λclsGclsGabsentsuperscriptsubscript𝜆𝑎𝑑𝑣𝐺superscriptsubscript𝑎𝑑𝑣𝐺superscriptsubscript𝜆𝑐𝑙𝑠𝐺superscriptsubscript𝑐𝑙𝑠𝐺\displaystyle=\lambda_{adv}^{G}\mathcal{L}_{adv}^{G}+\lambda_{cls}^{G}\mathcal% {L}_{cls}^{G}= italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT (11)
+λstrGstrG+λstyGstyG+λrecGrecG,superscriptsubscript𝜆𝑠𝑡𝑟𝐺superscriptsubscript𝑠𝑡𝑟𝐺superscriptsubscript𝜆𝑠𝑡𝑦𝐺superscriptsubscript𝑠𝑡𝑦𝐺superscriptsubscript𝜆𝑟𝑒𝑐𝐺superscriptsubscript𝑟𝑒𝑐𝐺\displaystyle+\lambda_{str}^{G}\mathcal{L}_{str}^{G}+\lambda_{sty}^{G}\mathcal% {L}_{sty}^{G}+\lambda_{rec}^{G}\mathcal{L}_{rec}^{G},+ italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ,

and for the discriminator it is defined as

allDsuperscriptsubscript𝑎𝑙𝑙𝐷\displaystyle\mathcal{L}_{all}^{D}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT =λadvDadvD+λclsDclsD,absentsuperscriptsubscript𝜆𝑎𝑑𝑣𝐷superscriptsubscript𝑎𝑑𝑣𝐷superscriptsubscript𝜆𝑐𝑙𝑠𝐷superscriptsubscript𝑐𝑙𝑠𝐷\displaystyle=\lambda_{adv}^{D}\mathcal{L}_{adv}^{D}+\lambda_{cls}^{D}\mathcal% {L}_{cls}^{D},= italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , (12)

where λadvGsuperscriptsubscript𝜆𝑎𝑑𝑣𝐺\lambda_{adv}^{G}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, λclsGsuperscriptsubscript𝜆𝑐𝑙𝑠𝐺\lambda_{cls}^{G}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, λstrGsuperscriptsubscript𝜆𝑠𝑡𝑟𝐺\lambda_{str}^{G}italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, λstyGsuperscriptsubscript𝜆𝑠𝑡𝑦𝐺\lambda_{sty}^{G}italic_λ start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, λrecGsuperscriptsubscript𝜆𝑟𝑒𝑐𝐺\lambda_{rec}^{G}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, λadvDsuperscriptsubscript𝜆𝑎𝑑𝑣𝐷\lambda_{adv}^{D}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and λclsDsuperscriptsubscript𝜆𝑐𝑙𝑠𝐷\lambda_{cls}^{D}italic_λ start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are the hyperparameters to balance the loss functions.

Adversarial Loss. The adversarial loss is to train the discriminator D𝐷Ditalic_D to distinguish the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) from the real character xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, which indirectly encourages the generator G𝐺Gitalic_G to generate more plausible characters. Binary cross entropy is applied as the adversarial loss. Formally, for the generator, the adversarial loss is defined as

advG=superscriptsubscript𝑎𝑑𝑣𝐺absent\displaystyle\mathcal{L}_{adv}^{G}=caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = s=13logDϕs(G({xkj}k=1c,xi0)),superscriptsubscript𝑠13superscriptsubscript𝐷italic-ϕ𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log D_{\phi}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^% {0})),- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) , (13)

and for the discriminator it is defined as

advD=superscriptsubscript𝑎𝑑𝑣𝐷absent\displaystyle\mathcal{L}_{adv}^{D}=caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = s=13logDϕs(xij)superscriptsubscript𝑠13superscriptsubscript𝐷italic-ϕ𝑠superscriptsubscript𝑥𝑖𝑗\displaystyle-\sum_{s=1}^{3}\log D_{\phi}^{s}(x_{i}^{j})- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (14)
s=13log[1Dϕs(G({xkj}k=1c,xi0))].superscriptsubscript𝑠131superscriptsubscript𝐷italic-ϕ𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log[1-D_{\phi}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i% }^{0}))].- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log [ 1 - italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ] .

Classification Loss. The classification loss is to train the discriminator D𝐷Ditalic_D to precisely predict the type and writer of the given character x𝑥xitalic_x. Different from the adversarial loss, the generator G𝐺Gitalic_G is also trained to minimize the classification loss, which indirectly encourages the generator G𝐺Gitalic_G to generate characters with accurate structure and style. Cross entropy is applied as the classification loss. Formally, for the generator, the classification loss is defined as

clsG=superscriptsubscript𝑐𝑙𝑠𝐺absent\displaystyle\mathcal{L}_{cls}^{G}=caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = s=13logDαs(G({xkj}k=1c,xi0))superscriptsubscript𝑠13superscriptsubscript𝐷𝛼𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log D_{\alpha}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i% }^{0}))- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) (15)
s=13logDβs(G({xkj}k=1c,xi0)),superscriptsubscript𝑠13superscriptsubscript𝐷𝛽𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log D_{\beta}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}% ^{0})),- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ,

and for the discriminator it is defined as

clsD=superscriptsubscript𝑐𝑙𝑠𝐷absent\displaystyle\mathcal{L}_{cls}^{D}=caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = s=13logDαs(xij)s=13logDβs(xij)superscriptsubscript𝑠13superscriptsubscript𝐷𝛼𝑠superscriptsubscript𝑥𝑖𝑗superscriptsubscript𝑠13superscriptsubscript𝐷𝛽𝑠superscriptsubscript𝑥𝑖𝑗\displaystyle-\sum_{s=1}^{3}\log D_{\alpha}^{s}(x_{i}^{j})-\sum_{s=1}^{3}\log D% _{\beta}^{s}(x_{i}^{j})- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log italic_D start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (16)
s=13log[Dαs(G({xkj}k=1c,xi0))]superscriptsubscript𝑠13superscriptsubscript𝐷𝛼𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log[D_{\alpha}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i% }^{0}))]- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log [ italic_D start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ]
s=13log[Dβs(G({xkj}k=1c,xi0))].superscriptsubscript𝑠13superscriptsubscript𝐷𝛽𝑠𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0\displaystyle-\sum_{s=1}^{3}\log[D_{\beta}^{s}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}% ^{0}))].- ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_log [ italic_D start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) ] .

We should note that the classification loss is indispensable, which introduces effective supervision to prevent the generator from simply generating meaningless characters to cheat the discriminator. We will show how the classification loss solves the problem of mode collapse in Section 4.

Structure Loss. Intuitively, we expect that the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) can inherit the structure of the template xi0superscriptsubscript𝑥𝑖0x_{i}^{0}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Therefore, The feature maps α1,α2,,α7subscript𝛼1subscript𝛼2subscript𝛼7\alpha_{1},\alpha_{2},\dots,\alpha_{7}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT extracted by the structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT should be invariant. The structure loss not only encourages the generator G𝐺Gitalic_G to generate the correct structure, but also encourages the structure encoder Eαsubscript𝐸𝛼E_{\alpha}italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT to extract valid structure features. The structure loss for the generator is formally defined as

strG=12Eα(G({xkj}k=1c,xi0))Eα(xi0)22.superscriptsubscript𝑠𝑡𝑟𝐺12superscriptsubscriptnormsubscript𝐸𝛼𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0subscript𝐸𝛼superscriptsubscript𝑥𝑖022\mathcal{L}_{str}^{G}=\frac{1}{2}\|E_{\alpha}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^% {0}))-E_{\alpha}(x_{i}^{0})\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) - italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

Style Loss. We also expect that the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) can follow the style of the references {xkj}k=1csuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐\{x_{k}^{j}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Therefore, the feature vector β𝛽\betaitalic_β extracted by the style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT should be invariant. The style loss not only encourages the generator G𝐺Gitalic_G to generate the correct style, but also encourages the style encoder Eβsubscript𝐸𝛽E_{\beta}italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT to extract valid style features. The style loss for the generator is formally defined as

styG=12Eβ(G({xkj}k=1c,xi0))Eβ({xkj}k=1c)22.superscriptsubscript𝑠𝑡𝑦𝐺12superscriptsubscriptnormsubscript𝐸𝛽𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0subscript𝐸𝛽superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐22\mathcal{L}_{sty}^{G}=\frac{1}{2}\|E_{\beta}(G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{% 0}))-E_{\beta}(\{x_{k}^{j}\}_{k=1}^{c})\|_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) - italic_E start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (18)

Reconstruction Loss. The reconstruction loss represents the pixel-wise difference between the generated character G({xkj}k=1c,xi0)𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) and the ground truth xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, which is a direct supervision to the generator G𝐺Gitalic_G. L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm is applied as the reconstruction loss. The reconstruction loss for the generator is formally defined as

recG=G({xkj}k=1c,xi0)xij1.superscriptsubscript𝑟𝑒𝑐𝐺subscriptnorm𝐺superscriptsubscriptsuperscriptsubscript𝑥𝑘𝑗𝑘1𝑐superscriptsubscript𝑥𝑖0superscriptsubscript𝑥𝑖𝑗1\mathcal{L}_{rec}^{G}=\|G(\{x_{k}^{j}\}_{k=1}^{c},x_{i}^{0})-x_{i}^{j}\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = ∥ italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (19)

We should note that the reconstruction loss is not to force the generated character to be exactly the same with the ground truth because such constraint will limit the diversity of the generated characters. Instead, the reconstruction loss should be controlled to a reasonable range.

3.5 Content Composition

We develop a typesetting tool, Typewriter, for reorganizing output characters into a complete result. For a given text and some style references, we generate an individual character for each character in the input text using the methods described in the preceding subsections. To better and more intuitively display the effectiveness of our method in generating Chinese character content, we first apply a random transformation to the generated character images, mimicking the alignment effect of human handwriting. Subsequently, they are arranged into a complete image. The details are shown in Algorithm 1.

Input : style references {xk}k=1csuperscriptsubscriptsubscript𝑥𝑘𝑘1𝑐\{x_{k}\}_{k=1}^{c}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, content text {wi}i=1lsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑙\{w_{i}\}_{i=1}^{l}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, expected character size sizec𝑠𝑖𝑧subscript𝑒𝑐size_{c}italic_s italic_i italic_z italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, expected line width widthl𝑤𝑖𝑑𝑡subscript𝑙width_{l}italic_w italic_i italic_d italic_t italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
Output : arranged handwritten content image Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
1 Initialize: cursor𝑐𝑢𝑟𝑠𝑜𝑟absentcursor\leftarrowitalic_c italic_u italic_r italic_s italic_o italic_r ← position 00, Iasubscript𝐼𝑎absentI_{a}\leftarrowitalic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ← empty image;
2 for wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in {wi}i=1lsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑙\{w_{i}\}_{i=1}^{l}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT do
3       if wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is line break then
4             move cursor𝑐𝑢𝑟𝑠𝑜𝑟cursoritalic_c italic_u italic_r italic_s italic_o italic_r to next line;
5             continue;
6            
7       else if wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is space then
8             gisubscript𝑔𝑖absentg_{i}\leftarrowitalic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← empty space with sizec2𝑠𝑖𝑧subscript𝑒𝑐2\frac{size_{c}}{2}divide start_ARG italic_s italic_i italic_z italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG width;
9            
10       else if wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is character then
11             generate giG({xk}k=1c,xwi0)subscript𝑔𝑖𝐺superscriptsubscriptsubscript𝑥𝑘𝑘1𝑐superscriptsubscript𝑥subscript𝑤𝑖0g_{i}\leftarrow G(\{x_{k}\}_{k=1}^{c},x_{w_{i}}^{0})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_G ( { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) with sizec𝑠𝑖𝑧subscript𝑒𝑐size_{c}italic_s italic_i italic_z italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT;
12            
13       else if wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is punctuation then
14             gisubscript𝑔𝑖absentg_{i}\leftarrowitalic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← punctuation template;
15            
16       end if
17      apply random transformation on gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
18       plot gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at cursor𝑐𝑢𝑟𝑠𝑜𝑟cursoritalic_c italic_u italic_r italic_s italic_o italic_r in Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and update cursor𝑐𝑢𝑟𝑠𝑜𝑟cursoritalic_c italic_u italic_r italic_s italic_o italic_r;
19       if cursor>widthl𝑐𝑢𝑟𝑠𝑜𝑟𝑤𝑖𝑑𝑡subscript𝑙cursor>width_{l}italic_c italic_u italic_r italic_s italic_o italic_r > italic_w italic_i italic_d italic_t italic_h start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT then
20             move cursor𝑐𝑢𝑟𝑠𝑜𝑟cursoritalic_c italic_u italic_r italic_s italic_o italic_r to next line;
21            
22       end if
23      
24 end for
return handwritten content image Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
Algorithm 1 Typewriter Procedure

4 Experiments

Refer to caption
Figure 6: Characters synthesized by the generator with 4444 references trained for 100k100k100\text{k}100 k iterations. The first 4444 rows are the style references. The 5555th row is the structure template. The 6666th row is the ground truth. The last row is the generated character.
Refer to caption
Figure 7: Content generated by the system, which is built based on the generator and cascaded with the type writter. The left hand side shows the style references. The right hand side shows the generated results.

4.1 Experiment Setup

We finetune the hyperparameters, train the model and build the system. Some results are shown in Figure 6 and 7. The training details are as follows.

Dataset. The training set is build based on the CASIA-HWDB-1.1 dataset [29]. We select approximately 1M1M1\text{M}1 M script images from 300300300300 writers, including 3755375537553755 common Chinese characters. By preprocessing, the script images are resized into 128×128128128128\times 128128 × 128 resolution, named by the types and classified by the writers, which will be used as style references. We also render the prototype images of the characters with the same resolution from the Source Han Sans CN font, which will be used as structure templates.

Implementation details. The model is trained for 100k100k100\text{k}100 k iterations with a batch size of 32323232. We use the Adam optimizer for training. The learning rate is set to 0.00010.00010.00010.0001 for both the generator and the discriminator. The weights of the loss function are λadvG=1superscriptsubscript𝜆adv𝐺1\lambda_{\text{adv}}^{G}=1italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = 1, λclsG=1superscriptsubscript𝜆cls𝐺1\lambda_{\text{cls}}^{G}=1italic_λ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = 1, λstrG=0.5superscriptsubscript𝜆str𝐺0.5\lambda_{\text{str}}^{G}=0.5italic_λ start_POSTSUBSCRIPT str end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = 0.5, λstyG=0.1superscriptsubscript𝜆sty𝐺0.1\lambda_{\text{sty}}^{G}=0.1italic_λ start_POSTSUBSCRIPT sty end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = 0.1, λrecG=20superscriptsubscript𝜆rec𝐺20\lambda_{\text{rec}}^{G}=20italic_λ start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = 20, λadvD=1superscriptsubscript𝜆adv𝐷1\lambda_{\text{adv}}^{D}=1italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = 1 and λclsD=1superscriptsubscript𝜆cls𝐷1\lambda_{\text{cls}}^{D}=1italic_λ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = 1.

Evaluation metrics. To evaluate the quality and diversity of the generated results, we apply recognition accuracy (RA), inception score (IS) [37] and Frechet inception distance (FID) [17] as the evaluation metrics. We use an EfficientNetV2 model [38] pretrained on the CASIA-HWDB-1.1 dataset as the classifier to extract high-level features and perform low-level predictions for computation. The test set is manually selected from the CASIA-HWDB-1.0 dataset, which contains 1000100010001000 characters written by 10101010 writers.

4.2 Ablation Studies

We conduct two separate ablation studies to investigate the best number of references and verify the effectiveness of each loss term.

Reference number. To figure out how the number of references affects the performance of the model, we train the model with 1111, 2222, 4444 and 8888 references for 100k100k100\text{k}100 k iterations respectively. The results are shown in Table 1.

Table 1: Different reference numbers exhibit variations in model performance, including recognition accuracy (RA), inception score (IS) and Frechet inception distance (FID).
Reference RA\uparrow IS\uparrow FID\downarrow
1111 79.8%percent79.879.8\%79.8 % 55.54355.54355.54355.543 197.950197.950197.950197.950
2222 77.5%percent77.577.5\%77.5 % 58.52458.52458.52458.524 190.193190.193190.193190.193
4444 81.9%percent81.9\mathbf{81.9}\%bold_81.9 % 58.17258.17258.17258.172 187.365187.365\mathbf{187.365}bold_187.365
8888 76.2%percent76.276.2\%76.2 % 58.59858.598\mathbf{58.598}bold_58.598 188.411188.411188.411188.411

We can see that in general, as the number of references increases, the inception score increases and the Frechet inception distance decreases, which indicates better authenticity and diversity of the generated results. This is reasonable because more references provide more sufficient style information for the generator. However, the recognition accuracy shows strong fluctuations and achieves the best performance with 4444 references. Therefore, we prefer to use 4444 references in practice.

Loss function. Another question is that whether each loss term is necessary for training the model. To answer this question, we train the model with 4444 references for 100k100k100\text{k}100 k iterations, but each loss term is removed respectively. The results are shown in Table 2.

Table 2: Changes in model performance with different loss terms removed, including adversarial loss (advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT), classification loss (clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT), structure loss (strsubscript𝑠𝑡𝑟\mathcal{L}_{str}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT), style loss (stysubscript𝑠𝑡𝑦\mathcal{L}_{sty}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT) and reconstruction loss (recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT), compared to the baseline model (allsubscript𝑎𝑙𝑙\mathcal{L}_{all}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT).
Loss RA\uparrow IS\uparrow FID\downarrow
w/ allsubscript𝑎𝑙𝑙\mathcal{L}_{all}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT 81.9%percent81.981.9\%81.9 % 58.17258.17258.17258.172 187.365187.365187.365187.365
w/o advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT 0.0%percent0.00.0\%0.0 % 1.0981.0981.0981.098 287.104287.104287.104287.104
w/o clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT 0.1%percent0.10.1\%0.1 % 1.6011.6011.6011.601 318.286318.286318.286318.286
w/o strsubscript𝑠𝑡𝑟\mathcal{L}_{str}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT 84.4%percent84.484.4\%84.4 % 57.57857.57857.57857.578 181.738181.738181.738181.738
w/o stysubscript𝑠𝑡𝑦\mathcal{L}_{sty}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT 67.4%percent67.467.4\%67.4 % 55.10855.10855.10855.108 204.016204.016204.016204.016
w/o recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT 72.3%percent72.372.3\%72.3 % 54.31654.31654.31654.316 202.782202.782202.782202.782
Refer to caption
Figure 8: Examples of the generated results with advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT removed.
Refer to caption
Figure 9: Examples of the generated results with clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT removed.

The models with one of the loss terms removed show performance degradation to varying degrees except for strsubscript𝑠𝑡𝑟\mathcal{L}_{str}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_r end_POSTSUBSCRIPT, which has been implicitly covered by the strong representation ability of the structure encoder. Removing stysubscript𝑠𝑡𝑦\mathcal{L}_{sty}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_y end_POSTSUBSCRIPT and recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT leads to a modest decrease in performance, while removing advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT results in a complete failure. We show some examples of the generated results with advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT and clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT removed in Figure 8 and 9 respectively. We can see that the generator cannot learn the handwritten style without adversarial learning, and fails to generate the correct character pattern with classification loss removed, which can even lead to serious mode collapse. Therefore, we claim that all the loss terms are necessary.

5 Conclusion

In this paper, we introduced MetaScript, a novel system for generating handwritten Chinese content using few-shot learning and Generative Adversarial Networks. Our approach effectively bridges the gap between the personalized nuances of handwriting and the efficiency of digital text generation. The key contributions of our work include the development of an innovative few-shot learning model, the integration of structural and stylistic elements in character generation, and the scalability and efficiency of the MetaScript system.

Our experiments demonstrate that MetaScript can successfully replicate a variety of handwriting styles with high fidelity using only a few style references. The system shows promising results in terms of recognition accuracy, inception score, and Frechet inception distance, indicating its effectiveness in generating authentic and diverse handwritten Chinese characters.

However, there are still challenges and limitations to be addressed. The quality of generated characters can vary depending on the number and quality of style references provided. Additionally, while our model performs well with common Chinese characters, its effectiveness with less common or more complex characters requires further exploration.

Future work will focus on enhancing the robustness and versatility of MetaScript: 1) We aim to enhance the robustness and versatility of the system, focusing on more sophisticated few-shot learning techniques. This enhancement is expected to significantly improve MetaScript’s ability to learn effectively from limited data. 2) Another pivotal area of interest is the extension of our approach to non-Latin scripts, including Arabic and Devanagari. These scripts, with their rich handwriting traditions, present unique challenges and opportunities for our handwriting generation model. 3) Finally, we plan to integrate MetaScript into real-world applications. This integration involves embedding our system into digital education tools and personalized digital communication platforms, thereby infusing the warmth and personality of traditional handwriting into the digital realm.

References

  • Alonso et al. [2019] Eloi Alonso, Bastien Moysset, and Ronaldo Messina. Adversarial generation of handwritten text images conditioned on sequences. In 2019 international conference on document analysis and recognition (ICDAR), pages 481–486. IEEE, 2019.
  • Bodapati et al. [2020] Suraj Bodapati, Sneha Reddy, and Sugamya Katta. Realistic handwriting generation using recurrent neural networks and long short-term networks. In Proceedings of the Third International Conference on Computational Intelligence and Informatics: ICCII 2018, pages 651–661. Springer, 2020.
  • Chang et al. [2018a] Bo Chang, Qiong Zhang, Shenyi Pan, and Lili Meng. Generating handwritten chinese characters using cyclegan. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 199–207. IEEE, 2018a.
  • Chang et al. [2018b] Jie Chang, Yujun Gu, Ya Zhang, Yan-Feng Wang, and CM Innovation. Chinese handwriting imitation with hierarchical generative adversarial network. In BMVC, page 290, 2018b.
  • Chang [1973] Shi-Kuo Chang. An interactive system for chinese character generation and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(3):257–265, 1973.
  • Ding et al. [2019] Zihan Ding, Xiao-Yang Liu, Miao Yin, and Linghe Kong. Tgan: Deep tensor generative adversarial nets for large image generation. arXiv preprint arXiv:1901.09953, 2019.
  • Durugkar et al. [2016] Ishan Durugkar, Ian Gemp, and Sridhar Mahadevan. Generative multi-adversarial networks. arXiv preprint arXiv:1611.01673, 2016.
  • Ehsani et al. [2018] Kiana Ehsani, Roozbeh Mottaghi, and Ali Farhadi. Segan: Segmenting and generating the invisible. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6144–6153, 2018.
  • Fogel et al. [2020] Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, and Roee Litman. Scrabblegan: Semi-supervised varying length handwritten text generation, 2020.
  • Gangadhar et al. [2007] Garipelli Gangadhar, Denny Joseph, and V Srinivasa Chakravarthy. An oscillatory neuromotor model of handwriting generation. International journal of document analysis and recognition (ijdar), 10:69–84, 2007.
  • Ghosh et al. [2018] Arnab Ghosh, Viveka Kulharia, Vinay P Namboodiri, Philip HS Torr, and Puneet K Dokania. Multi-agent diverse generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8513–8521, 2018.
  • Graves [2013] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  • Gui et al. [2023] Dongnan Gui, Kai Chen, Haisong Ding, and Qiang Huo. Zero-shot generation of training data with denoising diffusion probabilistic model for handwritten chinese character recognition. arXiv preprint arXiv:2305.15660, 2023.
  • Gui et al. [2020] Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, and Jieping Ye. A review on generative adversarial networks: Algorithms, theory, and applications, 2020.
  • He et al. [2022] Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, Dacheng Tao, and Yu Qiao. Diff-font: Diffusion model for robust one-shot font generation. arXiv preprint arXiv:2212.05895, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ji et al. [2022] Yu Ji, Wen Wu, Yi Hu, Xiaofeng He, Changzhi Chen, and Liang He. Automatic personality prediction based on users’ chinese handwriting change. In CCF Conference on Computer Supported Cooperative Work and Social Computing, pages 435–449. Springer, 2022.
  • Jiang et al. [2018] Haochuan Jiang, Guanyu Yang, Kaizhu Huang, and Rui Zhang. W-net: One-shot arbitrary-style chinese character generation with deep neural networks. In Neural Information Processing, pages 483–493, Cham, 2018. Springer International Publishing.
  • Jolicoeur-Martineau [2018] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard gan. arXiv preprint arXiv:1807.00734, 2018.
  • Kanda et al. [2020] Keisuke Kanda, Brian Kenji Iwana, and Seiichi Uchida. What is the reward for handwriting? — a handwriting generation model based on imitation learning. In 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 109–114, 2020.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Kumar et al. [2018] K. Manoj Kumar, Harish Kandala, and N. Sudhakar Reddy. Synthesizing and imitating handwriting using deep recurrent neural networks and mixture density networks. In 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pages 1–6, 2018.
  • Ledig et al. [2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  • Li et al. [2019a] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019a.
  • Li et al. [2019b] Meng Li, Jian Wang, Yi Yang, Weixing Huang, and Wenjuan Du. Improving gan-based calligraphy character generation using graph matching. In 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C), pages 291–295. IEEE, 2019b.
  • Liao et al. [2023] Qisheng Liao, Gus Xia, and Zhinuo Wang. Calliffusion: Chinese calligraphy generation and style transfer with diffusion modeling. arXiv preprint arXiv:2305.19124, 2023.
  • Liu et al. [2011] Cheng-Lin Liu, Fei Yin, Da-Han Wang, and Qiu-Feng Wang. Casia online and offline chinese handwriting databases. In 2011 international conference on document analysis and recognition, pages 37–41. IEEE, 2011.
  • Liu and Tuzel [2016] Ming-Yu Liu and Oncel Tuzel. Coupled generative adversarial networks. Advances in neural information processing systems, 29, 2016.
  • Liu et al. [2021] Xiyan Liu, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Handwritten text generation via disentangled representations. IEEE Signal Processing Letters, 28:1838–1842, 2021.
  • Lu et al. [2018] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. Attribute-guided face generation using conditional cyclegan. In Proceedings of the European conference on computer vision (ECCV), pages 282–297, 2018.
  • Madaan et al. [2022] Mehul Madaan, Aniket Kumar, Shubham Kumar, Aniket Saha, and Kirti Gupta. Handwriting generation and synthesis: A review. In 2022 Second International Conference on Power, Control and Computing Technologies (ICPC2T), pages 1–6, 2022.
  • Mogren [2016] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint arXiv:1611.09904, 2016.
  • Ratliff et al. [2013] Lillian J Ratliff, Samuel A Burden, and S Shankar Sastry. Characterization and computation of local nash equilibria in continuous games. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 917–924. IEEE, 2013.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Tan and Le [2021] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
  • Vondrick et al. [2016] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. Advances in neural information processing systems, 29, 2016.
  • Wang et al. [2018a] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018a.
  • Wang et al. [2018b] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018b.
  • Xu et al. [2009] Songhua Xu, Tao Jin, Hao Jiang, and Francis CM Lau. Automatic generation of personal chinese handwriting by capturing the characteristics of personal handwriting. In Twenty-First IAAI Conference, 2009.
  • Yu et al. [2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, 2017.
  • Yuan et al. [2018] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang, Chao Dong, and Liang Lin. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 701–710, 2018.
  • Zhang et al. [2018] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6810–6818, 2018.
  • Zhou et al. [2011] Baoyao Zhou, Weihong Wang, and Zhanghui Chen. Easy generation of personal chinese handwritten fonts. In 2011 IEEE international conference on multimedia and expo, pages 1–6. IEEE, 2011.