Keywords

1 Introduction

Speech-driven facial animation can be used for many applications such as video games, virtual assistants, animation movies, etc. and has thus garnered broad interest. The problem of generating such facial animation is multifaceted, requiring accurate lip-sync, a natural expression like eye blinks, head orientations, capturing subject-specific traits like identity, lip deformations, etc. Also, the generation of such animation should not be overly dependent on the training set, and the method, therefore, should be adaptable to unknown faces and speeches quickly. Existing end-to-end learning methods [32, 36] show poor adaptability given an unknown speech or face resulting in implausible animation. In order to overcome the problems of generating images directly from speech, Chen et al. [4] learn an intermediate high-level representation of motion from audio followed by texturing. Although this method preserves the identity but fails to produce accurate and realistic lip synchronization, as shown in Fig. 1 (the last row). On the other hand, [32] produces plausible lip motion but renders incorrect identity, as shown in Fig. 1 (third row). Therefore, the key challenges existing in the talking face problem are i) accurate lip synchronization along with identity preservation, ii) presence of natural expression like eye blinks, iii) fast adaptation to unknown subjects, and speeches for all practical purposes. Figure 1 shows that none of the most recent state-of-the-art methods produce animations which solves all the above challenges.

Fig. 1.
figure 1

Recent state-of-the-art methods [4, 32] (Evaluated using their publicly available pre-trained models trained on LRW and TCD-TIMIT datasets respectively.) for speech-driven facial animation fail to accurately capture the mouth shapes and detailed facial texture on an unknown test subject whose facial characteristics differ from the training data. In these methods, the generated face can appear to be very different from the given target identity [32], or there can be a significant blur in the mouth region [4], leading to unrealistic face animation. On the other hand, our generated facial texture and mouth shapes can accurately resemble the ground-truth animation sequence.

In this paper, we propose a novel strategy to solve the above-mentioned challenges. In essence, our method partitions the problem into four stages. First, we design a GAN network for learning motion on canonical (person-independent) landmark from DeepSpeech features obtained from audio. GAN is powerful in learning the subtle deformations in lips due to speech, and learning motion in a canonical face makes the method invariant to the person-specific face geometry. Along with this, DeepSpeech features alleviates the problems due to different accents and noises. With all these together, our method is able to learn motion from speech robustly and also adaptable to the unknown speech. Next, we impose eye blinks predicted from a separate network and transfer this learned canonical facial landmark motion to person-specific landmark motion using Procrustes alignment [29]. Subsequently, we train another GAN network for texture generation conditioning with the person-specific landmark. For better adaptation to the unknown subject and unknown head orientation, we meta-learn this GAN network using Model-Agnostic-Meta-Learning (MAML) algorithm [12]. At test time, we fine-tune the meta-learned model with few samples (20 images) to adapt quickly (approx. 100 secs fine-tuning) to the unseen subject. Our method produces significantly better results (Fig. 1, second row) with more accurate lip synchronization, better identity preservation, and easy adaptation to the unseen subjects over the state-of-the-art techniques. Figure 2 shows a conceptual diagram of our approach. The contributions of our work can be summarized as follows:

  1. 1.

    We design a GAN network for learning canonical facial landmark motion from a speech by using DeepSpeech features. The use of GAN helps to learn subtle deformations in lips accurately. DeepSpeech and motion learning in canonical face alleviates the problems in learning due to the variety of person-specific faces and speeches. Therefore the method is more robust to noises, accent, and different face geometry.

  2. 2.

    We use model-agnostic-meta-learning to train another GAN for texture generation conditioned on the person-specific texture. GAN produces high fidelity face images from given landmarks and because the network is meta-learned it provides quick as well as a better adaptation to the unseen subject using a few examples at the fine-tuning stage.

2 Related Work

Speech-Driven Face Animation: In recent years many researchers have focused on the synthesis of 2D talking face video from audio input [3, 4, 7, 28, 30, 32, 36]. The methods which are most relevant to us are [4, 7, 28, 31, 32, 36, 37] which animate an entire face from speech. Earlier methods that learn subject-specific 2D facial animation [11, 13, 30] require a large amount of training data of the target subject. The first subject-independent learning method [7] achieves good lip synchronization, but images generated require additional de-blurring. Hence GAN-based methods [4, 5, 28, 31, 32, 36] were proposed for generating sharp facial texture in speech-driven 2D facial animation. Although these methods animate the entire face, they mainly target lip synchronization with audio [4, 5, 28, 36], by learning disentangled audio representations [22] for robustness to noise and emotional content in audio, and disentangled audio-visual representations [36] to segregate identity information from speech [4, 36]. However, these methods have not addressed the other aspects for the realism of synthesized face video, such as natural expressions, identity preservation of target, etc.

Beyond Lip Synchronization - Realistic Facial Animation: The absence of spontaneous movements such as eye blinks in synthesized face videos can be easily perceived as fake [21]. Recent works [31, 32] have tried to address the problem of video realism by using adversarial learning of spontaneous facial gestures such as blinks. However, the generated videos with natural expressions may still imperfectly resemble the target identity, which can also be perceived as being fake. To retain facial identity information from the given identity image of target, image attention has been learnt with the help of facial landmarks in a hierarchical approach [4]. In this approach [4] the audio is used to generate motion on 2D facial landmarks, and the image texture is generated by conditioning on the landmarks. Although the generated texture in static facial regions can retain the texture from the identity image, the generated texture in regions of motion, especially the eyes and mouth, can differ from the target identity. Hence identity-specific texture generation is needed for realistic rendering of a target’s talking face.

Fig. 2.
figure 2

Block diagram of our proposed method for speech-driven facial animation.

3 Proposed Methodology

Given an arbitrary speech and a set of images of a target face, our objective is to synthesize speech synchronized realistic animation of the target face. Inspired by [4], we capture facial motion in a lower dimension space represented by 68 facial landmark points and synthesize texture conditioned on the motion of predicted landmarks. To this end, we use a GAN based cascaded learning approach consisting of the following: (1) Learning speech-driven motion on 2D facial landmarks independent of identity, (2) Learning eye blink motion on landmarks, (3) Landmark retargeting to generate target-specific facial shape along with motion, (4) Generating facial texture from the motion of landmarks. Figure 2 shows our overall approach.

3.1 Speech-Driven Motion Generation on Facial Landmarks

Let, A be an audio signal represented by a series of overlapping audio windows \(\{W_{t}|t \in [0,T]\}\) with corresponding feature representations \(\{F_{t}\}\). Our goal is to generate a sequence of facial landmarks \(\{\ell _{t} \in \mathbb {R}^{68 \times 2}\}\) corresponding to the motion driven by speech. We learn a mapping \(\mathcal {M}_{L}:F_{t} \xrightarrow {} \delta \ell ^{m}_{t}\) to generate speech-induced displacement \(\{\delta \ell ^{m}_{t} \in \mathbb {R}^{68 \times 2}\}\) on a canonical landmark (person-independent) in neutral pose \(\ell ^{m}_{p}\). Learning the speech-related motion on a canonical landmark \(\ell ^{m}_{p}\), which represents the average shape of a face, is effective due to the invariance of any specific facial structure. In order to generalize well over different voices, accent etc. we use a pre-trained DeepSpeech [15] model to extract the feature \(F_{t} \in \mathbb {R}^{6 \times 29}\).

Adversarial Learning of Landmark Motion. We use an adversarial network l-GAN to learn the speech-induced landmark displacement \(\mathcal {M}_{L}\). The generator network \(G_{L}\) generates displacements \(\{\delta \ell ^{m}_{t}\}\) of a canonical landmark from a neutral pose \(\ell ^{m}_{p}\). Our discriminator \(D_{L}\) takes the resultant canonical landmarks \(\{\ell ^{m}_{t} = \ell ^{m}_{p}+\delta \ell ^{m}_{t}\}\) and the ground-truth canonical landmarks as inputs to learn the real against fake. The loss functions used for training l-GAN are as follows:

Distance loss: This is \(L_{2}\) loss between generated canonical landmarks \(\{\ell ^{m}_{t}\}\) and ground-truth landmarks \(\{\ell ^{m*}_{t}\}\) for each frame t.

$$\begin{aligned} \mathcal {L}_{dist} = ||\ell ^{m}_{t} - \ell ^{m*}_{t}||^2_2 \end{aligned}$$
(1)

Regularization loss: We use \(L_{2}\) loss between consecutive frames for ensuring temporal smoothness in predicted landmarks.

$$\begin{aligned} \mathcal {L}_{reg} = ||\ell ^{m}_{t} - \ell ^{m}_{t-1}||^2_2 \end{aligned}$$
(2)

Direction Loss: We also impose a consistency in the motion vectors () by:

(3)

GAN Loss: We use an adversarial loss for capturing detailed mouth deformations.

$$\begin{aligned} \mathcal {L}_{gan} = \mathbb {E}_{\ell ^{m*}_{t}} [log(D_{L}(\ell ^{m*}_{t})] + \mathbb {E}_{F_{t}}[log (1-D_{L}(G_{L}(\ell ^{m}_{p},F_{t}))] \end{aligned}$$
(4)

The final objective function which is to be minimized is as follows:

$$\begin{aligned} \mathcal {L}_{motion} = \lambda _{dist} \mathcal {L}_{dist} + \lambda _{reg} \mathcal {L}_{reg} + \lambda _{dir} \mathcal {L}_{dir} + \lambda _{gan} \mathcal {L}_{gan} \end{aligned}$$
(5)

where, \(\lambda _{dist}\), \(\lambda _{reg}\), \(\lambda _{dir}\), \(\lambda _{gan}\) are experimentally set to 1, 0.5, 0.5 and 1, as presented in the ablation study (Sect. 4.3).

3.2 Spontaneous Eye Blink Generation on Facial Landmarks

Eye blinks are essential for realism of synthesized face animation, but not dependent on speech. Therefore, we propose an unsupervised method for generation of realistic eye blinks through learning a mapping \(\mathcal {M}_{B}:Z_{t} \xrightarrow {} \delta \ell ^{e}_{t}\) from a random noise \({Z_{t} \sim \mathcal {N}(\mu ,\,\sigma ^{2})|t \in (0,T) }\) to eye landmark displacements \(\{\delta \ell ^{e}_{t} \in \mathbb {R}^{22 \times 2}\}\). Our blink generator network \(G_{B}\) learns the blink pattern and duration through the mapping \(\mathcal {M}_{B}\) and generates a sequence of eye landmark displacements \(\{\delta \ell ^{e}_{t}\}\) on the canonical face by minimizing the MMD (Maximum Mean Discrepancy) [14] loss defined as follows:

$$\begin{aligned} L_{MMD}=\mathbb {E}_{X,X^\prime \sim p}\mathcal {K}(X,X^\prime ) + \mathbb {E}_{Y,Y^\prime \sim q}\mathcal {K}(Y,Y^\prime )- 2\mathbb {E}_{X \sim p, Y \sim q}\mathcal {K}(X,Y) \end{aligned}$$
(6)

where, \(\mathcal {K}(x,y)\) is defined as \(exp(-\frac{|x-y|^2}{2\sigma })\), p and q represents samples from distributions X and Y of GT \(\{\delta \ell ^{e*}_{t}\}\) and generated eye landmark motion \(\{\delta \ell ^{e}_{t}\}\) respectively. We also use a Min-max regularization to ensure that the range of the generated landmarks matches with the average range of average displacements present in the training data. We augment the eye blink with the speech-driven canonical landmark motion (Sect. 3.1) and retarget (Sect. 3.3) the combined landmarks \(\ell ^{M}_{t}\) = \(\{\ell ^{m}_{t} \bigcup \ell ^{e}_{t}\} \), where \(\{\ell ^{e}_{t} = \ell ^{e}_{p} + \delta \ell ^{e}_{t} \}\), to generate the person-specific landmarks \(\{\ell _{t}\}\) for subsequent use for texture generation.

3.3 Landmark Retargeting

We retarget the canonical landmarks \(\{\ell ^{M}_{t}\}\) generated by \(G_{L}\) and \(G_{B}\), to person-specific landmarks \(\{\ell _{t}\}\) (used for texture generation) as follows:

$$\begin{aligned} \ell _{t}=\ell _{p} + \delta \ell _{t} \ \text {where,} \ \delta \ell _{t} = \delta \ell ^{'}_{t}* S(\ell _{t})/S(\mathcal {T}(\ell ^{M}_{t})) \ \text {;} \ \delta \ell ^{'}_{t} = \mathcal {T}(\ell ^{M}_{t} )-\mathcal {T}(\ell ^{m}_{p}) \end{aligned}$$
(7)

where, \(\ell _{p}\) is the person-specific landmark in neutral pose (extracted from the target image), \(S(\ell ) \in \mathbb {R}^2\) is the scale (height \(\times \) width) of \(\ell \) and \(\mathcal {T}:\ell \xrightarrow {} \ell ^{'}\) represents a Procrustes (rigid) alignment of \(\ell \) with \(\ell _{p}\).

Fig. 3.
figure 3

State transitions of fast-weights (FW) and global-weights (GW) of t-GAN during meta-training. The sequence of training schedule: (1) copying FW to GW to keep global state unchanged during the training, (2)–(3) update the FW in iterations, (4)–(5) compute validation loss using FW, (6) update GW using total validation loss, (7) copy GW for the fine-tuning, (8)–(9) updating the GW using K sample images, (10) using the updated GW to produce target subject’s face.

3.4 Image Generation from Landmarks

We use the person-specific landmarks \(\{\ell _{t}\}\) containing motion due to the speech and the eye blink to synthesize animated face images \(\{I_{t}\}\) by learning a mapping \(\mathcal {M}_{T}:(\ell _{t},\{\mathcal {I}^{n}\}) \xrightarrow {} I_{t}\) using given target images \(\{\mathcal {I}^{n} | n \in [0,N]\}\).

Adversarial Generation of Image Texture. We use an adversarial network t-GAN to learn the mapping \(\mathcal {M}_{T}\). Our generator network \(G_{T}\) consists of a texture encoder \(E_{I}\) and landmark encoder-decoder \(E_{L}\) influenced by \(E_{I}\). \(E_{I}\) encodes the texture representation as \(e=E_{I}(\mathcal {I}^{n})\) for the input N images. We use Adaptive Instance Normalization [17] to modulate the bottleneck of \(E_{L}\) using e. Finally we use a discriminator network \(D_{T}\) to discriminate the real images from the fake. The losses for training the t-GAN are as follows:

Reconstruction Loss: \(L_{2}\) distance between synthesized \(\{I_{t}\}\) and GT images \(\{I^{*}_{t}\}\),

$$\begin{aligned} \mathcal {L}_{pix} = ||I_{t} - I^{*}_{t}||^2_2 \end{aligned}$$
(8)

Adversarial Loss: For sharpness of the texture an adversarial loss is minimized.

$$\begin{aligned} \mathcal {L}_{adv} = \mathbb {E}_{I^{*}_{t}} [log(D_{T}(\mathcal {I}^{n},I^{*}_{t})] + \mathbb {E}_{\ell _{t}}[log (1-D_{T}(\mathcal {I}^{n},G_{T}(\ell _{t},\mathcal {I}^{n}))] \end{aligned}$$
(9)

Perceptual Loss: We use a perceptual loss [18] which is the difference in feature representations \(vgg_{1}\) and \(vgg_{2}\) of generated and ground truth images obtained using pre-trained VGG19 and VGGFace [26] respectively.

$$\begin{aligned} \mathcal {L}_{feat} = \alpha _{1}||vgg_{1}(I_{t}) - vgg_{1}(I^{*}_{t})||^2_2 + \alpha _{2} ||vgg_{2}(I_{t}) - vgg_{2}(I^{*}_{t})||^2_2 \end{aligned}$$
(10)

The total loss minimized for training \(G_T\) network is defined as,

$$\begin{aligned} \mathcal {L}_{texture} = \lambda _{pix}\mathcal {L}_{pix} + \lambda _{adv}\mathcal {L}_{adv} + \lambda _{feat}\mathcal {L}_{feat} \end{aligned}$$
(11)

Meta-Learning. We use model-agnostic meta-learning (MAML) [12] to train our t-GAN for quick adaptation to the unknown face at inference time using few images. MAML trains on a set of tasks T called episodes. For each task, the number of samples for training and validation is \(d_{trn}\) and \(d_{qry}\), respectively. For our problem, we define subject specific task as \(T^{s} ={(I^s_{i},l^s_{j}) \cdots (I^s_{{i}_{d_{trn}+d_{qry}}},l^s_{{j}_{d_{trn}+d_{qry}}})}\) of task set \(\{T^{s}\}\), where s is the subject index, \(I^s_{i}\) is the \(i^{th}\) face image for subject s, \(l^s_{j}\) is the \(j^{th}\) landmark for the same subject s. During meta-training, MAML store the current weights of the t-GAN into global-weights and train the t-GAN with \(d_{trn}\) samples for m iteration using a constant step size. During each iteration, it measures the loss \(L^i\) with the validations samples \(d_{qry}\). Then the total loss \(L=L^1+L^2 \cdots +L^m\) is used to update global-weights as shown in Fig 3. The resultant direction of the global-weights encodes a global information of the t-GAN network for all the tasks, which is used as an initialization for fine-tuning during inference.

During fine-tuning, we initialize the t-GAN from the global-weights and update the weights by minimizing the loss as described in Eq. 11. We use a few (\(K =20\)) example images of the target face for the fine-tuning.

4 Experimental Results

In this section, we present the experimental results of our proposed method on different datasets along with the network ablation study. We also show that the accuracy of our cascaded GAN based approach is quite higher than an alternate regression-based motion and texture generation. Our meta-learning based texture generation strategy makes our method to be more adaptable to any unknown faces. The combined result is a significantly better facial animation from speech than the state-of-the-art methods in terms of both quantitative and qualitative results. In what follows, we present detailed experiments for each of the building blocks of our pipeline.

4.1 Datasets

We use TCD-TIMIT [16], GRID [9], and Voxceleb [24] datasets for our experiments. We train our model only on TCD-TIMIT and test the model on GRID as well as our own recorded data for showing the efficacy of our method on cross datasets with completely unknown faces. Our training split contains 3378 videos from 49 subjects with around 6913 sentences uttered in a limited variety of accents. Test split (same as [32]) of TCD-TIMIT and GRID datasets contains 1631 and 9957 videos respectively.

4.2 Motion Generation on Landmarks

Network Architecture of l-GAN: The architecture of the generator network \(G_L\) of l-GAN is built upon the encoder-decoder architecture used in [10] for generating mesh vertices. LeakyReLU [33] activation is used after each layer of the encoder network. The input DeepSpeech features are encoded to a 33 dimensional vector (PCA coefficients), which is decoded to obtain the canonical landmark displacements from the neutral pose. The discriminator network \(D_{L}\) consists of 2 linear layers, which re-encodes the predicted or ground-truth landmarks into PCA coefficients to discriminate between real and fake. We initialize weights of the last layer of the decoder in \(G_L\) and the first layer of \(D_{L}\) with 33 PCA components computed over the landmark displacements in training data.

Network Architecture of Blink Generator \(G_{B}\): We use RNN to predict a sequence of displacements \(\mathbb {R}^{n \times 75 \times 44}\), i.e x, y coordinates of eye landmarks \(\{\ell ^{e}_{t} \in \mathbb {R}^{22 \times 2}\}\) over 75 timestamps from given noise vector \(z \sim \mathcal {N}(\mu ,\,\sigma ^{2})\) with \(z \in \mathbb {R}^{n \times 75 \times 10}\). Similar to the \(G_L\) of our \(\textit{l-GAN}\) network, the last linear layer weights are initialized with PCA components (with 99% variants) computed over ground-truth eye landmark displacements.

Training Details: We extract audio features from the second last layer (before softmax) of the DeepSpeech [15] network. We consider sliding windows of \(\varDelta t\) features for providing a temporal context to each video frame. To compute accurate ground-truth facial landmark required for our training, we experiment with different existing state-of-the-art methods [1, 19, 34] and find that the combination of OpenFace [1] and face segmentation [34] to be most effective for our purpose. Our speech-driven motion generation network is trained on the TCD-TIMIT dataset. The canonical landmarks used for training l-GAN are generated by an inverse process of the landmark retargeting method, as described in Sect. 3.3. We train our \(\textit{l-GAN}\) network with a batch size of 6. Losses saturate after 40 epochs, which takes around 3 h on a single GPU of Quadro P5000 system. We use Adam [20] optimization with a learning rate of \(2e-4\) for training both of our \(\textit{l-GAN}\) and blink generator network.

Fig. 4.
figure 4

Performance comparison of l-GANusing only generator (third row) and the complete GAN (fifth row). The regression-based approach cannot capture the finer details like “a” and “o” of lip motion without the help of the discriminator.

Quantitative Results: We present our quantitative results in Table 1 and 2. For comparative analysis we use publicly availabe pre-trained models of state-of-the-art methods [4, 32, 36]. Our model is trained on TCD-TIMIT [16], while models of [4] and [36] are pre-trained on LRW [8] dataset. [32] is trained on both TCD-TIMIT and GRID separately.

For evaluating and comparing the accuracy of lip synchronization produced by our method, we use a) LMD, Landmark Distance (as used in [3, 4]) and b) Audio-Visual synchronization metrics (AV Offset and AV confidence produced by Syncnet [6]). For all methods, LMD is computed using lip landmarks extracted from the final generated frames. A lower value of LMD and AV offset with higher AV confidence indicates better lip synchronization. Our method shows better accuracy compared to state-of-the-art methods. Our models trained on TCD-TIMIT also shows good generalization capability in cross-dataset evaluation on GRID dataset (Table 2). Although [4] also generates facial landmarks from audio features (MFCC), unlike their regression-based approach, our use of DeepSpeech features, landmark retargeting, and adversarial learning results in improved accuracy of landmark generation.

Fig. 5.
figure 5

Statistics of average blink duration.

Moreover, our facial landmarks contain natural eye blink motion for added realism. We detect eye blinks using a sharp drop in EAR (Eye Aspect Ratio) signal [32] calculated using landmarks of eye corners and eyelids. Blink duration is calculated as the number of consecutive frames between the start and end of the sharp drop in the EAR. The average blink duration and blink frequencies generated from our method is similar to that of natural human blinks. Our method produces a blink rate of 0.3 blinks/s and 0.38 blinks/s (Table 1 and 2) for TCD-TIMIT and GRID datasets respectively which is similar to the average human blink rate of 0.28–0.4 blinks/s. Also, we achieve an average blink duration of 0.33 s and 0.4 s, which is similar to as reported in ground-truth (Table 1 and 2). In Fig. 5 we present distribution of blink durations (in no. of frames) in synthesized videos of GRID and TCD-TIMIT datasets. So, our method can produce realistic eye blinks similar to [32], but with better identity-preserved texture, due to our decoupled learning of eye blinks on landmarks.

Ablation Study: An ablation study of window size \(\varDelta t\) (Fig. 6) has indicated a value of \(\varDelta t=6\) frames (duration of around 198 ms) results in the lowest LMD. In Fig. 6 we also present an ablation study for different losses used for training our motion prediction network. It is seen that the proposed loss \(L_{motion}\) achieves the best accuracy. Use of \(L_{2}\) regularization loss helps to achieve temporal smoothness and consistency on predicted landmarks over consecutive frames. We use direction loss (Eq. 3) to capture the relative movements of landmarks over consecutive frames. Using direction loss helps to achieve faster convergence of our landmark prediction network. Use of DeepSpeech features helps us to achieve robustness in lip synchronization even for audios with noise, different accents, and different languages (Please refer to supplementary video). We experiment to evaluate the robustness of our l-GAN with different levels of noise by adding synthetic noise in the audio input. Figure 6 shows upto \(-30\) dB, the lip motion does not get affected by the noise and starts degrading afterward. In Fig. 4 we present a qualitative result of the landmark generation network on the TCD-TIMIT dataset. It shows the effectiveness of using discriminator in l-GAN.

Fig. 6.
figure 6

Left: Landmark Distance (LMD) with varying context window (\(\varDelta t\)) of deep-speech features. Middle: LMD with different losses used for training speech-driven motion generation network. Right: Error in lip synchronization (LMD) with different noise levels.

4.3 Texture Generation from Landmark Motion:

Network Architecture of t-GAN: We adapt a similar approach of an image-to-image translation method proposed by [18] for implementation of our texture generator \(G_{T}\). Our landmark encoder-decoder network \(E_{L}\) takes generated person-specific landmarks represented as images of size \(\mathbb {R}^{3 \times 256 \times 256}\) and \(E_{I}\) takes channel-wise concatenated face images with corresponding landmark images of the target subject. We use six downsampling layers for both \(E_{I}\) and the encoder of \(E_{L}\) and six upsampling layers for the decoder of the \(E_{L}\). To generate high fidelity images, we use residual block for our downsampling and upsampling layers similar to [2]. We use instance normalization for the residual blocks and adaptive instance normalization on the bottle-neck layer of the \(E_{L}\) using the activation produced by the last layer of \(E_{I}\). Moreover, to generate sharper images, we use a similar self-attention method as [35] at the \(32\times 32\) layer activation of downsampling and upsampling layers. Our discriminator network \(D_{T}\) consists of 6 residual blocks similar to \(E_{I}\), followed by a max pooling and a fully connected layer. To stabilize our GAN training, we use spectral normalization [23] for both generator and discriminator network.

Training and Testing Details: We meta-train our t-GAN network using ground-truth landmarks following the teacher forcing strategy. We use fixed step size [12] \(1e-3\) and Adam as the meta-optimizer [12] with learning rate \(1e-4\). The values of \(\alpha _{1}\), \(\alpha _{2}\), \(\lambda _{pix}\), \(\lambda _{adv}\) \(\lambda _{feat}\) and are experimentally set to \(1e-1\), \(2e-3\), 0.5, 1.0 and 0.3 respectively. At test time, we use 5 images of the target identity, and the person-specific landmark generated by the l-GAN to produce the output images. Before testing, we perform a fine-tuning of the meta-trained network using 20 images of the target person and the corresponding ground-truth landmarks. We use a clustered GPU of NVIDIA Tesla V100 for meta-training and Quadro P5000 for fine-tuning our network.

Table 1. Comparative results on TCD-TIMIT [16].
Table 2. Comparative results on GRID [9] (our cross-dataset evaluation).

Quantitative Results: Here, we present the comparative performance of our GAN-based texture generation network with the most recent state-of-the-art methods [4, 36] and [32]. Similar to l-GAN, the t-GAN is trained on TCD-TIMIT and evaluated on the test split of GRID, TCD-TIMIT and the unknown subjects. We compute the performance metrics PSNR, SSIM (Structural Similarity), CPBD (Cumulative Probability Blur Detection) [25], ACD (Average Content Distance) [32] and similarity between FaceNet [27] features for reference identity image (1st frame of ground truth video) and the predicted frames. Our method outperforms (Table 1 and 2) the state-of-the-art methods for all the datasets indicating better image quality. Due to inaccessibility of LRW [8] dataset we have evaluated our texture generation method on Voxceleb [24] dataset which gives average PSNR, SSIM and CPBD of 25.2, 0.63, 0.11 respectively. Our method does not produce head motion and synthesizes texture with frontal face. Hence, for Voxceleb, our method gives poor performance than that of TCD-TIMIT and GRID.

Qualitative Results: Figure 10 shows qualitative comparison against [4, 36] and [32]. It can be seen that [32] and [36] fail to preserve the identity of the test subject over frames in the synthesized video. Although [4] can preserve the identity, there is a significant blur, especially around the mouth region. Also, it lacks any natural movements over face except lip or jaw motion yielding an unrealistic face animation. On the other hand, our method can synthesize high fidelity images (\(256 \times 256\)) with preserved identity and natural eye motions. Figure 7 shows the qualitative comparison of our GAN based texture generation against a regression-based (without discriminator) network output where it is evident that our GAN based network gives more accurate lip deformation with similar motion as ground-truth.

Fig. 7.
figure 7

Qualitative comparison between our t-GAN based method (Row 2) against the regression based generator \(G_{T}\) (Row 3) method. Use of GAN results in more accurate mouth shape.

Table 3. Ablation Study of our model. CC = channel wise concatenation.
Table 4. Epoch-wise quantitative analysis in fine-tuning.
Fig. 8.
figure 8

Comparison between the fine-tuning stage of meta-learning and transfer-learning. Meta-learning (black) provides better initialization than the transfer-learning (blue). (Color figure online)

Fig. 9.
figure 9

Ablation study for no. of images during fine-tuning on GRID dataset.

Fig. 10.
figure 10

Qualitative comparison with the latest SoA methods on TCD-TIMIT dataset (Upper 10 rows) and GRID dataset (Lower 5 rows). Our results indicate improved identity preservation of the subject, good lip synchronization, detailed texture (such as teeth), lesser blur, and presence of randomly introduced eye blinks.

Ablation Study: We show a detailed ablation study on the TCD-TIMIT dataset to find out the effect of different losses (Table 3). Among channel-wise concatenation and adaptive instance normalization, which are the two different approaches in neural style transfer, adaptive instance normalization works better for our problem. Figure 7 and quantitative result (Table 3) show that GAN based method produces more accurate lip deformation than the regression-based method, which always produces an overly smooth outcome. Figure 9 shows the ablation study for the number of images required for fine-tuning. Using single image for fine-tune yields average PSNR, SSIM and CPBD values of 27.95, 0.82, and 0.27 respectively for GRID dataset. Our method can produce accurate motion and texture after 10 epochs (Table 4) of fine-tuning with \(K=20\) sample images.

Meta-Learning vs. Transfer-Learning: We compare the performance of MAML [12] and transfer-learning for our problem. To this end, we train a model with the same model architecture until it converges to similar loss values as meta-learning. After 10 epochs of fine-tuning with 20 images, the loss of meta-learning is much lower (Fig. 8b) than the transfer-learning (fine-tuning) and produces significantly better visual results (Fig. 8a) than transfer-learning. Moreover, fine-tuning of meta-learned network takes nearly 10 epochs with 20 images, which is much smaller than the transfer-learning based fine-tuning.

User Study: We assess the realism of our animation through a user study, where 25 users are asked to rate between 1(fake)–10(real) for 30 (10 videos from each of the methods) synthesized videos randomly selected from TCD-TIMIT and GRID. Our method achieves better realism scores with an average of \(72.76\%\) compared to state-of-the-art methods [4] and [32] with average scores \(58.48\%\) and \(61.29\%\) respectively.

5 Conclusion

In this paper, we present a novel strategy for speech driven facial animation. Our method produces realistic facial animation for unknown subjects with different languages and accents in speech showing generalization capability. We attribute this advantage due to our separate learning of motion and texture generator GANs with meta-learning capability. As a combined result, our method outperforms state-of-the-art methods significantly. In the future, we would like to study the effect of meta-learning for learning landmark motion from speech to mimic personalized speaking styles.