Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture

Das, Dipanjan; Biswas, Sandika; Sinha, Sanjana; Bhowmick, Brojeshwar

doi:10.1007/978-3-030-58577-8_25

Dipanjan Das¹²,
Sandika Biswas¹²,
Sanjana Sinha¹² &
…
Brojeshwar Bhowmick¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12375))

Included in the following conference series:

European Conference on Computer Vision

3946 Accesses
53 Citations

Abstract

Speech-driven facial animation methods should produce accurate and realistic lip motions with natural expressions and realistic texture portraying target-specific facial characteristics. Moreover, the methods should also be adaptable to any unknown faces and speech quickly during inference. Current state-of-the-art methods fail to generate realistic animation from any speech on unknown faces due to their poor generalization over different facial characteristics, languages, and accents. Some of these failures can be attributed to the end-to-end learning of the complex relationship between the multiple modalities of speech and the video. In this paper, we propose a novel strategy where we partition the problem and learn the motion and texture separately. Firstly, we train a GAN network to learn the lip motion in a canonical landmark using DeepSpeech features and induce eye-blinks before transferring the motion to the person-specific face. Next, we use another GAN based texture generator network to generate high fidelity face corresponding to the motion on person-specific landmark. We use meta-learning to make the texture generator GAN more flexible to adapt to the unknown subject’s traits of the face during inference. Our method gives significantly improved facial animation than the state-of-the-art methods and generalizes well across the different datasets, different languages, and accents, and also works reliably well in presence of noises in the speech.

D. Das and S. Biswas—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

Modular Joint Training for Speech-Driven 3D Facial Animation

Synthesis of Photo-Realistic Facial Animation from Text Based on HMM and DNN with Animation Unit

3D facial animation driven by speech-video dual-modal signals

Article Open access 23 May 2024

Keywords

1 Introduction

Speech-driven facial animation can be used for many applications such as video games, virtual assistants, animation movies, etc. and has thus garnered broad interest. The problem of generating such facial animation is multifaceted, requiring accurate lip-sync, a natural expression like eye blinks, head orientations, capturing subject-specific traits like identity, lip deformations, etc. Also, the generation of such animation should not be overly dependent on the training set, and the method, therefore, should be adaptable to unknown faces and speeches quickly. Existing end-to-end learning methods [32, 36] show poor adaptability given an unknown speech or face resulting in implausible animation. In order to overcome the problems of generating images directly from speech, Chen et al. [4] learn an intermediate high-level representation of motion from audio followed by texturing. Although this method preserves the identity but fails to produce accurate and realistic lip synchronization, as shown in Fig. 1 (the last row). On the other hand, [32] produces plausible lip motion but renders incorrect identity, as shown in Fig. 1 (third row). Therefore, the key challenges existing in the talking face problem are i) accurate lip synchronization along with identity preservation, ii) presence of natural expression like eye blinks, iii) fast adaptation to unknown subjects, and speeches for all practical purposes. Figure 1 shows that none of the most recent state-of-the-art methods produce animations which solves all the above challenges.

In this paper, we propose a novel strategy to solve the above-mentioned challenges. In essence, our method partitions the problem into four stages. First, we design a GAN network for learning motion on canonical (person-independent) landmark from DeepSpeech features obtained from audio. GAN is powerful in learning the subtle deformations in lips due to speech, and learning motion in a canonical face makes the method invariant to the person-specific face geometry. Along with this, DeepSpeech features alleviates the problems due to different accents and noises. With all these together, our method is able to learn motion from speech robustly and also adaptable to the unknown speech. Next, we impose eye blinks predicted from a separate network and transfer this learned canonical facial landmark motion to person-specific landmark motion using Procrustes alignment [29]. Subsequently, we train another GAN network for texture generation conditioning with the person-specific landmark. For better adaptation to the unknown subject and unknown head orientation, we meta-learn this GAN network using Model-Agnostic-Meta-Learning (MAML) algorithm [12]. At test time, we fine-tune the meta-learned model with few samples (20 images) to adapt quickly (approx. 100 secs fine-tuning) to the unseen subject. Our method produces significantly better results (Fig. 1, second row) with more accurate lip synchronization, better identity preservation, and easy adaptation to the unseen subjects over the state-of-the-art techniques. Figure 2 shows a conceptual diagram of our approach. The contributions of our work can be summarized as follows:

1.
We design a GAN network for learning canonical facial landmark motion from a speech by using DeepSpeech features. The use of GAN helps to learn subtle deformations in lips accurately. DeepSpeech and motion learning in canonical face alleviates the problems in learning due to the variety of person-specific faces and speeches. Therefore the method is more robust to noises, accent, and different face geometry.
2.
We use model-agnostic-meta-learning to train another GAN for texture generation conditioned on the person-specific texture. GAN produces high fidelity face images from given landmarks and because the network is meta-learned it provides quick as well as a better adaptation to the unseen subject using a few examples at the fine-tuning stage.

2 Related Work

Speech-Driven Face Animation: In recent years many researchers have focused on the synthesis of 2D talking face video from audio input [3, 4, 7, 28, 30, 32, 36]. The methods which are most relevant to us are [4, 7, 28, 31, 32, 36, 37] which animate an entire face from speech. Earlier methods that learn subject-specific 2D facial animation [11, 13, 30] require a large amount of training data of the target subject. The first subject-independent learning method [7] achieves good lip synchronization, but images generated require additional de-blurring. Hence GAN-based methods [4, 5, 28, 31, 32, 36] were proposed for generating sharp facial texture in speech-driven 2D facial animation. Although these methods animate the entire face, they mainly target lip synchronization with audio [4, 5, 28, 36], by learning disentangled audio representations [22] for robustness to noise and emotional content in audio, and disentangled audio-visual representations [36] to segregate identity information from speech [4, 36]. However, these methods have not addressed the other aspects for the realism of synthesized face video, such as natural expressions, identity preservation of target, etc.

Beyond Lip Synchronization - Realistic Facial Animation: The absence of spontaneous movements such as eye blinks in synthesized face videos can be easily perceived as fake [21]. Recent works [31, 32] have tried to address the problem of video realism by using adversarial learning of spontaneous facial gestures such as blinks. However, the generated videos with natural expressions may still imperfectly resemble the target identity, which can also be perceived as being fake. To retain facial identity information from the given identity image of target, image attention has been learnt with the help of facial landmarks in a hierarchical approach [4]. In this approach [4] the audio is used to generate motion on 2D facial landmarks, and the image texture is generated by conditioning on the landmarks. Although the generated texture in static facial regions can retain the texture from the identity image, the generated texture in regions of motion, especially the eyes and mouth, can differ from the target identity. Hence identity-specific texture generation is needed for realistic rendering of a target’s talking face.

3 Proposed Methodology

Given an arbitrary speech and a set of images of a target face, our objective is to synthesize speech synchronized realistic animation of the target face. Inspired by [4], we capture facial motion in a lower dimension space represented by 68 facial landmark points and synthesize texture conditioned on the motion of predicted landmarks. To this end, we use a GAN based cascaded learning approach consisting of the following: (1) Learning speech-driven motion on 2D facial landmarks independent of identity, (2) Learning eye blink motion on landmarks, (3) Landmark retargeting to generate target-specific facial shape along with motion, (4) Generating facial texture from the motion of landmarks. Figure 2 shows our overall approach.

3.1 Speech-Driven Motion Generation on Facial Landmarks

Let, A be an audio signal represented by a series of overlapping audio windows $\{W_{t}|t \in [0,T]\}$ with corresponding feature representations $\{F_{t}\}$. Our goal is to generate a sequence of facial landmarks $\{\ell _{t} \in \mathbb {R}^{68 \times 2}\}$ corresponding to the motion driven by speech. We learn a mapping $\mathcal {M}_{L}:F_{t} \xrightarrow {} \delta \ell ^{m}_{t}$ to generate speech-induced displacement $\{\delta \ell ^{m}_{t} \in \mathbb {R}^{68 \times 2}\}$ on a canonical landmark (person-independent) in neutral pose $\ell ^{m}_{p}$. Learning the speech-related motion on a canonical landmark $\ell ^{m}_{p}$, which represents the average shape of a face, is effective due to the invariance of any specific facial structure. In order to generalize well over different voices, accent etc. we use a pre-trained DeepSpeech [15] model to extract the feature $F_{t} \in \mathbb {R}^{6 \times 29}$.

Adversarial Learning of Landmark Motion. We use an adversarial network l-GAN to learn the speech-induced landmark displacement $\mathcal {M}_{L}$. The generator network $G_{L}$ generates displacements $\{\delta \ell ^{m}_{t}\}$ of a canonical landmark from a neutral pose $\ell ^{m}_{p}$. Our discriminator $D_{L}$ takes the resultant canonical landmarks $\{\ell ^{m}_{t} = \ell ^{m}_{p}+\delta \ell ^{m}_{t}\}$ and the ground-truth canonical landmarks as inputs to learn the real against fake. The loss functions used for training l-GAN are as follows:

Distance loss: This is $L_{2}$ loss between generated canonical landmarks $\{\ell ^{m}_{t}\}$ and ground-truth landmarks $\{\ell ^{m*}_{t}\}$ for each frame t.

$$\begin{aligned} \mathcal {L}_{dist} = ||\ell ^{m}_{t} - \ell ^{m*}_{t}||^2_2 \end{aligned}$$

(1)

Regularization loss: We use $L_{2}$ loss between consecutive frames for ensuring temporal smoothness in predicted landmarks.

$$\begin{aligned} \mathcal {L}_{reg} = ||\ell ^{m}_{t} - \ell ^{m}_{t-1}||^2_2 \end{aligned}$$

(2)

Direction Loss: We also impose a consistency in the motion vectors () by:

(3)

GAN Loss: We use an adversarial loss for capturing detailed mouth deformations.

$$\begin{aligned} \mathcal {L}_{gan} = \mathbb {E}_{\ell ^{m*}_{t}} [log(D_{L}(\ell ^{m*}_{t})] + \mathbb {E}_{F_{t}}[log (1-D_{L}(G_{L}(\ell ^{m}_{p},F_{t}))] \end{aligned}$$

(4)

The final objective function which is to be minimized is as follows:

$$\begin{aligned} \mathcal {L}_{motion} = \lambda _{dist} \mathcal {L}_{dist} + \lambda _{reg} \mathcal {L}_{reg} + \lambda _{dir} \mathcal {L}_{dir} + \lambda _{gan} \mathcal {L}_{gan} \end{aligned}$$

(5)

where, $\lambda _{dist}$, $\lambda _{reg}$, $\lambda _{dir}$, $\lambda _{gan}$ are experimentally set to 1, 0.5, 0.5 and 1, as presented in the ablation study (Sect. 4.3).

3.2 Spontaneous Eye Blink Generation on Facial Landmarks

Eye blinks are essential for realism of synthesized face animation, but not dependent on speech. Therefore, we propose an unsupervised method for generation of realistic eye blinks through learning a mapping $\mathcal {M}_{B}:Z_{t} \xrightarrow {} \delta \ell ^{e}_{t}$ from a random noise ${Z_{t} \sim \mathcal {N}(\mu ,\,\sigma ^{2})|t \in (0,T) }$ to eye landmark displacements $\{\delta \ell ^{e}_{t} \in \mathbb {R}^{22 \times 2}\}$. Our blink generator network $G_{B}$ learns the blink pattern and duration through the mapping $\mathcal {M}_{B}$ and generates a sequence of eye landmark displacements $\{\delta \ell ^{e}_{t}\}$ on the canonical face by minimizing the MMD (Maximum Mean Discrepancy) [14] loss defined as follows:

$$\begin{aligned} L_{MMD}=\mathbb {E}_{X,X^\prime \sim p}\mathcal {K}(X,X^\prime ) + \mathbb {E}_{Y,Y^\prime \sim q}\mathcal {K}(Y,Y^\prime )- 2\mathbb {E}_{X \sim p, Y \sim q}\mathcal {K}(X,Y) \end{aligned}$$

(6)

where, $\mathcal {K}(x,y)$ is defined as $exp(-\frac{|x-y|^2}{2\sigma })$, p and q represents samples from distributions X and Y of GT $\{\delta \ell ^{e*}_{t}\}$ and generated eye landmark motion $\{\delta \ell ^{e}_{t}\}$ respectively. We also use a Min-max regularization to ensure that the range of the generated landmarks matches with the average range of average displacements present in the training data. We augment the eye blink with the speech-driven canonical landmark motion (Sect. 3.1) and retarget (Sect. 3.3) the combined landmarks $\ell ^{M}_{t}$ = $\{\ell ^{m}_{t} \bigcup \ell ^{e}_{t}\} $, where $\{\ell ^{e}_{t} = \ell ^{e}_{p} + \delta \ell ^{e}_{t} \}$, to generate the person-specific landmarks $\{\ell _{t}\}$ for subsequent use for texture generation.

3.3 Landmark Retargeting

We retarget the canonical landmarks $\{\ell ^{M}_{t}\}$ generated by $G_{L}$ and $G_{B}$, to person-specific landmarks $\{\ell _{t}\}$ (used for texture generation) as follows:

$$\begin{aligned} \ell _{t}=\ell _{p} + \delta \ell _{t} \ \text {where,} \ \delta \ell _{t} = \delta \ell ^{'}_{t}* S(\ell _{t})/S(\mathcal {T}(\ell ^{M}_{t})) \ \text {;} \ \delta \ell ^{'}_{t} = \mathcal {T}(\ell ^{M}_{t} )-\mathcal {T}(\ell ^{m}_{p}) \end{aligned}$$

(7)

where, $\ell _{p}$ is the person-specific landmark in neutral pose (extracted from the target image), $S(\ell ) \in \mathbb {R}^2$ is the scale (height $\times $ width) of $\ell $ and $\mathcal {T}:\ell \xrightarrow {} \ell ^{'}$ represents a Procrustes (rigid) alignment of $\ell $ with $\ell _{p}$.

3.4 Image Generation from Landmarks

We use the person-specific landmarks $\{\ell _{t}\}$ containing motion due to the speech and the eye blink to synthesize animated face images $\{I_{t}\}$ by learning a mapping $\mathcal {M}_{T}:(\ell _{t},\{\mathcal {I}^{n}\}) \xrightarrow {} I_{t}$ using given target images $\{\mathcal {I}^{n} | n \in [0,N]\}$.

Adversarial Generation of Image Texture. We use an adversarial network t-GAN to learn the mapping $\mathcal {M}_{T}$. Our generator network $G_{T}$ consists of a texture encoder $E_{I}$ and landmark encoder-decoder $E_{L}$ influenced by $E_{I}$. $E_{I}$ encodes the texture representation as $e=E_{I}(\mathcal {I}^{n})$ for the input N images. We use Adaptive Instance Normalization [17] to modulate the bottleneck of $E_{L}$ using e. Finally we use a discriminator network $D_{T}$ to discriminate the real images from the fake. The losses for training the t-GAN are as follows:

Reconstruction Loss: $L_{2}$ distance between synthesized $\{I_{t}\}$ and GT images $\{I^{*}_{t}\}$,

$$\begin{aligned} \mathcal {L}_{pix} = ||I_{t} - I^{*}_{t}||^2_2 \end{aligned}$$

(8)

Adversarial Loss: For sharpness of the texture an adversarial loss is minimized.

$$\begin{aligned} \mathcal {L}_{adv} = \mathbb {E}_{I^{*}_{t}} [log(D_{T}(\mathcal {I}^{n},I^{*}_{t})] + \mathbb {E}_{\ell _{t}}[log (1-D_{T}(\mathcal {I}^{n},G_{T}(\ell _{t},\mathcal {I}^{n}))] \end{aligned}$$

(9)

Perceptual Loss: We use a perceptual loss [18] which is the difference in feature representations $vgg_{1}$ and $vgg_{2}$ of generated and ground truth images obtained using pre-trained VGG19 and VGGFace [26] respectively.

$$\begin{aligned} \mathcal {L}_{feat} = \alpha _{1}||vgg_{1}(I_{t}) - vgg_{1}(I^{*}_{t})||^2_2 + \alpha _{2} ||vgg_{2}(I_{t}) - vgg_{2}(I^{*}_{t})||^2_2 \end{aligned}$$

(10)

The total loss minimized for training $G_T$ network is defined as,

$$\begin{aligned} \mathcal {L}_{texture} = \lambda _{pix}\mathcal {L}_{pix} + \lambda _{adv}\mathcal {L}_{adv} + \lambda _{feat}\mathcal {L}_{feat} \end{aligned}$$

(11)

Meta-Learning. We use model-agnostic meta-learning (MAML) [12] to train our t-GAN for quick adaptation to the unknown face at inference time using few images. MAML trains on a set of tasks T called episodes. For each task, the number of samples for training and validation is $d_{trn}$ and $d_{qry}$, respectively. For our problem, we define subject specific task as $T^{s} ={(I^s_{i},l^s_{j}) \cdots (I^s_{{i}_{d_{trn}+d_{qry}}},l^s_{{j}_{d_{trn}+d_{qry}}})}$ of task set $\{T^{s}\}$, where s is the subject index, $I^s_{i}$ is the $i^{th}$ face image for subject s, $l^s_{j}$ is the $j^{th}$ landmark for the same subject s. During meta-training, MAML store the current weights of the t-GAN into global-weights and train the t-GAN with $d_{trn}$ samples for m iteration using a constant step size. During each iteration, it measures the loss $L^i$ with the validations samples $d_{qry}$. Then the total loss $L=L^1+L^2 \cdots +L^m$ is used to update global-weights as shown in Fig 3. The resultant direction of the global-weights encodes a global information of the t-GAN network for all the tasks, which is used as an initialization for fine-tuning during inference.

During fine-tuning, we initialize the t-GAN from the global-weights and update the weights by minimizing the loss as described in Eq. 11. We use a few ($K =20$) example images of the target face for the fine-tuning.

4 Experimental Results

In this section, we present the experimental results of our proposed method on different datasets along with the network ablation study. We also show that the accuracy of our cascaded GAN based approach is quite higher than an alternate regression-based motion and texture generation. Our meta-learning based texture generation strategy makes our method to be more adaptable to any unknown faces. The combined result is a significantly better facial animation from speech than the state-of-the-art methods in terms of both quantitative and qualitative results. In what follows, we present detailed experiments for each of the building blocks of our pipeline.

4.1 Datasets

We use TCD-TIMIT [16], GRID [9], and Voxceleb [24] datasets for our experiments. We train our model only on TCD-TIMIT and test the model on GRID as well as our own recorded data for showing the efficacy of our method on cross datasets with completely unknown faces. Our training split contains 3378 videos from 49 subjects with around 6913 sentences uttered in a limited variety of accents. Test split (same as [32]) of TCD-TIMIT and GRID datasets contains 1631 and 9957 videos respectively.

4.2 Motion Generation on Landmarks

Network Architecture of l-GAN: The architecture of the generator network $G_L$ of l-GAN is built upon the encoder-decoder architecture used in [10] for generating mesh vertices. LeakyReLU [33] activation is used after each layer of the encoder network. The input DeepSpeech features are encoded to a 33 dimensional vector (PCA coefficients), which is decoded to obtain the canonical landmark displacements from the neutral pose. The discriminator network $D_{L}$ consists of 2 linear layers, which re-encodes the predicted or ground-truth landmarks into PCA coefficients to discriminate between real and fake. We initialize weights of the last layer of the decoder in $G_L$ and the first layer of $D_{L}$ with 33 PCA components computed over the landmark displacements in training data.

Network Architecture of Blink Generator $G_{B}$: We use RNN to predict a sequence of displacements $\mathbb {R}^{n \times 75 \times 44}$, i.e x, y coordinates of eye landmarks $\{\ell ^{e}_{t} \in \mathbb {R}^{22 \times 2}\}$ over 75 timestamps from given noise vector $z \sim \mathcal {N}(\mu ,\,\sigma ^{2})$ with $z \in \mathbb {R}^{n \times 75 \times 10}$. Similar to the $G_L$ of our $\textit{l-GAN}$ network, the last linear layer weights are initialized with PCA components (with 99% variants) computed over ground-truth eye landmark displacements.

Training Details: We extract audio features from the second last layer (before softmax) of the DeepSpeech [15] network. We consider sliding windows of $\varDelta t$ features for providing a temporal context to each video frame. To compute accurate ground-truth facial landmark required for our training, we experiment with different existing state-of-the-art methods [1, 19, 34] and find that the combination of OpenFace [1] and face segmentation [34] to be most effective for our purpose. Our speech-driven motion generation network is trained on the TCD-TIMIT dataset. The canonical landmarks used for training l-GAN are generated by an inverse process of the landmark retargeting method, as described in Sect. 3.3. We train our $\textit{l-GAN}$ network with a batch size of 6. Losses saturate after 40 epochs, which takes around 3 h on a single GPU of Quadro P5000 system. We use Adam [20] optimization with a learning rate of $2e-4$ for training both of our $\textit{l-GAN}$ and blink generator network.

Quantitative Results: We present our quantitative results in Table 1 and 2. For comparative analysis we use publicly availabe pre-trained models of state-of-the-art methods [4, 32, 36]. Our model is trained on TCD-TIMIT [16], while models of [4] and [36] are pre-trained on LRW [8] dataset. [32] is trained on both TCD-TIMIT and GRID separately.

For evaluating and comparing the accuracy of lip synchronization produced by our method, we use a) LMD, Landmark Distance (as used in [3, 4]) and b) Audio-Visual synchronization metrics (AV Offset and AV confidence produced by Syncnet [6]). For all methods, LMD is computed using lip landmarks extracted from the final generated frames. A lower value of LMD and AV offset with higher AV confidence indicates better lip synchronization. Our method shows better accuracy compared to state-of-the-art methods. Our models trained on TCD-TIMIT also shows good generalization capability in cross-dataset evaluation on GRID dataset (Table 2). Although [4] also generates facial landmarks from audio features (MFCC), unlike their regression-based approach, our use of DeepSpeech features, landmark retargeting, and adversarial learning results in improved accuracy of landmark generation.

Moreover, our facial landmarks contain natural eye blink motion for added realism. We detect eye blinks using a sharp drop in EAR (Eye Aspect Ratio) signal [32] calculated using landmarks of eye corners and eyelids. Blink duration is calculated as the number of consecutive frames between the start and end of the sharp drop in the EAR. The average blink duration and blink frequencies generated from our method is similar to that of natural human blinks. Our method produces a blink rate of 0.3 blinks/s and 0.38 blinks/s (Table 1 and 2) for TCD-TIMIT and GRID datasets respectively which is similar to the average human blink rate of 0.28–0.4 blinks/s. Also, we achieve an average blink duration of 0.33 s and 0.4 s, which is similar to as reported in ground-truth (Table 1 and 2). In Fig. 5 we present distribution of blink durations (in no. of frames) in synthesized videos of GRID and TCD-TIMIT datasets. So, our method can produce realistic eye blinks similar to [32], but with better identity-preserved texture, due to our decoupled learning of eye blinks on landmarks.

Ablation Study: An ablation study of window size $\varDelta t$ (Fig. 6) has indicated a value of $\varDelta t=6$ frames (duration of around 198 ms) results in the lowest LMD. In Fig. 6 we also present an ablation study for different losses used for training our motion prediction network. It is seen that the proposed loss $L_{motion}$ achieves the best accuracy. Use of $L_{2}$ regularization loss helps to achieve temporal smoothness and consistency on predicted landmarks over consecutive frames. We use direction loss (Eq. 3) to capture the relative movements of landmarks over consecutive frames. Using direction loss helps to achieve faster convergence of our landmark prediction network. Use of DeepSpeech features helps us to achieve robustness in lip synchronization even for audios with noise, different accents, and different languages (Please refer to supplementary video). We experiment to evaluate the robustness of our l-GAN with different levels of noise by adding synthetic noise in the audio input. Figure 6 shows upto $-30$ dB, the lip motion does not get affected by the noise and starts degrading afterward. In Fig. 4 we present a qualitative result of the landmark generation network on the TCD-TIMIT dataset. It shows the effectiveness of using discriminator in l-GAN.

4.3 Texture Generation from Landmark Motion:

Network Architecture of t-GAN: We adapt a similar approach of an image-to-image translation method proposed by [18] for implementation of our texture generator $G_{T}$. Our landmark encoder-decoder network $E_{L}$ takes generated person-specific landmarks represented as images of size $\mathbb {R}^{3 \times 256 \times 256}$ and $E_{I}$ takes channel-wise concatenated face images with corresponding landmark images of the target subject. We use six downsampling layers for both $E_{I}$ and the encoder of $E_{L}$ and six upsampling layers for the decoder of the $E_{L}$. To generate high fidelity images, we use residual block for our downsampling and upsampling layers similar to [2]. We use instance normalization for the residual blocks and adaptive instance normalization on the bottle-neck layer of the $E_{L}$ using the activation produced by the last layer of $E_{I}$. Moreover, to generate sharper images, we use a similar self-attention method as [35] at the $32\times 32$ layer activation of downsampling and upsampling layers. Our discriminator network $D_{T}$ consists of 6 residual blocks similar to $E_{I}$, followed by a max pooling and a fully connected layer. To stabilize our GAN training, we use spectral normalization [23] for both generator and discriminator network.

Training and Testing Details: We meta-train our t-GAN network using ground-truth landmarks following the teacher forcing strategy. We use fixed step size [12] $1e-3$ and Adam as the meta-optimizer [12] with learning rate $1e-4$. The values of $\alpha _{1}$, $\alpha _{2}$, $\lambda _{pix}$, $\lambda _{adv}$ $\lambda _{feat}$ and are experimentally set to $1e-1$, $2e-3$, 0.5, 1.0 and 0.3 respectively. At test time, we use 5 images of the target identity, and the person-specific landmark generated by the l-GAN to produce the output images. Before testing, we perform a fine-tuning of the meta-trained network using 20 images of the target person and the corresponding ground-truth landmarks. We use a clustered GPU of NVIDIA Tesla V100 for meta-training and Quadro P5000 for fine-tuning our network.

Table 1. Comparative results on TCD-TIMIT [16].

Full size table

Table 2. Comparative results on GRID [9] (our cross-dataset evaluation).

Full size table

Quantitative Results: Here, we present the comparative performance of our GAN-based texture generation network with the most recent state-of-the-art methods [4, 36] and [32]. Similar to l-GAN, the t-GAN is trained on TCD-TIMIT and evaluated on the test split of GRID, TCD-TIMIT and the unknown subjects. We compute the performance metrics PSNR, SSIM (Structural Similarity), CPBD (Cumulative Probability Blur Detection) [25], ACD (Average Content Distance) [32] and similarity between FaceNet [27] features for reference identity image (1st frame of ground truth video) and the predicted frames. Our method outperforms (Table 1 and 2) the state-of-the-art methods for all the datasets indicating better image quality. Due to inaccessibility of LRW [8] dataset we have evaluated our texture generation method on Voxceleb [24] dataset which gives average PSNR, SSIM and CPBD of 25.2, 0.63, 0.11 respectively. Our method does not produce head motion and synthesizes texture with frontal face. Hence, for Voxceleb, our method gives poor performance than that of TCD-TIMIT and GRID.

Qualitative Results: Figure 10 shows qualitative comparison against [4, 36] and [32]. It can be seen that [32] and [36] fail to preserve the identity of the test subject over frames in the synthesized video. Although [4] can preserve the identity, there is a significant blur, especially around the mouth region. Also, it lacks any natural movements over face except lip or jaw motion yielding an unrealistic face animation. On the other hand, our method can synthesize high fidelity images ($256 \times 256$) with preserved identity and natural eye motions. Figure 7 shows the qualitative comparison of our GAN based texture generation against a regression-based (without discriminator) network output where it is evident that our GAN based network gives more accurate lip deformation with similar motion as ground-truth.

Table 3. Ablation Study of our model. CC = channel wise concatenation.

Full size table

Table 4. Epoch-wise quantitative analysis in fine-tuning.

Full size table

Ablation Study: We show a detailed ablation study on the TCD-TIMIT dataset to find out the effect of different losses (Table 3). Among channel-wise concatenation and adaptive instance normalization, which are the two different approaches in neural style transfer, adaptive instance normalization works better for our problem. Figure 7 and quantitative result (Table 3) show that GAN based method produces more accurate lip deformation than the regression-based method, which always produces an overly smooth outcome. Figure 9 shows the ablation study for the number of images required for fine-tuning. Using single image for fine-tune yields average PSNR, SSIM and CPBD values of 27.95, 0.82, and 0.27 respectively for GRID dataset. Our method can produce accurate motion and texture after 10 epochs (Table 4) of fine-tuning with $K=20$ sample images.

Meta-Learning vs. Transfer-Learning: We compare the performance of MAML [12] and transfer-learning for our problem. To this end, we train a model with the same model architecture until it converges to similar loss values as meta-learning. After 10 epochs of fine-tuning with 20 images, the loss of meta-learning is much lower (Fig. 8b) than the transfer-learning (fine-tuning) and produces significantly better visual results (Fig. 8a) than transfer-learning. Moreover, fine-tuning of meta-learned network takes nearly 10 epochs with 20 images, which is much smaller than the transfer-learning based fine-tuning.

User Study: We assess the realism of our animation through a user study, where 25 users are asked to rate between 1(fake)–10(real) for 30 (10 videos from each of the methods) synthesized videos randomly selected from TCD-TIMIT and GRID. Our method achieves better realism scores with an average of $72.76\%$ compared to state-of-the-art methods [4] and [32] with average scores $58.48\%$ and $61.29\%$ respectively.

5 Conclusion

In this paper, we present a novel strategy for speech driven facial animation. Our method produces realistic facial animation for unknown subjects with different languages and accents in speech showing generalization capability. We attribute this advantage due to our separate learning of motion and texture generator GANs with meta-learning capability. As a combined result, our method outperforms state-of-the-art methods significantly. In the future, we would like to study the effect of meta-learning for learning landmark motion from speech to mimic personalized speaking styles.

References

Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.P.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. IEEE (2018)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 538–553. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_32
Chapter Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 7832–7841 (2019)
Google Scholar
Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia. pp. 349–357. ACM (2017)
Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Workshop on Multi-view Lip-reading, ACCV (2016)
Google Scholar
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10101–10111 (2019)
Google Scholar
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4884–4888. IEEE (2015)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org (2017)
Google Scholar
Garrido, P., et al.: VDUB: modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum, vol. 34, pp. 193–204. Wiley Online Library (2015)
Google Scholar
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in Neural Information Processing Systems, pp. 513–520 (2007)
Google Scholar
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Harte, N., Gillen, E.: TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Trans. Multimedia 17(5), 603–615 (2015)
Article Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, Y., Chang, M.C., Lyu, S.: In ICTU oculi: exposing AI generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877 (2018)
Mittal, G., Wang, B.: Animating face using disentangled audio representations. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 3290–3298 (2020)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Narvekar, N.D., Karam, L.J.: A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In: 2009 International Workshop on Quality of Multimedia Experience, pp. 87–91. IEEE (2009)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A., et al.: Deep face recognition. In: BMVC, vol. 1, 6 (2015)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Song, Y., Zhu, J., Wang, X., Qi, H.: Talking face generation by conditional recurrent adversarial network. arXiv preprint arXiv:1804.04786 (2018)
Srivastava, A., Joshi, S.H., Mio, W., Liu, X.: Statistical shape analysis: clustering, learning, and testing. IEEE Trans. Pattern Anal. Mach. Intell. 27(4), 590–602 (2005)
Article Google Scholar
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graphics (TOG) 36(4), 95 (2017)
Article Google Scholar
Vougioukas, K., Center, S.A., Petridis, S., Pantic, M.: End-to-end speech-driven realistic facial animation with temporal GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–40 (2019)
Google Scholar
Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vision pp. 1–16 (2019)
Google Scholar
Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Chapter Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018)
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)
Google Scholar
Zhu, H., Zheng, A., Huang, H., He, R.: High-resolution talking face generation via mutual information approximation. arXiv preprint arXiv:1812.06589 (2018)

Download references

Acknowledgement

We would like to thank Dr. Angshul Majumdar, Professor at IIIT Delhi, India for helping us to get access of TCD-TIMIT dataset required for this research. We would also like to thank Govind Gopal from our infrastructure team for his immense support for creating a clustered GPU setup for training.

Author information

Authors and Affiliations

Embedded Systems and Robotics, TCS Research, Kolkata, India
Dipanjan Das, Sandika Biswas, Sanjana Sinha & Brojeshwar Bhowmick

Authors

Dipanjan Das
View author publications
You can also search for this author in PubMed Google Scholar
Sandika Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Sanjana Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Brojeshwar Bhowmick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandika Biswas .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, D., Biswas, S., Sinha, S., Bhowmick, B. (2020). Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12375. Springer, Cham. https://doi.org/10.1007/978-3-030-58577-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58577-8_25
Published: 24 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58576-1
Online ISBN: 978-3-030-58577-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Speech-Driven Facial Animation Using Cascaded GANs for Learning of Motion and Texture

Abstract

Similar content being viewed by others

Modular Joint Training for Speech-Driven 3D Facial Animation

Synthesis of Photo-Realistic Facial Animation from Text Based on HMM and DNN with Animation Unit

3D facial animation driven by speech-video dual-modal signals

Keywords

1 Introduction

2 Related Work

3 Proposed Methodology

3.1 Speech-Driven Motion Generation on Facial Landmarks