Emotional Voice Puppetry
Emotion
Interpolation
Emotional Talking Face Input
Audio
Mery
Ray
Ye Pan, Ruisi Zhang, Shengran Cheng, Shuai Tan, Yu Ding, Kenny Mitchell, Xubo Yang
Sad
Happy
Fig. 1. Given an audio clip, Emotional Voice Puppetry is capable of generating emotion-controllable talking faces for stylized characters
in a geometrically consistent and perceptually valid way. In addition, live mood dynamics can also blend smoothly.
Abstract—The paper presents emotional voice puppetry, an audio-based facial animation approach to portray characters with vivid
emotional changes. The lips motion and the surrounding facial areas are controlled by the contents of the audio, and the facial dynamics
are established by category of the emotion and the intensity. Our approach is exclusive because it takes account of perceptual validity
and geometry instead of pure geometric processes. Another highlight of our approach is the generalizability to multiple characters. The
findings showed that training new secondary characters when the rig parameters are categorized as eye, eyebrows, nose, mouth, and
signature wrinkles is significant in achieving better generalization results compared to joint training. User studies demonstrate the
effectiveness of our approach both qualitatively and quantitatively. Our approach can be applicable in AR/VR and 3DUI, namely, virtual
reality avatars/self-avatars, teleconferencing and in-game dialogue.
Index Terms—Virtual reality, audio, emotion, character animation
1
I NTRODUCTION
Many researchers enter the metaverse race with immersive meetings
and they are highly interested in animating expressive talking avatars
or self-avatars [4, 28]. A common challenge in virtual reality has been
the difficulty in supporting the users in controlling their avatars’ facial
expressions, due to the obstruction of a significant part of the users’
• Ye Pan is with Shanghai Jiao Tong University. E-mail:
whitneypanye@sjtu.edu.cn
• Ruisi Zhang is with UC San Diego. E-mail: ruz032@ucsd.edu
• Shengran Cheng is with Shanghai Jiao Tong University. E-mail:
SR-Cheng@sjtu.edu.cn
• Shuai Tan is with Shanghai Jiao Tong University. E-mail:
tanshuai0219@sjtu.edu.cn
• Yu Ding is with Virtual Human Group, Netease Fuxi AI Lab. E-mail:
dingyu01@corp.netease.com.
• Kenny Mitchell is with Roblox & Edinburgh Napier University. E-mail:
k.mitchell2@napier.ac.uk.
• Xubo Yang is with Shanghai Jiao Tong University. E-mail:
yangxubo@sjtu.edu.cn (Corresponding author).
Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication
xx xxx. 201x; date of current version xx xxx. 201x. For information on
obtaining reprints of this article, please send e-mail to: reprints@ieee.org.
Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx
faces by the headsets [20]. This obstruction often deters effective facial
capture using conventional video-based methods [29]. Consequently,
the proposed audio-based facial animation has the capability of offering
complementary strengths to the video-based methods, despite not quite
matching unobstructed quality vision systems.
Audio-based facial animation methods have the capability of generating lip movements that are seamlessly synchronized with audio
speech [42], as well as generating precisely personalized eye blink and
head movements [25, 34, 38, 39, 41]. However, the emotion state of
these approaches is seldom considered. Emotion is a strong feeling
based on a user’s circumstances, and mood is often expressed on the
face through muscle motion [21]. Recently, Wang et al. proposed
an emotional talking face generation baseline (MEAD) that enables
the manipulation of emotion and intensity [33]. The MEAD system’s
primary focus was on animating realistic human faces, but not on stylized characters where the facial geometry might go beyond real human
facial geometry. Applying such facial expression generation tools for
stylized characters often lacks expressive quality and perceptual validity
compared to artist-created animations.
Our paper proposes an emotion-controllable talking face generation
framework for stylized characters. Inspired by previous human talking
face generation, the study aimed at controlling stylized characters that
go beyond the normal human look and act as expressive proxies for
the users. The MEAD algorithms were applied at the initial stage to
map the audio to lip motion. Then the character rig parameters that
complimented the mouth shape were retrieved while considering emotional intensity and category. The last stage entailed the development
of a new multiple characters generalization network that enables the
transfer of expression between the characters.
We also introduce a data-efficient multiple-character generalization
network based on the previous ExprGen [2], which automatically learns
a function to map the rig parameters of the primary characters to the
secondary characters. However, ExprGen requires over 5k samples to
train the mapping network. By carefully studying the Facial Action
Coding System (FACS) [40] and consulting with our in-house artist,
the character rig parameters were collected and categorized into five
groups, namely: eye, eyebrows, nose, mouth, and signature wrinkles.
Then the rig parameters were trained in parallel and then retargeted
concurrently on the secondary characters. Note that our approach only
requires a small number of training examples for retargeting.
We demonstrate the effectiveness of our method by comparing it to
the state-of-the-art method on recognition, perceived intensity, synchronization, and naturalness & attractiveness, as these are crucial factors
for audience engagement [16, 36]. Results show that our method significantly improved scores of the expression recognition & intensity
while maintaining the same level of lip sync quality, naturalness &
attractiveness compared to MakeItTalk. The proposed technology is
highly applicable in impactful fields, including VR, in-game dialogue,
and telepresence.
The main contributions of this work include the following:
• To the best of our knowledge, we presented the first emotional
talking heads specially designed for 3D stylized characters in a
geometrically consistent and perceptually valid way.
• We developed our framework on the FERG-3D-DB dataset by
adding intensity labels for each character, and extensive user
studies validated the effectiveness.
• Our multiple-character generalization network significantly improved the generalization and efficiency of retargeting on new
characters.
• We propose new metrics to evaluate the recognition, intensity, synchronization, naturalness, and attractiveness of different talking
head animation approaches. Extensive experiments demonstrated
their effectiveness.
2
2.1
R ELATED
WORK
Audio driven facial animation
The literature review focuses on audio-based facial animation. Several
researchers have studied video-based facial animations and found that
video-driven animations have the capability of creating more realistic
facial expressions when they capture the facial performance of a human
actor. The primary downside of performance capture compared to the
animator-generated animation is that it is visually restricted by the performance of the actor and in most cases is void of the expressive quality
and perceptual validity [2]. Our objective is to generate expressive and
plausible 3D facial animations based solely on audio. The audio-based
techniques can be organized in facial enactment, which aims at creating
photo-realistic videos of the existing human such as idiosyncrasies, and
facial animation, which focuses on expression prediction that can be
utilized with a predefined simulator or avatar [30].
There exist many previous studies on audio-based facial animation,
however, most of the studies concentrated on the relationship between
speech content and the shape of the mouth [11, 23, 44]. The brand
pioneered Voice Puppetry to generate full facial animation from an
audio track [6]. Karras et al. proposed a neural network, which stacks
several convolution layers, to generate the 3D vertex coordinates of a
face model from the audio and known emotions [18]. Zhou et al. developed speaker-aware talking head animations from a single image and
an audio clip by decomposing the input audio into speaker and content
information and then applying a deep learning-based method [43]. Guo
et al. adopt the neural radiance field (NeRF) representation for scenes
of talking heads [14]. However, the emotion is not addressed.
Additionally, some studies have looked at head motions and eye
blinks. One such study is covered in Chen et al. article where the
authors synthesize videos of talking face with natural head movements
through the explicit generation of head motions and facial expressions [8]. One of the most challenging aspects of synthesizing talking
face videos is that the natural poses of a human results in head motions that are either in-plane or out-of-plane. Yi et al.’s article where
the authors reconstructed a 3D face animation and then rendered into
synthesized frames [37]. Hao et al. present a two-stage approach for
the generation of talking-face videos that have realistic controllable eye
blinking capabilities [15]. Liu et al. produced talking face that have
controllable eye blinking capabilities which are driven by joint features
of identity, audio and blinking [22]. Given the extensive level of research on the synthetization of talking face videos, this paper focuses
on the animation of 3D stylized characters where the head and eye are
animated by rig parameters.
A few studies have looked at human face emotions [27], for instance,
Wang et al. developed a large-scale emotional audio-visual dataset
(MEAD) that contained talking face videos having varied emotions
at varying intensity levels. In addition, the researchers proposed an
emotional talking head generation baseline that was essential in manipulating emotions and their strength [33]. Ji et al. attained emotional
control by editing video-based talking face generation approaches [17].
Again, the developed systems were used in animating human faces and
not 3D stylized characters.
2.2 Facial expression for stylized characters
The successful creation of an animated story depends on the emotional state of a character, which must always be staged unambiguously [19, 32]. Keyframing is a prevalent method for animating characters with clear emotions and artistic expressions [1]. It is a simple
method of animating a character but is often time-consuming. Recently, character expressions have been generated by motion capture
systems which use modeled features on human faces and geometric
markers [35]. Nevertheless, these features do not precisely match the
stylized character expressions. Therefore, facial geometric features
alone cannot produce the desired and perceptually valid stylized character expression.
On the other hand, photographic precision (for instance, precise
drawing of facial wrinkles) is not a certainty of achieving an accurate
communication of emotion. Underlying every emotion, only a limited
set of elements are the actual basis of our recognition. Primitive artists
and cartoonists have invented unexpected and extraordinary graphic
substitutes for actions and features. Generally, the stylized or abstracted
interpretation of expressions must always rely on the actual nature of
the human face [12].
Aneja et al. proposed ExprGen, a multi-stage deep learning system that can generate stylized character expressions from human face
images in a way that is geometrically consistent and perceptually
valid [2, 3]. This idea inspired this study, and thus the proposed Emotional Voice Puppetry System that develops character expressions using
audio as the sole input.
3
P RELIMINARY
3.1 Emotion Categories and Intensities
Categories We utilized four 3D stylized characters, namely Mery,
Bonnie, Ray, and Malcolm acquired from the Facial Expression Research Group 3D Database (FERG-3D-DB) possess annotated facial
expressions that were categorized into seven groups namely anger, fear,
disgust, sadness, joy, surprise, and neutral.
Intensities The original dataset did not label the intensities. They
are labeled according to the changes in facial expressions. When the
facial expression is categorized under anger, fear, or surprise, the expression is expected to be more pronounced with enhanced eye-opening,
while the eye closes and face becomes lackluster if the expression is
Emotion
Angry
Disgusted
Fear
Happy
Neutral
Sad
Surprised
Weak
Medium
Strong
Weak
Medium
Strong
Fig. 2. Resulting images generated from MEAD and our method for seven emotion categories and three intensities.
the opposite. The intensity label for the expression categories was also
approximated by first disentangling them into five parts, namely eye,
eyebrows, nose, mouth, and signature wrinkle, and then calculating the
offset of the expressions relative to the neutral expression for every part
identified. Equal weights are given to the five parts, thus offsetting the
rig parameters and ranking them into the three intensity levels.
Based on these assumptions, we then define a three levels of emotion
intensity.
Coefficients (MFCC) [24] from the audio. We pair the video frames and
audio features using a one-second temporal sliding window with the
sample rate set to 30. Based on the audio temporal properties, a long
short-term memory (LSTM) network and a fully connected layer are
applied to predict the lip’s motion. The L2 loss function is established
to define the audio-to-landmark task.
• WEAK describes the slight or gentle but detectable facial motion.
where x and lgt are the input audio and the corresponding ground truth
lip landmark, respectively, and F(·) represents the audio-to-landmark
module.
• MEDIUM describes the normal emotion state or the typical emotion expression.
• STRONG describes the exaggerated facial expressions characterized by intense emotion in the face area.
4
M ETHOD
Overview As summarized in Figure 3, we propose an emotional
talking-head generation method for stylized characters that is able to
automatically manipulate emotion and intensity. We used three-branch
architecture to process the audio and emotion distinctly. We first map
the audio to lip movements of the base character and retrieve the desired
emotion on the upper face from the preprocessed FERG-3D-DB dataset
in Section 3.1, and then added the head pose & eye blink. Lastly, the
acquired expression parameters are utilized to generate expressions on
multiple secondary 3D stylized characters.
4.1 Lip
Audio-to-Landmarks The input audio is converted to lip landmarks of the talking face by first extracting the Mel-Frequency Cepstral
Lossa2l = F(x) − lgt
2
(1)
,
4.2 Emotional rig parameter & Matching
We bridge the generated mouth landmarks with rig parameters of
Mery performing the following steps. We first render Mery’s 2D images
with given rig parameters in FERG-3D-DB. Then, we apply the face
detection algorithm [7] to obtain 20 landmarks associated with the
mouth. Finally, we use the two-stage matching method to obtain the
best-matched rig parameter and mouth landmark pairs: (1) We first
filter five rig parameters based on the L2 distance in Equation2. The
i
lrender
denotes the landmark i of the rendered image. (2) Then, we
calculate the open mouth similarity between the generated and detected
landmarks based on Equation3. The open mouth distance is given in
Equation4, where upper and lower denote the index of the upper and
lower lip, respectively.
20
Dstage1 =
∑
i=1
i
F(x)i − lrender
2
,
(2)
Input
emotional state
Emotional rig
parameters
Emotion categories
OR
Dataset
Input audio
Extracting
Feature
Intensities
Matching
LSTM
Resulting
animation
Audio to landmark
Head motions & Eye blink
Input
character
Brow
MLP
Eye
MLP
Mouth
MLP
Nose
MLP
Cheek
MLP
Target
characters
Multiple character adaptation
Fig. 3. The overview of emotional voice puppetry. Our method includes three branches to process the mouth shape, emotion and head pose & eye
blink respectively.The first branch maps the audio to lip movements of the base character and retrieve the desired emotion on the upper face from the
preprocessed FERG-3D-DB dataset in Section 3.1 The second branch adds the additional head pose & eye blink. The third branch utilizes the
acquired expression parameters to generate expressions on multiple secondary 3D stylized characters. Finally, a multiple character adaptation
network performs input character to target characters expression transfer.
Dstage2 = ∥mouth(F(x)) − mouth(lrender )∥2 ,
Input
Target 1
Target 2
(3)
Brow
mouth(y) = yupper − ylower
2
,
(4)
4.3 Head motions, Eye gaze & Eye blink
We employ LSTM-based generators to create the relationship
between head motions and input audio. We first construct an
audio-to-head motion dataset. Our dataset is adapted on the basis
of Multi-view Emotional Audio-visual Dataset (MEAD). We then
employed OpenFace [5] to create corresponding 3D rig parameters
based on data from the MEAD dataset. The LSTM network is then
trained to generate head movements based on the audio. The same
methodology is also used to generate eye movements.
4.4 Smoothing Optimization
After getting the rig parameter and mouth landmark pairs, we can
feed them into the second branch. Due to the clips that may exist in
matching, we introduce smoothing to make the frames more consistent
and natural. Here, we simply use linear interpolation to add frames
between two neighbor keyframes based on their rig parameters.
4.5 Multiple character adaptation
Generalizing facial expressions to multiple characters plays an important role in the animation field. A prior method in ExprGen [2] employs
two steps to transfer animation to multiple characters. It first resolves
training pairs by measuring the feature distance between two characters. Then, they use the controller values of input character and target
character to train MLP models. Though the character pairs’ feature
distance is close, the two characters’ facial geometry details might be
Eye
Mouth
Nose
Cheek
Fig. 4. Matching samples of different parts
different. For example, the input character and target character are both
smiling and we retrieved the best-matched target character expression.
The retrieved expression might be similar in upper face but different
in lower face. Additionally, to control a character’s facial expression
more precisely, we usually need several hundred parameters for both
input and target characters. When training MLP networks, we need a
larger dataset to match these pairs (about 10k examples). It requires
more manual effort overall and, thus, is harder to generalize to another
character.
The facial expressions are controlled by different independent action
units. We divide the face of character into five parts: brow, eye, mouth,
cheek and nose 4. For each part, we use the rig parameters as feature
vectors and train a separate MLP network to generalize the character’s
expressions.
We propose a new model to generalize input character expressions to
comparable performance as state-of-the-art models in terms of lip synchronization and facial geometry, although emotion is not addressed in
their approach.
5.1.1
Participants
We recruited 20 participants from Shanghai Jiao Tong university. The
average age of participants was 21, with an age range of 20-23 years,
of which 10 were female and 10 were male. All of the participants
are majoring in engineering. They were naı̈ve to the purposes of the
experiment.
5.1.2
Fig. 5. Multiple character adaptation to new characters. The first row
is the input character Mery, the second and third rows are the target
character Waitress and Miosha.
multiple target characters. In the matching step, we split the character
face into different parts and match their geometry. In the training
step, we use different MLP networks to train the facial components
separately. Then, the MLP networks’ outputs are combined to control
target characters.
Matching To match the input character with the target characters,
we first split the input character’s face into five parts, namely mouth,
nose, brow, eye, and cheek. For each part, we perform a two-step filtering similar to the matching algorithm in Section 4.1. In the first step,
we directly filter images with the same emotion annotation as the input
character. In the second step, we use the following landmark distance to
formulate geometry feature vectors: mouth: mouth width (left mouth
corner to right mouth corner distance), closed mouth height (distance
is vertical between the upper and the lower lip); nose: nose width (distance is horizontal between leftmost and rightmost nose landmarks);
brow: left/right eyebrow height (distance is vertical between top of the
eyebrow and center of the eye), eye: left/right eyelid height (distance
is vertical between top of an eye and bottom of the eye), and left/right
lip height (distance is vertical between the lip corner from the lower
eyelid), cheek: the distance from nose to leftist face, and lowest face.
Then, for each part in the face at i-th frame, let the input character’s
i
and the target character’s landmarks delandmarks denoted as linput
i
noted as ltarget , we retrieve the closest landmark by minimizing the L2
distance D between input and target landmarks as shown in Equation 5.
i
i
D = argmini ||linput
− ltarget
||2
(5)
Training We create a separate multilayer perceptron (MLP) for
each facial part, which consists of N output nodes, M input nodes, and
a hidden layer with ReLU activation. The M and N are the dimension
of facial parts in input character and target character. The parameters
change with different characters and different facial regions used in
training. When training the model, gradient descent is used with a
mini-batch size of 10 and a learning rate of 0.01 to minimize the square
loss between the ground truth and output parameters, where the ground
truth is obtained from the matching step.
5
5.1
E VALUATION
Comparison to the state-of-the-art
We compare our emotion-controllable talking face generation approach
with MEAD and MakeItTalk. We include MEAD system, because our
system also used Multi-view Emotional Audio-visual Dataset, and their
emotional talking-face generation baseline. However, our system is
developed for stylized characters, instead of real human. Additionally, we also include MakeItTalk to demonstrate our system achieved
Design
The experiment utilized animated characters 6 (Human created via
MEAD, MakeItTalk, Mery, Bonnie, Ray & Malcolm) × 7 emotions
(Neutral, Anger, Sadness, Fear, Disgust, Happiness, & Surprise) × 3
intensities × 4 audio in a mixed design, with a between-subject design
for audio, but a within-subject design regarding characters, emotions,
and intensities.
Each participant engaged in 126 (6 characters × 7 emotions × 3
intensities = 126 trials) experimental trials by randomly presenting
video clips to them to reduce cases of fatigue.
5.1.3
Procedure
The participants all signed the consent form before engaging in the trial.
They were presented with a video clip that they viewed and afterward
answered the related questions. The questions were close-ended, and
the participants had to choose pre-defined responses.
• Which expression did the character depict? Participants were
asked to select one of the below responses: neutral, anger, sadness,
fear, disgust, happiness, surprise or other.
• How intense was the indicated emotion depicted by the character?
Participants rated the intensity on a scale from 1 to 3, where; 1 is
weak and 3 is strong.
• Whether the lip motion sync with the speech? Participants rated
on lip sync qualities on a scale from 1 to 7, where 1 is not synchronized at all & 7 is synchronized extremely well.
• How natural was the talking head overall?” Participants rated
naturalness on a scale from 1 to 7, where 1 is not natural at all
and 7 is extremely natural.
• How attractive was the character overall?” Participants rated attractiveness on a scale from 1 to 7, where; 1 is not attractive at all
and 7 is extremely attractive.
The participants engaged in one practice trial where asking questions
was allowed, and then took 126 trials that were measured.
The participants were paid 50 RMB. The experiment took about
30 minutes. The experiment was approved by Shanghai Jiao Tong
University Research Ethics Committee.
5.1.4
Results & Discussion
We applied Analysis of variance (ANOVA) on separate repeated measures for results on recognition, intensity, synchronization, naturalness
and attractiveness. The ANOVA results showed within-participant
factors of emotion (7), character (6), and intensities (3). Using the
Shapiro-Wilk test, the data were distributed normally for all the assessed conditions, and the boxplot showed no outliers. Furthermore,
Mauchly’s test was conducted to evaluate data sphericity and possible
cases of data sphericity violation. In those cases, we applied with
Greenhouse-Geisser correction and marked with an asterisk“∗ ”. The
post hoc test used was the Bonferroni test for multiple means comparison.
, Q W H Q V L W \
5 H F R J Q L W L R Q
+ X P D Q
0 D N H , W 7 D O N
0 H U \
% R Q Q L H
5 D \
0 D O F R O P
+ X P D Q
0 D N H , W 7 D O N
0 H U \
, Q W H Q V L W \
: H D N
% R Q Q L H
5 D \
0 D O F R O P
, Q W H Q V L W \
0 H G L X P
6 W U R Q J
: H D N
(a) Recognition
0 H G L X P
6 W U R Q J
(b) Intensity
$ S S H D O
1 D W U X D O Q H V V
6 \ Q F K U R Q L ] D W L R Q
+ X P D Q
0 D N H , W 7 D O N
: H D N
0 H U \
% R Q Q L H
, Q W H Q V L W \
0 H G L X P
(c) Synchronization
5 D \
6 W U R Q J
0 D O F R O P
+ X P D Q
0 D N H , W 7 D O N
0 H U \
: H D N
% R Q Q L H
, Q W H Q V L W \
0 H G L X P
(d) Naturalness
5 D \
0 D O F R O P
+ X P D Q
0 D N H , W 7 D O N
0 H U \
% R Q Q L H
5 D \
0 D O F R O P
, Q W H Q V L W \
6 W U R Q J
: H D N
0 H G L X P
6 W U R Q J
(e) Attractiveness
Fig. 6. Mean for each controlled intensity level and character on recognition, perceived intensity, synchronization, naturalness & attractiveness. Error
bars show standard deviation.
Recognition For the recognition of expressions, responses were
converted to scores, “1” for correct or “0” for incorrect, and then
averaged over stimuli repetitions. Figure 6(a) shows the comparison
of average scores obtained for three intensities across 6 characters.
First, the MakeItTalk had the lowest mean score (Mean, M =
.193, StandardError, SE = .016). The main effect of characters was
significant, F(5, 95) = 80.175, p < .001. Bonferroni post-hoc comparisons indicated the mean for MakeItTalk is significantly lower than
Human, p < .001. This is expected, because MakeItTalk animates
characters based on geometric markers only.
Second, the average score for Human (M = .519, SE = .03) is significantly lower than the average score for all stylized characters, e.g.,
primary character Mery (M = .764, SE = .031).p < .001. This could
be due to the characters’ simpler geometry and stylization, which makes
the expressions simpler to discern.
Third, Bonferroni post-hoc comparisons also shows that the mean
Mery, Bonnie (M = .714, SE = .027), Ray (M = .705, SE = .033) ,
Malcolm (M = .624, SE = .031) did not significantly differ from each
other, p > .05. This demonstrated the effectiveness of our multiple
character generalization network.
Additionally, Figure 6(b) also shows mean ratings for MalkItTalk
are the lowest among all characters. We note that the main effect
of character was significant, F(2.892, 54.942) = 47.626, p < .001∗ .
Bonferroni post-hoc comparisons indicated the mean ratings for
MakeItTalk (M = 1.371, SE = .084) is significantly lower than Human (M = 1.99, SE = .065), Mery (M = 2.257, SE = .075), Bonnie
(M = 2.067, SE = .07), Ray (M = 1.871, SE = .065) , and Malcolm
(M = 2.062, SE = .066), p < .001. This is expected, as MakeItTalk’s
main contribution is focused on creating better lip synchronization,
head motions and personalized facial expressions, instead of generating
expressive and emotion for stylized characters.
Intensity Figure 6(b) shows user perceived intensity ratings for
three labeled intensities (WEAK, MEDIUM, & STRONG) across 6
characters. Firstly, the main effect of controlled intensities was significant, F(1.154, 21.918) = 26.266, p < .001∗ . Bonferroni post-hoc comparisons indicated the mean ratings for WEAK (M = 1.745, SE = .078)
is significantly lower than MEDIUM, (M = 1.973, SE = .06).p < .001
p = .016. The mean for MEDIUM also significantly differ from
STRONG, (M = 2.092, SE = .055), p = .022. This shows our labeled
emotion intensities are well distinguished.
Naturalness & Attractiveness Figure 6(d) shows the rating
on naturalness for 3 intensities across 6 characters. Results shows
that the main effect on intensities and characters were not significant, F(1.311, 24.913) = 4.393, p = .057∗ , and F(2.439, 46.339) =
.935, p = .415∗ , respectively.
Figure 6(e) shows the rating on attractiveness for 3 intensities across
6 characters. Results shows that the main effect on intensities and
characters were also not significant, F(2.748, 52.218) = 1.556, p =
.214∗ , and F(1.453, 27.6) = 3.897, p = .063∗ , respectively.
Synchronization Figure 6(c) shows the lip synchronization scores
for 3 intensities across 6 characters. The mean lip sync scores of our
four stylized characters are similar to both Human and MakeItTalk. Results reveal that the main effect on intensities and characters were
not significant, F(2, 38) = 3.598, p = .037, and F(2.602, 49.44) =
1.806, p = .165∗ , respectively. This is expected, because we used
the same audio-to-lip method as Human to generate the mouth shape.
&