In Pursuit of Visemes
Sarah Hilder, Barry-John Theobald, Richard Harvey
School of Computing Sciences, University of East Anglia, UK
{s.hilder, b.theobald, r.w.harvey}@uea.ac.uk
Abstract
We describe preliminary wprk towards an objective
method for identifying visemes. Active appearance
model (AAM) features are used to parameterise a
speaker’s lips and jaw during speech. The temporal behaviour of AAM features between automatically identified salient points is used to represent visual speech gestures, and visemes are created by clustering these gestures using dynamic time warping (DTW) as a costfunction. This method produces a significantly more
structured model of visual speech than if a typical
phoneme-to-viseme mapping is assumed.
Index Terms: Visemes, visual speech encoding.
1. Introduction
Phonemes are a linguistic unit used to represent acoustic speech. Replacing one phoneme with another will
change the meaning of an utterance. Visemes (visual
phonemes) [1] are the supposed equivalent for visual
speech. However, visemes are less well defined and less
well understood. Our goal is to identify a representation
for visual speech equivalent to the phonemes of acoustic
speech.
Previous efforts to define visemes have proved inconclusive. Many make the assumption of a many-to-one relationship between phonemes and visemes [2, 3, 4, 5, 6],
but there are a number of limitations with this approach.
Firstly, this does not take into account the asynchrony between the acoustic and the visual modalities of speech,
where the onset of movement does not always correspond
to the onset of the acoustic realisation of a phone. Secondly, there are some phones that do not require the use
of the visual articulators, and so phonemes such as /k/ or
/g/, which are velar consonants articulated at the back of
the soft palate, are unlikely to have an associated viseme.
Finally, previous approaches generally ignore coarticulation effects. The allophones of a phoneme often appear
very different visually, yet these obviously different visual gestures are assigned the same meaning in a visual
sense.
To overcome these limitations, we break the assumption of a formal link between the acoustic and the visual
representations of speech. Instead, we will use machine
learning algorithms to analyse visual speech to identify
automatically patterns of visual behaviour. This paper
describes preliminary work towards this goal.
The remainder of this paper is organised as follows.
Section 2 reviews previous methods used to identify
visemes, and Section 3 describes our approach. Section 4
discusses our findings, and Section 5 provides detail of
future work.
2. Background and Related Work
The sounds of speech are dependent on the formation of
the articulatory organs, such as the nasal cavity, tongue,
teeth, velum and lips. Since only a small number of these
articulators are visible (lips and partially the teeth and
tongue) it is apparent that a one-to-one mapping between
phonemes and visemes results in redundancy. Speech
sounds that differ in their voicing or nasality tend to appear visually similar and so are assigned to the same
visemic class. For example, /f/ is voiceless and /v/ is
voiced, but both have the same place and manner of articulation (labiodental fricative). As they appear the same,
they often are considered to form a viseme.
Typically phonemes are mapped to visemes using
some form of subjective assessment, based on analysing
patterns of confusion in a stimulus-response matrix [2].
Visemes are defined by clustering phonemes such that
the within-cluster response accounts for at least 75% or
more of all responses. Auer and Bernstein [3] compiled
a set of consonant visemes and vowel visemes (referred
to as phonemic equivalence classes or PECs) by grouping phonemes based on their similarity. Their data were
taken from the Eberhardt et al. consonant recognition
task, where participants were asked to lip-read from C-/a/
contexts [4] and from Montgomery and Jackson’s vowel
recognition task, where participants were asked to lipread the vowel in /h/-V-/g/ contexts [7]. They find that
twelve PECs best approximate the data.
Similarly, Lesner and Kricos [6] asked subjects to lipread vowels and diphthongs in /h/-V-/g/ contexts spoken
by different speakers. They find that the number and
composition of visemes differs with different speakers,
and that speakers who are easier to speech-read generally
produce a larger number of visemes. A similar experiment was performed by Jiang et al. [8] in which participants were asked to lip-read the consonant in C-V words
uttered by four speakers. They also find that visemes are
both speaker dependent and context dependent, where the
number of visemes varies from four to six.
Fisher [1] asked participants to lip-read the initial and
final consonants of an utterance using a ‘forced-error’ approach, where the correct response was omitted from a
closed-set of possible responses. The results suggest that
viseme groupings for initial and final consonants differ,
and that initial consonants contain directional confusions.
For example, /m/ is significantly confused with /b/, but /b/
is not significantly confused with /m/.
For experiments that require human participation, it
is necessary for stimuli to be kept simple. Many of
the methods used to identify visemes use data where
phonemes are presented only in a single context [3, 7].
Natural speech production is much less constrained that
this, and visual articulation is based on what is necessary
for acoustic distinctiveness, but not for visual distinctiveness [14]. The physical constraints enforced by the human muscular system prevent the articulators switching
position instantaneously, which results in blurring across
gesture boundaries [10]. Perkell and Matthies [9] describe this as ‘the superposition of multiple influences on
the movement of an articulator’. Montgomery and Jackson [7] state this as the reason for selecting the /h/-V-/g/
context in their work — this context produces minimal
coarticulation effects.
Coarticulation can be anticipatory (forwards coarticulation) or can reflect the influence from previous gestures
(backwards coarticulation). For example, the appearance
of /s/ in the words ‘sue’ and ‘sea’ may be different as the
anticipatory, labial rounding of /u/ in ‘sue’ begins during or before the articulation of /s/ (forwards coarticulation). Equally, the appearance of /k/ in the words ‘spook’
and ‘speak’ may be different due to the labial rounding of
the preceding vowel in ‘spook’ and widening in ‘speak’
(backwards coarticulation). Figure 1 illustrates the affect of backwards coarticulation on lip shape. The images
were taken during the articulation of /t/ in the word ‘bat’
(left) and ‘jot’ (right). In the latter case the lip roundedness of /dZ/ and /6/ from ‘jot’ continues to influence /t/.
It is worth noting that coarticulation effects are not functions only of directly neighbouring phones, but have been
found to be influenced by phones up to six segments in
either direction [11]. Löfqvist incorporated this idea into
a theory of speech production [20], whereby each speech
segment has a set of dominance functions — one for each
articulator — and speech production involves concatenating and overlapping these dominance functions.
The dominance and deformability of a gesture depends on whether fully reaching the articulatory targets
is necessary to produce the required sound. This means
that not all visual phones are equally affected by coarticulation as the organs that are deemed necessary for
producing a sound may or may not be visually apparent.
Figure 1: A frame from /t/ in the word ‘bat’ (left) and
‘jot’ (right). The difference in lip shape highlights the
influence of coarticulation on visual speech production.
For example, the consonants /f/ and /v/ are far less deformable than /k/ and /g/. The former are fricative consonants that are articulated using the upper teeth and lower
lip — granting minimal freedom to the shape of the lips
— whereas the latter are velar consonants that are articulated at the back of the soft palate — granting more freedom to the shape of the lips. The articulation of the vowel
/u/ is a dominant gesture as lip rounding and protrusion
are both essential to produce the sound. Consequently,
in consonant recognition tasks, Owens and Blasek [15]
and Benguerel and Pichora-Fuller [16] find that the most
apparent coarticulation effects are in /u/-C-/u/ contexts,
where C is a more deformable consonant (such as /s, z,
t, d, l, n, k, g, h, j/). Benguerel and Pichora-Fuller [16]
also find that in VCV contexts /u/ attains a near perfect
recognition score whereas /æ/ scores the lowest. Perkell
and Matthies [9] measured coarticulation in /i/-C-/u/ utterances by recording the vertical displacement of a point
on the upper lip. They find that many speakers begin lipprotrusion for /u/ directly after the acoustic offset of /i/.
To date there has been no unequivocal agreement regarding the number of visemic classes, nor how the set of
phonemes are clustered to form the visemes. This may
be due to the subjective nature of the methods employed,
where small variations in stimuli and different participants have an influence on the resulting visemes.
For the purposes of automatic lip-reading, Goldschen
et al. [12] used an objective approach for identifying
visemes using sentences as stimuli. A selection of static
and dynamic lip features were extracted from video and
manually segmented into phones. These were then clustered using a hidden Markov model similarity measurement and the average linkage hierarchical clustering algorithm [13]. The resulting visemes are consistent with
results from perceptual experiments [1], but the notion of
a viseme was extended to include lip opening/closing for
the consonants /b/, /p/ and /m/, forming the groups /bcl,
m, pcl/ and /b, p, r/ (where ’cl’ indicates closure).
To account for variation in visual articulation due to
phonetic context, a many-to-many relationship between
phonemes and visemes is required [17]. Mattys et al. [18]
are one of the few that have attempted to model visemes
in this way, where different viseme classes for initial and
non-initial consonants are used. However, a limitation of
this approach is that consonants are assumed to have no
influence on the articulation of vowels.
We propose to identify visual units of speech independently of a phonetic/acoustic representation of
speech. Instead, patterns of behaviour of the articulators
will be used to identify visual meaning, and by clustering behaviours that appear similar we will identify a set
of visemes. This will overcome three of the major shortcomings identified with previous work. Firstly, the allophones of a phoneme will not be required to have the
same visual label. Secondly, the onset and offset of the
visual gestures will be identified in the visual modality,
thus we do not require visemes to align with acoustic labels, as is usually the case [12, 21]. Thirdly, our analysis
will be objective in nature, and the visual units will be
derived from continuous speech (sentences).
3. Viseme extraction
To produce a set of visual gestures that we will refer to
as visemes, we use continuous speech to acknowledge
the influence of coarticulation and adopt a data-driven
approach to avoid prior assumptions regarding phonetic
alignment or labels.
3.1. Stimuli
The stimuli used in this work are drawn from the
LIPS2008 audio-visual corpus [22]. This contains 278
phonetically balanced sentences spoken by a single, female speaker. It was recorded at 50 frames per second in
standard definition. The speaker maintained a neutral expression throughout the recording and spoke at a steady
pace. The camera captured a full frontal image of the
face.
3.2. Feature extraction and preprocessing
s = {x1 , y1 , x2 , y2 , . . . , xn , yn }T
of a mesh that delineates the inner and outer lip contours and the jaw. A set of model parameters that control
the non-rigid variation allowed by the mesh are derived
by hand-labelling a set of training images, then applying
principal components analysis (PCA) to give a compact
model of the form:
m
X
s i pi ,
pi are the shape parameters, which define the contribution
of each mode in the encoding of s.
The appearance of the AAM is an image defined over
T
the pixels that lie inside the base mesh, x = (x, y) ∈ s0 ,
and the set of model parameters that control the variation
allowed by the image. The appearance is constructed by
warping each training image from the manually annotated
mesh locations to the base shape, then applying PCA (to
these shape-normalised images) to give a compact model
of appearance variation of the form:
A(x) = A0 (x) +
In this work, active appearance models (AAMs) [23] are
used to encode visual speech. AAMs provide a compact
statistical description of the variation in the shape and appearance of a face. The shape of an AAM is defined by
the two-dimensional vertex locations:
s = s0 +
Figure 2: A example of the shape (top row) and appearance (bottom row) components of an AAM. Show are the
first two modes of variation overlaid onto the mean shape,
the mean appearance, and the first mode of variation of
the appearance component of the model.
(1)
i=1
where s0 is the mean shape and the vectors si are the
eigenvectors of the covariance matrix corresponding to
the m largest eigenvalues (see Figure 2). The coefficients
l
X
λi Ai (x)
∀ x ∈ s0 ,
(2)
i=1
where the coefficients λi are the appearance parameters,
A0 (x) is the base appearance, and the appearance images, Ai (x), are the eigenvectors corresponding to the
l largest eigenvalues of the covariance matrix (see Figure 2).
To encode the visual speech information within the
video sequences, the face is tracked using the inverse
compositional project-out AAM algorithm [24]. Next
Equations 1 and 2 are solved for the shape and appearance parameters respectively, which are then concatenated and smoothed to reduce the effects of noise.
Smoothing is achieved using a cubic spline smoothing
with a weighting of 0.7 — a value determined by subjectively analysing the curves. The smoothed features are
normalised as follows:
b=
Wp
λ
(3)
where
v
u Pl
u
σλ2 i
W = t Pi=1
m
2
i=1 σpi
(4)
where λ and p are column vectors of appearance and
shape parameters respectively, l and m are the number of
dimensions corresponding to the appearance and shape
respectively, and σλ2 i and σp2i represent the variance captured by each dimension of the respective model. A third
PCA is applied to these features to model the correlated
variation in the shape and appearance parameters, providing a compact, low-dimensional feature vector describing
the shape and appearance variation of the lips and jaw
during speech.
3.3. Speech segmentation
To produce the gestures we define to be visemes, the visual speech first is segmented into units. Typically this
is done using the phonetic boundaries derived from the
acoustic speech [12, 21], but here we use a data-driven
approach whereby we locate segment boundaries using
only the visual signal.
During speech, articulators do not move at a constant
rate. Instead they tend to accelerate towards articulatory
targets and deccelerate as they approach or realise the
targets. Consequently, we make the assumption that a
salient lip pose is that in which the lips are at their most
still, and that a gesture is the transition from one salient
lip pose to the next. This involves calculating the acceleration (∆∆) coefficients from the AAM features and extracting the frames where the sign changes from negative
to positive.
Using this method we extract from the LIPS2008 corpus 8421 gestures. This value falls between the number
of phones articulated (11850) and the number of syllables
produced (≈ 4500).
3.4. Clustering
Pair-wise distances between gestures are obtained by
measuring the cost associated with performing a dynamic
time warp (DTW). Dynamic time warping is a method
for measuring the similarity between two time series that
may vary in length or speed where a non-linear warp
along the time axis is applied to one of the sequences to
align it to the other. This warp is optimised to minimise
a cost function. The reader is referred to [25] for a more
in-depth discussion regarding DTW.
We use the clustering toolbox, CLUTO [26] (version
2.1.2), to cluster the gesture space. Empirically we find
that the graph-based clustering algorithm results in the
best clustering. This possibly is because it is able to
model non-spherical clusters.
To determine the number, k, of clusters required, we
measure the silhouette coefficient (SC) [27], Dunns In-
dex (DI) [28] and Davies-Bouldin Index (DBI) [29] at n
clusters, where {n = 2...200}. We find that k = 58 is
optimal. This is where the SC and DI were maximised
and DBI was minimised.
3.5. Results
Figure 3A shows the distance matrix produced by finding the DTW cost between each pair of gestures and ordering the samples by their cluster ID. In this image, the
colours range from blue to red representing, respectively,
the smallest to largest values. The cluster boundaries are
highlighted with black boxes along the leading diagonal.
A perfect clustering would produce a series of blue boxes
down the diagonal on a red background.
For comparison, Figure 3B presents a visualisation
of the distance matrix produced by finding the cost of
performing DTW to each pair of visemes, i.e. visual
phonemes. To produce this image, the AAM features
were segmented via the acoustic phone boundaries. Labels were then assigned to each segment based on the
phoneme to viseme mapping taken from [30]. In the image, the samples are ordered by viseme group. For example, all segments with phoneme labels /p/, /b/ and /m/
are arranged sequentially in the distance matrix. Again,
the group boundaries are outlined with a black box. We
can see from this image that this distance matrix appears
to lack structure as the distances between grouped items
appear to be no smaller the distances between ungrouped
items. The one exception to this is the group that appears
in the lower right-hand corner of the image. This group
represents silence. As the speaker was asked to maintain a neutral expression prior to and after the utterance,
the examples of a silence would be almost unaffected by
coarticulation.
It is apparent that the distance matrix in Figure 3A
shows more structure than in Figure 3B. To quantify this
we measure the average silhouette coefficient for each
cluster [27]. The silhouette coefficient is a measure of
cluster separation and cohesion that falls between −1
and +1. A value of ≈ 1 denotes perfectly clustered
data. From the data we record average silhouette values
of −0.32 when using the typical ‘viseme’ approach and
−0.11 for our gestural approach. The latter approach attained both higher minimum and maximum scores and
the difference between the approaches are statistically
significant (p < 0.01).
4. Discussion
In this paper we have presented a review of the methods that are traditionally used for viseme classification
and highlighted some of the many problems associated
with them. We noted that when using subjective methods,
there is little explicit agreement as to which phonemes are
grouped to form which visemes, and how many visemes
(A)
(B)
Figure 3: Distance matrices obtained using the DTW cost between each pair of visemes. The visemes were found using:
a) the acoustic phone boundaries and the phoneme to viseme mapping from [30] (that is the traditional approach), and b)
using the visual gestural approach described in this paper. Each viseme/gesture is delineated by a black square along the
leading diagonal. Red regions correspond to high error, whereas blue regions correspond to low error. Ideally for both
images, the areas bound by the black squares would be more blue (small intra-viseme distance) whilst the other areas
would appear more red (higher inter-viseme distance).
there should be. However, most have agreed that the relationship between phonemes and visemes is many-to-one
and all studies have ignored the matter of audio-visual
asynchrony. The visual labels usually are assumed to
align with the underlying acoustic (phone) boundaries.
In Section 3 we introduced our proposed method of
viseme classification; A data-driven approach that makes
no prior assumption regarding the segmentation of speech
or the phonetic labels associated with each segment. We
described a novel method of segmenting the speech according to the acceleration of the lips during articulation.
We clustered each segment based on the cost of performing DTW between each pair of gestures. We have found
that using this approach we can produce a significantly
more structured model of visual speech production than if
we assume a many-to-one phoneme to viseme mapping.
5. Further work
We are applying machine learning algorithms to first discover whether we can extract a set of gestures that describe the speech space of a single speaker, and then expanding to multiple speakers in the hope that we can produce a generic set of parameterised visemes. This will
then be used to compare traditional approaches to visual speech representation employed both in our visual
speech synthesisers (talking heads) and our automatic lipreading systems.
6. Acknowledgements
The
authors
gratefully
(EP/E028047/1) for funding.
acknowledge
EPSRC
7. References
[1] C. G. Fisher, “Confusions among visually perceived
consonants,” Journal of Speech, Language, and
Hearing Research, vol. 11, pp. 796–804, December
1968.
[2] T. Chen and R. R. Rao, “Audio-visual integration in
multimodal communication,” in Proceedings of the
IEEE, 1998, pp. 837–852.
[3] E. T. Auer and L. E. Bernstein, “Speechreading and
the structure of the lexicon: Computationally modeling the effects of reduced phonetic distinctiveness
on lexical uniqueness,” Journal of the Acoustical
Society of America, vol. 102, pp. 3704–3710, Dec.
1997.
[4] S. P. Eberhardt, L. E. Bernstein, and M. H.
Goldstein, “Speechreading sentences with singlechannel vibrotactile presentation of voice fundamental frequency,” Journal of the Acoustical Society of America, vol. 88, no. 3, pp. 1274–1285, 1990.
[5] T. Ezzat and T. Poggio, “Visual speech synthesis by
morphing visemes,” International Journal of Computer Vision, vol. 38, no. 1, pp. 45–57, 2000.
[6] S. A. Lesner and P. B. Kricos, “Visual vowel and
diphthong perception across speakers,” Journal of
the Academy of Rehabilitative Audiology, vol. 14,
pp. 252–258, 1981.
[18] S. L. Mattys, L. E. Bernstein, and E. T. Auer,
“Stimulus-based lexical distinctiveness as a general
word-recognition mechanism,” Perception and Psychophysics, vol. 64, no. 4, pp. 667–679, 2002.
[7] A. A. Montgomery and P. L. Jackson, “Physical
characteristics of the lips underlying vowel lipreading performance,” Journal of the Acoustical Society
of America, vol. 73, no. 6, pp. 2134–2144, 1983.
[19] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic visual speech,” in Models and
Techniques in Computer Animation.
SpringerVerlag, 1993, pp. 139–156.
[8] J. Jiang, A. Alwan, L. E. Bernstein, E. T. Auer, and
P. A. Jr. Keating, “Similarity structure in perceptual
and physical measures for visual consonants across
talkers,” in Proc. IEEE International Conference on
Acoustics, Speech and Signal Processing, vol. 1,
2002, pp. 441–444.
[9] J. S. Perkell and M. L. Matthies, “Temporal measures of anticipatory labial coarticulation for the
vowel /u/: Within- and cross-subject variability,” Journal of the Acoustical Society of America,
vol. 91, no. 5, pp. 2911–2925, 1992.
[10] A. Turkmani, “Visual analysis of viseme dynamics,” Ph.D. dissertation, University of Surrey, 2008.
[11] R. D. Kent and F. D. Minifie, “Coarticulation in recent speech production models,” Journal of Phonetics, vol. 5, no. 2, pp. 115–133, April 1977.
[12] A. J. Goldschen, O. N. Garcia, and E. Petajan,
“Continuous optical automatic speech recognition
by lipreading,” in Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computers, 1994, pp. 572–577.
[13] C. Ding and X. He, “Cluster merging and splitting
in hierarchical clustering algorithms,” in Proceedings of the Second IEEE International Conference
on Data Mining, 2002, pp. 139–146.
[14] J. Luettin, “Visual speech and speaker recognition,”
Ph.D. dissertation, University of Sheffield, 1997.
[15] E. Owens and B. Blazek, “Visemes observed by
hearing-impaired and normal-hearing adult viewers,” Journal of Speech and Hearing Research,
vol. 28, pp. 381–393, 1985.
[16] A. P. Benguerel and M. K. Pichora-Fuller, “Coarticulation effects in lipreading,” JSHR, vol. 25, pp.
600–607, 1982.
[17] P. L. Jackson, “The theoretical minimal unit for
visual speech perception: Visemes and coarticulation,” Volta Review, vol. 90, no. 5, pp. 99–115,
September 1988.
[20] A. Löfqvist, Speech as audible gestures.
Academic Publishers, 1990.
Kluwer
[21] J. Melenchón, J. Simó, G. Cobo, and E. Martínez,
“Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and
visual modes,” in International conference on
auditory-visual speech processing, 2007.
[22] B. Theobald, S. Fagel, G. Bailly, and F. Elsei,
“Lips2008: Visual speech synthesis challenge,” in
Interspeech, 2008, pp. 2310–2313.
[23] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,” IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 23, no. 6, pp. 681–
685, June 2001.
[24] I. Matthews and S. Baker, “Active appearance models revisited,” International Journal of Computer Vision, vol. 60, no. 2, pp. 135–164, November 2004.
[25] K. Wang and T. Gasser, “Alignment of curves by
dynamic time warping,” The annals of statistics,
vol. 25, no. 3, pp. 1251–1276, June 1997.
[26] G. Karypis, “Cluto - a clustering toolkit,” University of Minnesota, Department of Computer Science, Minneapolis, Tech. Rep., April 2002.
[27] P. J. Rousseeuw, “Silhouettes: a graphical aid to
the interpretation and validation of cluster analysis,”
Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1986.
[28] J. C. Dunn, “Well-separated clusters and optimal
fuzzy partitions,” Cybernetics and Systems, vol. 4,
pp. 95–104, 1974.
[29] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 1, pp. 224–227, 1979.
[30] B. Walden, R. Prosek, and A. Montgomery, “Effects
of training on the visual recognition of consonants,”
Journal of Speech and Hearing Research, vol. 20,
pp. 130–145, 1977.