Visual Speech Recognition: a solution from feature extraction to words classification
LUCIANA GONÇALVES DA SILVEIRA1, JACQUES FACON2, DÍBIO LEANDRO BORGES2
1
FACULDADE CAMBURY – Cambury College, Goiânia, Go, Brazil
lugon@cultura.com.br
2
PUCPR- Pontifical Catholic University of Parana, Laboratory for Vision and Image Science, CCET – PPGIA,
Rua Imaculada Conceição 1155 – Prado Velho, Curitiba, Pr, Brazil
{facon,dibio}@ppgia.pucpr.br
Abstract. Audio-visual Speech Recognition has been an active area of research lately. A bit, and yet unsolved,
part of this problem is the visual only recognition, or lip reading. Considering an image sequence of a person
pronouncing a word, a full image analysis solution would have to segment the mouth area, extract relevant
features, and use them to be able to classify the word from those visual features. In this paper we approach this
problem by proposing a segmentation technique for the lips contours together with a set of features based on
the extracted contours which is able to perform lip reading with promising results. We have collected visual
speech sequences in our lab and show the results here for a set of ten words in Brazilian Portuguese, spoken
by different speakers in more than 150 samples. The approach can be extended and applied to other spoken
languages as well.
We review here some important results found in the
recent literature most related to the problem and the
1. Introduction
approach proposed in this paper.
Audio-visual speech recognition is an area which
embraces the tasks of lip reading and audition. It is a
Caplier in [2] proposes an active shape model to
psychological finding, see Summerfield in [12], that
describe the mouth area. The approach works first by
automatic speech recognition might be made more robust
training the deformations using spatiotemporal sets.
if visual speech information could be incorporated.
Parameters of a Kalman Filter are defined at specific
However, straight recognition of fluent speech based
points and then used to integrate the information coming
solely on visual information is limited, since there are not
from the training. The mouth area is not automatically
enough distinctions for classification purposes on lips
located, and the points are given for the training.
movements whenever one is speaking using a regular size
Experiments are shown for tracking the mouth in some
vocabulary. Because of larger availability of commercial
image sequences.
dictation systems, and multimedia interfaces with visual
Faruquie et. al. report an integrated audio-visual
input capacity, the problem of designing visual speech
speech
recognition system in [5]. They use active shape
recognition solutions has received great research interest
models
with hand marked 46 knot points in the mouth
lately. For a recent review of the area see Chen in [3]
area, and train the Point Distribution Models (PDM) in
Providing a computational solution to visual speech
order to get the principal deformation modes. The lip
recognition can be divided into three main tasks, or
contour is represented using five parabolic curves
stages. First, given a sequence of images the mouth area
approximated by a fitting using the sets of 46 points.
has to be detected automatically; second, features suitable
Features are computed for interior and exterior lips
for the identification and classification of the speech
contours and then passed to classification. Highest figures
ought to be extracted from the images; and finally,
achieved for video only recognition were reported as 31%
recognition is to be achieved based on the visual features
success rate.
regarding a particular vocabulary. In this paper we report
A geometric lip model based on a quadratic curve is
a novel solution for visual speech recognition including
presented by Liew et. al. in [7] to use it as a complete
the three tasks mentioned. The solution was tested for a
model for lip detection from gray level images. An
set of words from Brazilian Portuguese, however it can be
extended to work for different sets of words and other
algorithm employing an stochastic cost function to find lip
languages as well.
and non-lip regions in the images is then implemented in
order to fit the model to images. Authors report detecting
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
the lip regions, although no recognition is mentioned after
the regions is detected.
Matthews [8] proposed variations on the active shape
model to work into an integrated system of audio-visual
speech recognition. A Multiscale Spatial Analysis (MSA)
approach gave the best results achieving rates of 77%
reported on a particular database of words. The approach
is heavily dependent on the training of the parameters in
order to find the correct locations in the scale histograms.
Results of the work with MSA are also reported in [9].
Sadeghi et. al. in [11] proposed a model using
Gaussian mixtures and general covariance matrix
functions in order to estimate a mouth area from a set of
examples. Results are given showing the approach is
robust for segmenting mouth areas and lips for a set of
gray level images. Neither tracking nor recognition is
performed or mentioned in the work.
As it can be seen from the literature on Visual
Speech Recognition the proposed methods work mainly
tracking the mouth of an specific speaker, and then
extracting features for comparison with trained sequences
of a finite vocabulary. In this paper we propose a method
that first find the mouth and lips area in the image,
estimate and extract special features for use in the
classification stage, and in the experiments none of the
mentioned stages are restricted to a specific speaker and
pre-trained sequences. So, besides the original details of
the method which will be given in the next sections, the
conditions of the experimental setup and restrictions of
our method are different from most of the work we found
in the literature.
particularly because of lighting conditions, speed of
speech, and different aspects of someone´s mouth such as
teeth and lips area. A solution to work for more than one
speaker will have to take those issues into consideration.
We have designed a solution which is not aimed to extract
the exact lips contours at each frame of the speech, but to
provide stable points of the lips contours for the next
stage of feature extraction.
The input is a sequence of gray level images from
the start until the end of a word pronouncing, and for
segmentation of lips contours we propose the following
steps: 1) Movement detection; 2) Enhancement filtering;
3) Entropy thresholding; 4) Mouth region detection; and
5) Lips contours identification.
Movement detection works by computing a difference
between two consecutive frames of a sequence, being a
frame at time t minus an averaged filtered version of
frame at time t-1. The output frame computed this way
gives a more stable account region moved, Lie and Hsieh
[6].
(1)
IE t 1 I t 1 * Mask
ª1 / 9 1 / 9 1 / 9º
where, Mask = ««1 / 9 1 / 9 1 / 9»»
«¬1 / 9 1 / 9 1 / 9»¼
DI
(2)
I t IE t 1
(3)
This paper is further organized as follows. Section 2
describes the approach proposed here. Section 3 explains
the experimentation setup for testing the approach and
gives the results on classification. Section 4 outlines
major conclusions as well as gives directions for future
lines of work.
2. The Approach
The approach we propose here is organized into three
modules: 1) Segmentation of Lips Contours; 2) Lips
Features and Extraction; 3) Recognition. Further sections
explain in details these modules.
2.1 Segmentation of Lips Contours
In order to provide an automatic solution for visual speech
recognition the mouth area in an image sequence has to be
detected, and the contouring of the lips segmented and
followed apart from the background and the rest of the
face.
This is a difficult segmentation problem,
(a)
(b)
(c)
Figure 1..Consecutive frames of a sequence,
and its compensated movement image. (a)
Original frame at instant t-1, I t 1 ; (b) Original
frame at instant t, I t ; (c) Detected frame DI
computed from (a) and (b).
Figure 1 shows two consecutive frames and its
computed new frame with movement compensated. It can
be seen from Figure 1.(c) that lighting became more
uniform and strong details were enhanced.
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
Enhancement filtering is designed to enhance the lips
and mouth regions, and to diminish noise smaller than a
structuring element. It works by applying to each detected
frame DI a gray level morphological opening with the
structuring element as in (4) with 8 iterations.
(a)
(4)
Figure 2 shows a result of applying the process of
enhancement.
(a)
(b)
(b)
Figure 4 Result for mouth detection. (a) Frame
from last stage already in binary; (b) Result after
applying mouth detection stage.
Lips contour identification stage uses an structuring
element to erode the mouth region, and then the eroded
version minus the mouth image provides the lips contours.
Cleaning for remaining artifacts found is then performed
for final output of only a major lips contour, i.e. external.
Figure 5 shows results for lips contour identification in
one frame of our data.
Figure 2 Result of morphological filtering to
enhance lips and mouth region. (a) Original
frame; (b) Enhanced image after application of
(4).
Entropy thresholding was a way we tested in order to
remove the background of the image, and most of the face
features in order to leave the mouth area as a more
uniform region in the image. We have implemented the
algorithm of Abutaleb [1], and applied it to each frame of
the sequence. Figure 3 shows an application of such
process in one frame.
(a)
(b)
Figure 3 Result after entropy thresholding. (a)
Enhanced frame from last stage; (b) Image
resultant after applying Abutaleb thresholding.
Mouth region detection separates upon the uniform
regions, already in binary from the last stage, the mouth
from the rest of the image. The hypothesis that the mouth
would be the largest region left works well at this stage.
Figure 4 gives an example of mouth detection.
(a)
(b)
(c)
Figure 5 Result of contour identification after
mouth detection. (a) Mouth region from
previous stage; (b) Result with lips contour. The
image (b) is the negative since it is the direct
result of the eroded minus the mouth image; (c)
Final contour after chain coding is performed.
2.2
Lips Features and Extraction
The output from the last stage of lips contours
identification is passed to a chain code marking algorithm,
Davies [4], in order to provide a more efficient data
structure for feature extraction (see Figure 5 (c)). We
have used eight directions in the code, which also cleared
some jagging on the lips contours. Most solutions in the
visual speech recognition, see Matthews [8] and Matthews
et.al. [9], for recent examples, choose to approximate the
lips contours using splines, or active contours, and
perform tracking frame by frame of the lips movements.
This solution has a high computational cost, and it usually
needs pre-defined starting control points. We propose
here in this work to use a different and much smaller set
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
of features for the lips than the closed 2-D contours. They
are easier to extract, and more stable, and since the final
aim is to perform the recognition, we tested them reaching
comparable results with the top known published results
nowadays.
Since the significant differences that can be detected
from someone´s mouth while speaking a word are the
opening of the mouth, and its speed, we propose as
features for the lips four (4) distances: 1) H1, the largest
horizontal; 2) V1, a vertical distance exactly at midpoint
of H1; 3) V2, a vertical distance at midpoint of the left part
of H1 from the crossing of V1; and 4) V3, a vertical
distance at midpoint of the right part of H1 from the
crossing of V1. Figure 6 shows the four distances
proposed as features schematically from the lips contours.
3. Experiments
We have collected for the experiments 15 samples for
each word of the vocabulary set. The words chosen are
{“zero”, “um”, “dois”, três”, “quatro”, “cinco”, “seis”,
“sete”, “oito”, “nove”}, respectively related to the
numerals {0,1,2,3,4,5,6,7,8,9}. The database includes
more than twenty (20) different speakers pronouncing
one, or more sequences.
Figure 7 shows part of an original sequence of a
speaker pronouncing the word “um” (1). The camera
sampled at frame rate of 30 frames per second. Figure 8
shows the first three frames of the sequence on the top
row and the last three on the bottom row.
Figure 6 Image showing the features H1, V1,
V2, and V3 to be extracted and used for visual
speech recognition.
Figure 7 Some input frames of a sequence
pronouncing word 1 (“um”).
This set of lips features are then computed for each
frame of a sequence, and a feature vector to be passed for
recognition is then a list of the number of frames times
four features for each frame.
Figure 8 shows respectively, for the same sequence
from Figure 7, the results after the stages of movement
detection, enhancement filtering and entropy thresholding.
The images on Figure 8 are already in binary showing
mainly the mouth region, nostrils, a part of background
and still some noise.
2.3
Words Classification based on the Visual
Features
The number of frames for the sequences of speakers is not
constant, since even for the same word a different speaker
would take longer to pronounce it. We did not want to
constrain our solution to a fixed number of frames. For
the words classification stage we compared all the feature
vectors for different words and speakers, and the ones
with less frames than others are filled with zeros in those
particular frames missing.
Nearest neighbor with Euclidean distance was used
in order to classify the feature vector onto each of the ten
classes proposed. Next section explains the experimental
setup used and gives the main results achieved by the
proposed method.
Figure 8 Frames of a sequence pronouncing
word 1 (“um”)., results after motion detection
and morphological enhancement compensation.
Figure 9 gives the outputs, for the sequence in Figure
8, after mouth region detection and lips contour
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
identification. It can be seen on the image frames in
Figure 9 that there are more contours than the external
lips, and this is the reason why it is needed cleaning for
not closed contours left. Next stage, see the output at
Figure 10, performs this cleaning and chain code the lips
contours for further processing.
Figure 11 Some input frames of a sequence
pronouncing word 8 (“oito”).
Figure 9 Frames of a sequence pronouncing
word 1 (“um”), results after lips contours
detection.
Figure 12 Frames of a sequence pronouncing
word 8 (“oito”)., results after motion detection
and morphological enhancement compensation.
Figure 10 Frames of a sequence pronouncing
word 1 (“um”), results with final contour
detected.
Figures 11, 12, 13 and 14 shows respectively frames
of a sequence of word “oito”, being the original data,
output at point of entropy thresholding, lips identification,
and lips contours. We have chosen to show images of
these two words in order for the reader to see the
difficulty in classifying the speech based on the visual
data at each frame. Our chosen features helped since they
were more stable than using the whole contour.
Figure 13 Frames of a sequence pronouncing
word 8 (“oito”), results after lips contours
detection.
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
“0”
“1”
“2”
“3”
“4”
“5”
“6”
“7”
“8”
“9”
“0” 0,05 0,00 0,00 0,22 0,20 0,15 0,13 0,00 0,02 0,05
“1” 0,00 0,97 0,03 0,00 0,00 0,00 0,00 0,00 0,00 0,00
“2” 0,00 0,00 0,93 0,00 0,00 0,00 0,07 0,00 0,00 0,00
“3” 0,10 0,00 0,00 0,07 0,13 0,00 0,38 0,00 0,32 0,00
“4” 0,02 0,00 0,00 0,22 0,25 0,08 0,18 0,03 0,15 0,07
“5” 0,02 0,00 0,00 0,00 0,10 0,42 0,15 0,22 0,0
0,10
“6” 0,03 0,00 0,05 0,38 0,13 0,05 0,17 0,02 0,17 0,00
Figure 14 Frames of a sequence pronouncing
word 8 (“oito”), results with final contour
detected.
“7” 0,17 0,00 0,00 0,00 0,10 0,22 0,00 0,20 0,05 0,27
From a total of 150 samples, or sequences, being 15
for each word the system was run directly on the original
data to extract the 150 feature vectors for classification.
A typical sequence would have 45 frames, although this
number varies depending on the word and the speaker for
each sample. For testing the accuracy for the recognition
of the words we chose randomly 20% of the samples and
labeled them in order to computer the centers of the
clusters for each word. The 80% left was then classified
using nearest neighbor with Euclidean distance. A rank
was made from the closest to the most distant. Five (5)
batteries of tests were made following this procedure,
always changing the samples from the 20%-80% subsets.
This way we tested with all the samples into the two
situations and averaged the results. Table 1 gives a
confusion matrix for the ten (10) classes.
It can be seen from the results on Table 1 that words
“um”, “dois”, “quatro”, and “cinco” were classified
correctly as first choices, and words “seis” and “nove”
were classified as second choices. Those figures are
encouraging since the conditions of testing were
challenging because of different speakers used, automatic
location of mouth and lips region, unfixed number of
frames for the speech, and relative small sample size with
difficult vocabulary (i.e. the words are either one, or two
pitches apart, in the sense that you could separate the
sound wave onto two major sounds).
The recent literature in the area [3, 9] indicates that
for an small size vocabulary, mostly tested using
controlled conditions, and one speaker only, positive
classifications figures of 40% were the highest.
Considering the top four words classified correctly as
shown in Table 1 the average of success is 64%, and for
the complete set of experiments the success rate is 35%.
Table 1 Confusion matrix for the ten classes of
words tested. Values indicate percentage of
positive classification, and they were averaged
using five batteries of tests in a hold-out
scheme.
“8” 0,15 0,00 0,00 0,00 0,15 0,18 0,05 0,03 0,15 0,28
“9” 0,12 0,00 0,00 0,00 0,02 0,48 0,02 0,07 0,00 0,30
4. Conclusions and Future Work
In this paper we have shown a novel solution we
developed for visual speech recognition. The method
embraces the tasks of automatic lips detection, contour
and feature extraction, and recognition. More specifically
the automatic lips contour detection and the features
designed and tested for the set of words recognition are
original contributions of this work.
We have collected 150 samples using a digital
camera with more than 20 speakers pronouncing the set of
numerals from “zero” to “nove” in Brazilian Portuguese.
The experimental results reached a top success rate of
over 64% for the correctly classified words, and on
average the success rate was 35%. Recent published
results from the literature , Matthews in [8] indicate a
range from 30-40% success rate at the state of the art
systems. The results we achieved are promising, since
there is room for improvement considering adding more
features for the lips and maybe synchronizing the speech
acquired.
Visual speech recognition is an important area of
application for dictation and interface systems, especially
considering the new standards for digital video and
television being adopted in the market place. Important to
say that visual speech recognition is not meant to achieve
recognition by its own, it is a rather complementary tool
in the broader context of Audio-visual speech recognition.
Besides being a novel and competitive approach the work
presented here appears to be one of the first regarding
Brazilian Portuguese evaluation. Lines of future work
includes evaluating the approach with more data and
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE
larger vocabulary, as well as designing visual features
more sensitive to a specific set of words and visemes. We
plan to use video data, as in TV broadcast news, as well
as a vocabulary of 3K words in Brazilian Portuguese. For
this extension we think that an HMM (Hidden Markov
Model) would improve the recognition rate since it will
be more precise to distinguish the movement and speed
constraints, please see Rabiner [10] for more details on
HMMs.
Acknowledgements
We thank the people who volunteered as anonymous
speakers for the database we collected, and all friends
from the former Laboratory for Intelligent Systems (LIS)
(1998-2000).
References
[1] A. Abutaleb, “Automatic Thresholding of Gray-level
Pictures
using
Two-dimensional
Entropy”,
Computer Graphics and Image Understanding,,
41(1) (1989), 22—32.
[2] A. Caplier, “Lip Detection and Tracking”, in
Proceedings of the 11th Conference.. on Image
Analysis and Processing (CIAP) (2001), IEEE CS
Press, 8--13
[3] T. Chen, “Audiovisual Speech Processing: lip reading
and lip synchronization”, IEEE Signal Processing
Magazine, January (2001), 9—21.
[4] E. Davies, “Machine Vision: theory, algorithms,
practicalities”, 2nd. Ed., Academic Press, USA, 1997.
[5] T. Faruquie, A. Majundar, N. Rajput, and L.
Subramaniam, “Large Vocabulary Audio-Visual
Speech Recognition using Active Shape Models”, in
Proceedings of the 15th International Conference. on
Pattern Recognition (ICPR) (2000), IEEE CS Press,
106—109.
[6] W. Lie and H. Hsieh, “Lip Detection by
Morphological Image Processing”, in Proceedings of
International Conference. on Signal Processing
(ICSP) (1998), 7--13.
[7] A. Liew, S. Leung and W. Lau, “Region-based
Approach to Robust Lip Contour Extraction”,
Electronics Letters 22 (15)(2000), 1272--1274.
[8] I. Matthews, “Features for Audio-Visual Speech
Recognition”, Ph.D. Thesis, School of Information
Systems, University of East Anglia, UK, 1998.
[9] I. Matthews, T. Cootes, J. Bangham, S. Cox and R.
Harvey, “Extraction of Visual Features for
Lipreading”, IEEE Transactions on Pattern Analysis
and Machine Intelligence 24 (2) (2002), 779--789.
[10] L. R. Rabiner, “A Tutorial on hidden Markov models
and selected applications in speech recognition”,
Proceedings of the IEEE 77(2) (1989), 257—286.
[11] M. Sadeghi, J. Kittler, and K. Messer, “Modelling
and Segmentation of Lip Area in Face Images”, IEE
Proceedings on Vision. Image, and Signal Processing
149(3) (2002), 179--184.
[12] Q. Summerfield. “Lipreading and audio-visual
speech perception”, Philosophical Transactions of
the Royal Society of London B, (335), 71—78, 1992.
Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03)
1530-1834/03 $17.00 © 2003 IEEE