A Segment-Based Audio-Visual Speech Recognizer:
Data Collection, Development, and Initial Experiments
Timothy J. Hazen, Kate Saenko, Chia-Hao La, and James R. Glass
MIT Computer Science and Artificial Intelligence Laboratory
32 Vassar Street
Cambridge, Massachusetts, USA
{hazen,saenko,chiahao,jrg}@sls.csail.mit.edu
ABSTRACT
This paper presents the development and evaluation of a
speaker-independent audio-visual speech recognition (AVSR)
system that utilizes a segment-based modeling strategy. To
support this research, we have collected a new video corpus,
called Audio-Visual TIMIT (AV-TIMIT), which consists of 4
total hours of read speech collected from 223 different speakers. This new corpus was used to evaluate our new AVSR
system which incorporates a novel audio-visual integration
scheme using segment-constrained Hidden Markov Models
(HMMs). Preliminary experiments have demonstrated improvements in phonetic recognition performance when incorporating visual information into the speech recognition
process.
Categories and Subject Descriptors
I.2.M [Artificial Intelligence]: Miscellaneous
General Terms
Algorithms, Design, Experimentation.
In this paper, we describe our efforts in developing an
audio-visual speech recognition (AVSR) system. It is hoped
that this speech recognition technology can be deployed in
systems located in potentially noisy environments where visual monitoring of the user is possible. These locales include
automobiles, public kiosks, and offices. Our new AVSR system is built upon our existing segment-based speech recognizer [5]. This AVSR system incorporates information collected from visual measurements of the speaker’s lip region
using a novel audio-visual integration mechanism which we
call a segment-constrained Hidden Markov Model (HMM).
Our AVSR system is described in detail in Section 3.
To help develop and evaluate our new AVSR system, we
have collected a corpus containing 4 total hours of audiovisual speech data collected from 223 different speakers. The
corpus contains read-speech utterances of TIMIT SX sentences [22]. The video contains frontal views of the speaker’s
head in front of a solid blue backdrop under two different
lighting conditions. This corpus is described in Section 2.
Phonetic recognition experiments using this corpus are reported in Section 4. We summarize our efforts in Section 5
and propose future work in Section 6.
Keywords
Audio-visual speech recognition, audio-visual corpora.
2. DATA COLLECTION
2.1 Existing Audio-Visual Corpora
1.
INTRODUCTION
Visual information has been shown to be useful for improving the accuracy of speech recognition in both humans
and machines [1, 15]. These improvements are the result of
the complementary nature of the visual and aural modalities. For example, many sounds that are confusable by ear
are easily distinguishable by eye, such as n and m. The improvements from adding the visual modality are also more
pronounced in noisy conditions where the audio signal-tonoise ratio is reduced [1, 17].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICMI’04, October 13–15, 2004, State College, Pennsylvania, USA.
Copyright 2004 ACM 1-58113-954-3/04/0010 ...$5.00.
A variety of audio-visual corpora have been created by researchers in order to obtain experimental results for specific
tasks. Corpora available for public use have originated primarily from universities, and tend to be less extensive than
the ones collected by private research labs. Many of the former contain recordings of only one subject, e.g. [2]. Even
those with multiple subjects are usually limited to small
tasks, such as isolated letters [12], digits [3, 16], or a short
list of fixed phrases [14]. Only two of the A/V corpora published in the literature (including English, French, German
and Japanese) contain both a large vocabulary and a significant number of subjects. The first is IBM’s proprietary,
290-subject, large-vocabulary AV-ViaVoice database of approximately 50 hours in duration [8]. The second is the
VidTIMIT database [19], which was recently made available by the Linguistic Data Consortium (LDC). It consists
of 43 subjects reciting 10 TIMIT sentences each, and has
been used in multi-modal person verification research [20].
This corpus was not yet publicly available at the onset of
our own data collection effort.
Figure 1: Sample still shots from the AV-TIMIT
corpus showing the two lighting conditions used in
the recording.
Figure 2: Examples of tracked mouth regions from
the AV-TIMIT corpus. The bottom row shows
tracking failures.
2.2.3 Database Format
2.2 AV-TIMIT Data Collection
To provide an initial corpus for our research in audiovisual speech recognition we collected a new corpus of video
recordings called the Audio-Visual TIMIT (AV-TIMIT) corpus. It contains read speech and was recorded in a relatively quiet office with controlled lighting, background and
audio noise level. The main design goals for this corpus
were: 1) continuous, phonetically balanced speech, 2) multiple speakers, 3) controlled office environment and 4) high
resolution video. The following sections will describe each
aspect of the data collection in detail.
2.2.1 Linguistic Content
Because of size and linguistic flexibility requirements, we
decided to create a corpus of phonetically rich and balanced
speech. We used the 450 TIMIT-SX sentences originally
designed to provide a good coverage of phonetic contexts of
the English language in as few words as possible [22]. Each
speaker was asked to read 20 sentences. The first sentence
was the same for all speakers, and is intended to allow them
to become accustomed to the recording process. The other
19 sentences differed for each round. In total, 23 different
rounds of utterances were created that test subjects were
rotated through. Each of the 23 rounds of utterances was
spoken by at least nine different speakers.
2.2.2 Recording Process
Recording was completed during the course of one week.
The hardware setup included a desktop PC, a GN Netcom
voice array microphone situated behind the keyboard, and
a high-quality SONY DCR-VX2000 video camcorder. The
camera was mounted on a tripod behind the computer display to record a frontal view of each subject. A blue curtain was hung behind the chair to reduce image background
noise; however, users were not told to restrict their movements. The audio quality was generally clean, but the microphone did pick up some noise from a computer fan. The average signal-to-noise ratio within individual utterances was
approximately 25 dB, with a standard deviation of 4.5 dB.
After being seated in front of the computer, the user
was instructed to press and hold the “Record” button on
the interface while reading each prompted utterance from
the screen. Upon button release, the program echoed the
recorded waveform back, so that the user could hear his/her
own recording. To help ensure that the speech matched the
orthographic transcription, an observer was present in the
room to ask the user to re-record a sentence if necessary.
For the last five sentences, extra side lighting was added in
order to simulate different lighting conditions (see Figure 1).
Full color video was stored in uncompressed digital video
(DV) AVI format at 30 frames per second and 720x480 resolution. In addition to the audio track contained in the video
files, the audio was also saved into separate WAV files, sampled at 16 KHz. The total database duration is approximately 4 hours.
2.2.4 Demographics
The majority of volunteers came from our organization’s
community. The final audio-visual TIMIT corpus contained
223 speakers, of which 117 were male and 106 were female.
All but 12 of the subjects were native speakers of English.
Different ages and ethnicities were represented, as well as
people with/without beards, glasses and hats.
2.3 AV-TIMIT Annotation
2.3.1 Audio Processing
Time-aligned phonetic transcriptions of the data were created automatically using a word recognition system configured for forced-path alignment. This recognizer allowed
multiple phonetic pronunciations of the spoken words. Alternate pronunciation paths could result either from a set of
phonological variation rules or from alternate phonemic pronunciations specified in a lexical pronunciation dictionary.
The acoustic models for the forced-path alignment process were seeded from models generated from the TIMIT
corpus [22]. Because the noise level of the AV-TIMIT corpus was higher than that of TIMIT (which was recorded
with a noise-canceling close-talking microphone), the initial
time-aligned transcriptions were not as accurate as we had
desired (as determined by expert visual inspection against
spectrograms). To correct this, the acoustic models were iteratively retrained on the AV-TIMIT corpus from the initial
transcriptions. After two re-training iterations, a final set of
transcriptions were generated and deemed acceptable based
on expert visual inspection. These transcriptions serve as
the reference transcriptions used during the phonetic recognition evaluation presented in Section 4.
2.3.2 Video Processing
The video was annotated in two different ways. First, the
face region was extracted using a face detector. In order
to eliminate any translation of the speaker’s head, the face
sequence was stabilized using correlation tracking of the nose
region. However, since we also needed a reliable way to
locate the speaker’s mouth, we then used a mouth tracker
to extract the mouth region from the video. The mouth
tracker is part of the visual front end of the open source
AVCSR toolkit available from Intel [9].
Waveform
Frames
Segment Network
m
−
k
ah
p
−
er
y
uw dx
t
z
dh
ae
ao
−
−
k
Figure 3: Illustration of a search network created for segment-based recognition. The best segment path is
highlighted in the segment network.
Although the front end algorithms were trained on different corpora than our own, they performed relatively well on
the AV-TIMIT corpus. The mouth tracker uses two classifiers (one for mouth and the other for mouth-with-beard)
to detect the mouth within the lower region of the face. If
the mouth was detected successfully in several consecutive
frames, the system entered the tracking state, in which the
detector was applied to a small region around the previously
detected mouth. Finally, the mouth locations over time were
smoothed and outliers were rejected using a median filter.
For more details about the algorithm, see [11]. The system performed well on most speakers; however, for some it
produced unacceptable tracking results (see Figure 2). Two
possible reasons for such failures are side lighting and rotation of the speaker’s head, both of which the system had
difficulty handling. Facial expressions, e.g. smiling, also
seemed to have a negative effect on tracking. Another possibility is that the fixed parameters used in the search did
not generalize well to some speakers’ facial structure. To
obtain better tracking in such cases, the search area for the
mouth in the lower region of the face was adjusted manually,
as were the relative dimensions of the mouth rectangle.
With these measures, most of the remaining tracking failures were in the first few frames of the recording, before the
speaker started reading the sentence. The final tracking results, consisting of a 100x70 pixel rectangle centered on the
mouth in each frame, were saved to a separate file in raw
AVI format.
3.
SEGMENT-BASED AVSR
3.1 Segment-Based Recognition
Our audio-visual speech recognition approach builds upon
our existing segment-based speech recognition system [5].
One of our recognizer’s distinguishing characteristics is its
use of segment-based networks for processing speech. Typical speech recognizers use measurements extracted from
frames processed at a fixed rate (e.g., every 10 milliseconds). In contrast, segment networks are based on the idea
that speech waveforms can be broken up into variable length
segments that each correspond to an acoustic unit, such as
a phone.
Our recognizer initially processes the speech using standard frame-based processing. Specifically, 14 Mel-Scale cepstral coefficients (MFCCs) are extracted from the acoustic
waveform every 5 milliseconds. However, unlike frame-based
hidden Markov models (HMMs), our system hypothesizes
points in time where salient acoustic landmarks might exist. These hypothesized landmarks are used to generate a
network of possible segments. The acoustic modeling component of the system scores feature vectors extracted from
the segments and landmarks present in the segment network
(rather than on individual frames). The search then forces
a one-to-one mapping of segments to phonetic events. The
end result of recognition is a path through the segment network in which all selected segments are contiguous in time
and are assigned an appropriate phone. Figure 3 illustrates
an example segment network constructed for a waveform of
the phrase “computers that talk”, where the optimal path
determined by the recognizer has been highlighted.
3.2
Visual Feature Extraction
There are two main approaches to visual feature extraction for speech recognition. The first is an appearancebased, or bottom-up, approach, in which the raw image pixels are compressed, for example, using a linear transform,
such as a discrete cosine transform (DCT), principle component analysis (PCA) projection, or a linear discriminant
analysis (LDA) projection. The second is a model-based,
or top-down, approach, in which a pre-determined model,
such as the contour of the lips, is fitted to the data. Some
approaches combine both appearance and model-based features. It has been found that, in general, bottom-up methods perform better than top-down methods, because the latter tend to be sensitive to model-fitting errors [15].
For this experiment, appearance-based visual features were
extracted from the raw images of the mouth region using the
visual front end of the AVCSR Toolkit. First, each image
was normalized for lighting variation using histogram equalization. Then, a PCA transform was applied, and the top 32
coefficients retained as feature vectors. Figure 4 shows the
lip images re-constructed from the means of the PCA coefficient distributions for the middle frame of each phoneme. In
order to capture the dynamics of the signal, three consecutive vectors were concatenated together, resulting in one
96-dimensional vector per frame.
3.3
Visual Recognition Units
Typical audio-only speech recognition systems use phones
(i.e., the acoustic realizations of phonemes) as the basic units
Figure 4: Image representation generated from mean value of the 32-dimension feature vector extracted from
video frames for each of 50 different phonetic labels.
for speech recognition. Words are represented as strings
of phones, which can then be mapped to acoustic observations using context-dependent acoustic models. When
adding visual information, most systems incorporate visual
units called visemes, which correspond roughly to phones,
and describe the visual realization of phonemes.
In general, one can only see a speaker’s lips and jaws, while
the other articulators (e.g., the tongue and the glottis) are
typically hidden from sight. Therefore, some visemes can actually correspond to more than one phone, resulting in a oneto-many mapping. For example, the phones [b] and [p] differ
from each other only in that [b] is voiced. Since voicing occurs at the glottis, which is not visible, these two phones are
visually indistinguishable, and can thus be mapped to the
same viseme unit. From a probabilistic modeling viewpoint,
the use of viseme units is essentially a form of model clustering that allows visually similar phonetic events to share
a tied model.
To help us determine a useful set of visemic units for our
AVSR system, we performed bottom-up clustering experiments using models created from phonetically labeled visual
frames. We use the Bhattacharyya distance as a measure of
similarity between two Gaussian distributions, which is defined as follows:
Σ1 + Σ2
1
(µ1 − µ2 )t
8
2
−1
(µ1 − µ2 ) +
✁
1 |(Σ1 + Σ2 )/2|
ln
1
1
2
|Σ1 | 2 |Σ2 | 2
We started from 50 visual feature distributions corresponding to 50 phonetic units (there are 54 total phonetic units
used in our recognizer, but during this clustering we preclustered the two silence units into one unit, as well as merging [em] with [m], [en] with [n], and [zh] with [sh]). We then
used a standard agglomerative hierarchical clustering algorithm to successively merge clusters based on the maximum
distance between them. The resulting cluster tree based on
the 96-dimensional stacked PCA feature vectors is shown in
Figure 5. In almost all cases the clusterings are obvious and
expected. For example the cluster of [s], [z], [n], [tcl] and
[dcl] represents all of the phones with coronal closures or
constrictions.
Table 1 shows the list of viseme units used in our system,
along with the set of phonetic units corresponding to each
viseme (as roughly based on the clusterings in Figure 5).
As mentioned early, the entire phonetic unit set contains 54
different units. This set of units roughly mimics the set of
units used in TIMIT.
3.4
Audio-Visual Integration
3.4.1 Early vs. Late Integration
One of the key questions to ask when integrating audio and visual data is “When should the information be
combined?” In early integration, feature vectors from both
modalities are concatenated into one large vector. This resulting vector is then processed by a joint audio-visual classifier, which uses the combined information to assign likelihoods to the recognizer’s phonetic hypotheses.
In late integration, the audio data and video data are analyzed by separate classifiers. Each classifier processes its
aw
ae
ay
eh
ey
aa
ao
hh
t
d
dh
sil
g
k
ah
ih
iy
dx
ng
ax
l
gcl
kcl
uh
er
el
axr
r
y
uw
ow
sh
ch
jh
th
epi
s
z
n
tcl
dcl
pcl
bcl
m
f
v
b
p
oy
w
Figure 5: Bottom-up clustering of phonetic events
based on similarity of Gaussian models constructed
from the central visual frame of phonetic segments
using 96-dimension stacked PCA feature vectors.
own data stream, and the two sets of outputs are combined
in a later stage to produce the final hypothesis. We have
chosen to integrate two independent audio and visual classifiers at the segment-level. In other words, the audio and
visual streams are processed independently until they are
merged during the segment-based search to produce joint
audio-visual scores for each segment.
We chose to perform late integration instead of early integration for two primary reasons. First, the feature concatenation used in early integration would result in a highdimensional data space, making a large multi-modal database
necessary for robust statistical model training.
Second, late integration provides greater flexibility in modeling. With late integration, it is possible to train the audio
and visual classifiers on different data sources. Thus, when
training the audio classifier, we could use audio-only data
sources to supplement the available joint audio-visual data.
Also, late integration allows adaptive channel weighting between the audio and visual streams based on environmental
conditions, such as the signal-to-noise ratio. Additionally,
late integration allows asynchronous processing of the two
streams, as discussed in the next subsection.
Viseme Label
Sil
OV
BV
FV
RV
L
R
Y
LB
LCl
AlCl
Pal
SB
LFr
VlCl
Phone Set
ax ih iy dx
ah aa
ae eh ay ey hh
aw uh uw ow ao w oy
el l
er axr r
y
bp
bcl pcl m em
s z epi tcl dcl n en
ch jh sh zh
t d th dh g k
fv
gcl kcl ng
Table 1: Mapping of phonetic units to visemes for
our experiments.
3.4.2
Audio-Visual Asynchrony
There is an inherent asynchrony between the visual and
audio cues of speech. Speech is produced via the closely
coordinated movement of several articulators. In some cases,
such as the [b] burst release, the visual and audio cues are
well synchronized. However, due to co-articulation effects
and articulator inertia, the audio and visual cues may not be
preciously synchronized at an given time. The articulators
such as the lips and tongue sometimes move in anticipation
of a phonetic event tens or even hundreds of milliseconds
before the phone is actually produced [1]. In these cases,
the visual evidence of the phonetic event may be evident
before the acoustic evidence is produced.
To provide an example, consider the /g/ to /m/ transition in the word segment. Typically, the /g/ in this context
is unreleased with only the voiced velar closure [gcl] being
realized. Because this closure is produced with the tongue,
the lips are free to form the closure for the [m] during the
[gcl] segment. The labial closure for [m] does not affect
the acoustics of the velar closure [gcl] because velar closures
precede labial closures in the vocal tract. As a result, the
visual evidence of the [m] can be present before its acoustic
evidence.
A variety of methods for modeling audio-visual asynchrony
have been proposed in the literature including multi-stream
hidden Markov models [4], product hidden Markov models [4], and multi-state time-delay neural networks [13]. In
all of the previous work we have examined, frame-based
modeling is used on both streams, and asynchrony is controlled via joint constraints placed on the finite-state networks processing each stream.
3.4.3
Segment-Constrained HMMs
In our AVSR recognizer, we have implemented asynchronous
modeling of the audio and visual streams using an approach
we call segment-constrained HMMs. This modeling is implemented with three primary steps. First, fixed-length video
frames are mapped to variable-length audio segments defined by the audio recognition process. The mapping is performed such that any path through the segment network
will incorporate each video frame exactly once. Second,
each context-dependent phonetic segment is mapped to a
RV−SB+VlCl
Triphone label: ao−t+kcl
Triviseme label: RV−SB+VlCl
Triviseme HMM:
Observation model:
S1
S2
S3
SB|RV
RV
RV|VlCL
Figure 6: Example segment-constrained triviseme
HMM from our system.
S1
S2
SB|RV
RV
Visual
Frames
Audio
Segments
context-dependent segment-constrained viseme HMM. Finally, the segment-constrained viseme HMM uses a framebased Viterbi search over visual frames in the segment to
generates a segment-based score for the visemic model. These
steps are described in the following paragraphs.
The first step of our audio-visual integration is identifying the visual frames corresponding to each segment. Since
the audio sampling rate is often not an integer multiple of
the video sampling rate, a convention must be defined to
systematically map frames to segments. In our approach,
the beginning and ending times of any visual frame are averaged to obtain the frame’s midpoint. Then, the frame is
assigned to any segment whose start time lies before the
frame’s midpoint, and whose end time lies after it. This
scheme has the desirable property that, for any given path
of non-overlapping segments covering all times in the utterance, each video frame is mapped to exactly one segment.
The audio recognizer uses triphone-based context-dependent
modeling for each segment. The mapping of segment-based
triphone acoustic models to frame-based viseme models is
accomplished through our segment-constrained HMM approach. Each triphone acoustic model is mapped to a corresponding triviseme visual model. Each visual model is represented by a three state, left-to-right HMM model which
allows any of the three states to be skipped. Figure 6 shows
an example triviseme HMM model used for the phonetic
triphone ao-t+kcl (where ao is the current phone, t is the
left context, and kcl is the right context). This triphone
is mapped to the triviseme RV-SB+VlCl (where RV is a
rounded vowel, SB is a stop burst, and VlCl is a velar closure.)
Figure 6 also shows the mapping from each triviseme
state to the label of the observation density function it uses.
In our approach, the left state of every triviseme HMM is
mapped to a diviseme model (e.g., SB|RV) based on its left
context, the middle state uses a context independent model
for that viseme (e.g., RV), and the right state is handled by
the right side diviseme (e.g., RV|VlCl). Having established
a model structure for aligning visual frames with an audio
segment, the optimal frame alignment is determined using
a Viterbi search over the frames in a segment.
At this point, it is important to note that the third state
(S3) of a triviseme model will use the same diviseme observation density function as the first state (S1 ) of the
triviseme model for the following segment. This is illustrated
in Figure 7, where state 3 of the triviseme RV-SB+VlCl and
state 1 of the triviseme VlCl-RV+SB both use the same diviseme observation model, RV|VlCl, for their output probability function. This figure demonstrates how a constrained
form of asynchronous modeling is introduced by our implementation of the viseme HMMs. For any sequence of two
VlCl−RV+SB
S3
S1
RV|VlCl
S2
S3
VlCl
VlCl|SB
S2 S2 S2 S3 S1 S1 S1 S2 S2 S3 S3 S3
ao−t+kcl
kcl−ao+k
Figure 7: Using segment-constrained HMMs to represent audio-visual asynchrony for a given audio segment sequence.
triphone segments, the diviseme observation model capturing the visual transition between these audio segments is
allowed to extend an arbitrary number of visual frames into
either of the preceding or following audio segments. Also
note that the segment constrained HMM is allowed to skip
any of the three states in a triviseme model. An example of this is shown in Figure 7 where the first state of the
triviseme RV-SB+VlCl is skipped and the segment begins
immediately with state 2 of the HMM.
The specific example in Figure 7, provides an example of
how our approach can model asynchrony during the [ao] to
[kcl] transition in the word talk. In the visual signal there
will be a smooth and gradual transition from the rounded
vowel [ao] into the velar closure [kcl]. However, in the audio
signal an abrupt acoustic transition occurs at the moment
the velar closure is realized. The acoustic signal then contains silence during the velar closure [kcl] until it is released
with a [k] burst. Thus, the transitional movement of the
lips from a rounded to an unrounded position occurs during
a period of time when no acoustic change is evident. This
asynchrony is handled in our model by allowing the visual
frames assigned to RV|VlCl viseme transition to straddle
the acoustic segment boundary separating the [ao] acoustic
segment from the [kcl] acoustic segment.
4. EXPERIMENTAL RESULTS
For our initial experiments with our AVSR design, we
have implemented a phonetic AVSR recognizer that we have
trained and evaluated on the AV-TIMIT corpus. The following sections provide the details and results of our experiments.
4.1 Experimental Data Sets
For our experiments we sub-divided the AV-TIMIT corpus into three subsets: a training set, a development test
set, and a final test set. The training set consisted of 3608
utterances from 185 speakers. To help constrain our initial experiments, we elected to evaluate using frontal lighting conditions only (and ignore the side-lighting condition).
Under this constraint, the training set is reduced to 2751 utterances. The full 3608 utterances are still used for training
the acoustic models, while the visual classification models
are trained from only the reduced set of 2751 frontal lighting utterances.
For evaluation, our development test set contains 284 ut-
terances from 19 speakers. The final test set contains 285
utterances from another 19 speakers. There is no overlap in
speakers between any of the three data sets.
4.2 Modeling Details
For acoustic modeling our system utilizes two types of
acoustic measurements, segment features and landmark features. Details about these measurement types can be found
in [5] and [21]. These models were trained from time-aligned
phonetic transcriptions described in Section 2.3.1. In all, the
segment model contains 834 context-dependent triphonebased segment models, and 629 diphone- and triphone-based
landmark models.
For visual modeling, the time-aligned phonetic transcriptions were mapped to viseme labels using the mapping contained in Table 1. Visual frames were mapped to the timealigned segments using each frame’s mid-point. To initialize
the triviseme observation models, each frame was mapped
to a specific triviseme state (either the first, second or third
state) based on its relative position in the segment (i.e., in
the first, second, or final third). Each triviseme state was
then mapped to its appropriate observation model (i.e., a
diviseme transition model for the first and third states, or
a context-independent viseme model for the middle state).
The trained models obtained from this initial mapping were
then subjected to two rounds of iterative Viterbi retraining. In total the viseme observation models contained 15
context-independent viseme models and 203 diviseme transition models. In addition to the viseme observation models, the transition probabilities in the triviseme segmentconstrained HMMs were also estimated during the Viterbi
training process. In total, there are 2690 unique trivisemes
resulting in 8070 states in the segment-constrained HMMs.
In addition to the audio and visual classifiers, a phonetic
bigram was estimated from the training utterances to provide the language model constraint. The recognizer also
controls the trade-off between phonetic insertions and phonetic deletions using a single segment transition weight. The
scores from the visual classifier, audio classifiers, and phonetic bigram are integrated using linear weighted combination. The weighting factors for the linear combination were
tuned to maximize phonetic recognition performance on the
development test.
4.3 Results
Our initial AVSR results are reported in Table 2. Results are reported in phonetic recognition error rates over
the standard 39 phonetic classes typically used in TIMIT
experiments [10]. As can be seen in the table, when incorporating the visual information into the recognizer, a relative reduction of 5% in phonetic error rate was obtained on
the development set, when the system’s scaling weights were
tuned appropriately. On the final test set, our AVSR system
produced a relative 2.5% reduction in errors from our ASR
system. Although these gains appear small, previous studies in quiet environments have also found relatively small
improvements in performance when adding visual information [17]. With an average signal-to-noise ratio of 25 dB, the
AV-TIMIT corpus is relatively noise-free. We expect a bigger benefit from the visual channel when we move to noisier
environments. We should also note that previous “small”
improvements in phonetic recognition have translated into
larger improvements in word recognition when our model-
Test
Set
Dev
Dev
Test
Test
Type of
Recognizer
Audio Only
Audio-Visual
Audio Only
Audio-Visual
Phonetic Error Rates (%)
Subs. Ins. Del. Total
23.6
6.8
9.9
39.8
21.9
6.4
9.4
37.8
20.9
4.9 10.4 36.1
20.2
4.6 10.4 35.2
Table 2: Phonetic recognition error rates using
audio-only and audio-visual recognition.
Phone Pair
mn
kp
lr
ae ax
ae ao
Errors Corrected
16
16
10
10
7
Table 3: Phone pairs which accounted for the largest
number of ASR substitution errors corrected by the
AVSR system.
ing techniques are transitioned from phonetic recognition to
word recognition tasks [6, 7].
To examine what types of errors are corrected when the
visual information is added, we extracted the list of phonetic
confusions that were most often corrected when moving from
our audio-only recognizer to our AVSR system. Table 3
shows the five phonetic confusions must commonly corrected
by the AVSR system. In each of these five phone pairs, it
is clear that visual information can be exploited to improve
recognition between the pairs. For example, the [m]/[n] and
[k]/[p] pairs can be distinguished by the presence or absence
of a labial closure.
5. SUMMARY
In this paper we have presented details about our initial efforts in developing an audio-visual speech recognition
system. This research contains two primary contributions.
First, we have collected a new audio-visual speech corpus
based on TIMIT utterances. This corpus contains 4 hours
of video from 223 different speakers. It is our plan to make
all (or at least portions) of this corpus publicly available to
the research community once its annotation has been fully
completed and verified. As of the publication time of this
paper, we are in the final stages of verifying the video annotations (i.e., the face and lip tracking results), and are
beginning to investigate the potential avenues for distributing this data. The most obvious mechanism is to distribute
the data via the Linguistic Data Consortium, but no final
decision has yet been determined in this regard.
The second major component of this work is the development of an audio-visual speech recognition system which
utilizes a new approach to audio-visual integration, which
we call segment-constrained hidden Markov modeling. A
preliminary implementation of this approach using standard
acoustic and visual processing techniques yielded a 2.5% reduction in phonetic error rate during our experiments with
the AV-TIMIT corpus.
6.
FUTURE WORK
As with all development work in its initial phases, there
is still much to be done on our project. From the scientific point of view, we are currently exploring new visual
features for efficiently and robustly representing relevant lip
information [18]. For real-world applications, such as kiosks,
visual features need to be robust to additional difficulties
such as visual occlusions, rotated or tilted head positions,
and general visual noise (e.g., variable lighting conditions,
video compression artifacts, etc.).
From an engineering point of view, we plan to explore
a variety of techniques for improving performance, including replacing our current manually determined set of viseme
clusters with a clustering determined using a top-down decision tree approach. We also plan to investigate the effect of
different noise conditions on our approach, and specifically
how it affects the weighting of the audio and visual scores.
From a development point of view, we plan to migrate this
system to word recognition tasks, and eventually to deploy
this system within real applications in difficult environments
such as automobiles, public kiosks, and noisy offices. To
this end, we have already undertaken a data collection effort
within moving automobiles, and future data collections in
other challenging environments are anticipated.
7.
ACKNOWLEDGMENTS
This work was sponsored in part by the Ford-MIT Alliance. The authors also wish to thank Marcia Davidson for
her efforts during the data collection process, and all of the
subjects who contributed data for the AV-TIMIT corpus.
8.
REFERENCES
[1] C. Benoit. The intrinsic bimodality of speech
communication and the synthesis of talking faces. In
Journal on Communications of the Scientific Society
for Telecommunications, Hungary, number 43, pages
32–40, September 1992.
[2] M. T. Chan, Y. Zhang, and T. S. Huang. Real-time lip
tracking and bimodal continuous speech recognition.
In Proc. of the Workshop on Multimedia Signal
Processing, pp. 65–70, Redondo Beach, CA, 1998.
[3] S. Chu and T. Huang. Bimodal speech recognition
using coupled hidden Markov models. In Proc. of the
International Conference on Spoken Language
Processing, vol. II, Beijing, October 2000.
[4] S. Dupont and J. Luettin. Audio-visual speech
modeling for continuous speech recognition. In IEEE
Transactions on Multimedia, number 2, pages
141–151, September 2000.
[5] J. Glass. A probabilistic framework for segment-based
speech recognition. To appear in Computer Speech and
Language, 2003.
[6] A. Halberstadt and J. Glass. Heterogeneous
measurements and multiple classifiers for speech
recognition. In Proceedings of ICSLP 98, Sydney,
Australia, November 1998.
[7] T. J. Hazen and A. Halberstadt, ”Using aggregation
to improve the performance of mixture Gaussian
acoustic models,” In Proceedings of the International
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Conference on Acoustics, Speech, and Signal
Processing, Seattle, May, 1998.
IBM Research - Audio Visual Speech Technologies:
Data Collection. Accessed online at
http://www.research.ibm.com/AVSTG/data.html,
May 2003.
Intel’s AVCSR Toolkit source code can be downloaded
from http://sourceforge.net/projects/opencvlibrary/.
K. F. Lee and H. W. Hon. Speaker-independent phone
recognition using hidden Markov models. In IEEE
Transactions on Acoustics, Speech, and Signal
Processing, vol. 37, no. 11, pp. 1641-1648, November
1989.
L. H. Liang, X. X. Liu, Y. Zhao, X. Pi and A.V.
Nefian. Speaker independent audio-visual continuous
speech recognition. In Proc. of the IEEE International
Conference on Multimedia and Expo, vol.2, pp. 25–28,
2002.
I. Matthews, J. A. Bangham, and S. Cox.
Audio-visual speech recognition using multiscale
nonlinear image decomposition. In Proc. of the
International Conference on Spoken Language
Processing, pp. 38–41, Philadelphia, PA, 1996.
U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel.
Towards unrestricted lip reading. In International
Journal of Pattern Recognition and Artificial
Intelligence, number 14, pages 571–585, August 2000.
K. Messer, J. Matas, J. Kittler, and K. Jonsson.
XM2VTSDB: The extended M2VTS database. In
Audio- and Video-based Biometric Person
Authentication, AVBPA’99, pages 72–77, Washington,
D.C., March 1999. 16 IDIAP–RR 99-02.
C. Neti, et al. Audio-visual speech recognition. In
Technical Report, Center for Language and Speech
Processing, Baltimore, Maryland, 2000. The Johns
Hopkins University.
S. Pigeon and L. Vandendorpe. The M2VTS
multimodal face database. In Proc. of the Audio- and
Video-based Biometric Person Authentication
Workshop, Germany, 1997.
G. Potamianos and C. Neti. Audio-visual speech
recognition in challenging environments. In Proc. Of
EUROSPEECH, pp. 1293–1296, Geneva, Switzerland,
September 2003.
K. Saenko, T. Darrel, and J. Glass. Articulatory
features for robust visual speech recognition In these
proceedings, ICMI’04, State College, Pennsylvania,
2004.
C. Sanderson. The VidTIMIT Database. IDIAP
Communication 02-06, Martigny, Switzerland, 2002.
C. Sanderson. Automatic Person Verification Using
Speech and Face Information. PhD Thesis, Griffith
University, Brisbane, Australia, 2002.
N. Ström, L. Hetherington, T.J. Hazen, E. Sandness,
and J. Glass. Acoustic modeling improvements in a
segment-based speech recognizer. In Proc. 1999 IEEE
ASRU Workshop, Keystone, CO, December 1999.
V. Zue, S. Seneff, and J. Glass. Speech database
development: TIMIT and beyond. Speech
Communication, vol. 9, no. 4, pp. 351–356, 1990.