Audiovisual
Audiovisual
Audiovisual
Chalapathy Neti IBM T. J. Watson Research Center, Yorktown Heights, Gerasimos Potamianos IBM T. J. Watson Research Center, Yorktown Heights, Juergen Luettin Institut Dalle Molle d'Intelligence Arti cielle Perceptive, Martigny, Iain Matthews Carnegie Mellon University, Pittsburgh, Herve Glotin Institut de la Communication Parl
ee, Grenoble; and
Institut Dalle Molle d'Intelligence Arti cielle Perceptive, Martigny, Dimitra Vergyri Center for Language and Speech Processing, Baltimore, June Sison University of California, Santa Cruz, Azad Mashari University of Toronto, Toronto, and Jie Zhou The Johns Hopkins University, Baltimore
Abstract
We have made signi cant progress in automatic speech recognition ASR for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding dramatic ASR improvements. Visual speech is one such source for making large improvements in high noise environments with the potential of channel and task independence. It is not e ected by the acoustic environment and noise, and it possibly contains the greatest amount of complementary information to the acoustic signal. In this workshop, our goal was to advance the state-of-the-art in ASR by demonstrating the use of visual information in addition to the traditional audio for large vocabulary continuous speech recognition LVCSR. Starting with an appropriate audio-visual database, collected and provided by IBM, we demonstrated for the rst time that LVCSR performance can be improved by the use of visual information in the clean audio case. Speci cally, by conducting audio lattice rescoring experiments, we showed a 7 relative word error rate WER reduction in that condition. Furthermore, for the harder problem of speech contaminated by speech babble" noise at 10 dB SNR, we demonstrated that recognition performance can be improved by 27 in relative WER reduction, compared to an equivalent audio-only recognizer matched to the noise environment. We believe that this paves the way to seriously address the challenge of speech recognition in high noise environments and to potentially achieve human levels of performance. In this report, we detail a number of approaches and experiments conducted during the summer workshop in the areas of visual feature extraction, hidden Markov model based visual-only recognition, and audio-visual information fusion. The later was our main concentration: In the workshop, a number of feature fusion as well as decision fusion techniques for audio-visual ASR were explored and compared.
Contents
1 Introduction 2 Database, Experimental Framework, and Baseline System 4 9
2.1 The Audio-Visual Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experiment Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Baseline ASR System Training Using HTK . . . . . . . . . . . . . . . . . . . 14 3.1 Discriminant DCT Based Visual Features . . . . . . . . 3.1.1 Face Detection and Mouth Location Estimation 3.1.2 Region of Interest Extraction . . . . . . . . . . 3.1.3 Stage I: DCT Based Data Compression . . . . . 3.1.4 Stage II: Linear Discriminant Data Projection . 3.1.5 Stage III: Maximum Likelihood Data Rotation . 3.1.6 Cascade Algorithm Implementation . . . . . . . 3.1.7 DCT-Feature Visual-Only Recognition Results . 3.2 Active Appearance Model Visual Features . . . . . . . 3.2.1 Shape Modeling . . . . . . . . . . . . . . . . . . 3.2.2 Shape Free Appearance Modeling . . . . . . . . 3.2.3 Combined Shape and Appearance Model . . . . 3.2.4 Learning to Fit . . . . . . . . . . . . . . . . . . 3.2.5 Training Data and Features . . . . . . . . . . . 3.2.6 Tracking Results . . . . . . . . . . . . . . . . . 3.2.7 AAM-Feature Visual-Only Recognition Results 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
18 20 21 21 22 23 24 24 25 26 28 30 33 35 36 37 38
4.1 Visual Clustering . . . . . . . . . . . . . . . . 4.1.1 Viseme Classes . . . . . . . . . . . . . 4.1.2 Visual Context Questions . . . . . . . 4.1.3 Phone Tree Root Node Inspection . . . 4.1.4 Visual Clustering Experiments . . . . . 4.2 Visual Model Adaptation . . . . . . . . . . . . 4.2.1 MLLR Visual-Only HMM Adaptation 4.2.2 Adaptation Results . . . . . . . . . . . 4.3 Conclusions . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
40 41 42 43 44 46 47 47 48 51 51 53 53 55 55 56 57 58 59 60 61 61 62 63 67 67 67 68 68 69
5.1 Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Concatenative Feature Fusion . . . . . . . . . . . . . . . . . . . . 5.1.2 Hierarchical Fusion Using Feature Space Transformations . . . . . 5.1.3 Feature Fusion Results . . . . . . . . . . . . . . . . . . . . . . . . 5.2 State Synchronous Decision Fusion . . . . . . . . . . . . . . . . . . . . . 5.2.1 The Multi-Stream HMM . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Multi-Stream HMM Training . . . . . . . . . . . . . . . . . . . . 5.2.3 State Synchronous Fusion Results . . . . . . . . . . . . . . . . . . 5.3 Phone Synchronous Decision Fusion . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Product HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Product HMM Training . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Phone Synchronous Fusion Results . . . . . . . . . . . . . . . . . 5.4 Class and Utterance Dependent Stream Exponents . . . . . . . . . . . . 5.4.1 Class Dependent Exponents: Silence Versus Speech . . . . . . . . 5.4.2 Utterance Dependent Stream Exponents . . . . . . . . . . . . . . 5.5 Utterance Level Discriminative Combination of Audio and Visual Models 5.5.1 Static Combination . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Dynamic Combination - Phone Dependent Weights . . . . . . . . 5.5.3 Optimization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
71
Acknowledgements Bibliography
74 75
Chapter 1 Introduction
We have made signi cant progress in automatic speech recognition ASR for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for speech to be a pervasive user interface in the same league as, for example, graphical user interfaces, it is necessary to make ASR far more robust to variations in the environment and channel. Recent studies 55 have shown that ASR performance is far from the human performance in a variety of tasks and conditions. Indeed, ASR to date is very sensitive to variations in the channel desktop microphone, telephone handset, speakerphone, cellular, etc., environment non-stationary noise sources such as speech babble, reverberation in closed spaces such as a car, multi-speaker environments, etc., and style of speech whispered, Lombard speech, etc. 24 . At present, the most e ective approach for achieving robustness of environment focuses on obtaining a clean signal through a head-mounted or hand-held directional microphone. However, this is neither tether-free nor hands-free, and it makes speech-based interfaces very unnatural. Moving the speech source away from the microphone can degrade the speech recognition performance due to the contamination of the speech signal by other extraneous sound sources. For example, using monitor microphones for far- eld input can severely degrade performance in the presence of noise, but on the other hand using directional desktop microphones constrains the extent of movement of the speaker, thus making the interaction unnatural. The research work in robust ASR in noise may be classi ed into three broad areas:
Filtering of the noisy speech prior to classi cation 50 . In this class of techniques, represented by spectral subtraction, an estimate of the clean speech spectrum is obtained by subtracting an average noise spectrum from the noisy speech 6 . A disadvantage of
such techniques is that crucial speech information may be removed during the ltering process.
Adaptation of the speech models to include the e ects of noise 36,68 . In this class of techniques, speech models are adapted to include the e ects of noise in an attempt to obtain models that would have been obtained in matched conditions.
Use of features that are robust to noise 38,46,70 . In this class of techniques, an attempt has been made to incorporate temporal and cross-spectral correlation between speech features modeled after the mammalian auditory processing 38,70 . These signal-based and model-based techniques to make speech recognition independent of channel and environment have been attempted with limited success 35,50 . Most of these methods make strict assumptions on the environment characteristics and require a sizable sample of the environment to get small improvements in speech recognition performance. Furthermore, modeling reverberation is a hard problem. In summary, current techniques are not designed to work well in severely degraded conditions. We need novel, nontraditional approaches that use other orthogonal sources of information to the acoustic input that not only signi cantly improve the performance in severely degraded conditions, but also are independent to the type of noise and reverberation. Visual speech is one such source, obviously not perturbed by the acoustic environment and noise. It is well known that humans have the ability to lipread: We combine audio and visual information in deciding what has been spoken, especially in noisy environments 92 . A dramatic example is the so-called McGurk e ect, where a spoken sound ga is superimposed on the video of a person uttering ba . Most people perceive the speaker as uttering the sound da 65 . In addition, the visual modality is well known to contain some complementary information to the audio modality 62 . For example, using visual cues to decide whether a person said ba rather than ga can be easier than making the decision based on audio cues, which can sometimes be confusing. On the other hand, deciding between ka and ga is more reliably done from the audio than from the video channel. The above facts have recently motivated signi cant interest in the area of audio-visual speech recognition AVSR, also known as automatic lipreading, or speechreading 45 . Work in this eld aims at improving automatic speech recognition by exploring the visual modality of the speaker's mouth region, in addition to the traditional audio modality. Not surprisingly, automatic speechreading has been shown to outperform audio-only ASR over a wide range of conditions 1, 29, 76, 86, 93 . Such performance gains are particularly impressive in noisy 5
environments, where traditional ASR performs poorly. Coupled with the diminishing cost of quality video capturing systems, this fact makes automatic speechreading tractable for achieving robust ASR in certain scenarios 45 . However, to date, all automatic speechreading studies have been limited to small vocabulary tasks and, in most cases, to a very small number of speakers 15, 45 . In addition, the number of diverse algorithms suggested in the literature for automatic speechreading are very di cult to compare, as they are hardly ever tested on any common audio-visual database. Furthermore, most such databases are of very small duration, thus placing doubts about the generalizability of reported results to larger populations and tasks. As a result, to date, no de nite answers exist on the two issues that are of paramount importance to the design of speaker independent audio-visual large vocabulary continuous speech recognition LVCSR systems: a The choice of appropriate visual features that are informative about unconstrained, continuous visual speech; and b The design of audio-visual information fusion algorithms that demonstrate signi cant gains over traditional audio-only LVCSR systems, under all possible audio-visual channel conditions. In the summer 2000 workshop, our goal was to advance the state of the art in audio-visual ASR by seriously tackling the problem of speaker independent LVCSR for the rst time. To achieve this goal, we have gathered a team of senior researchers in the area of automatic speechreading with expertise in both visual feature extraction and information fusion 29,63, 71, 76 , assisted by a number of graduate and undergraduate students 39, 97 . In addition, the IBM participants have provided a one-of-a-kind audio-visual database appropriate for LVCSR experiments that has been recently collected at the IBM Thomas J. Watson Research Center 2,80 . The major concentration of the summer workshop team was on audio-visual fusion strategies, however visual feature extraction and certain aspects of visual modeling, as well as visual model adaptation have also been investigated. In more detail, two algorithms for visual feature extraction have been considered by our workshop team: The rst technique belongs to the so called low-level, video pixel based category of visual features 45 . It consists of a cascade of linear transformations of the video pixels representing the speaker's mouth region 80 , and it requires successful face and mouth region tracking as a rst step 89 . The second technique considered uses a combination of low-level and higher-level, shape based face information 45 . In this approach, both face tracking and feature extraction are based on an active appearance model face representation 19, 30, 63 . High-level shape features have not been considered by themselves in this work, as it is in general agreed that they result in lower speechreading performance 16,29,78 . 6
Both feature sets have been used to train hidden Markov model HMM based statistical classi ers for recognizing visual-only speech. It is worth mentioning that the visual front end design is not only limited to automatic speechreading: Lip region visual features can readily be used in multimodal biometric systems 33, 49, 100 , as well as to detect speech activity and intent to speak 23 , among others. In addition to visual feature extraction, we have investigated various aspects relevant to visual-only HMM training. One important aspect in any LVCSR HMM based system is the issue of clustering of typically triphone context dependent units state or phone models 82, 103 . Since not all phones are visually distinguishable, but rather they cluster in so-called viseme classes 45, 62 , it is of interest to investigate whether clustering on basis of visemic instead of phonetic context is advantageous. The design of appropriate visemic questions for tree based HMM state clustering has been addressed in the summer workshop. Another visual modeling issue studied was the problem of visual-only HMM adaptation to unseen subjects. Although visual HMM adaptation has been considered before in small vocabulary tasks 79 , this constitutes the rst time that successful visual-only model adaptation has been demonstrated in the LVCSR domain. As stated above, the main concentration of our team has been the audio-visual integration problem. As with visual modeling, HMM only based fusion techniques have been considered in the workshop, although alternative statistical classi cation methods, such as neural networks, can also be used to address both the speech classi cation and fusion problems 8,45,47 . Two simple feature fusion approaches have been tried rst. The rst one uses the concatenation of synchronous audio and visual feature vectors as the joint audio-visual feature vector, whereas an improved algorithm uses a hierarchical linear discriminant analysis HiLDA technique to discriminatively project the audio-visual feature vector to a lower dimension. Subsequently, a number of decision fusion algorithms have been investigated. Such algorithms combine the class conditional likelihoods of the audio and visual feature vector streams using an appropriate scoring function at various possible stages of integration. The main model investigated in this approach has been the multi-stream HMM. Its class conditional observation likelihood is the product of the observation likelihoods of its audio-only and visual-only stream components, raised to appropriate stream exponents that capture the reliability of each modality. Such model has been considered in multi-band audio-only ASR, among others 7, 39, 73 . Although extensively used in small-vocabulary audio-visual ASR tasks 28, 29, 48, 76, 86 , this work constitutes its rst application to the LVCSR do7
main. Furthermore, to our knowledge, our joint audio-visual multi-stream HMM training by means of maximum likelihood estimation has not been considered before. Notice that the multi-stream HMM corresponds in its simplest form to a state level integration strategy. By considering the likelihood combination at the HMM phone level, we obtain the asynchronous multi-stream composite, or product HMM 10, 29, 96 , also implemented during the workshop. In both state and phone level integration strategies, the estimation of appropriate HMM stream exponents is of paramount importance to the resulting model performance. We rst considered modality-only based exponents, constant over the entire database. Such exponents were estimated by directly minimizing the word error rate on a held-out data set, since maximum likelihood approaches are inappropriate for training them 76, 103 . Alternative discriminative training techniques can also be used for that task 17,18,48,76 . Motivated by the fact that the audio of various speakers and utterances is characterized by varying signal to noise ratio and thus audio channel reliability, we subsequently re ned the stream exponents by making them utterance dependent as well. We used a harmonicity index 4,39,105 to estimate the average voicing per utterance, and we estimated exponents based on this index. Finally, a late integration, decision fusion technique has been explored based on rescoring N-best recognition hypotheses using the general framework of multiple knowledge source integration for ASR developed in 97 . Global, viseme-, and phone-dependent audio-visual weights were explored in this approach, all estimated by means of minimum error training on a held-out data set. In this report, we discuss in detail our summer work. Speci cally, in chapter 2, we present the audio-visual database, our general experiment framework, as well as our audio-only baseline system and its training procedure. In chapter 3, we discuss the two visual feature extraction techniques considered at the workshop, and we present visual-only LVCSR results. In chapter 4, we concentrate on two issues relevant to visual modeling, namely visual-only clustering and visual model adaptation. In chapter 5, we report our work on HMM based audio-visual fusion. We rst present two feature fusion algorithms, followed by a number of decision fusion techniques at the state, phone, and utterance level. Finally, in chapter 6, we summarize our most important results, and we discuss plans for future work.
Figure 2.1: Example video frames of the IBM ViaVoiceTM audio-visual database.
is approximately 50 hours. It is worth mentioning that, to date, this is the largest audiovisual database collected, and it constitutes the only one suitable for the task of continuous, large vocabulary, speaker independent audio-visual speech recognition, as all other existing audio-visual databases are limited to small number of subjects and or small vocabulary tasks 1,8,13,15,45,64,66,67,75,93 . In addition to the IBM ViaVoiceTM audio-visual database, a much smaller broadcast news dataset has also been obtained both at the IBM Thomas J. Watson Research Center and at the Johns Hopkins University, preceding the workshop. This database contains audiovisual sequences of frontal anchor speech, and it has been digitized from CNN and CSPAN broadcast news tapes, kindly provided by the Linguistic Data Consortium LDC. The entire duration of the database is approximately 5 hours, and it has been collected with the intent of performing audio-visual speaker adaptation experiments, using HMMs trained on the ViaVoiceTM data. However, the short duration of the summer workshop did not allow us to complete visual feature extraction for this data. We hope to perform such experiments in the future.
Scenario Set Utter. Duration Subj. SI MS Training 17111 34.9 hrs 239 Held-out 2277 4.8 hrs 25 SI Adaptation 855 2.1 hrs 26 Test 1038 2.5 hrs 26 Held-out 1944 4.0 hrs 239 MS Test 1100 2.3 hrs 239 Total 24325 50.6 hrs 290
Table 2.1: Database partitioning for speaker independent SI and multi-speaker
MS experiments. Number of utterances, duration, and number of subjects are depicted for each set. A single training set is used in both SI and MS scenarios SI only experiments are reported in this work.
testing: A held-out data set of close to 5 hours of data from 25 subjects and a test set of 2.5 hours from 26 subjects. The rst is used to train HMM parameters relevant to audiovisual decision fusion see section 5, while the second is used for testing evaluation of the trained models. Of course, all three sets comprise of disjoint subjects. Furthermore, an adaptation set is provided to allow speaker adaptation experiments see section 4.2. This set contains an additional 2 hours of data from the 26 test set subjects. In addition to the above mentioned sets, two more sets are available for multi-speaker MS HMM re nement and testing, namely a 4 hour held-out data set and a 2.3 hour test set, both containing data from all 239 training set subjects. The later were created in case speaker-independent visual models provided poor generalization to unseen subjects. Our results during the initial weeks of the workshop indicated that this was not the case, therefore, in this report, only speaker-independent experiments are reported. To assess the bene ts of the visual modality to LVCSR for both clean and noisy audio, two audio conditions have been considered: The original database clean wideband audio, and a degraded one, where the database audio is arti cially corrupted by additive babble" noise1 at a 10 db SNR level. Sixty-dimensional acoustic feature vectors are extracted for both conditions at a rate of 100 Hz 2 . These features are obtained by a linear discriminant analysis LDA data projection, applied on a concatenation of nine consecutive feature frames consisting of a 24-dimensional discrete cosine transform DCT of mel-scale lter bank energies. LDA is followed by a maximum likelihood linear transform MLLT based data rotation see section 3.1 for details on these transforms. Cepstral mean subtraction CMS and energy
1
11
normalization 56, 103 are applied to the DCT features at the utterance level, prior to the LDA MLLT feature projection. It is worth mentioning, that, for both clean and noisy audio, the LDA and MLLT matrices are estimated using the training set data in the matched condition. Similarly, all audio-only test set results are reported for HMMs trained on matched audio. For the noisy audio-only system, this is clearly an ideal scenario, which results in improved audio-only performance over systems that use noise compensation techniques when trained on unmatched data. In addition to the audio features, visual features need to be extracted in order to perform audio-visual speech recognition experiments. As mentioned in the Introduction and discussed in detail in chapter 3, two types of visual features have been considered in this work. The baseline ones consist of a discrete cosine image transform of the subject's mouth region, followed by an LDA projection and an MLLT feature rotation 80 . They have been provided by the IBM participants for the entire database, are of dimension 41, and are synchronous to the audio features at a rate of 100 Hz see section 3.1. These baseline features are exclusively used in our audio-visual ASR experiments. Alternative visual features based on active appearance models are presented in section 3.2, and preliminary visual-only recognition results are reported there. Notice that, in contrast to the audio, no noise has been added to the video channel or features. Many such cases of visual noise" could have been considered, for example additive white noise on the video frames, blurring, frame rate reduction, and extremely high compression factors, among others. Some preliminary studies on the e ects of video degradation to speechreading can be found in 22,78,101 . Given the training set utterance transcriptions, the corresponding appropriate features, and the pronunciation dictionary, we can train an HMM based ASR system 82,103 . However, due to the HTK large memory and speed requirements for LVCSR decoding, and in order to allow fast experimentation, we have decided to follow a lattice rescoring based decoding strategy. Namely, using a well trained HMM system, we rst generate appropriate ASR lattices o line, that contain the most probable" decoding paths. Subsequently, we rescore these lattices using various HTK-trained HMM systems of interest based on a number of feature sets, fusion strategies, etc. Baseline HTK systems are trained as discussed in section 2.3. For rescoring, we employ the HTK decoder HVite that runs e ciently, since the search is constrained by the lattice grammar 103 . Notice that the generated lattices are trigram lattices 82 , and that, on every lattice arc, the log-likelihood value of the trigram language model used to generate them has been provided by IBM. During rescoring, the language model weight and the word insertion penalty are roughly optimized by seeking
12
Best Oracle Anti-oracle LM-only Depth 14.24 5.53 46.83 29.57 64.7 45.43 26.81 96.12 58.31 164.5 37.15 16.84 103.69 52.02 271.2
Table 2.2: Word error rate WER of the IBM generated lattices on the SI
test set. WER for best path, oracle, anti-oracle, and best path based on language model information alone LM-only are depicted. Average lattice depth in words per reference transcription length is also shown.
minimum word error rate WER on the held-out set. Test set results are reported based on the NIST scoring standard 103 . For the summer workshop experiments, we have generated three sets of lattices for all database utterances not belonging to the training set, using the IBM LVCSR recognizer and appropriately trained HMM systems at IBM cross-word pentaphone systems, with about 50,000 Gaussian mixtures each. The three sets of lattices are:
Lat: Lattices based on the IBM system with clean audio features. NLat: Lattices based on the IBM system with noisy audio features matched training. NAVLat: Lattices based on the IBM system with noisy audio-visual features, using the HiLDA feature fusion technique reported in section 5.1.2.
Table 2.2 depicts the lattice word error rates, as well as other useful lattice information. Lattices Lat" and NLat" are rescored by HTK trained systems on clean and noisy audio features, respectively, to provide the baseline clean and noisy audio-only ASR performance. For visual-only recognition experiments, lattices NLat" are used, because they have the worst accuracy see Table 2.2. Such experiments are used to investigate the relative performance of the visual features of sections 3.1 and 3.2 and of the various visual modeling and adaptation techniques in chapter 4. The absolute visual-only recognition numbers reported there are clearly meaningless, as they are based on rescoring lattices that contain audio information! Finally, audio-visual fusion experiments are reported by rescoring the Lat" lattices in the clean audio case. However, the NAVLat" lattices are used in the noisy audio-visual fusion experiments, because, in this case, performance improves signi cantly by adding the visual modality see Table 2.2. 13
Training Data
Initialise monophone HMMs with single Gaussian densities. Set all means and variances to global mean and variance of training data. 41 phonemes + silence model, each with 3 states
Pick the first pronunciation for every word in the training transcription.
Create onestate "short pause model" Add skip states to "silence model" Tie "short pause model" to middle state of "silence model" Perform 2 training iterations of embedded reestimation
Triphone transcriptions
Iterate (N=2, 4, 8, 12): Increase number of mixtures to N Perform 2 iterations of embedded reestimation
Condition Lattices HTK IBM Clean-audio Lat 14.44 14.24 Noisy-audio NLat 48.10 45.43
Table 2.3: HTK baseline audio-only WER obtained by rescoring the IBM
generated lattices on the SI test set. Performance of the IBM system lattices is also depicted.
and variances of the training data. Monophones are trained by embedded reestimation using the rst pronunciation variant in the pronunciation dictionary. A short pause model sp is subsequently added and tied to the center state of the silence model sil , followed by another 2 iterations of embedded reestimation. Forced alignment is then performed to nd the optimal pronunciation in case of multiple pronunciation variants in the dictionary. The resulting transcriptions are used from now on for further training steps. Another 2 iterations of embedded reestimation lead to the trained monophone models. Context dependent phone models are obtained by rst cloning the monophone models into context dependent phone models, followed by 2 training iterations using triphone based transcriptions. Decision tree based clustering is then performed to cluster phonemes with similar context and to obtain a smaller set of context dependent phonemes. This is followed by 2 training iterations. Finally, Gaussian mixture models are obtained by iteratively splitting the number of mixtures to 2, 4, 8, and 12, and by performing two training iterations after each splitting. The training procedure has been the same for all parameter sets, whether audio-only, visual-only, or audio-visual. The resulting baseline clean and noisy audio-only system performance, obtained by rescoring lattices Lat" and NLat", respectively, was 14.44 and 48.10 WER see also Table 2.3. These numbers are quite close to the ones obtained by the IBM system, therefore our goal of obtaining comparable baseline performance between the IBM and HTK systems has been achieved.
16
der to improve visual-only discrimination among the speech classes of interest, or to provide better visual data maximum likelihood modeling. Such techniques considered in this work are the linear discriminant analysis LDA 83 , as well as a maximum likelihood linear transformation MLLT of the data, which is aimed at optimizing the observed data likelihood under the assumption of class conditional multi-variate normal distribution with diagonal covariance 42 . For visual speech extraction, LDA has been used as a stand-alone visual front end in 27, 77 , and as the second and nal visual front end stage following the application of PCA in 2, 100 . The visual front end used in the workshop is a cascade of a DCT of the mouth ROI, followed by LDA and MLLT, as in 80 . The three stages of this visual front end are described in the following section. Note that both LDA and MLLT are general pattern recognition and modeling techniques, and, as such, they have also been used in the AAM feature visual-only recognition experiments see section 3.2.7, as well as in our audio-visual feature fusion work section 5.1.2.
The schematic of the algorithm is depicted in Figure 3.1. Implementation details, including some DCT feature post-processing following Stage I, are presented in section 3.1.6. Visualonly recognition experiments are reported in section 3.1.7. The algorithm requires the use of a highly accurate face and mouth region detection system e.g., 43, 89 as its rst step. Subsequently, for every video frame f Vtm ; n g, at 18
t
1 M= 64 1
xt
(I)
t
1 d
(I)
yt
1
(I)
N= 64 d =MN =4096
(I)
P (I)
D(I) =24
D (I) =24
D (I)
--E[yt ]
(I)
xt
(II)
t
1
yt
(I)
1 D (I) =24
d(II)
yt =xt
(II)
(III)
STAGE II LDA
d(III) 1
yt
(III)
P (II)
J=15
D(II)=41
D (II) =d (III)=41
P (III)
D(III)=41
D (III)
d(II)=D(I)J=360
(see (3.3))
Figure 3.1: The DCT based cascade algorithm block diagram of the visual front
end used in our audio-visual ASR experiments. time t , the two-dimensional ROI centered around the speaker's mouth center mt ; nt , is extracted, as discussed in the following sections. The ROI video pixel values are then placed into the vector1
1 Throughout this work, boldface lowercase symbols denote column vectors, and boldface capital symbols denote matrices.
19
video frames from 8 database subjects, with detected facial features superimposed. Lower row: Corresponding extracted mouth regions of interest. for l = 1;:::;L .
such facial features are marked on the frames of Figure 3.2. The search for these features occurs hierarchically. First, a few high"-level features are located, and, subsequently, the 26 low"-level features are located relative to the high"-level feature locations. The feature locations at both stages are determined using a score combination of prior statistics, linear discriminant and DFFS 89 . The algorithm requires a training step to estimate the Fisher discriminant, face space eigenvectors, and prior statistics for face detection and facial feature estimation. Such training uses a number of frames labeled with the faces and their visible features see also section 3.1.6.
where denotes vector or matrix transpose. Then, matrix PI contains as its rows the rows of B that maximize the transformed data energy
DI X L X d=1 l=1
xI l ; bj
2;
3.2
where jd 2 f1;:::;dIg are disjoint, and ; denotes vector inner product. Obtaining the optimal values of jd , for d = 1;:::;DI, that maximize 3.2 is straightforward. It is important to note that DCT allows fast implementations 81 when M and N are powers of 2. It is therefore advantageous to choose such values in 3.1.
3.3
of length dII = DIJ see also Figure 3.1. In general, LDA 83 assumes that a set of classes C is a-priori given, as well as that the training set data vectors xII l , l = 1 ;:::; L , are labeled as cl 2 C . LDA seeks a projection PII, such that the projected training sample f PII xII l ; l = 1 ;:::; L g is well separated" into II the set of classes C . Formally, P maximizes
det PII SB PII QPII = det PII SW PII ;
3.4
where det denotes matrix determinant. In 3.4, SW , SB denote the within-class scatter and between-class scatter matrices of the training sample. These matrices are given by
SW =
X
c
2C
X
c
2C
3.5
respectively. In 3.5, Prc= Lc=L , c 2 C , is the class empirical probability mass function, j c where Lc =L l=1 c l , and i =1 , if i = j ; 0 , otherwise. In addition, each class sample mean 22
is
mc
; where
mc
d
= 1 Lc
L X l=1
c l l;d
c
c l
c c II 0 II xII l;d , md xl;d0 , md0 ; for d ; d = 1;:::; d :
Finally, m = c2C Prc mc, denotes the total sample mean. To maximize 3.4, we subsequently compute the generalized eigenvalues and right eigenvectors of the matrix pair SB ,SW that satisfy SB F = SW FD 41, 83 . Matrix F = f1;:::; fdII has as columns the generalized eigenvectors. Let the DII largest eigenvalues be located at the j1 ;:::; jDII diagonal positions of D. Then, given data vector xII t , we extract II = f ;:::; f its feature vector of length DII as ytII = PIIxII j1 j II . Vectors t , where P II f j , for d = 1;:::;D , are the linear discriminant eigensequences" that correspond to the directions where the data vector projection yields high discrimination among the classes of interest. We should note that the rank of SB is at most jCj , 1, hence we consider DII jCj , 1 . In addition, the rank of SW cannot exceed L , jCj , therefore insu cient training data is a potential problem. In our case, however, rst, the input data dimensionality is signi cantly reduced by using Stage I of the proposed algorithm, and, second, the available training data are of the order L = O106 . Therefore, in our experiments, L , jCj dII see also section 3.1.6.
D d
DIII = dIII, and it is derived as ytIII = PIIIxIII t . MLLT considers the observation data likelihood in the original feature space, under the assumption of diagonal data covariance in the transformed space. The desired matrix PIII is obtained by maximizing the original data likelihood, namely 42
Y
c
2C
where diag denotes matrix diagonal. Di erentiating the logarithm of the objective function with respect to P and setting it to zero, we obtain 42
X
c
2C
Condition WER Visual-only with LM 51.08 LM-only no features 58.31 Visual-only, with no LM 61.06 Random lattice path 78.14 Noisy audio-only 48.10
Table 3.1: NLat" lattice rescoring results in WER , obtained with or without
the use of visual-only trained HMM scores and language model LM scores. The baseline noisy audio-only performance is also depicted.
Recognition results are reported in Table 3.1. Recall that lattices NLat" were obtained using noisy audio-only HMMs section 2.2, therefore the absolute visual-only recognition results reported here are meaningless. Instead, these experiments were carried out to demonstrate that DCT features do provide useful speech information, and, in addition, to allow a preliminary comparison to the AAM features presented next. Indeed, as depicted in Table 3.1, the visual-only WER of 51.08 is signi cantly lower than the 58.31 WER of the best path through the NLat" lattices using the language model information alone. Similarly, if we do not use any lattice language model information, the visual-only WER becomes 61.06, which is much lower than the 78.14 WER of the random path through the NLat" lattices, obtained when no HMM or language model scores are used. Clearly therefore, the DCT visual features do provide useful speech information.
It is useful to introduce the concept of primary and secondary landmarks when manually labeling data. A primary landmark shown in red is one that should correspond to an easily identi able image feature, such as the mouth corner. The secondary landmarks shown in green are equally spaced between primary landmarks to describe the shape. In this implementation, all landmarks are hand located, and all secondary landmarks are smoothed spatially along a spline. These can then be edited to accurately describe the shape and minimize the introduction of variance due to point mislocation along each curve. The notion of primary and secondary landmarks exists only to aid the labeling process. For all video data processing, shape is described simply by the x ; y coordinates of all the landmark points. Any shape s , is represented in two dimensions by the 2N -dimensional vector of N concatenated coordinates
s = s +Pb ;
where s is the mean aligned shape, P = p1; p2 ;:::; p2N is the matrix of eigenvectors, and b is the vector of corresponding weights for each eigenvector the principal components. The eigenvalues, i , represent the variance accounted for by the corresponding ith eigenvector, pi . These allow sensible limits to be de ned for each of the principle components. p For example, they may be limited to lie within 3 i , to force points in the model to lie within three standard deviations of the mean. If the eigenvectors are sorted in decreasing order according to the size of the correspond2
Hence the use of the term eigen-X, where X is the application of your choice.
27
mode 1
mode 2
mode 3
mode 4 mode 5 mode 6 Figure 3.4: Statistical shape model. Each mode is plotted at 3 standard deviations around the mean. These six modes describe 74 of the variance of the training set. ing eigenvalue, then the top t eigenvectors can be used to approximate the actual shape. Typically, t is chosen so that the sum of the top t eigenvectors describe, let's say, 95 of the total variance. This reduces the dimensionality signi cantly allowing valid shapes to be represented in a compact space
s s + Ps bs ;
where Ps is the matrix of t shape eigenvectors ps1 ; ps2 ;:::; ps , and bs is the t-dimensional vector of corresponding weights. Figure 3.4 shows the mean face shape deformed by projecting up to 3 standard deviations for the rst six modes. This model uses 11 modes to describe 85 of the variance of 4072 labeled images from the IBM ViaVoiceTM audio-visual database.
t
a = l1 ; l2;:::; lNM ;
where li is the ith luminance value in the image. The extension to a color image is simply to sample each color attribute for each pixel. For example, an RGB color image can be sampled to give the 3NM -dimensional appearance vector
labeled image face region warped image Figure 3.5: Appearance normalization. The landmark points de ne the region of interest. They form the input vertices of a Delaunay triangulation for a texture mapping operation. The output vertices are the mean shape. ities at each triangle boundary, but, in practice, this approximation to an ideal continuous warping function produces reasonable results very quickly. A further post-processing step on the shape-normalized images is to normalize them all to have zero mean and unit variance. This removes the global lighting variation between images. PCA can now be used on the normalized appearances to identify the major modes of variation. Shape-normalized appearance is then approximated using the top t eigenvectors as
a a + Pa ba ;
where Pa is the matrix of t shape normalized appearance eigenvectors pa1 ; pa2 ;:::; pa , and ba is the t-dimensional vector of corresponding weights. Figure 3.6 shows the mean shape-normalized appearance and projections at 3 standard deviations for the rst six modes. This model uses 186 modes to describe 85 of the variance of the 4072 labeled training images from the IBM ViaVoiceTM audio-visual database.
t
+3 mean
,3
mode 1 mode 2 mode 3 mode 4 mode 5 mode 6 Figure 3.6: Shape free appearance. Center row: Mean appearance. Top row: Mean appearance +3 standard deviations +3 . Bottom row: Mean appearance ,3 standard deviations ,3 . The top six modes describe 41 of the training set variance. cavity is seen and possibly the teeth and tongue. A third PCA can be used to decorrelate the individual shape and shape-normalized appearance eigenspaces and create a combined shape and appearance model. A combined shape and appearance space can be generated by concatenating the shape and appearance model parameters into a single vector
c = bs ; ba :
As these models represent x ; y coordinates and pixel intensity values respectively, PCA cannot be applied directly on the combined vectors. This is due to the PCA scaling problem 14 . PCA identi es the axes of most variance, so if the data is measured in di erent units, then scaling di erences between them will dominate the analysis, and any correlation between the variables will be lost. This can be compensated for, by introducing a weight to normalize the di erence between the variance in shape and appearance parameters. The sum of the retained eigenvalues in the shape and appearance PCA calculation is the respective
31
variance described by each model, so the required weight can be calculated using
v u t X u u a u u i=1 ; w=u t u X t s
a i s
i=1
where a is the ith eigenvalue from the appearance PCA, s is the ith eigenvalue from the shape PCA, and ts and ta are the number of retained eigenvectors in the shape and appearance PCA, respectively. A weight matrix to be applied to the shape parameters is then simply
i i
W = wI ;
where I denotes the identity matrix. For all examples in the training set, the labeled landmark points are projected into their shape parameters bs, and the appearance into appearance parameters ba . These are concatenated using the variance normalizing weight to form combined shape and appearance vectors
c = W bs ; ba :
Then, PCA is used to calculate the combined eigenspace
c Pc bc ;
where P c is the matrix of t shape and appearance eigenvectors pc1 ; pc2 ;:::; pc and bc is the t-dimensional vector of corresponding weights. There is no mean vector to add, as both bs and ba are zero mean. Again, t is chosen so the retained eigenvectors model the desired percentage of variance. As the model is linear, shape and appearance can still be calculated from the combined shape and appearance model parameters
t
s s + PsW,1Pc bc ;
s
32
+3 mean
,3
mode 2 mode 3 mode 4 mode 5 mode 6 Figure 3.7: Combined shape and appearance. Center row: Mean shape and appearance. Top row: Mean shape and appearance +3 standard deviations. Bottom row: Mean shape and appearance ,3 standard deviations. The top 6 modes describe 55 of the combined shape and appearance variance. and mode 1
a a + Pa Pc bc ;
a
where
Pc = Pc ; Pc :
s a
Figure 3.7 shows the combined shape and appearance projections at 3 standard deviations for the rst six modes. This model uses 86 modes to describe 95 of the variance of the 4072 IBM ViaVoiceTM dataset training images.
perturbations from the actual t of the model to a target image, a linear relationship exists between the di erence in the model projection and image and the required updates to the model parameters. A similar approach was also used for model tting by Sclaro in 88 . All of the model parameters are grouped into a single vector with the pose values that de ne a similarity transform for projecting the model into the image
where tx and ty are translations in the x and y coordinates, respectively, is rotation, s is scale, go and gs are global appearance o set and scaling terms to model changes in lighting conditions, and bc is the ith combined shape and appearance model parameter. If the linear assumption is valid, then small perturbations in the total model parameter set, denoted by m , have a linear relationship to the di erence between the current model projection and the image, denoted by a = ai , a , where a is the image appearance and ai is the current model appearance. Clearly, to remove the e ects of shape and pose, this di erence must be calculated at some reference shape. The model shape-free appearance is calculated for a speci c shape generally the mean shape, so the image at the current model projection is warped back to the same shape to create the image appearance vector ai . Given a training set of model perturbations m , and corresponding di erence appearances a , the linear t model
i
m=R a;
can be solved for R , using multiple linear regression. The training set can be synthesized to an arbitrary size using random perturbations of the model parameters and recording the resulting di erence appearance. The tting algorithm is then a process of iterative re nement: Calculate the current di erence image a , and current t error Ec = Calculate the predicted update m = R a ; Apply a weighted predicted update mp = m , Iterate for values of iterations is reached.
a; a ;
m , where initially = 1:0 ; Calculate the predicted di erence image ap and predicted t error Ep = ap ; ap ;
= 1:0 ; 0:5 ;::: ; until Ep 34
model the entire face region. This represents a signi cant amount of labor as each image can take several minutes to label. However, in total, this covers only 2 mins, 13 secs out of the approximately 50 hrs of the full database. Some example labeled images are shown in Figure 3.8. This training data was used to build a point distribution model retaining 85 of the total shape variance, giving 11 modes of variation see Figure 3.4. A shape-free appearance model was calculated using the mean shape as the reference shape, but scaled to contain 6000 pixels. This model required 186 modes to describe 85 of the shape-free appearance variance see also Figure 3.6. These were combined to form the combined shape and appearance model by taking the 86 modes that described 95 of the concatenated shape and shape-free appearance model variance Figure 3.7. Features were extracted by applying the AAM tting algorithm described in section 3.2.4 and recording the nal model parameters. Model pose information translation, rotation, scaling, and global appearance lighting transformation was ignored as it is scene dependent. The 86-dimensional model parameter vectors were then either used directly as features, or further transformed using the methods described in sections 3.1.4 and 3.1.5. Models were also built taking only the beard" region of the face the lower jaw and up to the bottom of the nose, or only the lip region. In both cases, poor tracking performance from the less detailed model prevented investigation of lipreading performance.
36
b Example frame from a bad t. Figure 3.9: AAM tracking result examples. A well tted frame is shown in a and a poorly tted frame in b, alongside the original image. facial motions. In practice, the tracker was more e ective at locating the face region than accurately modeling facial expression. Given the small size of the AAM training set, this is perhaps to be expected.
Feature set WER AAM: 86-dim 65.69 AAM: 30-dim 65.66 AAM: 30-dim + + 90-dim 65.90 AAM: 86-dim + LDA 24-dim + LDA over 15 frames + MLLT 41-dim 64.00 DCT: 18-dim + + 54-dim 61.80 DCT: 24-dim + LDA over 15 frames + MLLT 41-dim 58.14 Noise: 30-dim 61.37 Table 3.2: NLat" lattice rescoring results on a subset of the SI test set, expressed in WER , obtained with visual-only HMMs trained on various visual feature sets. The rescoring results are summarized in Table 3.2. All results are depicted in percentage word error rate WER.4 The top row is the result using all 86 AAM features. The second row is the result using only the top 30 of the 86 AAM features. The third row is obtained by appending rst and second derivatives denoted by and , respectively to these 30. The fourth row is the AAM result obtained after using an LDA feature projection to a 24-dimensional space, followed by the LDA MLLT projection described in section 3.1. The fth row is the result using DCT features with their rst and second derivatives appended, and the sixth row is the result using the LDA MLLT transformed DCT features. Finally, the bottom row is the result obtained by training models on 30-dimensional uniform random noise features. It is interesting to note that the only features that give lower word error rate than the random noise features are the LDA MLLT transformed DCT features. All of the AAM feature variants performed worse than the random noise features, which are e ectively exploiting information in the language model combined with the restricted depth lattices.
3.3 Summary
In this chapter, we presented two visual front ends for automatic speechreading, namely features based on the DCT of an appropriately tracked mouth ROI, discussed in section 3.1, and features based on a joint shape and appearance model of the face ROI, by means of AAMs, presented in section 3.2. Both feature sets can be further transformed by using LDA
As mentioned in section 3.1.7, these results cannot be interpreted as visual-only recognition, due to the rescoring of the noisy audio-only lattices.
4
38
and MLLT, discussed in section 3.1. Noisy audio lattice rescoring experiments show that using AAM features results in worse recognition performance than simply using uniform random noise as visual features. The AAM features also perform worse than DCT features on the same subset of the ViaVoiceTM dataset. Therefore, the DCT based visual feature representation discussed in section 3.1 is exclusively used in all experiments reported in the following chapters. There are two reasons for the poor AAM performance: Modeling errors, and tracking errors. The rst may be due to a poor choice of model or insu cient training data to generalize the model to the test data. The second may also be due to insu cient training data as the AAM algorithm also uses this to learn how to t. The poor recognition performance is related to the signi cant number of poorly tracked sequences. The tracking algorithm used does not update model parameters if no better t is found between successive images. This introduces sections where the features remain constant over many frames. As a direct transformation of the image, the DCT method always gives a dynamic feature, even if the face tracking has failed on a given frame. Given the small amount of labeled AAM training data it may not be surprising that the resulting model is unable to capture all facial changes during speech. Only snap-shots of speech were modeled and this does not appear to be enough to generalize to continuous visual speech. Note also that the tedious task of hand labeling training data images is a signi cant drawback to the AAM approach.
39
Silence
sil , sp ao , ah , aa , er , oy , aw , hh Lip-rounding uw , uh , ow based vowels ae , eh , ey , ay ih , iy , ax Alveolar-semivowels l , el , r , y Alveolar-fricatives s, z Alveolar t , d , n , en Palato-alveolar sh , zh , ch , jh Bilabial p, b, m Dental th , dh Labio-dental f, v Velar ng , k , g , w Table 4.1: The 13 visemes considered in this work. acoustic similarity. Such a question is referred to as a context question. In the workshop we investigated the design of context questions that are based on visual similarity. Speci cally, we rst de ned thirteen visemes, i.e., visually similar phone groupings see section 4.1.1. Visual context questions based on these visemes were subsequently developed to guide binary tree partitioning during triphone state clustering section 4.1.2. The resulting phone trees were inspected in order to observe the importance of visual context questions and possibly reveal similar linguistic contextual behavior between phones that belong in the same viseme section 4.1.3. Finally, visual-only HMMs were trained based on the resulting context trees, and they were compared to ones trained using decision tree clustering on basis of acoustic phonetic only questions section 4.1.4.
Figure 4.1: Decision tree based HMM state clustering Figure 10.3 of 103 .
visual context questions needed for decision tree based triphone state clustering, as described next.
Figure 4.2: Decision tree root questions for the three emitting states states 2,
3, and 4 of the HMMs for phones p , b , and m , that make up the bilabial viseme.
Dec. Tree WER AA" 51.24 VA" 51.08 VV" 51.16 Table 4.2: Visual-only HMM recognition performance based on three di erent decision trees. at their root node, 74 had audio context root node partitions, and the remaining 16 had root node partitions obtained by single phone context questions. Clearly, visual context questions played an important role in the decision tree based triphone state clustering. Further inspection of the decision trees, however, did not reveal similar linguistic contextual behavior between phone trees within the same viseme class. Rather, the results appeared unbalanced and any pattern seemed to be an artifact of the speci c data corpus and not driven by linguistic rules see also Figure 4.2.
Figure 4.3: Absolute recognition performance di erence between the VA" and
VV" clustered visual-only HMMs, expressed in WER , for each of the 26 test subjects. Positive values indicate subjects where the VV" system is superior. The three visual-only context dependent HMMs were trained based on clustering by means of three di erent decision trees. These trees were obtained using various front ends features and questions, and are denoted as follows:
AA: Uses audio features and audio" questions i.e., the decision tree is identical to the one used for audio-only HMM training; VA: Uses visual features, but audio" questions; VV: Uses visual features and visual" questions.
The performance of the resulting visual-only HMMs trained using the AA", VA", and VV" decision trees is depicted in Table 4.2, expressed in WER . Clearly, there was no signi cant di erence in the performance of the three models. The AA" based system performed somewhat worse than the other two models, whereas, surprisingly, the VA" was the best. We further investigated the VA" and VV" system di erences on a per subject basis, for each of the 26 subjects of the SI test set. The results are depicted in Figure 4.3. Notice, that although there were not signi cant overall di erences resulting from incorporating the new set of questions in decision tree design, for particular individuals, absolute WER di erences were 45
almost as great as 3. It is worth mentioning, that visual-only recognition results by both the VA" and VV" systems followed the noisy audio-only HMM recognition performance, per subject. This was an artifact of the NLat" lattice rescoring experiments, which severely restricted decoding see also sections 3.1.7 and 3.2.7. We also performed an analysis of how many times each question was used in the VV" decision trees. This revealed that the 76 introduced viseme based questions were used quite frequently: Within the top 20 questions used in the VV" tree, 11 were viseme based, thus a ecting the relative frequency with which the traditional audio questions were used. Some such audio questions that did not rank high in the VA" decision tree, were used further up the trees in the VV" based clustering. It is also worth noticing that all three decision trees AA", VA", and VA" formed approximately 7000 clusters. However, the set of visually distinguishable classes visemes is much smaller than the number of phones, thus we considered it of interest to investigate smaller VV" decision tree sizes. We constructed such a decision tree of the VV" type with about 2500 clusters, by increasing the minimum likelihood gain threshold to 900. However, this resulted to some performance degradation of the corresponding visual-only HMM. These results indicate that viseme based context questions for decision tree based clustering do not appear to improve system performance. We view, however, these experiments to be a rst only investigation of visual model clustering. Further work is merited in this area, including full decoding experiments, improvements in the decision tree clustering algorithm, and a possible redesign of visual context questions.
task in traditional audio-only ASR 37, 54, 72 . Such common algorithms include the maximum likelihood linear regression MLLR 54 , maximum-a-posteriori MAP 37 adaptation, and methods that combine both 72 . The rst is especially useful when the adaptation data is of very small duration rapid adaptation. In contrast to audio-only HMM adaptation, in the visual-only and audio-visual ASR domains, speaker adaptation has only been considered for small vocabulary tasks 79 . In this section, we investigate the use of MLLR to visual-only adaptation in the LVCSR domain.
mSA j = Wp 1 ; m
c
cj
; where c ; j 2 p ;
4.1
to maximize the adaptation data likelihood. In 4.1, matrices Wp are of size D D + 1 , where D is the mean vector dimension. Hence, MLLR also adds a bias term to the SI Gaussian means 54 . To avoid overtraining, matrices Wp are often block-diagonal.
Subject SI SA AXK 44.05 41.92 BAE 36.81 36.17 CNM 84.73 83.89 DJF 71.96 71.15 JFM 61.41 59.23 JXC 62.28 60.48 LCY 31.23 29.32 MBG 83.73 83.56 MDP 30.16 29.89 RTG 57.44 55.73 Table 4.3: Visual-only HMM adaptation experiments using MLLR: Speakerindependent SI and speaker-adapted SA visual-only HMM performance is reported in WER , per subject, obtained by rescoring NLat" lattices. It is clear from Table 4.3 that adaptation consistently improved visual-only HMM performance for all subjects. For several individual subjects e.g., AXK, JXC, JFM, LCY, and RTG, we actually observed signi cant improvements. These results could likely be further improved by using multiple block-diagonal MLLR transformation matrices, and possibly by applying MAP adaptation, following MLLR 72 .
4.3 Conclusions
As pointed out above, modeling context dependence is a key element of the progress that has been made in audio-based speech recognition. Most of the speech community has converged on using triphone contexts, while others including IBM use pentaphone contexts. In both cases, it is essential to discover the most meaningful contexts. This is often done by automatically grouping using decision trees, for instance phonetic contexts that are similar along some acoustic dimension. Acoustic similarities however are not necessarily the most appropriate for training visual-only systems. So, in this chapter, we explored ways to develop visually meaningful phone groupings based on the place of articulation, and we designed a set of decision tree questions to develop viseme based triphone contexts. Analysis of the resulting decision trees indicated that questions about visually relevant groupings do get used at high levels in the decision trees. However, preliminary experiments using visually clustered HMMs did not show any improvements relative to the baseline acoustically clus48
tered HMM system. This is a somewhat surprising result. However, we believe that more work is needed in this direction, before we can draw any conclusions. In particular, we did not adequately optimize the parameters that guide the process of developing decision tree triphone clusters. Also, we used the visual questions as a complement to the acoustic questions. Instead, it may have been more appropriate to use the visual questions by themselves. In this chapter, we also considered visual-only HMM supervised adaptation in the LVCSR domain to new subjects. A simple implementation of the MLLR adaptation algorithm in this domain showed some expected, but small, improvements.
49
In all cases, estimation of appropriate log-likelihood combination weights is of paramount importance to the resulting model performance. Weight estimation for multi-stream and product HMMs is discussed in section 5.4, and for the discriminative model combination approach in section 5.5. A summary of the best audio-visual fusion results is given in section 5.6.
51
LDA/MLLT
AUDIO 24 x 9 audio 60
AV-concat
AV-HiLDA
LDA/MLLT
audio-visual 60
LDA/MLLT
VIDEO 24 x 15 visual 41
101
Figure 5.1: Two types of feature fusion considered in this section: Plain audiovisual feature concatenation AV-concat and hierarchical LDA MLLT feature extraction AV-HiLDA. Feature vector dimensions are also depicted. feature vector is then simply the concatenation of the two, namely
o = oA ; oV
t t t
2 RD ;
T
5.1
where D = DA + D V . We model the generation process of a sequence of such features, O = o1; o2;:::; o , by a traditional, single-stream HMM, with emission class conditional observation probabilities, given by
Pr o j c =
t
J X
c
j =1
wc j ND o ; m c j ; s c j :
t
5.2
In 5.2, c 2 C denote the HMM context dependent states classes. In addition, mixture weights wc j are positive adding up to one, Jc denotes the number of mixtures, and Nd o ; m ; s is the d-variate normal distribution with mean m and a diagonal covariance matrix, its diagonal being denoted by s . As depicted in Figure 5.1, in our experiments, the concatenated audio-visual observation vector 5.1 is of dimension 101. This is rather high, compared to the audio- and visual-only feature sizes, and can cause inadequate modeling in 5.2 due to the curse of dimensionality. To avoid this, we seek lower dimensional representations of 5.1, next.
52
Matrices PLDA and PMLLT denote the LDA projection and MLLT rotation matrices. In our experiments, their dimensions are 60 101 and 60 60 , respectively: We have chosen to obtain a nal audio-visual feature vector of the same size as the audio-only one, in order to avoid high-dimensionality modeling problems.
Table 5.1: Audio-visual feature fusion performance on the SI test set using con-
catenated AV-concat and hierarchical LDA AV-HiLDA audio-visual features: Clean and noisy audio conditions are considered. Both NLat" and NAVLat" lattices are rescored in the noisy audio case fusion. All results are in WER .
In the noisy audio case, we rst rescored NLat" lattices, generated by the IBM system on basis of noisy audio-only observations and a matched-trained HMM. Both feature fusion techniques resulted in substantial gains over the noisy audio-only baseline performance, with HiLDA being again the best method. As discussed in section 2.2, the NLat" lattices contain audio-only information, that, in the noisy audio case, is very unreliable. It is therefore more appropriate to rescore lattices that contain audio-visual information. Such are the NAVLat" lattices, generated by training an HMM on HiLDA audio-visual features, in the noisy audio case. As expected, the results improved signi cantly. The HiLDA algorithm yielded a 36.99 WER, compared to the baseline noisy audio-only 48.10 WER. This amounts to a 24 WER relative reduction. Notice that NAVLat" lattice rescoring provides the fair result to report for the HiLDA technique. However, the concatenative feature fusion result is boosted" by its superior HiLDA-obtained NAVLat" lattices. Its actual, free decoding performance is expected to be somewhat worse than the 40.00 WER but better than the 44.97 WER, reported in Table 5.1. In the remaining decision fusion experiments, NAVLat" lattices were exclusively used in the noisy audio case. It is of course not surprising that HiLDA outperformed plain feature concatenation. In our implementation, concatenated audio-visual features, were of dimension 101, which is rather high, compared to audio-only and HiLDA features, that were both of dimension 60. HiLDA uses a discriminative feature projection to e ciently compact" the concatenated audio-visual features. The curse of dimensionality and undertraining are possibly also to blame for the performance degradation compared to the clean audio-only system, when plain audio-visual feature concatenation is used.
54
Pr o j c =
t
J Y X
sc
s2fA ;Vg j =1
ws c j ND o ; m s c j ; s s c j
t s
sct
5.3
In 5.3, s c t are the stream exponents, that are non-negative, and, in general, depend on the modality s , the HMM state class c 2 C , and, locally, on the utterance frame time t . Such time-dependence can be used to capture the local" reliability of each stream, and 55
can be estimated on basis of stream con dences 1,63,85,93 , for example, or acoustic signal characteristics 1 , an approach which we consider in section 5.4, below. In this section, we consider global, modality-dependent weights, i.e., two stream exponents constant over the entire database
5.4
5.5
Clearly see 5.3, and in contrast to feature fusion techniques, the multi-stream HMM assumes class conditional independence of the audio and visual stream observations. This appears to be a non-realistic assumption.
Clean audio Noisy audio Audio-only 14.44 48.10 AV-HiLDA 13.84 36.99 AV-MS-1 14.62 36.61 AV-MS-2 14.92 38.38
Table 5.2: Audio-visual decision fusion performance on the SI test set by means
of the multi-stream HMM, separately trained as two single-stream models AVMS-2, or jointly trained AV-MS-1. For reference purposes, audio-only and AVHiLDA feature fusion WER results are also depicted.
5.3 assumes that the HMM stream components are state synchronous. The alternative is to train the whole model at once, in order to enforce state synchrony. Due to the stream log-likelihood linear combination by means of 5.3, the EM algorithm carries on in the multi-stream HMM case with minor only changes 103 . The only modi cation is that the state occupation probabilities or, forced alignment, in the case of Viterbi training are computed on basis of the joint audio-visual observations, and the current set of multi-stream HMM parameters. Clearly, this approach requires an a-priori choice of stream exponents. Such stream exponents cannot be obtained by maximum likelihood estimation 76 . Instead, discriminative training techniques have to be used, such as the generalized probabilistic descent GPD algorithm 17,76 , or maximum mutual information MMI training 18,48 . The simple technique of directly minimizing the WER on a held-out data set can also be used. Clearly, a number of HMM stream parameter and stream exponent training iterations can be alternated. Finally, decoding using the multi-stream HMM does not introduce additional complications, since, obviously, 5.3 allows a frame-level likelihood computation, like any typical HMM decoder.
Video Model
V1
V2
V3
A1
A2
A3
Audio Model
58
Stream A1
Stream A2
Stream A3
A 1V 2 A A 1V 1 A 2V 1
Stream V1 Stream V2
A 2V 3 A 2V 2 A 3V 2
Stream V3
Figure 5.4: Stream tying in a product HMM with limited state asynchrony.
to the multi-stream HMM see Figure 5.3.
3V 3
Clean audio Noisy audio Audio-only 14.44 48.10 AV-HiLDA 13.84 36.99 AV-MS-1 14.62 36.61 AV-PROD 14.19 35.21
Table 5.3: Audio-visual decision fusion performance on the SI test set by means
of the product HMM AV-PROD. For reference purposes, audio-only, AV-HiLDA feature fusion, and AV-MS-1 decision fusion performance is also depicted. All results are in WER .
WER % for multistream model on clean speech, 16 sentences of speaker independant dvp test set 1
30
WER % for multistream model on noisy speech, 16 sentences of speaker independant dvp test set 1
46 45 44
28 0.8 26
0.75
43 42
41 0.5 40 39 38 0.25 37 36
0.4
20
18
0.2
16
14 0
0
35
0.2
0.8
0.25
0.75
Figure 5.5: E ect of the variation of speech silence dependent stream exponents on the WER of a 16-utterance subset of the SI held-out set. Audio stream exponents at a resolution of 0.1 have been considered for silence ordinates versus speech states abscissa. Left: Clean audio. Right: Noisy audio.
re nements of stream exponent dependence. First, we consider exponents that depend on the HMM phone class, in addition to the modality. We investigate a very coarse such dependence, namely silence sil , sp versus non-silence state phone stream exponents. A ner such dependence has been considered in 48 , with no de nite conclusions. Subsequently, we consider exponents that are utterance dependent. Such exponents are estimated on basis of the degree of voicing present in the audio signal. Voicing is considered an indication of the reliability of the audio stream, and as such, this approach follows the concept of audio-visual adaptive weights used in 85,86 .
250
200
Ri
150
100
50
0 60
48
36
24
12
0 12 local dB SNR
24
36
48
60
72
Figure 5.6: Histogram of R1R0 Ri of low frequency cells 115; 629 Hz computed on 128 ms speech windows, and for 60 sentences, versus SNR in ,21; 39 db and
increments of 3 db, for white additive noise. Notice the clear nonlinear correlation between SNR and R1R0 after 39 .
tions, for clean and noisy audio, respectively, do lie in the optimal minimum WER region. Furthermore, lower WERs are obtained for higher values A sil in both conditions. This suggests that silence is better modeled in the audio stream than by the video observations. Notice however, that these results have been obtained on a very small number of sentences. At this point, no conclusions can be drawn about whether phone class dependent stream exponents are useful in state synchronous decision fusion by means of the multistream HMM. No such experiments have been carried out for the product HMM.
;= =
0.5
1 Time in bin
1.5
2 x 10
4
R1R0noisy speech, and XNR, for a database utterance. All calculations are performed on 128 ms speech windows shifted by 64 ms. Bottom: Noisy audio spectrogram of the same utterance.
Figure 5.7: Top: Local estimates of R1R0 for clean R1R0clean and noisy
Voicing Estimation
We use the autocorrelogram of a demodulated signal as a basis for di erentiating between a harmonic signal and noise. The peaks in the autocorrelogram isolate the various harmonics in the signal. The autocorrelogram can also be used to separate a mixture of harmonic noises and a dominant harmonic signal. An interesting property is that such separation can be e ciently accomplished, using a time window in the same range of the average phoneme duration 4,39 , and in a frequency domain divided in four subbands leading to the concept of multi-band speech recognition 7 . A correlogram of a noisy cell is less modulated than a clean one. We use this fact to estimate the reliability of a cell 40 for which time and frequency de nitions are compatible with the recognition process 128 ms of duration. Before the autocorrelation, we compute the demodulated signal after half wave recti cation, followed by band-pass ltering in the pitch domain 90,350 Hz. For each cell, we calculate the ratio Ri = R1=R0, where R1 is the local maximum in time delay segment corresponding to the fundamental frequency, and R0 is the cell energy. This measure is comparable to the HNR index 105 . Furthermore, it 64
0.8
0.75
HNR
0.7
0.65
0.6
0.55
0.5
50
100
150
200
350
400
450
500
Figure 5.8: Utterance dependent A t for the rst 500 utterances of the SI test
set, representing 14 speakers nearly 40 consecutive utterances for each speaker are considered.
is strongly correlated with SNR in the 5 20 db range, as it is demonstrated in Figure 5.6 4 . In Figure 5.7, we plot R1R0 estimates on 128 ms speech windows on a noisy database utterance, against the R1R0 estimates in the clean audio case, as well as, an SNR-alike measure, de ned as XNR = 10 log10 S=S + N . We observe that the biggest di erence in R1R0 between the clean and noisy conditions occurs during silent frames. Notice that R1R0 and XNR are not strictly giving the same kind of information, but they are quite strongly correlated. Indeed, their correlation factor is 0.84, computed over the entire SI test set. Locally, R1R0 is higher than XNR on voiced parts, and it is lower on other parts. This local divergence could be well exploited in case we further re ne stream exponent dependence at the frame level.
Clean audio Noisy audio Audio-only 14.44 48.10 AV-HiLDA 13.84 36.99 AV-MS-1 14.62 36.61 AV-PROD 14.19 35.21 AV-MS-UTTER 13.47 35.27
Table 5.4: Audio-visual decision fusion performance on the SI test set by means of
the the multi-stream HMM with utterance level, HNR-estimated stream exponents AV-MS-UTTER. For reference purposes, audio-only, AV-HiLDA feature fusion, AV-MS-1, and AV-PROD decision fusion performance is also depicted. All results are in WER .
exponents A t , constant for all t within the utterance, to be the mean of all R1R0 values higher than 0.5. We assume this to be an adequate estimate of voicing within the utterance. Then, V t = 1 , A t see 5.5. As it is demonstrated by Figure 5.8, A t is mostly speaker dependent, and in a smaller extent, utterance dependent, as well. For the entire SI test data set, the average A t is calculated to be 0:79 and 0:73 for the clean and noisy audio case, respectively.
66
5.6
where ZI is a normalization factor so that the probabilities for all possible lattice hypotheses h 2 H add to one. The weights in this formulation are constant for every model.
67
i=1
5.7
where hi is the ith phone in hypothesis h . The weights ; can be tied for di erent classes of segments. For example, we can have the same weight for all the consonants and the same for all the vowels as was examined in 12 . In the case of the visual model we can examine the case of having one weight for each of the di erent visemic classes.
68
SI held-out SI test Baseline acoustic 12.8 13.65 DMC: Static acoustic + visual weights 12.5 13.35 DMC: 1 acoustic + 13 visemic weights 12.2 13.22 DMC: 43 phonemic-acoustic + 13 visemic weights 11.8 12.95 Table 5.5: Discriminative model combination fusion WER results in the clean audio case. One global weight is still used for the audio model scores, but we use 13 di erent weights for visual models corresponding to the each of the 13 visemic classes of Table 4.1. Di erent weights are used for each of the 43 audio phone-models and each of the 13 visemic classes. The results are depicted in Table 5.5. Signi cant gains have been obtained in the clean audio case. The DMC technique has outperformed all other decision fusion techniques, albeit with the caveat of a lower audio-only baseline see also Table 5.4.
5.6 Summary
In this chapter, a number of feature fusion and decision fusion techniques have been applied to the problem of large vocabulary continuous audio-visual speech recognition. Some of these techniques have been tried before in small vocabulary audio-visual ASR tasks, such as concatenative feature fusion, as well as state- and phone-level decision fusion by means of the multi-stream and product HMMs, respectively. However, none of these methods have been applied to the LVCSR domain before. Furthermore, new fusion techniques were introduced in the workshop: The hierarchical LDA feature fusion technique, an HNR-based, utterance dependent, stream exponent estimation algorithm, as well as the composite model joint maximum likelihood training based on bimodal observations. Finally, the discriminative model combination approach has never before been considered for audio-visual ASR. We have conducted fusion experiments in both clean and noisy audio conditions. In both cases, we were able to obtain signi cant performance gains over state-of-the-art baseline audio systems, by incorporating the visual modality. Thus, we demonstrated for the rst time that speaker independent audio-visual ASR in the large vocabulary continuous speech 69
Clean audio Noisy audio Audio-only 14.44 48.10 AV-concat 16.00 40.00 AV-HiLDA 13.84 36.99 AV-MS-1 14.62 36.61 AV-MS-2 14.92 38.38 AV-MS-UTTER 13.47 35.27 AV-PROD 14.19 35.21 AV-DMC 12.95
Table 5.6: Audio-visual feature and decision fusion results in WER on the
SI test set in both clean and matched noisy audio conditions. domain is bene cial. A summary of all workshop audio-visual fusion results is depicted in Table 5.6. A novel and simple feature fusion technique, namely the hierarchical LDA approach, gave us signi cant gains in both audio conditions considered. More complicated decision fusion techniques, by means of the multi-stream HMM with utterance dependent stream exponents, the product HMM, and the discriminative model combination for rescoring n-best hypotheses, resulted in additional gains. Overall, we achieved up to a 7 WER relative reduction in the clean audio case, and 27 WER reduction in the noisy case. It is worth noticing that the nature of lattice rescoring experiments places limits to these improvements. It is worth conducting full decoding experiments with some of the decision fusion techniques considered. Furthermore, it is of interest to consider local stream exponent estimation schemes at the frame level, in conjunction with multi-stream, as well as, product HMMs.
70
71
Visual speech representation: What portion of the face provides all the visually relevant speech information? A simple low-level, video pixel based approach representing a rectangular box around the subject mouth baseline in our experiments appears to take us a long way. However, perceptual 92 and other experiments 53 suggest that more of the face region including the cheeks and the jaw carry useful information. In our experiments, we did some preliminary investigation by using representations of the whole face active appearance models, in section 3.2, with limited success. However, we believe that the results are preliminary and the jury is out on this thread of experimentation. Also, 3-D aspects of the face during speech production appear to provide additional information in particular, for languages like French. Such 3-D representations could also provide a greater degree of pose-invariance. Thus, 3-D visual speech representations are a potential direction for future pursuit. Visual modeling: Modeling context dependence is a key element of the progress that has been made in audio-based speech recognition. Most of the speech community has converged on using triphone contexts, while others including IBM use pentaphone contexts. In both cases, it is essential to discover the most meaningful contexts. This is often done by automatically discovering using decision trees, for instance similar contexts by grouping together phonetic contexts that are similar along some acoustic dimension. Obviously, acoustic similarities are not the most appropriate for visemes. So, we explored ways to develop visually meaningful groupings based on the place of articulation of phones and their use in developing triphone contexts that are similar. Our preliminary results did not show any improvements due to visually meaningful modeling section 4.1.4. However, the investigation is too preliminary to come to any conclusions. Audio-visual integration. This we believe, is a wide open area for research with implications transcending the audio-visual speech recognition problem 71 . In audio-visual speech recognition, the key questions are: What is the right granularity for combining the decisions between the audio and visual sources of information? A useful source of information that in uences the decision is the experimentally observed asynchrony between the two streams 9 . Being the easiest from an implementation point of view, synchronous feature level fusion was the baseline in our experiments. Feature level fusion synchronous fusion using discriminant joint representations HiLDA, see section 5.1.2 bought
72
us most of the gains during the workshop. We experimented with state-level decision fusion. Although this framework does not allow for asynchrony between the audio and visual streams, it does allow for weighting the decisions independently section 5.2. We did not see any improvements over discriminant feature fusion for clean speech in fact, it was slightly worse. We partially modeled the asynchrony between the streams by creating HMM topologies that permit asynchrony within a phone section 5.3. Although this does not adequately address the asynchrony at onset, the approach used in the workshop lays the foundation for more general asynchronous models at word and utterance level. We did see some additional improvements over feature level fusion by using this approach. However, our ability to investigate this further was limited by what we could implement in HTK in 6 weeks. Carefully modeling the asynchrony between the two streams by taking into account the sampling rates and the timing of information-bearing events is an area of research with a lot of potential. How do you measure the reliability of the audio and visual information sources to weight the in uence of the decisions in the combination? Reliability of the audio and visual streams can be obtained by measures of the signal such as the amount of noise using SNR or by knowledge-based perceptual or linguistic aspects of the two streams or by data-driven approaches discriminative training. We pursued two di erent lines of investigation. The rst was based on perceptual and acoustic-phonetic knowledge. We used the fact that voicing is only available in the audio stream to de ne an utterance level voicing estimator to determine the relative weights. We did see improvements section 5.4.2. Although, we used utterance level weighting schemes, more local at the frame level or unit level weighting schemes may be more appropriate. The second approach was a data-driven approach where individual stream weights were estimated at the appropriate unit level phones for audio and visemes for visual using a discriminative technique. Small improvements of the order of 5 relative in clean were observed section 5.5. A combination of the two approaches may be a fruitful direction.
73
Acknowledgements
We would like to acknowledge a number of people for contributions to this work: First and foremost, Michael Picheny and David Nahamoo IBM for encouragement and support of the proposal of including audio-visual ASR in the summer workshop; Giridharan Iyengar and Andrew Senior IBM for their help with face and mouth region detection for the IBM ViaVoiceTM audio-visual data; Eric Helmuth IBM for his help in data collection; Asela Gunawardana and Murat Saraclar CLSP, Johns Hopkins University for their help with the HTK software toolkit. We would like to thank Jie Zhou, Eugenio Culurciello, and Andreas Andreou Johns Hopkins University for help with some of the CNN data collection and setup. Further, we would like to thank the Center for Language and Speech Processing sta for their help with arrangements during the workshop. Finally, we would like to thank Frederick Jelinek, Sanjeev Khudanpur and Bill Byrne CLSP, Johns Hopkins University for continuing to carry on the tradition of hosting the summer workshops. We believe that the workshop is a unique and bene cial concept, and we hope that funding institutions such as the NSF, DARPA and others will continue supporting it.
74
Bibliography
1 A. Adjoudani and C. Beno^
t. On the integration of auditory and visual parameters in an HMM-based ASR. In Stork and Hennecke 91 , pages 461 471. 2 S. Basu, C. Neti, N. Rajput, A. Senior, L. Subramaniam, and A. Verma. Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In Proc. IEEE 3rd Workshop on Multimedia Signal Processing, pages 475 481, Copenhagen, 1999. 3 S. Basu, N. Oliver, and A. Pentland. 3D modeling and tracking of human lip motions. In Proc. International Conference on Computer Vision, 1998. 4 F. Berthommier and H. Glotin. A new SNR-feature mapping for robust multistream speech recognition. In Proc. International Congress on Phonetic Sciences ICPhS, volume 1, pages 711 715, San Francisco, 1999. 5 P. Beyerlein. Discriminative model combination. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 481 484, Seattle, 1998. 6 S. F. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27:113 120, 1979. 7 H. Bourlard and S. Dupont. A new ASR approach based on independent processing and recombination of partial frequency bands. In Proc. International Conference on Spoken Language Processing, volume 1, pages 426 429, Philadelphia, 1996. 8 C. Bregler, H. Hild, S. Manke, and A. Waibel. Improving connected letter recognition by lipreading. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 557 560, Minneapolis, 1993.
75
9 C. Bregler and Y. Konig. `Eigenlips' for robust speech recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 669 672, Adelaide, 1994. 10 N. Brooke. Talking heads and speech recognizers that can see: The computer processing of visual speech signals. In Stork and Hennecke 91 , pages 351 371. 11 N. M. Brooke and S. D. Scott. PCA image coding schemes and visual speech intelligibility. Proc. Institute of Acoustics, 165:123 129, 1994. 12 W. Byrne, P. Beyrlein, J. M. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Pterek, J. Picone, and W. Wang. Towards language independent acoustic modeling. Technical report, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, 1999. 13 M. T. Chan, Y. Zhang, and T. S. Huang. Real-time lip tracking and bimodal continuous speech recognition. In Proc. IEEE 2nd Workshop on Multimedia Signal Processing, pages 65 70, Redondo Beach, 1998. 14 C. Chat eld and A. J. Collins. Introduction to Multivariate Analysis. Chapman and Hall, London, 1991. 15 C. C. Chibelushi, F. Deravi, and J. S. D. Mason. Survey of audio visual speech databases. Technical report, Department of Electrical and Electronic Engineering, University of Wales, Swansea, 1996. 16 G. Chiou and J.-N. Hwang. Lipreading from color video. IEEE Transactions on Image Processing, 68:1192 1195, 1997. 17 W. Chou, B.-H. Juang, C.-H. Lee, and F. Soong. A minimum error rate pattern recognition approach to speech recognition. Journal of Pattern Recognition and Arti cial Intelligence, III:5 31, 1994. 18 Y.-L. Chow. Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 701 704, Albuquerque, 1990. 19 T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In Proc. European Conference on Computer Vision, pages 484 498, 1998. 76
20 T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Training models of shape from sets of examples. In D. Hogg and R. Boyle, editors, Proc. British Machine Vision Conference, pages 9 18. BMVA Press, 1992. 21 T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models their training and application. Computer Vision and Image Understanding, 611:38 59, 1995. 22 F. Davoine, H. Li, and R. Forchheimer. Video compression and person authentication. In J. Bigun, G. Chollet, and G. Borgefors, editors, Audio- and Video-based Biometric Person Authentication, pages 353 360, Berlin, 1997. Springer. 23 P. De Cuetos, C. Neti, and A. Senior. Audio-visual intent to speak detection for human computer interaction. In Proc. International Conference on Acoustics, Speech and Signal Processing, Istanbul, 2000. 24 J. R. Deller, Jr., J. G. Proakis, and J. H. L. Hansen. Discrete-Time Processing of Speech Signals. Macmillan Publishing Company, Englewood Cli s, 1993. 25 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 391:1 38, 1977. 26 I. L. Dryden and K. V. Mardia. Statistical Shape Analysis. Wiley, 1998. 27 P. Duchnowski, M. Hunke, D. Busching, U. Meier, and A. Waibel. Toward movementinvariant automatic lip-reading and speech recognition. In Proc. International Conference on Spoken Language Processing, pages 109 112, 1995. 28 S. Dupont and J. Luettin. Using the multi-stream approach for continuous audio-visual speech recognition: Experiments on the M2VTS database. In Proc. International Conference on Spoken Language Processing, Sydney, 1998. 29 S. Dupont and J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 23:141 151, 2000. 30 G. J. Edwards, T. F. Cootes, and C. J. Taylor. Face recognition using active appearance models. In Proc. European Conference on Computer Vision, pages 582 595, 1998.
77
31 J. G. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction ROVER. In Proc. Workshop on Automatic Speech Recognition and Understanding, 1997. 32 J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley, 1996. 33 B. Froba, C. Kublbeck, C. Rothe, and P. Plankensteiner. Multi-sensor biometric person recognition in an access control system. In Proc. 2nd International Conference on Audio and Video-based Biometric Person Authentication AVBPA, pages 55 59, Washington, 1999. 34 K. Fukunaga. Introduction to Statistical Pattern Recognition. Morgan Kaufmann, 1990. 35 M. J. F. Gales. `Nice' model based compensation schemes for robust speech recognition. In Proc. ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 55 59, Pont-a-Mousson, 1997. 36 M. J. F. Gales and S. J. Young. An improved approach to hidden Markov model decomposition. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 729 734, San Francisco, 1992. 37 J.-L. Gauvain and C.-H. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2:291 298, 1994. 38 O. Ghitza. Auditory nerve representation as a front end for speech recognition in noisy environments. Computer, Speech and Language, 1:109 130, 1986. 39 H. Glotin. Elaboration et et ude comparative d'un syst eme adaptatif de reconnaissance robuste de la parole en sous-bandes: Incorporation d'indices primitifs F0 et ITD. PhD thesis, Doctorat de l'Institut National Polytechnique de Grenoble, Grenoble, 2000. 40 H. Glotin, F. Berthommier, E. Tessier, and H. Bourlard. Interfacing of CASA and multistream recognition. In Proc. Text, Speech and Dialog International Workshop TSD, pages 207 212, Brno, 1998. 41 G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1983. 78
42 R. A. Gopinath. Maximum likelihood modeling with Gaussian distributions for classication. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 661 664, Seattle, 1998. 43 H. P. Graf, E. Cosatto, and G. Potamianos. Robust recognition of faces and facial features with a multi-modal system. In Proc. International Conference on Systems, Man, and Cybernetics, pages 2034 2039, Orlando, 1997. 44 M. S. Gray, J. R. Movellan, and T. J. Sejnowski. Dynamic features for visual speechreading: A systematic comparison. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9, pages 751 757, Cambridge, 1997. MIT Press. 45 M. E. Hennecke, D. G. Stork, and K. V. Prasad. Visionary speech: Looking ahead to practical speechreading systems. In Stork and Hennecke 91 , pages 331 349. 46 H. Hermansky and N. Morgan. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 24:578 589, 1994. 47 A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 221:4 37, 2000. 48 P. Jourlin. Word dependent acoustic-labial weights in HMM-based speech recognition. In Proc. European Tutorial Workshop on Audio-Visual Speech Processing AVSP, pages 69 72, Rhodes, 1997. 49 P. Jourlin, J. Luettin, D. Genoud, and H. Wassner. Acoustic-labial speaker veri cation. Pattern Recognition Letters, 189:853 858, 1997. 50 B. H. Juang. Speech recognition in adverse environments. Computer, Speech and Language, 5:275 294, 1991. 51 M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 14:321 331, 1988. 52 R. Kaucic, B. Dalton, and A. Blake. Real-time lip tracking for audio-visual speech recognition applications. In B. Buxton and R. Cipolla, editors, Proc. European Conference on Computer Vision, volume II of Lecture Notes in Computer Science, pages 376 387, Cambridge, 1996. Springer-Verlag. 79
53 T. Kuratate, H. Yehia, and E. Vatiokotis-Bateson. Kinematics based synthesis of realistic talking faces. In Proc. Workshop on Audio Visual Speech Processing, Terrigal, 1998. 54 C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9:171 185, 1995. 55 R. Lippmann. Speech recognition by machines and humans. Speech Communication, 221, 1997. 56 F. Liu, R. Stern, X. Huang, and A. Acero. E cient cepstral normalization for robust speech recognition. In Proc. ARPA Human Language Technologies Workshop, 1993. 57 J. Luettin. Towards speaker independent continuous speechreading. In Proc. of the European Conference on Speech Communication and Technology, pages 1991 1994, Rhodes, 1997. 58 J. Luettin. Visual Speech and Speaker Recognition. PhD thesis, University of She eld, 1997. 59 J. Luettin and N. A. Thacker. Speechreading using probabilistic models. Computer Vision and Image Understanding, 652:163 178, 1997. 60 J. Luettin, N. A. Thacker, and S. W. Beet. Active shape models for visual feature extraction. In Stork and Hennecke 91 , pages 383 390. 61 J. Luettin, N. A. Thacker, and S. W. Beet. Speechreading using shape and intensity information. In Proc. International Conference on Spoken Language Processing, volume 1, pages 58 61, 1996. 62 D. W. Massaro and D. G. Stork. Speech recognition and sensory integration. American Scientist, 863:236 244, 1998. 63 I. Matthews. Features for Audio-Visual Speech Recognition. PhD thesis, School of Information Systems, University of East Anglia, Norwich, 1998. 64 I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham. Lipreading using shape, shading and scale. In Proc. Workshop on Audio Visual Speech Processing, pages 73 78, Terrigal, 1998. 80
65 H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746 748, 1976. 66 K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. XM2VTS: The extended M2VTS database. In Proc. 2nd International Conference on Audio and Video-based Biometric Person Authentication AVBPA, pages 72 76, Washington, 1999. 67 J. R. Movellan and G. Chadderdon. Channel seperability in the audio visual integration of speech: A Bayesian approach. In Stork and Hennecke 91 , pages 473 487. 68 A. Nadas, D. Nahamoo, and M. Picheny. Speech recognition using noise adaptive prototypes. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37:1495 1503, 1989. 69 J. A. Nelder and R. Mead. A simplex method for function minimisation. Computing Journal, 74:308 313, 1965. 70 C. Neti. Neuromorphic speech processing for noisy environments. In Proc. IEEE International Conference on Neural Networks, pages 4425 4430, Orlando, 1994. 71 C. Neti, G. Iyengar, G. Potamianos, A. Senior, and B. Maison. Perceptual interfaces for human computer interaction: Joint processing of audio and visual information for human computer interaction. In Proc. International Conference on Spoken Language Processing, Beijing, 2000. 72 L. Neumeyer, A. Sankar, and V. Digalakis. A comparative study of speaker adaptation techniques. In Proc. European Conference on Speech Communication and Technology EUROSPEECH, pages 1127 1130, Madrid, 1995. 73 S. Okawa, T. Nakajima, and K. Shirai. A recombination strategy for multi-band speech recognition based on mutual information criterion. In Proc. European Conference on Speech Communication and Technology EUROSPEECH, volume 2, pages 603 606, Budapest, 1999. 74 E. D. Petajan. Automatic lipreading to enhance speech recognition. In Proc. Global Telecommunications Conference GLOBCOM, pages 265 272, Atlanta, 1984. 75 G. Potamianos, E. Cosatto, H. P. Graf, and D. B. Roe. Speaker independent audiovisual database for bimodal ASR. In Proc. European Tutorial Workshop on AudioVisual Speech Processing AVSP, pages 65 68, Rhodes, 1997. 81
76 G. Potamianos and H. P. Graf. Discriminative training of HMM stream exponents for audio-visual speech recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 6, pages 3733 3736, Seattle, 1998. 77 G. Potamianos and H. P. Graf. Linear discriminant analysis for speechreading. In Proc. IEEE 2nd Workshop on Multimedia Signal Processing, pages 221 226, Redondo Beach, 1998. 78 G. Potamianos, H. P. Graf, and E. Cosatto. An image transform approach for HMM based automatic lipreading. In Proc. IEEE International Conference on Image Processing, volume I, pages 173 177, Chicago, 1998. 79 G. Potamianos and A. Potamianos. Speaker adaptation for audio-visual speech recognition. In Proc. European Conference on Speech Communication and Technology EUROSPEECH, volume 3, pages 1291 1294, Budapest, 1999. 80 G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu. A cascade image transform for speaker independent automatic speechreading. In Proc. International Conference on Multimedia and Expo, volume II, pages 1097 1100, New York, 2000. 81 W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. The Art of Scienti c Computing. Cambridge University Press, Cambridge, 1988. 82 L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cli s, 1993. 83 C. R. Rao. Linear Statistical Inference and Its Applications. John Wiley and Sons, New York, 1965. 84 R. R. Rao and R. M. Mesereau. Lip modeling for visual speech recognition. In 28th Annual Asilomar Conference on Signals, Systems, and Computers, volume 1, pages 587 590, 1994. 85 A. Rogozan. Etude de la fusion des donn ees h et erog enes pour la reconnaissande automatique de la parole audiovisuelle. PhD thesis, University of Orsay-Paris XI, Paris, 1999. 86 A. Rogozan, P. Del eglise, and M. Alissali. Adaptive determination of audio and visual weights for automatic speech recognition. In Proc. European Tutorial Workshop on Audio-Visual Speech Processing AVSP, pages 61 64, Rhodes, 1997. 82
87 M. U. R. S anchez, J. Matas, and J. Kittler. Statistical chromaticity-based lip tracking with B-splines. In Proc. International Conference on Acoustics, Speech and Signal Processing, Munich, 1997. 88 S. Sclaro and J. Isidoro. Active blobs. In Proc. International Conference on Computer Vision, 1998. 89 A. W. Senior. Face and feature nding for a face recognition system. In Proc. 2nd International Conference on Audio and Video-based Biometric Person Authentication AVBPA, pages 154 159, Washington, 1999. 90 P. L. Silsbee. Motion in deformable templates. In Proc. IEEE International Conference on Image Processing, volume 1, pages 323 327, 1994. 91 D. G. Stork and M. E. Hennecke, editors. Speechreading by Humans and Machines: Models, Systems and Applications, volume 150 of NATO ASI Series F: Computer and Systems Sciences. Springer-Verlag, Berlin, 1996. 92 A. Q. Summer eld. Some preliminaries to a comprehensive account of audio-visual speech perception. In B. Dodd and R. Campbell, editors, Hearing by Eye: The Psychology of Lip-Reading, pages 97 113, Hillside, 1987. Lawrence Erlbaum Associates. 93 P. Teissier, J. Robert-Ribes, and J. L. Schwartz. Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Transactions on Speech and Audio Processing, 76:629 642, 1999. 94 M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 31:71 86, 1991. 95 O. Vanegas, A. Tanaka, K. Tokuda, and T. Kitamura. HMM-based visual speech recognition using intensity and location normalization. In Proc. International Conference on Spoken Language Processing, pages 289 292, Sydney, 1998. 96 P. Varga and R. K. Moore. Hidden Markov model decomposition of speech and noise. In Proc. International Conference on Acoustics, Speech and Signal Processing, pages 845 848, Albuquerque, 1990. 97 D. Vergyri. Integration of Multiple Knowledge Sources in Speech Recognition Using Minimum Error Training. PhD thesis, Center for Speech and Language Processing, The Johns Hopkins University, Baltimore, 2000. 83
98 D. Vergyri. Use of word level side information to improve speech recognition. In Proc. International Conference on Acoustics, Speech and Signal Processing, Istanbul, 2000. 99 D. Vergyri, S. Tsakalidis, and W. Byrne. Minimum risk acoustic clustering for multilingual acoustic model combination. In Proc. International Conference on Spoken Language Processing, Beijing, 2000. 100 T. Wark and S. Sridharan. A syntactic approach to automatic lip feature extraction for speaker identi cation. In Proc. International Conference on Acoustics, Speech and Signal Processing, volume 6, pages 3693 3696, Seattle, 1998. 101 J. J. Williams, J. C. Rutledge, D. C. Garstecki, and A. K. Katsaggelos. Frame rate and viseme analysis for multimedia applications. In Proc. IEEE 1st Workshop on Multimedia Signal Processing, pages 13 18, Princeton, 1997. 102 M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL Programming Guide: The O cial Guide to Learning OpenGL, version 1.2. Addison-Wesley, third edition, 1999. 103 S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book. Entropic Ltd., Cambridge, 1999. 104 A. L. Yuille, P. W. Hallinan, and D. S. Cohen. Feature extraction from faces using deformable templates. International Journal of Computer Vision, 82:99 111, 1992. 105 E. Yumoto, W. J. Gould, and T. Baer. Harmonic to noise ratio as an index of the degree of hoarseness. Journal of the Acoustical Society of America, 1971:1544 1550, 1982.
84