Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
University of Sheffield, Electronic Systems Group Report No. 95/47 Also appears in IEEE ICASSP, Atlanta, GA, May 1996. VISUAL SPEECH RECOGNITION USING ACTIVE SHAPE MODELS AND HIDDEN MARKOV MODELS Juergen Luettin, Neil A. Thacker and Steve W. Beet Department of Electronic and Electrical Engineering University of Sheffield, Sheffield, UK J.Luettin@sheffield.ac.uk ABSTRACT This paper describes a novel approach for visual speech recognition. The shape of the mouth is modelled by an Active Shape Model which is derived from the statistics of a training set and used to locate, track and parameterise the speaker’s lip movements. The extracted parameters representing the lip shape are modelled as continuous probability distributions and their temporal dependencies are modelled by Hidden Markov Models. We present recognition tests performed on a database of a broad variety of speakers and illumination conditions. The system achieved an accuracy of 85.42 % for a speaker independent recognition task of the first four digits using lip shape information only. 1. INTRODUCTION It has been shown that the robustness and accuracy of automatic speech recognition can be improved by the use of visual information of the speaker’s lip movements in addition to the acoustic speech signal [1]. The main difficulty in incorporating visual information into an acoustic speech recognition system is to find a robust and accurate method for extracting important visual speech features. The two main approaches for extracting speech information from image sequences are the image based approach [1, 2, 3] and the model based approach [4,5]. In the image based approach the image intensities are pre-processed and then used as the feature vector. Preprocessing normally consists of filtering and dimension reduction. The advantage of this approach is that no data is thrown away. The disadvantage is that it is left to the classifier to learn the nontrivial task of finding the generalisation for translation, scaling, rotation, illumination and linguistic variability. Another disadvantage is the high dimensionality and high redundancy of the feature vector. In the model based approach a model of the visible speech articulators, mainly the lip contours, is built and its configuration is described by a small set of parameters. The advantage of the model based approach is that important features are represented in a low dimensional space and are normally invariant to translation, rotation, scale and illumination. A disadvantage is that a particular model may not consider all relevant speech information. The main difficulty in the model based approach is to build a model which represents the lip shape efficiently and which is able to locate and track the lip contours of different speakers and under different illumination conditions. We describe a model based speechreading system where a model of the lips is constructed from a training set. The model is subsequently used to locate, track and parameterise lip contours in image sequences. We show how Hidden Markov Models (HMMs) can be used to model visual speech and describe recognition tests purely based on lip shape features. 2. LOCATING AND TRACKING LIPS Deformable templates [6] have been proposed [4][5] to locate and track lip contours, but because deformation of the model is constrained by the initial choice of polynomials, representing the contour, they are often unable to represent various lip shapes in fine detail. “Snakes” [7] on the other hand are able to resolve fine contour details but shape constraints are difficult to incorporate [8] and one has to compromise between the degree of elasticity and the ability to resolve fine contour details. Image search for deformable templates and “snakes” is normally performed by fitting the model to the edges of the image, assuming strong edges along the lip contours. This assumption is often overestimated as lip edges vary across speakers and depend on illumination, visibility of teeth and mouth opening. Edges on the lower outer lip contour are particularly hard to distinguish and edges inside the mouth often originate from teeth. We use an approach based on Active Shape Models (ASMs) [9] to model, locate and track lip contours, which is described in detail in [10]. These are flexible models which represent the boundary or other significant loca- +2 s.d. Mean -2 s.d. 1. 2. 3. Figure 1: Mean shape and the first three principal modes of variation at ± 2 standard deviations. tions of an object by a set of labelled points. ASMs use a priori knowledge about shape deformation from the statistics of a training set which was labelled by hand. The main modes of shape variation are projected into a linear subspace obtained by Principal Component Analysis (PCA). Any shape can therefore be approximated by a linear combination of the mean shape and the first few main modes of variation. No heuristic limits for shape deformation are used. Instead we constrain each shape parameter to lie within ± 3 standard deviations of the training set which accounts for about 99% of variation. We built two models of the lips, one representing the outer lip contour and one describing the inner and outer lip contour. Figure 1 shows the first 3 principal modes of deformation captured in the training set for the double contour model. The models are then used to locate and track lips in image sequences. During image search a cost function is used which measures the fit between the model and the image. We have found that image gradients are inappropriate for representing lip boundaries. Instead we use a profile model which learns typical intensity values around lip contours from the training set. We sample onedimensional intensity profiles gij of length n, perpendicular to the contour and centred at model point i for each training image j, as described in [9], but we concatenate the profiles of all model points of a training image j to form a global profile vector hj. Similar to describing shape deformation, we constrain the main modes of profile variation, captured in the training set, to lie in a low-dimensional linear subspace which is obtained by PCA. Any profile in the training set can now be approximated by h = h + Pb . (1) where h is the mean profile, P the matrix of the first column eigenvectors, corresponding to the largest eigenvalues and b a vector containing the weights for each eigenvector. The motivation for this approach is to build a model which describes the mean intensity profile of the training set and its main modes of variation which originate from different speakers, different lighting conditions and different “mouth states”. For example the profile inside the mouth contains large intensity variation and depends on the mouth opening and the visibility of teeth and tongue. We use the Downhill Simplex Method [11] for image search, which performs a multi-dimensional minimisation process. The model is first placed at an initial position in the image, then the mean profile is aligned to the image profile h as closely as possible using the first few modes of profile variation. The profile weight vector is found using b = P (h − h) . T (2) The cost E at a particular location and shape is calculated as the mean square error (MSE) between the image profile and the aligned profile model using E = (h − h) (h − h) − b b . T T (3) We assume equal prior probabilities of all shapes within the deformation constraints and therefore do not include a term for shape deformation in the cost function. Locating the lips in the first frame of an image sequence is performed as described above. For subsequent frames the estimated position and shape of the lips in the previous frame are used as the initial estimate for the search algorithm. 3. VISUAL SPEECH FEATURE EXTRACTION The parameters describing the shape of the lips are extracted at each time frame and used as visual speech feature vectors. The parameters are invariant to scale, rotation, translation and illumination and can directly be used by the recognition network. The translation and rotation parameters are not used for recognition because they are unlikely to provide speech information. Much speech information is contained in the dynamics of the lip movements rather than the actual shape. Furthermore dynamics of lip movements might be less sensitive to linguistic variability. We therefore performed some recognition tests by including temporal differences of each feature (delta shapes). Scale might contain relevant speech information but absolute values are hard to estimate and may vary from speaker to speaker. We omitted absolute scale information but we performed Coefficients sm sm + dsm sm + dsm + ds Single Contour Model 58.33 % 68.75 % 80.21 % Double Contour Model 67.71 % 79.17 % 85.42 % Table 1: Word accuracy using one shape mode (sm) with optional delta shape mode (dsm) and delta scale (ds). Fig. 2: Examples of image sequences with lip tracking results. some recognition tests by including scale differences (delta scale). 4. VISUAL SPEECH MODELLING Visual speech is modelled by representing each utterance as a sequence of visual speech vectors. Their emission probabilities are modelled by continuous Gaussian distributions and temporal changes are modelled by Hidden Markov Models. We used whole-word HMMs and trained one HMM for each word class to be recognised. The models are trained using the Baum-Welch reestimation algorithm. Recognition is performed using the Viterbi algorithm, which estimates the likelihood for each HMM of having generated the observed sequence and the model with the highest likelihood is chosen as the recognised word. This is a standard approach used in acoustic speech recognition systems [12]. The shape features contain some information which contributes to class discriminability and some information which describes between- and within-speaker variability (linguistic variability). If we have sufficient training data we assume that the recognition network will learn which features contribute to class discriminability and which do not. Since the database we used was very small we performed a variety of recognition tests by using only the first few shape parameters, corresponding to the largest variances, assuming that these parameter estimates are more robust and contain most of the speech information. 5. EXPERIMENTS Experiments were performed using the Tulips1 database [3] which consists of grey level image sequences of the first four digits. Each digit was spoken twice by 12 individuals (9 males, 3 females). The database reflects a broad variety of speakers and illumination conditions. Experiments for locating and tracking lips were individually evaluated and are described in detail in [13]. Figure 2 shows examples of lip tracking results using the double contour model. The examples demonstrate that the profile model has learned how the profile at the inner lip contour can change due to mouth opening and visibility of teeth and tongue. The second row also shows that the model is able to track lips which extend beyond image boundaries. We performed speaker independent recognition tests, using different speakers for training and testing to see how well the system generalises for new speakers. Because of the small size of the database, recognition tests were performed using the ‘jack-knife’ or ‘leave-one-out’ method, i.e. 11 subjects were used for training and the 12th subject for testing. The whole procedure was repeated 12 times, each time leaving a different subject out for testing. The results were averaged over all speakers. A large variety of visual front ends and HMM architectures was used to evaluate the method. 6. RESULTS Word accuracies of 80.21 % were achieved using the single contour model and 85.42 % using the double contour model. These results demonstrate that lip contours are a rich source of speech information. This is contrary to Bregler and Omohundro [14], who found the outer lip contour not distinctive enough to give reasonable recognition performance. Best results were achieved with HMMs of 5 or 6 states and one diagonal covariance matrix. This suggests that the training set, consisting of 22 training instances for each class was not large enough to estimate the parameters of HMMs with a full covariance matrix or more than one diagonal mixture component. Accuracy ACKNOWLEDGEMENTS Juergen Luettin is funded by a University of Sheffield Scholarship and the German Academic Exchange Service (DAAD). 90 80 70 60 50 40 30 20 10 0 REFERENCES sm sm + dsm sm + dsm + ds 1 2 3 4 5 6 No of shape modes Fig. 3: Recognition accuracy for different numbers of shape modes using combinations of basic shape modes (sm), delta shape modes (dsm) and delta scale (ds). Using only one shape parameter together with its delta coefficient and delta scale gave the best recognition rate. This might also indicate that the training set was not large enough to model more than the first main shape mode reliably. Table 1 shows results using one shape mode with optional “delta shape” and “delta scale” for 6 state HMMs with one diagonal mixture component. Figure 3 summarises results for different numbers of shape modes included in the feature vector. 7. CONCLUSIONS We have described a new approach for visual speech recognition based on a data driven lip model and HMMs. Experiments have demonstrated high recognition performance using very low dimensional shape information only. The recognition task described is relatively simple because it only consists of four word classes and only deals with isolated words. Nevertheless, recognition tests were speaker independent and have demonstrated high recognition accuracy and generalisation ability of the system. More extensive tests with more speakers and subword classes are necessary to estimate the discrimination ability of shape features for all phonemes. Our results are not as good as the ones reported in [3] with 89.58% correct and which was about equivalent to the performance of untrained humans performing the same task. One reason for this might be the absence of additional intensity information particularly about the visibility of teeth and tongue. In the future we plan to extract this information from the profile weight vector and incorporate it in the visual feature vector. The ability to locate and track lips accurately opens several other potential applications, as example model based image coding, facial animation, facial expression recognition and audio-visual person identification. [1] C. Bregler, H. Hild, S. Manke and A. Waibel, “Improved Connected Letter Recognition by Lipreading”, Proc. IEEE ICASSP, pp. 557-560, 1993. [2] B. P. Yuhas, M. H. Goldstein and T. J. Sejnowski, “Integration of Acoustic and Visual Speech Signals using Neural Networks”, IEEE Communications Magazine, pp. 75-81, 1989. [3] J. R. Movellan, “Visual Speech Recognition with Stochastic Networks”, G.Tesauro, D.Touretzky, T.Leen (eds.), Advances in Neural Information Processing Systems 7, MIT Press Cambridge, 1995. [4] M. E. Hennecke, K. V. Prasad and D. G. Stork, “Using Deformable Templates to Infer Visual Speech Dynamics”, 28th Annual Asilomar Conference on Signals, Systems and Computers, 1994. [5] R. R. Rao and R. M. Mersereau, “Lip Modeling for Visual Speech Recognition”, 28th Annual Asilomar Conference on Signals, Systems and Computers, 1994. [6] A. L. Yuille, P. Hallinan and D. S. Cohen, “Feature extraction from faces using deformable templates”, Int. J. Computer Vision, Vol. 8, pp. 99-112, 1992. [7] M. Kass, A. Witkin and D. Terzopoulos, “Snakes: active contour models”, Int. J. Computer Vision, pp. 321-331, 1988. [8] C. Bregler and S. Omohundro, “Surface Learning with Applications to Lip-Reading”, J.D.Cowan, G.Tesauro and J.Alspector (eds.), Advances in Neural Information Processing Systems 6, Morgan Kaufmann Publishers, 1994. [9] T. F. Cootes, A. Hill, C. J. Taylor and J. Haslam, “Use of active shape models for locating structures in medical images”, Image and Vision Computing, Vol. 12, No. 6, pp. 355-365, 1994. [10] J. Luettin, N. A. Thacker and S. W. Beet, “Active Shape Models for Visual Speech Feature Extraction”, D. G. Storck (Editor), Speechreading by Man and Machine: Models, Systems and Applications (NATO Advanced Study Institute), Springer Verlag, in press. [11] J. A. Nelder and R. Mead, “A simplex method for function minimization”, Comput. J. Vol. 7(4), pp. 308-313, 1965. [12] S. J. Young, “HTK Version 1.4: User, Reference & Programmer Manual”, Cambridge University Engineering Department, Cambridge, 1992. [13] J. Luettin and N. A. Thacker, S. W. Beet, “Locating and Tracking Facial Speech Features”, submitted to ECCV’96. [14]C. Bregler and S. M. Omohundro, “Nonlinear Manifold Learning for Visual Speech Recognition”, Pro. ICCV, 1995.