Feature analysis for automatic speechreading

Patricia Scanlon

Feature analysis for automatic speechreading

2001

Visual Feature Analysis for Automatic Speechreading Patricia Scanlon, Richard Reilly, Philip de Chazal Dept. of Electronic and Electrical Engineering University College Dublin, Belfield, Dublin 4, Ireland {patricias,richard.reilly,philip}@ee.ucd.ie Abstract This paper proposes a novel method of visual feature extraction for automatic speechreading. While current methods of extracting delta or difference features involves computing the difference between adjacent frames, this method proposed provides information on how the visual features evolve over a time period longer than the time period between adjacent frames, the time period being relative to the length of the utterance. These new features provide a visual memory capability for improved system performance. Good visual discrimination is achieved by maintaining a base level of detail in the features. A frame rate of 30 frames per second provides rapid visual recognition of speech. The combination of the novel visual memory features, good visual discrimination and rapid visual recognition of speech movements is shown to improve visual speech recognition. Using this method an isolated word accuracy of 28.1% for a vocabulary 78 words over a database of 10 speakers was achieved. 1. Introduction Speechreading may be defined as “the ability to understand a speaker’s thoughts by watching the movements of the face and body and using information provided by the situation and language” [3]. It is well known that visual information from the face of a speaker provides related speech information that enhances intelligibility of speech utterances under difficult listening conditions. Even under noiseless conditions, speechreading is known to improve the accuracy of speech perception in people with normal hearing [1]. Audio-Visual Automatic Speech Recognition (AVASR) systems use visual information to enhance Automatic Speech Recognition (ASR) systems [4 -10]. Auditory and visual speech provides two independent sources of information but they can also be considered complementary. Certain characteristics which may be strong in one modality may be weak in the other i.e. some phonemes which are difficult to understand acoustically are easily distinguished visually and vice versa. Visemes are defined as visually distinguishable speech units. Most visemes cannot be uniquely associated with a single phoneme, therefore there is often a many-to-one mapping of phonemes to visemes. It is this complementary nature of speech that drives the use of visual information in ASR. As little is known on exactly how humans perceive speech, it is beneficial to look at human speech perception in order to obtain clues on methods to optimise speech recognition by machines. It is clear from the McGurk effect [2], that humans integrate the audio and visual modalities, but at what stage is not so apparent. Consequently, significant research has been carried out focusing on the integration of the audio and visual modalities [6], [7], another active area of research is in determining which features should be extracted for classification of speech [8-10]. The skills required for speechreading are visual recognition, visual discrimination, visual memory and flexibility. Visual recognition is the ability to recognize different speech movements and to be able to do this rapidly. Visual discrimination is more concerned with the ability to see fine differences in lips, teeth, and jaw positions in order to distinguish between different speech movements. Another important skill is visual memory, which is the ability to recall previous visual patterns of speech. As the speechreader may need to change their interpretation of the utterance as the conversation unfolds, a capacity to remember the previous visual patterns of speech is vital. Flexibility is a further skill that is also required allowing the speechreader to make quick changes in the perception of the utterance as it is being spoken. In the next section visual feature extraction methods are discussed and in section 3 the implementation of the proposed method is described. Section 4 describes the results where the level of detail required for visual speech recognition was investigated. Also more traditional methods of computing delta features are compared with the proposed method of computing delta features over longer time periods. 2. Visual feature extraction methods The objective of video processing for AVASR is to pre- process facial images into a set of meaningful parameters for subsequent classification into words. A region of interest (ROI), consisting of the mouth area and perhaps the jaw, is then located in the acquired facial images. Image pre- processing is employed within the ROI to minimize the effects of variable lighting conditions, using techniques such as histogram flattening and balancing of the left-to-right brightness distribution. The ROI images are also often downsampled at this stage, to reduce their dimensionality and therefore the complexity of the feature extraction process. The downsampling step also transforms the image into a square matrix of pixel values of dimension NxN to allow ease of manipulation. Downsampling the ROI image reduces the resolution (number of pixels) and as a result can reduce the level of detail in the image. Using pixel based methods every pixel in the entire raw image is considered a feature. This approach ensures that no information is lost but the size of the feature vector can be very large and may also contain considerable redundancy. Image transformations transform NxN pixels into NxN transform coefficients. The transform coefficients are ordered according to the importance of their information [11]. The transformation process compacts most of the image’s energy in a relatively small number of coefficients thus removing the AVSP 2003 - International Conference on AudioĆVisual Speech Processing St. Jorioz, France September 4-7, 2003 ISCA Archive http://www.iscaĆspeech.org/archive

Face Image 720x480 pixels Location of Region of Interest Histogram Flattening Downsampling Image Transform Selection of Pertinent Coefficents Computation of Delta Features Formation of Feature Vector Normalisation 720x480 pixels 264x216 pixels 64x64 pixels 64x64 pixels Feature Vector for Classification AND / OR switch Figure 1 Visual Subsystem linguistically redundant information. The low frequency coefficients represent the low level detail of the image, with the finer image details in the higher frequency coefficients. A subset of the transform coefficients is used as the features. The Discrete Cosine Transform (DCT) is the most widely used transformation in transform coding and performs well for highly correlated data and has excellent energy compaction. The Hadamard and Haar transforms are also implemented for image processing application [11]. Unlike the DCT, these transforms are non-trigonometric based transforms. The Hadamard transform basis functions consist of +1’s and -1’s, therefore requiring no multiplications in the transformation. The Hadamard transform provides good energy compaction but not as efficiently as the DCT [11]. The Haar transform coefficients are the differences along the rows and columns of the local averages of pixels in the image, which results in good edge extraction. The Haar transform provides relatively poor energy compaction [11]. These two transforms have enormous benefit for embedded systems, as the computational requirements are far less than that of the DCT. The number of transform coefficients used reflects the level of detail in the feature vector. While smaller feature vector dimensions result in better use of the available training data, larger feature vectors result in more discriminating detail being included. What is needed is a realistic feature vector dimension that maintains a good level of discriminatory detail for the application at hand. In speech, the acoustic and the visual information necessary for the recognition of a word is not present simultaneously. Therefore it is the dynamic nature of speech that provides the information on what was said and therefore information further back in the utterance needs to be taken into account. An investigation into geometric features required for visual speech recognition described in [12] concluded that the most discriminative features were primarily dynamic. Another method that was used to include dynamic information involved concatenating features from a number of neighboring frames and then performing linear discriminate analysis to reduce the number of features used [13]. Another way to include dynamic information is to include the temporal difference between frames. Current methods of extracting delta features involve computing the difference between adjacent frames [14]. However, it can reasoned that since visual speech is generally sampled at around 30 frames per second (fps), twice the minimum requirement for efficient human speech reception [15], the temporal difference between the two adjacent frames will not be great. Therefore this temporal difference between adjacent frames does not provide significant discriminating features for classification. However, taking the difference between the current frame and a frame much further back in the image sequence should provide information on how the visual features evolve over periods of time longer than 1/30sec. Rapidly analysing the dynamics of the speech articulators, faster than the human visual system can detect, should enable a computer-based system to accurately identify all utterances visually. However the confusability of some visemes reduces this accuracy. Also, unless all image pixels are used, some level of detail is sacrificed, which could provide assistance in distinguishing speech articulator dynamics. By using a combination of good level of image detail, fast frame rate and accounting for previous visual patterns in the utterance a good level of accuracy can be achieved. 3. Implementation The original acquired face image is of size 720x480 pixels. Location of the ROI is the first process and in this study the ROI is centred about the mouth area using a window of size 264x216 pixels [17]. The images are then pre-processed with histogram flattening to minimise the effects of variable lighting conditions. The images are subsequently downsampled to a resolution of 64x64 pixels. The downsampled ROI images are image transformed and the transform coefficients used as the feature vector for classification. The transforms investigated included the Discrete Cosine, Haar and Hadamard, with subsets of the transform coefficients used to examine the level of detail required for good visual speech recognition. The visual feature vector was formed, by taking 15, 28 or 36 of the highest energy components produced on applying the image transform. In addition to the transform coefficient features, temporal difference information or delta features between frames are also calculated. Given frame k, delta features are computed between current frame n and n-k. The value k is relative to the total number of frames N within an utterance e.g. k = 1, N/3, N/2, N*2/3, N*5/6, N-1. To account for the different speaking rates of each instance of an utterance, k was set relative to the total number of frames, N. Therefore, the delta features are more reliably computed between the same segments in each instance of the utterance. With these delta features included in the feature vectors, the feature vector dimension increases in length to 30, 56 and 72 respectively. For frames numbered 1 to k the delta features appended to the feature vector are zero.

AVSP 2003 - International Conference on AudioĆVisual Speech Processing St. Jorioz, France September 4-7, 2003 ISCA Archive http://www.iscaĆspeech.org/archive Visual Feature Analysis for Automatic Speechreading Patricia Scanlon, Richard Reilly, Philip de Chazal Dept. of Electronic and Electrical Engineering University College Dublin, Belfield, Dublin 4, Ireland {patricias,richard.reilly,philip}@ee.ucd.ie Abstract This paper proposes a novel method of visual feature extraction for automatic speechreading. While current methods of extracting delta or difference features involves computing the difference between adjacent frames, this method proposed provides information on how the visual features evolve over a time period longer than the time period between adjacent frames, the time period being relative to the length of the utterance. These new features provide a visual memory capability for improved system performance. Good visual discrimination is achieved by maintaining a base level of detail in the features. A frame rate of 30 frames per second provides rapid visual recognition of speech. The combination of the novel visual memory features, good visual discrimination and rapid visual recognition of speech movements is shown to improve visual speech recognition. Using this method an isolated word accuracy of 28.1% for a vocabulary 78 words over a database of 10 speakers was achieved. 1. Introduction Speechreading may be defined as “the ability to understand a speaker’s thoughts by watching the movements of the face and body and using information provided by the situation and language” [3]. It is well known that visual information from the face of a speaker provides related speech information that enhances intelligibility of speech utterances under difficult listening conditions. Even under noiseless conditions, speechreading is known to improve the accuracy of speech perception in people with normal hearing [1]. Audio-Visual Automatic Speech Recognition (AVASR) systems use visual information to enhance Automatic Speech Recognition (ASR) systems [4 -10]. Auditory and visual speech provides two independent sources of information but they can also be considered complementary. Certain characteristics which may be strong in one modality may be weak in the other i.e. some phonemes which are difficult to understand acoustically are easily distinguished visually and vice versa. Visemes are defined as visually distinguishable speech units. Most visemes cannot be uniquely associated with a single phoneme, therefore there is often a many-to-one mapping of phonemes to visemes. It is this complementary nature of speech that drives the use of visual information in ASR. As little is known on exactly how humans perceive speech, it is beneficial to look at human speech perception in order to obtain clues on methods to optimise speech recognition by machines. It is clear from the McGurk effect [2], that humans integrate the audio and visual modalities, but at what stage is not so apparent. Consequently, significant research has been carried out focusing on the integration of the audio and visual modalities [6], [7], another active area of research is in determining which features should be extracted for classification of speech [8-10]. The skills required for speechreading are visual recognition, visual discrimination, visual memory and flexibility. Visual recognition is the ability to recognize different speech movements and to be able to do this rapidly. Visual discrimination is more concerned with the ability to see fine differences in lips, teeth, and jaw positions in order to distinguish between different speech movements. Another important skill is visual memory, which is the ability to recall previous visual patterns of speech. As the speechreader may need to change their interpretation of the utterance as the conversation unfolds, a capacity to remember the previous visual patterns of speech is vital. Flexibility is a further skill that is also required allowing the speechreader to make quick changes in the perception of the utterance as it is being spoken. In the next section visual feature extraction methods are discussed and in section 3 the implementation of the proposed method is described. Section 4 describes the results where the level of detail required for visual speech recognition was investigated. Also more traditional methods of computing delta features are compared with the proposed method of computing delta features over longer time periods. 2. Visual feature extraction methods The objective of video processing for AVASR is to preprocess facial images into a set of meaningful parameters for subsequent classification into words. A region of interest (ROI), consisting of the mouth area and perhaps the jaw, is then located in the acquired facial images. Image preprocessing is employed within the ROI to minimize the effects of variable lighting conditions, using techniques such as histogram flattening and balancing of the left-to-right brightness distribution. The ROI images are also often downsampled at this stage, to reduce their dimensionality and therefore the complexity of the feature extraction process. The downsampling step also transforms the image into a square matrix of pixel values of dimension NxN to allow ease of manipulation. Downsampling the ROI image reduces the resolution (number of pixels) and as a result can reduce the level of detail in the image. Using pixel based methods every pixel in the entire raw image is considered a feature. This approach ensures that no information is lost but the size of the feature vector can be very large and may also contain considerable redundancy. Image transformations transform NxN pixels into NxN transform coefficients. The transform coefficients are ordered according to the importance of their information [11]. The transformation process compacts most of the image’s energy in a relatively small number of coefficients thus removing the Face Image 720x480 pixels Location of Region of Interest Histogram Flattening 720x480 pixels Feature Vector for Classification 264x216 pixels Normalisation Selection of Pertinent Coefficents Image Transform Downsampling 64x64 pixels Formation of Feature Vector 64x64 pixels AND / OR switch Computation of Delta Features Figure 1 Visual Subsystem linguistically redundant information. The low frequency coefficients represent the low level detail of the image, with the finer image details in the higher frequency coefficients. A subset of the transform coefficients is used as the features. The Discrete Cosine Transform (DCT) is the most widely used transformation in transform coding and performs well for highly correlated data and has excellent energy compaction. The Hadamard and Haar transforms are also implemented for image processing application [11]. Unlike the DCT, these transforms are non-trigonometric based transforms. The Hadamard transform basis functions consist of +1’s and -1’s, therefore requiring no multiplications in the transformation. The Hadamard transform provides good energy compaction but not as efficiently as the DCT [11]. The Haar transform coefficients are the differences along the rows and columns of the local averages of pixels in the image, which results in good edge extraction. The Haar transform provides relatively poor energy compaction [11]. These two transforms have enormous benefit for embedded systems, as the computational requirements are far less than that of the DCT. The number of transform coefficients used reflects the level of detail in the feature vector. While smaller feature vector dimensions result in better use of the available training data, larger feature vectors result in more discriminating detail being included. What is needed is a realistic feature vector dimension that maintains a good level of discriminatory detail for the application at hand. In speech, the acoustic and the visual information necessary for the recognition of a word is not present simultaneously. Therefore it is the dynamic nature of speech that provides the information on what was said and therefore information further back in the utterance needs to be taken into account. An investigation into geometric features required for visual speech recognition described in [12] concluded that the most discriminative features were primarily dynamic. Another method that was used to include dynamic information involved concatenating features from a number of neighboring frames and then performing linear discriminate analysis to reduce the number of features used [13]. Another way to include dynamic information is to include the temporal difference between frames. Current methods of extracting delta features involve computing the difference between adjacent frames [14]. However, it can reasoned that since visual speech is generally sampled at around 30 frames per second (fps), twice the minimum requirement for efficient human speech reception [15], the temporal difference between the two adjacent frames will not be great. Therefore this temporal difference between adjacent frames does not provide significant discriminating features for classification. However, taking the difference between the current frame and a frame much further back in the image sequence should provide information on how the visual features evolve over periods of time longer than 1/30sec. Rapidly analysing the dynamics of the speech articulators, faster than the human visual system can detect, should enable a computer-based system to accurately identify all utterances visually. However the confusability of some visemes reduces this accuracy. Also, unless all image pixels are used, some level of detail is sacrificed, which could provide assistance in distinguishing speech articulator dynamics. By using a combination of good level of image detail, fast frame rate and accounting for previous visual patterns in the utterance a good level of accuracy can be achieved. 3. Implementation The original acquired face image is of size 720x480 pixels. Location of the ROI is the first process and in this study the ROI is centred about the mouth area using a window of size 264x216 pixels [17]. The images are then pre-processed with histogram flattening to minimise the effects of variable lighting conditions. The images are subsequently downsampled to a resolution of 64x64 pixels. The downsampled ROI images are image transformed and the transform coefficients used as the feature vector for classification. The transforms investigated included the Discrete Cosine, Haar and Hadamard, with subsets of the transform coefficients used to examine the level of detail required for good visual speech recognition. The visual feature vector was formed, by taking 15, 28 or 36 of the highest energy components produced on applying the image transform. In addition to the transform coefficient features, temporal difference information or delta features between frames are also calculated. Given frame k, delta features are computed between current frame n and n-k. The value k is relative to the total number of frames N within an utterance e.g. k = 1, N/3, N/2, N*2/3, N*5/6, N-1. To account for the different speaking rates of each instance of an utterance, k was set relative to the total number of frames, N. Therefore, the delta features are more reliably computed between the same segments in each instance of the utterance. With these delta features included in the feature vectors, the feature vector dimension increases in length to 30, 56 and 72 respectively. For frames numbered 1 to k the delta features appended to the feature vector are zero. is selected. Through empirical testing t = 0.3 was found to maximizes audio-visual recognition. n-(N-1) n-(N/6) time Figure 2 Visual memory features (e.g.N = 30 frames shown). All features in the feature vector X were normalized prior to classification. The normalization procedure for each feature x, required calculation of the mean and standard deviation across the training samples and these were used to rescale the feature values. After rescaling each feature across the training sample had a mean of zero and a standard deviation of one. The same normalisation values were used to normalise the test data. The audio pre-processing was based on extraction of melfrequency Cepstral coefficients of order 12, in conjunction with log energy of the speech frame. The first and second derivatives are also included in the feature vector, which has a dimension of 39 [16]. Hidden Markov Models (HMMs) were used for classification of both audio and visual isolated word recognition. The Hidden Markov Model Toolkit, HTK 3.0 was used to implement the HMM topologies [16]. Both the audio and visual HMMs contained 10 states with one mixture per state. The audio and visual modalities were integrated following separate classification. The fusion scheme chosen is multiplicative using probabilistic rules. The scheme initially selects the candidate that maximises the cross product of the N-best output probabilities of the audio and visual modalities; N was set to 10 in this study. This process is known as late integration [7]. The audio and visual N-best output probabilities are subsequently weighted according to the dispersion or variances of their N-best output probabilities. These adaptive weights account for the confusability of phonemes visually and also the confusability of phonemes acoustically for varying levels of SNR. This weighting indicates the reliability of the modalities [6]. λ= σv σv +σa (1) The weighting is carried out using Equation 1, where σa and σv are the variances of the audio and visual modality’s N-best output probabilities, respectively. The visual N-best output probabilities are weighted using λ and the audio N-best output probabilities using 1-λ. However when one of the modalities becomes corrupt, it can mask reliable output recognition from the other modality. Placing thresholds on the integration process can avert this corruption. When λ falls below a threshold, t, the candidate that maximizes the visual output probabilities is selected, also when λ is greater than 1t, the candidate that maximizes the audio output probabilities 4. Results The audio-visual database employed in this study consisted of 10 speakers (7 male and 3 female) [20]. For each speaker ten full frontal recordings were taken for a vocabulary of 78 words. The audio signals were contaminated with white Gaussian noise giving audio recognition across varying SNR’s e.g. -12, -6, 0, 6, 12, 18 and 24dB. N-fold cross validation was used in all results to maximise the use of the data available. The data was divided into n subsets of equal size, where n is the number of speakers and each subset contains all the recordings of one speaker. The system is trained and tested n times, each time leaving out one subset from training and using the omitted subset for testing. The results are obtained using 4 recordings of the 10 speakers uttering the entire 78 word vocabulary. The usefulness of the delta features is first compared to the static features only and to the static plus delta features in Table 1, where 28 static and 28 delta features were used alone or in combination. Features Used Recognition accuracy (%) Static only (28 features) 7.2 Delta only (28 features) 28.1 Static + Delta (56 features) 20.2 Table 1 Comparing recognition results using, Static, Delta and Static+Delta features The effects of varying k as part of the delta(n-k) feature set, for visual speech recognition was first considered. Static transform coefficients of dimension 15, 28 or 36 were first extracted. Then delta features were computed as the difference between the current frame and the kth previous frame, where frame and the feature vector dimension was increased to 30, 56 or 72. Figure 3 shows the variation of visual recognition for different feature vector dimensions. Visual recognition was found to be maximized for delta(n-k) where k is N/6 over all feature vector dimensions, i.e. the delta or difference features are computed between the current frame n and the frame (nN/6). Feature vector dimension of 30 was found to provide a higher accuracy across all k than that of 56 and 72. The best visual recognition rate was 20.2%, which occurred at k = N/6 and with a feature vector dimension of 56. 30 Visual recognition (%) 20 30 15 56 10 72 5 25 20 DCT 15 Haar 10 Hadamrd 5 /2 *2 /3 N *5 /6 N -1 N N Delta(n-k), k Delta(N-k), k In Figure 4 we examine the use of feature vectors composed of no transform coefficient features but only delta (n-k) features. The exception is for the case of No Delta, where only 15, 28 and 36 transform coefficients were used as features. Results are again found to be maximized for k = N/6, for all feature vector dimensions. The maximum visual recognition rate obtained was 28.1%, which occurred at k = N/6 and with a feature vector dimension of 28. Table 2 indicates the most frequently confused words for k = 1, N/2, N-1. 30 25 Figure 5 Variation of visual recognition as a function of k for delta(n-k), for different transformations. Feature vector dimension is 28 and includes only delta features The results of AVASR across all SNR values are shown in Figure 6. The visual recognition was performed on feature vectors composed of 28 delta features. 100 90 Recognition (%) Figure 3 Variation of visual recognition as a function of k for delta(n-k), for different feature vector dimensions. Feature vector includes DCT coefficients and delta features Visual recognition (%) /3 1 N -1 N *5 /6 N *2 /3 /2 N N /3 N /6 1 N /6 0 0 N Visual recognition (%) 25 80 70 60 50 Audio Visual 40 30 20 Audio Visual 10 0 20 15 15 28 -12 -6 0 6 12 18 24 SNR (dB) 36 10 Figure 6 Variation of audio, visual and audio-visual recognition as a function of audio SNR 5 0 1 N/6 N/3 N/2 N*2/3 N*5/6 N-1 Delta(n-k), k Figure 4 Variation of visual recognition as a function of k for delta(n-k), for different feature vector dimensions. The feature vector includes only delta features In Figure 5 the different image transforms implemented, Discrete Cosine, Haar and Hadamard Transforms are compared. Again, the results are obtained using 4 recordings of the 10 speakers uttering the entire 78 word vocabulary. The feature vector consisted of 28 delta features. Again, as in Figure 4 and Figure 5, visual recognition was seen to be maximized at k =N/6, for all three image transforms. 5. Discussion The results in Table 1 show that the visual delta features provide a much more consistent description of the isolated utterance than the static features and the combined static and delta feature vector. Also choosing a value for k, as in delta(n-k), relative to the number of frames in the utterance, N, consistently provides difference features between the same segments of the utterance, for all instances of the utterance. A number of factors contribute to the rest of the visual speech recognition results obtained in Section 4. Firstly, the effects on recognition of the level of detail included in the feature vector were examined. Figure 3 illustrates the need for a base level of detail to be contained in the feature vector. A realistic feature vector dimension that maintains a good level of discriminatory detail is required. Increasing the number of transform coefficients increases the visual recognition accuracy. However, the size of training data available limits the possible feature vector dimensions for good recognition. It is observed in Figure 3 that a feature vector dimension of 56 outperforms that of 30 and 72. There are insignificant differences in the recognition accuracies of feature vector lengths of 30 and 72. Therefore 56 is a realistic feature vector dimension that maintains good accuracy for this database size. The addition of the new delta features, shows an increase in visual speech recognition for all feature vector dimensions. The number of frames per utterance in the database varies considerably. Therefore the difference or delta features delta(n-k), are being taken between different positions within the utterances. What was required was a method of choosing the delta frames, which is relative to the number of frames per utterance. Figures 3 and 4 shows the results for choosing the value of k relative to the number of frames N, e.g. k = 1, N/3, N/2, N*2/3, N*5/6, N-1. Visual recognition is observed to be maximum at k = N/6 i.e. for an utterance 30 frames of length, k = 5. As speech is analysed rapidly at 30 fps, the difference in features between two consecutive frames does not provide sufficient discriminatory features. Thirty fps corresponds to a difference in the features occurring 0.03 seconds apart. However, from Figure 3, visual recognition is maximised where k is N/6. At 30 fps this corresponds to the generating a feature vector based on two frames 0.2 seconds apart, for an utterance 30 frames long. The different transforms implemented in Figure 5, show highly comparable results. The DCT while being the most widely implemented image transform, performs slightly better than the Haar and Hadamard transforms. This indicates that it is not necessarily the type of transform chosen that provides useful visual information, rather it is how the features evolve over time, through analysis with delta features that provide good discriminatory information. It is useful to note that while the Haar and Hadamard transforms are much simpler to implement than the DCT, they are providing highly comparable visual recognition results to those of the DCT. Figure 4 examines the effects of using delta features only. The transform coefficient features are not included in the feature vector and this reduces the feature vector dimension in half. It can be seen that it is the delta features that provide robust visual features for recognition, as the visual recognition results in Figure 4 are significantly higher than in Figure 3 e.g. maximum recognition for feature vector including both transform coefficient features and delta features is 20.2%, while from Figure 4 maximum recognition for feature vectors composed of delta features only is 28.1%. The audio and visual modalities were integrated using thresholded adaptive weighting. Introducing a threshold at t =0.3, the audio-visual recognition is greater than both the audio and visual recognition alone, across all SNR’s. In highly noisy conditions, –12dB, the audio-visual recognition is 9% higher than audio-only recognition. Even in noiseless conditions, 24dB, the audio-visual integrated results shows an improvement of 1.5% over audio-only recognition. In fact, the audio-visual recognition accuracy is greater than audioonly or visual-only recognition across all SNR. The audio-visual database employed in this study while containing a wide vocabulary does not contain many visually similar words. Further investigation is required into what values of k for delta(n-k) should be used for more robust visual speech recognition to distinguish between utterances with visually similar phonemes. A feature vector including a combination of delta feature sets of different k could provide more robust recognition for visually similar words. 6. Conclusion Skills required for speechreading in humans have been applied to automatic speechreading. The combination of visual memory, good visual discrimination and rapid visual recognition of speech movements was shown to improve visual speech recognition. The novel delta features, that provide the visual memory, were shown to improve visual recognition over static features only and also over static plus delta features computed over consecutive frames. The maximum recognition for a vocabulary of 78 words, for feature vectors including both transform coefficient features and delta features, is 20.2%. While the maximum recognition for feature vectors composed of delta features only was found to be 28.1%. Also, the different transforms implemented, show highly comparable results. This indicates that it is not necessarily the type of transform chosen that provides useful visual information, rather the new delta features that improve recognition. The delta features are computed based on the number of frames in the utterance, while this works well for isolated word recognition it does not translate well to the task of continuous speech recognition. However, these results strongly indicate the importance of delta features for visual speech recognition, further investigation is required comparing results using different delta/derivative kernels. Also, future work will involve testing these features on a larger database and on continuous speech recognition experiments. 7. Acknowledgements The authors are grateful to Professor Tsuhan Chen, Carnegie Mellon University, Pittsburgh, PA, for providing the audiovisual database used in this study. The support of the Informatics Research Initiative of Enterprise Ireland is gratefully acknowledged. 8. References [1] [2] [3] [4] [5] [6] [7] Summerfield, Q. “Lipreading and audio visual perception”, Phil. Trans. R. Soc. London 1972. McGurk, H., MacDonald, J “Hearing lips and seeing voices”, Nature, Vol. 264, 1976. Kaplan, H., Bally, S.J., Garretson, C., “Speechreading: A way to improve understanding”, Revised 2nd Edition, Gallaudet University Press, 1985. Petajan, E.D. “Automatic lipreading to enhance speech recognition”, Proc. of the IEEE Comm Society Global Telecommunications Conference, Atlanta, Georgia, 1984. Hennecke, M., Stork, D., Prasad, K. “Visionary Speech: Looking Ahead to Practical Speechreading Systems”, Speechreading by Humans and Machine, Springer 1996. Adjoudani, A., Benoit, C. “On the Integration of Auditory and Visual Parameters in an HMM-based ASR”, Speechreading by Humans and machine, Springer 1996. Teissier, P., Robert_Ribes, J., Schwartz, L., Guérin_Dugué, A. “Comparing models for audio-visual fusion in a noisy-vowel recognition task”, IEEE Trans. on Speech and Audio Processing, vol. 7, no. 6, 1999. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] View publication stats Potamianos, G., Cosatto, E., Graf, H.P., Roe, D.B. “Speaker independent audio-visual database for bimodal ASR”, Proc. European. Tut. Work. Audio-Visual Speech Proc., Rhodes, pp. 65-68, 1997. Potamianos, G., Graf, H.P Cosatto, E. “An Image Transform Approach for HMM based Automatic Lipreading”, Proc. Int. Conf on Image Processing, Rhodes, Greece, p.p. 173-177, 1998. Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., Harvey, R. “Extraction of Visual Features for Lipreading.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198-213, February 2002. Jain, A. “Fundamentals of Digital Image Processing”, Prentice Hall, 1989. Goldschen, A., Garcia, O., Petajan, E. “Continuous optical automatic speech recognition by lipreading” 28th Asimolar conf. on Signals, Systems and Computers, 1994. Potamianos, G. and Neti, C. “Improved ROI and within frame discriminant features for lipreading” Proc. Int. Conf. Image Processing, Thessaloniki, Greece, 2001. Gray, M. S., Movellan, J. R., and Sejnowski, T. J., “Dynamic features for visual speechreading: A systematic comparison” Advances in Neural Information Processing Systems Volume 9, 751-757, 1997. Frowein, H.W., et al., “Improved speech recognition through video telephony: experiments with the hard of hearing.” IEEE Journal on Selected Areas in Communication, vol. 9, no. 4, 1991. Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P. “The HTK Book (for HTK Version 3.0) ”, Microsoft, 2000. Audio-Visual data corpus, Advanced Multimedia Processing lab, Carnegie Mellon University, Pittsburgh, PA, USA.

Log In

Feature analysis for automatic speechreading