Classification of visemes using visual cues

Nazeeh Alothmany

Classification of visemes using visual cues

2010

Psycho-acoustic tests have indicated that human vision classifies the visemes (visual representation of phonemes) into different classes. This study shows that visual features extracted from 2-D images of lip motion can be used to design an automatic classifier for visemes. Audio-visual recordings from 18 native speakers of American English for 12 Vowel-Consonant-Vowel (VCV) sounds were obtained using the consonants /b,v,w,ð,d,z/ and the vowels /a,i/. The lip height, lip width, motion of the upper lip and the rate at which lips move while producing the VCV words were visual features used to represent each VCV sound. Features extracted from nine of the speakers were used to define Linear Discriminant Analysis functions to classify the visemes and features extracted from the remaining nine speakers were used in testing the classifiers. When the VCV sounds were divided into five classes consistent with those obtained from psycho-acoustic tests, the percentage of correct classification ...

CLASSIFICATION OF VISEMES USING VISUAL CUES by Nazeeh Shuja Alothmany BSc, King Abdulaziz University, 1993 MS, University of Michigan, 1998 Submitted to the Graduate Faculty of Swanson School of Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2009 i UNIVERSITY OF PITTSBURGH SWANSON SCHOOL OF ENGINEERING This dissertation was presented by Nazeeh Shuja Alothmany It was defended on April, 17th, 2009 and approved by Ching-Chung Li, Professor, Department of Electrical and Computer Engineering Luis F. Chaparro, Associate Professor, Department of Electrical and Computer Engineering Amro El-Jaroudi, Associate Professor, Department of Electrical and Computer Engineering John D. Durrant, PhD, Professor and Vice Chair of Communication Science and Disorders Susan Shaiman, PhD, Associate Professor, Department of Communication Science and Disorders Dissertation Director: J. Robert Boston, Professor, Department of Electrical and Computer Engineering ii Copyright © by Nazeeh Shuja Alothmany 2009 iii CLASSIFICATION OF VISUAL VISEMES USING VISUAL CUES Nazeeh Shuja Alothmany, Ph.D University of Pittsburgh, 2009 Studies have shown that visual features extracted from the lips of a speaker (visemes) can be used to automatically classify the visual representation of phonemes. Different visual features were extracted from the audio-visual recordings of a set of phonemes and used to define Linear Discriminant Analysis (LDA) functions to classify the phonemes. . Audio-visual recordings from 18 speakers of Native American English for 12 VowelConsonant-Vowel (VCV) sounds were obtained using the consonants /b,v,w,ð,d,z/ and the vowels /ɑ,i/. The visual features used in this study were related to the lip height, lip width, motion in upper lips and the rate at which lips move while producing the VCV sequences. Features extracted from half of the speakers were used to design the classifier and features extracted from the other half were used in testing the classifiers. When each VCV sound was treated as an independent class, resulting in 12 classes, the percentage of correct recognition was 55.3% in the training set and 43.1% in the testing set. This percentage increased as classes were merged based on the level of confusion appearing between them in the results. When the same consonants with different vowels were treated as one class, resulting in 6 classes, the percentage of correct classification was 65.2% in the training set and 61.6% in the testing set. This is consistent with psycho-visual experiments in which subjects were unable to distinguish between visemes associated with VCV words with the same consonant but different vowels. When the VCV sounds were grouped into 3 classes, the percentage of correct classification in the training set was 84.4% and 81.1% in the testing set. iv In the second part of the study, linear discriminant functions were developed for every speaker resulting in 18 different sets of LDA functions. For every speaker, five VCV utterances were used to design the LDA functions, and 3 different VCV utterances were used to test these functions. For the training data, the range of correct classification for the 18 speakers was 90100% with an average of 96.2%. For the testing data, the range of correct classification was 5086% with an average of 68%. A step-wise linear discriminant analysis evaluated the contribution of different features towards the dissemination problem. The analysis indicated that classifiers using only the top 7 features in the analysis had a performance drop of 2-5%. The top 7 features were related to the shape of the mouth and the rate of motion of lips when the consonant in the VCV sequence was being produced. Results of this work showed that visual features extracted from the lips can separate the visual representation of phonemes into different classes. v TABLE OF CONTENTS TABLE OF CONTENTS ........................................................................................................... VI LIST OF TABLES ...................................................................................................................... IX LIST OF FIGURES .................................................................................................................. XII PREFACE ................................................................................................................................. XIV 1.0 INTRODUCTION ........................................................................................................ 1 2.0 LITERATURE REVIEW............................................................................................ 6 2.1 AUDIO-VISUAL CORRELATION ........................................................................... 7 2.2 LIP READING ............................................................................................................. 9 2.2.1 Visemes and Phonemes ................................................................................... 9 2.2.2 Visual Perception of Phonemes .................................................................... 17 2.2.3 Lip reading in Speech Recognition .............................................................. 22 2.3 MISCELLANEOUS APPLICATIONS OF AUDIO-VISUAL SIGNAL PROCESSING ............................................................................................................ 26 2.4 SUMMARY ................................................................................................................ 29 3.0 EXPERIMENTAL METHOD .................................................................................. 32 3.1 DATA RECORDING ................................................................................................ 32 3.1.1 VP110 Motion Analyzer ................................................................................ 32 3.1.2 The Recording Procedure ............................................................................. 33 3.1.3 The Recorded Audio-Visual Data ................................................................ 36 vi 3.2 PRE-PROCESSING THE WAVEFORMS ............................................................. 38 3.2.1 Generating Path files from Centroid files ................................................... 39 3.2.2 Removing the effect of head motion from every frame .............................. 41 3.2.3 Generate the distance waveforms ................................................................ 42 3.2.4 Obtaining Single Utterances ......................................................................... 47 3.2.5 Time-And-Amplitude Normalization .......................................................... 49 3.3 FEATURE SELECTION AND EXTRACTION .................................................... 54 3.3.1 Detecting Extremas in the waveforms ......................................................... 56 3.3.2 Features from the upper and lower lips distance waveform ..................... 56 3.3.3 Features from the left and right lip corners distance waveform ............... 58 3.3.4 Features from the upper-lip distance waveform ........................................ 59 3.4 LINEAR DISCRIMINANT ANALYSIS ................................................................. 62 4.0 3.4.1 Discriminant Analysis Model ....................................................................... 62 3.4.2 Linear Discriminant Analysis for Two Groups .......................................... 63 3.4.3 Linear Discriminant Analysis, C-classes ..................................................... 66 3.4.4 Stepwise Discriminant Analysis ................................................................... 67 RESULTS ................................................................................................................... 70 4.1 TRAINING AND TESTING THE CLASSIFIER .................................................. 70 4.1.1 Speaker Based Training ................................................................................ 71 4.1.1.1 Speaker-Based Training with 12 Classes .......................................... 71 4.1.1.2 Speaker-Based Training with 6 Classes ............................................ 79 4.1.1.3 Speaker-Based Training with 5 Classes ............................................ 83 4.1.1.4 Speaker-Based Training with 3 Classes ............................................ 86 4.1.2 Testing Models developed by Speaker-based Training ............................. 89 vii 4.1.3 Word-Based Training ................................................................................... 94 4.2 STEP-WISE ANALYSIS ......................................................................................... 100 4.3 SPEAKER SPECIFIC DISCRIMINATION ......................................................... 104 5.0 DISCUSSION ........................................................................................................... 109 5.1 SPEAKER-BASED VERSES WORD-BASED TRAINING................................ 109 5.2 PERFORMANCE FOR DIFFERENT NUMBER OF CLASSIFICATION CLASSES .................................................................................................................. 110 5.3 TESTING THE MODELS ...................................................................................... 113 5.4 STEPWISE ANALYSIS .......................................................................................... 115 5.5 SPEAKER SPECIFIC DISCRIMINATION ......................................................... 118 6.0 CONCLUSION......................................................................................................... 120 7.0 FUTURE WORK ..................................................................................................... 122 APPENDIX A ............................................................................................................................ 124 WORD-BASED TRAINING.................................................................................................... 124 A.1 WORD-BASED TRAINING WITH 12 CLASSES............................................... 124 A.2 WORD-BASED TRAINING WITH 6 CLASSES................................................. 126 A.3 WORD-BASED TRAINING WITH 5 CLASSES................................................. 128 A.4 WORD-BASED TRAINING WITH 3 CLASSES................................................. 130 APPENDIX B ............................................................................................................................ 132 WORD-BASED TESTING ...................................................................................................... 132 B.1 WORD BASED TESTING WITH 12-CLASSES ................................................. 132 B.2 WORD BASED TESTING WITH 6-CLASSES ................................................... 133 B.3 WORD BASED TESTING WITH 5-CLASSES ................................................... 134 B.4 WORD BASED TESTING WITH 3-CLASSES ................................................... 135 BIBLIOGRAPHY ..................................................................................................................... 136 viii LIST OF TABLES Table 2-1 Acoustic and visual features used by Roland ................................................................. 8 Table 2-2 Faruqui’s phoneme-to-viseme mapping rule ................................................................ 11 Table 2-3 Kate et al. viseme to phoneme mapping rule ............................................................... 12 Table 2-4 Viseme to feature mapping ........................................................................................... 12 Table 2-5 Jintao phonemic equivalence classes............................................................................ 14 Table 2-6 Recognition rate of French vowels ............................................................................... 15 Table 2-7 Phonemes visually confused with each other for different speakers (Kricos) ............. 18 Table 2-8 Phonemes visually confused with each other (Benguerel) ........................................... 18 Table 2-9 Visemes associated with different vowels (Owen) ..................................................... 19 Table 2-10 Dodd’s Viseme groups for English consonants.......................................................... 20 Table 2-11 Common viseme-to-phoneme mapping ..................................................................... 21 Table 2-12 VCV sounds to be rerecorded..................................................................................... 30 Table 3-1 Number of utterances for each word by all speakers ................................................... 37 Table 3-2 Path assignment in the first frame of the centroid file.................................................. 39 Table 3-3 Summary of the extracted visual features .................................................................... 61 Table 4-1 Testing equality of means for speaker based training analysis .................................... 72 Table 4-2 Test of equality of covariance matrix between groups ................................................. 73 Table 4-3 Tests null hypothesis of equal population covariance matrices. .................................. 74 ix Table 4-4 Classification results and the confusion matrix 12-class speaker based ...................... 76 Table 4-5 Contribution of the discriminant functions towards the classification problem........... 77 Table 4-6 The contribution of each feature towards the discrimination (Structural matrix) ........ 78 Table 4-7 Classification function coefficients .............................................................................. 79 Table 4-8 Six classes resulting from combining words for the same vowel ................................ 80 Table 4-9 Classification results and the confusion matrix 6 class speaker based ......................... 81 Table 4-10 The contribution of each feature towards the discrimination (Structural matrix) ...... 82 Table 4-11 Five classes resulting from combining /VdV/ with /VzV/ ......................................... 83 Table 4-12 Classification results and the confusion matrix 5 class speaker based....................... 84 Table 4-13 The contribution of features towards the discrimination (Structural matrix) ............. 85 Table 4-14 Three classes resulting from combining /VdV/ with /VzV/ ....................................... 86 Table 4-15 Classification results and the confusion matrix 3 class speaker based....................... 87 Table 4-16 The contribution of each features towards the discrimination (Structural matrix) .... 88 Table 4-17 Testing the Fisher functions developed in speaker based training ............................. 89 Table 4-18 Confusion matrix for the 12 class testing phase ......................................................... 91 Table 4-19 Confusion matrix for the 6 class testing phase ........................................................... 92 Table 4-20 Confusion matrix for the 5 class testing phase ........................................................... 93 Table 4-21 Confusion matrix for the 3 class testing phase ........................................................... 93 Table 4-22 Structural matrix for word-based training with 12 classes ......................................... 95 Table 4-23 Structural matrix for word-based training with 6 classes ........................................... 96 Table 4-24 Structural matrix for word-based training with 5 classes ........................................... 97 Table 4-25 Structural matrix for word-based training with 3 classes ........................................... 98 Table 4-26 Comparing performance results between speaker based and word based training .... 99 x Table 4-27 Testing speaker-based and word-based LDA functions ............................................. 99 Table 4-28 Features in the order of their importance in different classes .................................. 103 Table 4-29 Classification performance with the top 7 features .................................................. 104 Table 4-30 Range and average of correct discrimination for 18 speaker-specific models ........ 105 Table 5-1 Intuitive meaning of the features used in the analysis ................................................ 117 Table 7-1 Classification Function Coefficients .......................................................................... 124 Table 7-2 Classification results for word-based training with 12-classes ................................. 125 Table 7-3 Classification function coefficients ............................................................................ 126 Table 7-4 Classification results for word-based training with 6-classes .................................... 127 Table 7-5 Classification function coefficients ............................................................................ 128 Table 7-6 Classification results for word-based training with 5-classes .................................... 129 Table 7-7 Classification function coefficients ............................................................................ 130 Table 7-8 Classification results for word-based training with 3-classes .................................... 131 Table 7-9 Classification results for word-based testing with 12 classes .................................... 132 Table 7-10 Classification results for word-based testing with 6 classes .................................... 133 Table 7-11 Classification results for word-based testing with 5 classes .................................... 134 Table 7-12 Classification results for word-based testing with 3 classes .................................... 135 xi LIST OF FIGURES Figure 1-1 Points of focus around the lips in Chen's audio-visual data base.................................. 3 Figure 1-2 Steps of designing an automatic viseme classifier based on visual cues ...................... 5 Figure 2-1 Placement of 20 optical markers on speaker face ....................................................... 14 Figure 2-2 Representative images for six major viseme classes................................................... 16 Figure 3-1 Experimental setup for audio-visual data recording ................................................... 35 Figure 3-2 Location of optical reflectors and coordinate system for tracking of lip movements 35 Figure 3-3 Waveforms associated with each reflector.................................................................. 38 Figure 3-4 Motion of forehead reflector in consecutive frames ................................................... 42 Figure 3-5 Upper/Lower distance waveform for “aba” with the audio signal superimposed ...... 43 Figure 3-6 Four distance waveforms associated with the VCV word “ɑbɑ” ................................ 45 Figure 3-7 Four distance waveforms associated with the VCV word “ɑðɑ” ................................ 45 Figure 3-8 Four distance waveforms associated with the VCV word “iði” ................................. 46 Figure 3-9 Four distance waveforms associated with the VCV word “ɑwɑ” ............................... 46 Figure 3-10 Broken utterances for the word /ɑbɑ/ together with the mean for each speaker ....... 48 Figure 3-11 Distance waveforms associated with ten utterances of /ɑbɑ/ before normalization 51 Figure 3-12 Amplitude normalization by dividing over the maximum value .............................. 51 Figure 3-13 Ten word utterances after applying the Ann Smith normalization technique........... 52 xii Figure 3-14 Standard deviation at each point of the 8 utterances in the 3 distance waveforms ... 53 Figure 3-15 Features extracted from upper/lower distance waveform ......................................... 57 Figure 3-16 Features extracted from lip corners distance waveform ........................................... 58 Figure 3-17 Amplitude features extracted from the upper-lips waveform ................................... 60 Figure 3-18 Projection of data on a line (a) Poor separability (b) Good separability................... 64 Figure 4-1 Effect of adding features on the classification performance ..................................... 101 Figure 4-2 Training and testing results for every speaker (3 class configuration) .................... 106 Figure 4-3 Training and testing results for every speaker (5 class configuration) .................... 106 Figure 4-4 Training and testing results for every speaker (6 class configuration) ..................... 107 Figure 4-5 Training and testing results for every speaker (12 class configuration) ................... 108 xiii PREFACE I write these words wishing that my father had lived to see this day. He was my main motivation for pursuing graduate work and he always wished that all of his children would manage to obtain high degrees. His dream became a reality and I hope that his sole will rest in peace. I would like to thank my advisor Dr. Robert Boston for the support and guidance he provided me throughout my Ph.D. I would also like to express my gratitude to all my research committee members for the time and valuable feedback they have given me. I would like to thank the Electrical and Computer Engineering Department at King Abdulaziz University in Jeddah Saudi Arabia for providing me with a scholarship to pursue my graduate degree and I look forward for joining the department as a faculty member. I would like to thank my older brother Dr Dheya Alothmany for his constant support and encouragement for me during my stay in USA. I would also like to thank my mother Sabiha who kept praying for my success day and night. I would not have been able to complete my work if it had not been for God’s blessings first, and then the enormous support of my wife Souad Rahmatullah. She stood by me through difficult times and encouraged me to keep my spirit high. My lovely children Danyah, Hamzah, Mouaz and Zayd can now get ready to go back home. I dedicate this dissertation to all those who contributed towards the success of this work. xiv 1.0 INTRODUCTION A human being has five senses - sight, hearing, touch, smell, and taste. One can acquire more exact information on the surrounding environment by integrating cues obtained from different senses. This merging of cues from different senses in humans has led many researchers to investigate the effects of combining several modalities or sensor outputs together on the performance of many automated systems currently utilizing a single modality. Hearing impairment is one of the handicaps common among people, and amplification devices to compensate for it date back several centuries. These amplification devices are often called hearing aids, and their aim is to maximize speech understanding for individuals with hearing impairment. However, some researchers believe that the performance of these devices in noisy environments has not yet reached a satisfactory level that justifies the cost to the patient [1] Alternatively, a person skilled in lip reading is able to infer the meaning of spoken sentences by looking at the configuration and the motion of the visible articulators of the speaker, such as the tongue, lips, teeth, and cues from the context. Lip reading is widely used by hearing impaired persons for speech understanding. In addition to lip reading, facial expressions and body language can be used to assist in aural communication [2]. Sumby and Pollack [3] showed that adding visual information to acoustic waveforms is equivalent to a 12dB increase in the signal to noise ratio (SNR). 1 One of the important studies that attempted to investigate the relation between acoustic and visual information was conducted by McGurk [4]. The study showed that presenting a viewer with conflicting audio-visual recordings of a certain word results in the wrong perception of the sound. This study demonstrates that vision can play a role in speech perception. Clavert [5] confirmed the Mcgurk effect by using functional Magnetic Resonance Imaging (FMRI) to show that the speech perception center in the brain analyzes speech-like lip motions even when no sound is present. It is not yet clear how the brain combines visual information with audio information to understand speech. It is also not clear what kind of visual cues are utilized by the brain in this process. Therefore, many studies have attempted to explore different methods of combining visual cues with acoustic information in addition to utilizing different types of visual cues to represent the speech [6-10]. In some applications, such as lip reading and screening of security camera recordings, there is no access to audio. In other applications, an audio signal might be present but severely corrupted. Having automatic viseme classifiers based on visual cues will help in narrowing down the list of possible spoken phonemes. Furthermore, the identification of visemes might be useful to adjust the parameters of hearing aid filters for better performance in situations where more than one person is talking. The initial objective of the present study was to investigate whether utilizing visual information in conjunction with a hearing aid device would enhance the performance of the device in noisy environments. To further investigate this objective, audio-visual recordings were obtained from the Advanced Multimedia Lab at Carnegie Mellon University [11]. The data consisted of recordings for the change in x-y coordinates of lip corners together with lower and 2 upper lip heights in consecutive frames of video recordings for subjects repeating different words. The points of interests on the lips in that database are shown in Figure 1-1. The consistency of the visual information accompanying the utterance of specific words across different speakers was evaluated. Figure 1-1 Points of focus around the lips in Chen's audio-visual data base The data set showed that a repeated pattern for the same word exists within the same speaker and even across speakers. However, since these data were for connected speech, it was difficult to quantify and model these patterns. In order to build and systematically evaluate models of lip motion patterns, models should first be built for smaller blocks of speech, then extended to connected speech. Phonemes are the basic units of speech in the acoustic/auditory domain. The term “viseme” has been introduced as an analogy to represent the visual representation of the phoneme [12]. A viseme describes the particular facial and oral movements that occur with the 3 voicing of phonemes, and they are considered the smallest distinguishable visual unit of speech [13-17]. Researchers concerned with speech production and lip reading have obtained different viseme-to-phoneme mappings and identified phonemes that are visually confused with one another. One of the important studies conducted in this area was done by Owen and Blazek [18]. The group presented video recordings of vowel-consonant-vowel (VCV) sounds for 23 English consonants to 10 subjects. Subjects observed video sequences of speakers (without audio) saying a certain sound. Based on the motion patterns of the visual articulators (lips, tongue, throat, teeth, and facial emotions) they observed, the subjects attempted to identify the produced sound. Results from all subjects were gathered in matrices with row representing the actual VCV sequence and columns representing the response of the subject. These matrices were called confusion matrices. When a cluster of confused sequences appeared in the matrix, a 75% response criterion was used as a requirement for considering these phonemes to have the same visual representation (i.e. viseme). The viseme-to-phoneme mapping obtained by Owen is consistent with mappings obtained by many other researchers working in speech production as well as audio-visual signal processing as discussed in Section 2.2. The objective of the current study is to analyze the feasibility of using a set of visual features extracted from the 2D images of lip motion in an automated classification system to classify sounds into different visemes. Different visual features from the audio-visual recordings of a set of phonemes were extracted and used in a stepwise linear discriminant analysis to identify which visual features are most effective in distinguishing between the different visemes. 4 Figure 1-2 Steps of designing an automatic viseme classifier based on visual cues Figure 1-2 shows a block diagram representing the steps involved in reaching the objective. The figure has three major blocks, with feedback loops between them. The first step in designing a visual classifier of phonemes is to develop an audio-visual database consisting of recordings for speakers uttering different phonemes. Visual features are extracted from the video images and used to train and test a visual classifier. The parameters of the classifier were modified based on the feedback coming from the results of analyzing the classification output. Chapter Two of this thesis reviews the literature available on different audio-visual applications and then concludes with a summary of the objective of the study that is based on the review. Chapter Three describes the experimental setup. Chapter Four presents the results of the experiments conducted. Chapter Five discusses the results. The conclusion is presented in Chapter Six and suggestions to future work are presented in Chapter Seven. 5 2.0 LITERATURE REVIEW There have been a wide range of studies in the area of audio-visual signal processing. In each study, different visual features as well as different classifiers have been chosen and tested for performance. Despite this extensive research, there are many research questions still open in this area, for example [13] 1. Which facial features are important? How are geometric features such as lip height and width related to non-geometric features such as discrete cosine transform of the mouth image? 2. What methods offer an effective means of using visual information and audio information together for speech comprehension? 3. How can visual cues such as face pose and gaze be effectively used to direct the attention of audio speech recognition to enhance the robustness of the audio signal? This chapter summarizes the different approaches used by many researchers working in audio-visual applications. Section 2.1 discusses work done in studying the correlation between the audio and visual signals. Section 2.2 focuses on work in the area of mapping visemes to phonemes. Section 2.3 presents some miscellaneous audio-visual applications and Section 2-4 summarizes the review and re-states the objectives of this study based on the literature review. 6 2.1 AUDIO-VISUAL CORRELATION Studies have been conducted by many researchers to investigate the correlation between visual articulators and sounds produced by them. Many parameters have been used in these studies to represent both visual and audio signals. Hani et al [19] examined the linear association between the vocal tract configuration, facial behavior and speech acoustics. They applied linear estimation techniques to support the claims that facial motion during speech is largely a byproduct of producing the speech acoustics. The experimental data they used included measurements of speech acoustics, the motion of markers placed on the face and in the vocal-tract for two subjects. The numerical results showed that, for both subjects, 91% of the total variance observed in the facial motion data could be explained by vocal-tract motion by means of simple linear estimators. For the inverse path, i.e. recovery of vocal-tract motion from facial motion, their results indicated that about 80% of the variance observed in the vocal-tract can be estimated from the face. Regarding speech acoustics, they observed that, in spite of the nonlinear relation between vocal-tract geometry and acoustics, linear estimators are sufficient to explain between 72 and 85% (depending on subject and utterance) of the variance observed in the RMS amplitude of the spectral envelope. J Barker [20] showed that there is correlation between the linear estimate of acoustics from lip and jaw configuration and speech acoustics itself. In his study, the lips and jaw movements were characterized by measurements taken from video images of the speaker’s face, and the acoustics were characterized using spectral pair parameters and a measure of RMS energy. The speech acoustics estimated from the lip and jaw configurations had a correlation of 0.75 with the actual speech acoustics. 7 Ezzat and Poggio [21] designed a system that had a set of images spanning a wide range of mouth shapes. They attempted to correlate those images with the phonemes from a speech signal. The purpose was to use the correlation results in animating a lip that moves according to the speech signal. Their system takes input from a keyboard and produces an audio-visual movie of a face enunciating that sentence. The generic name of their system is Mike Talk, and videos of their results are available on their website [22]. Their results indicated a correlation between the lips and the acoustics produced. Roland et al [23] investigated the statistical relationship between the acoustic and visual speech features for vowels. Their study used an audio-visual speech data corpus recorded using Australian English. The acoustic features were the voice source excitation frequency f0, the formant frequencies f1-f3, and the RMS energy, while the visual features were extracted from the 3D positions of the two lip corners and the mid point of upper and lower lips as shown in Table 2-1. Several strong correlations are reported between acoustic and visual features. In particular, F1 and F2 and mouth height were strongly correlated. Table 2-1 Acoustic and visual features used by Roland Acoustic feature Visual feature Voice source excitation f0 Mouth height Formant frequency F1 Mouth width Formant frequency F2 Lip protrusion Formant frequency F3 8 The studies presented in this section indicate that the audio and visual signals are correlated, which justifies the attempt to model the visual representation of phonemes. Most of the studies in this area focused on the visual cues related to the mouth and lips, which led us to focus on visual features related to the mouth area in designing the first block shown in Figure 12. 2.2 LIP READING A person skilled in lip reading is usually able to infer the meaning of spoken sentences by looking at the configuration and the motion of visible articulators of the speaker such as the tongue, lips, and teeth. This skill of lip reading is widely used by hearing impaired persons for speech understanding. However, lip reading is effective only if the speaker is observed from the frontal view. In addition, lip reading becomes difficult if more than one person is talking at the same time because a lip reader can focus only on one speaker at a time. This section reviews some of the concepts involved in lip reading as well as research that has been conducted in the visual identification of phonemes. 2.2.1 Visemes and Phonemes The Webster English dictionary defines phonemes as abstract units of the phonetic system of a language that correspond to a set of similar speech sounds which are perceived to be a single distinctive sound in the language [24]. An example of a phoneme is the /t/ sound in the words “tip”, “stand”, “water”, and “cat”. Since the number of consonants in the world's 9 languages is larger than the number of consonant letters in any one alphabet, linguists have devised systems such as the International Phonetic Alphabet (IPA) to assign a unique symbol to each consonant [25]. The Longman Pronunciation Dictionary, by John C. Wells [24], for example, used symbols of the International Phonetic Alphabet and noted that American English has 25 consonants and 19 vowels, with one additional consonant and three additional vowels for foreign words. The term "viseme" combines the words "visual" and "phoneme" [12]. Visemes refer to the visual representations of lip movements corresponding to speech segments (phonemes), and they are considered the smallest unit of speech that can be visually distinguished [13], [15], [14], [16], [17]. The mapping between visemes and phonemes is many to many, meaning that one viseme may correspond to more than one phoneme and the same phoneme can correspond to multiple visemes. This happens because the neighboring phonemic context in which a sound is uttered influences the lip shape for that sound. For example, the viseme associated with \t\ differs depending on whether the speaker is uttering the word "two" or the word "tea". In the former case, the \t\ viseme assumes a rounded shape in anticipation of the upcoming \uu\ sound, while in the latter it assumes a more spread shape in anticipation of the upcoming \ii\ sound[18, 26]. Researchers have developed many mappings between the visemes and phonemes, which are discussed in the remaining part of this section. Faruqui et al [27] used a map between Hindi phonemes and 12 visemes, where several phonemes were mapped to one viseme. The mapping shown in Table 2-2 was used to animate a face with lips moving in synchrony with an incoming audio stream. In this system, once the incoming audio signal was recognized, the mapping shown in Table 2-2 was used to select the 10 corresponding viseme to be animated to the observer. The paper did not explain how this mapping was obtained, but it stated that the animated faces were shown to different observers and the perception rates were promising. Verma et al [28] also used this mapping with a modified scheme for synchronizing the audio and visual signals. He applied the speech to a recognizer that generated a phonetic sequence that was converted to a corresponding viseme sequence using the mapping in Table 22. Ashish also stated that he would attempt to extend this work to English language phonemes. Table 2-2 Faruqui’s phoneme-to-viseme mapping rule Viseme Phoneme Viseme No Phoneme No a,h 1 g,k,d,n,t,y 7 e,i 2 f,v,w 8 l 3 h,j,s,z 9 r 4 sh,ch 10 o,u 5 th 11 p,b,m 6 Silence 12 Saenko et al [29] attempted to use articulatory features to model visual speech. They presented another mapping of English phonemes to 14 visemes as shown in Table 2-3. 11 Table 2-3 Kate et al. viseme to phoneme mapping rule Viseme Index Corresponding Phoneme Viseme Index Corresponding Phoneme 1 ax ih iy dx 8 bp 2 ah aa 9 bcl pcl m em 3 ae eh ay ey hh 10 s z epi tcl dcl n en 4 aw uh u wow ao w oy 11 ch jh sh zh 5 el l 12 t d th dh g k 6 er axr r 13 fv 7 Y 14 gcl kcl ng Saenko’s group wanted to design a classifier that would identify some of the phonemes using the four visual features related to lip shape shown in Table 2-4. Table 2-4 Viseme to feature mapping Viseme Lip-Open Lip-Round /ao/ Wide Yes /ae/ Wide No /uw/ Narrow Yes /dcl/ Narrow No 12 Saenko’s group developed an audio-visual database consisting of two speakers and conducted preliminary experiments to classify the phonemes using the features listed in Table 24. They obtained classification rates above 85%. The system they developed focused on vowel identification, and the testing was done on two subjects only. Jintao et al [30, 31] worked on visually classifying consonants based on visual physical measures. They developed an audio-visual database consisting of four speakers producing 69 consonant-vowel (CV) syllables. The video recordings from the database were presented to six viewers with average or better lip reading abilities, and visual confusion matrices were obtained. Consonants that were most commonly visually confused with each other were grouped together as one visual unit as shown in Table 2-5 Studies conducted by this group included placing 20 optical markers on the face of a speaker as shown in Figure 2-1. A motion detector designed by Qualisys tracked the 3D positions of these markers. The output of the detector had 51 points for every marker per frame. These points were arranged in matrices, and Euclidean distances between the points in consecutive frames were calculated and used as visual features. These features were then used to train a clustering based classifier using the classes shown in Table 2-5. The recognition accuracy was 38.4% for the spoken CV sequence /Ca/, 36.1% for the spoken CV sequence /Ci/ and 36.0% for the spoken CV sequence /Cu/. This study is very close to the objectives of our project but it was done only on two speakers. In addition, this study required the use of the 3D coordinates of the points of focus shown in Figure 2-1. This adds some limitations on the applications of this study since 3D coordinates are not available in many applications, such as video conferencing, telephony, and satellite images. 13 Table 2-5 Jintao phonemic equivalence classes Visual Unit Confused Consonants 1 /m,b/p/ 2 /f,v/ 3 /r/ 4 /w/ 5 /θ,δ/ 6 /ζ,∫,d ζ,t∫/ 7 /t,d,s,z/ 8 /l,n/ 9 /k,g,y,h/ Figure 2-1 Placement of 20 optical markers on speaker face 14 Warda et al [32] attempted to classify visual visemes associated with three French phonemes /ba/, /bi/,/bou/. The group used lip corners and center points of both upper and lower lips as visual features entered into a neural network for the purpose of distinguishing between the video recordings of the three phonemes. The study didn’t mention the number of speakers involved. The results are shown in Table 2-6 Table 2-6 Recognition rate of French vowels Recognition Rate Training Set Testing Set Ba 63.33% 63.64% bi 73.33% 72.73% Bou 83.33% 81.82% Average 73.33% 72.73% Leszcynski et al [33, 34] used three classification algorithms for visemes obtained from the CORPORA database that consists of audio-visual recordings of Polish. The group used two different sets of features to describe the visemes. The first one was based on a normalized triangle covering the mouth area and the color image texture vector indexed by barycentric coordinates. The second procedure performed a 2D Discrete Fourier Transform (DFT) on the rectangular image including the mouth area with respect to small blocks of DFT coefficients. The classifiers in their work were based on Principle Component Analysis and Linear Discriminant Analysis. The group reported that the DFT+LDA exhibits higher recognition rates than MESH+LDA and MESH+PCA methods – 97.6% versus 94.4 and 90.2%, respectively. It is 15 also much faster than MESH+PCA. The group obtained a 94% recognition rate for the 6 classes shown in Figure 2-2 but the group didn’t associate those visemes with corresponding phonemes. Figure 2-2 Representative images for six major viseme classes Huang and Chen [35] used Gaussian mixture models (GMM) and Hidden Markov Models (HMM) in mapping an audio parameter set to a visual parameter set in a technique aiming to synthesize mouth movements based on acoustic speech. In this technique, the visual information was represented by lip location (width and height of the outer contour of the mouth), while 13 spectral coefficients were extracted from the acoustical speech representing the audio data. Both audio and visual features were combined to form a single feature vector that was applied to the GMM and HMM. Huang and Chen reported smooth and realistic lip movements with this system. However, the system assumed that both audio and visual information is available, which might not always be the case. 16 The studies presented in this section presented different applications for viseme– phoneme mappings and applications. The common visual cues in the applications presented in the previous sections were related to motions and positions of the lips. 2.2.2 Visual Perception of Phonemes Researchers concerned with speech production and lip reading have obtained different viseme-phoneme mappings and identified phonemes that are visually confused with one another. This section will cover the work done in this area. Kricos [36] conducted a study with 12 female students who had normal hearing with no experience in lip reading. These subjects were presented with black and white video recordings from six different speakers repeating VCV sounds involving 23 English phonemes and 3 vowels. Subjects were asked to identify the consonant being shown. The responses were used to generate confusion matrices for every speaker. Phonemes that were confused with each other for more than 75% of the utterances were grouped together and considered to have the same visual representation (i.e. viseme). Results from this study are shown in Table 2-7 Another study by Benguerel [37] used video recordings of VCV sounds including the consonants /p,t,k,t∫,f,θ,s,∫,w/ and the vowels /i/, /æ/, or /u/. These recordings, obtained from a single female speaker, were presented to 10 subjects. Five of those subjects were hearing impaired, while the remaining five had normal hearing. All subjects were asked to identify the consonant being shown on the video monitor. Consonants that were confused with each other for more than 75% of total utterances were considered to have the same visual representation. Their results are shown in Table 2-8. 17 Table 2-7 Phonemes visually confused with each other for different speakers (Kricos) Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 /p,b,m/ /p,b,m/ /p,b,m/ /p,b,m/ /p,b,m/ /p,b,m/ /f,v/ /f,v/ /f,v/ /f,v/ /f,v,s,z/ /w,r, ∂,θ / /w,r/ /w,r/ /w,r/ /w,r/ /w,r/ /∫,ζ,t∫,d ζ/ /∂,θ/ /∂,θ/ /∂,θ/ /∂,θ/ /∫,ζ,t∫,d ζ/ /t,d,s,z,n,l,j,h/ /∫,ζ,t∫,d ζ/ /∫,ζ,t∫,d ζ/ /∫,ζ,t∫,d ζ/ /∫,ζ,t∫,d ζ/ /t,d,s,z/ /t,d,s,z/ /k,g/ /t,d,s,z/ /l/ /l/ /k,n,j,h/ /k,g,n,j,h/ Table 2-8 Phonemes visually confused with each other (Benguerel) Normal Hearing Hearing Impaired /p/ /p/ /f/ /f/ /w/ /w/ / θ/ / θ/ /t∫,∫/ /t∫,∫/ /t,k,s/ /t,k,s/ 18 Owen and Blazek [18] extended the studies made by Benguerel and Kricos. They used 10 subjects, 5 hearing impaired and 5 normal hearing. All subjects were presented with video recordings for vowel-consonant-vowel (VCV) sequences without sounds. They used 23 English consonants /p,b,m,f,v,t,k,t∫,f,θ,∂,s,∫,w,r,dζ,ζ,d,s,z,g,n,l,h,j/ and 4 vowels /ɑ/, /i/, /u/, and /^/. Subjects were asked to identify the consonant shown on video. The highest overall correct score was 46%. Results from all subjects were gathered in confusion matrices. When a cluster appeared in the matrix, a 75% response criterion was used as a requirement for considering these phonemes to have the same visual representation (i.e. viseme). This criterion resulted in different viseme classes as shown in Table 2-9 Table 2-9 Visemes associated with different vowels (Owen) Viseme Class /ɑ/C/ɑ/ /^/C/^/ /i/C/i/ /u/C/u/ Class 1 /p,b,m/ /p,b,m/ /p,b,m/ /p,b,m/ Class 2 /f,v/ /f,v/ /f,v/ /f,v/ Class 3 /θ,∂/ /θ,∂/ /θ,∂/ Class 4 /w,r/ /w,r/ /w,r/ Class 5 /t∫,dζ,∫,ζ/ /t∫,dζ,∫,ζ/ /t∫,dζ,∫,ζ/ Class 6 /k,g,n,l/ Class 7 /h/ /t,d,s,z/ 19 /t,d,s,z/ The viseme clustering was consistent between all four vowels for the first two classes. For vowels /ɑ/, /^/, and /i/, the viseme clustering was consistent between all three vowels for the first five classes. Dodd [38] conducted a review for the literature available on viseme classification and concluded that visemes can generally be classified into nine distinct groups as shown in Table 210. Table 2-10 Dodd’s Viseme groups for English consonants Viseme # Consonants Viseme # Consonants 1 /f,v/ 6 /w/ 2 /th,dh/ 7 /r/ 3 /s,z/ 8 /g,k,n,t,d,y/ 4 /sh,zh/ 9 I 5 /p,b,m/ The studies presented in this section show some of the attempts made in obtaining mappings between phonemes and visemes. Most of the work was based on human response. The consistency of these mappings across different subjects motivated us to investigate if this consistency can be captured by an automatic classifier that is based on visual cues. 20 A comparison of the results in Tables 2-7 through 2-10 and the results of viseme- tophoneme mappings in Tables 2-2, 2-3 and 2-5 show that many phonemes are consistently grouped together in one viseme. These common assignments are summarized in Table 2-11 Table 2-11 Common viseme-to-phoneme mapping Viseme Class Associated phonemes 1 /p,b.m/ 2 /f,v/ 3 /θ,∂/ 4 /w,r/ 5 /t∫,dζ,∫,ζ/ 6 /t,d,s,z/ 7 /l/ This objective of this study is to classify visemes based on visual cues. The study involves conducting experiments on a set of audio-visual recordings of specific sounds. Since the viseme classes for some phonemes are consistent across different studies, as shown in Table 211, a sample representing each class in Table 2-11 will be in the audio-visual data set used for this study. 21 2.2.3 Lip reading in Speech Recognition One of the major areas of application for visual cues is the area of audio-visual speech recognition. Several researchers have worked on incorporating lip motion information into systems that were originally based on audio information, hoping to enhance the performance of these systems. This section covers some of the work done in this area. In a study of audio-visual signal processing, Chen [13] stated that an automatic lip reading system involves three core components: visual feature extraction, audio feature extraction, and the recognizer or classifier. The automatic lip reading system proposed by Chen used both visual and audio information. Lip movements were used as the visual features. The audio signal was divided into frames and converted into sixteenth-order linear prediction coding (LPC) coefficients to create the audio features. The audio and visual features were combined in one vector and applied to a hidden Markov model (HMM) for final recognition. For the lip-tracking phase, Chen modeled the color distribution of the face pixels and of the background and then used a Gaussian function to extract the face of the speaker. After extracting the face, a template resembling the shape of the lips was used to extract the corners and height of both the upper and lower lips. This process was repeated for every image frame. One of the limitations of this technique was that the speaker needed to be in front of the camera. Chen compared the performance of a HMM based speech recognizer with audio only input, audio-visual input and visual only input. The audio signals were corrupted with additive white Gaussian noise at various SNRs ranging from 32 dB to almost 16 dB. The study showed that at an SNR of 16dB the recognition rate of the audio-visual based system was almost four times higher than the recognition rate of the audio-based system. The differences between the recognition rates became less as the SNR increased, but the audio-visual 22 system consistently had higher values. At an SNR of 32 dB, both systems performed at about the same rate. Chen concluded that automatic lip reading could enhance the reliability of speech recognition. He added that through lip synchronization with acoustics, it would be possible to render realistic talking heads with lip movements synchronized with the voice. Luettin and Dupont [39] stated that the main approaches for extracting visual speech information from image sequences can be grouped into image-based, geometric-feature-based, visual-motion-based, and model-based approaches. In the image-based approach the gray-level representing the mouth is either used directly or after some preprocessing as a feature vector. The visual-motion-based method assumes that visual motion during speech production contains relevant speech information such as lip movement. The geometric-feature based approach assumes that certain measures such as the width or height of the mouth openings are important features. In the model-based approach, a model for the visible speech articulators, usually lip contours, is built and is described by a small set of parameters. The group developed a large vocabulary continuous, audio-visual speech recognizer for Dutch using different representations of visual cues and showed that a combined audio-visual recognizing system improves upon audio-only recognition in the presence of noise. Petjan [40] also developed an audio-visual speech recognizer that used lip height and width as visual cues applied with the acoustic waveform to the recognizer. Petjan’s results confirmed Chen’s claim of obtaining higher recognition rates with the addition of visual cues. Some researchers have used the image of the entire mouth area as a visual feature applied to a speech recognizer together with audio cues [41, 42]. Li et al.[42] used eigen vector analysis in lip reading. In the training part of their approach, they formed a vector consisting of all gray level values of pixels representing the mouth in all frames of a sequence representing one spoken 23 letter from the English alphabet. Next they formed a training matrix containing several such vectors and computed Eigen vectors for each letter in the alphabet. In the classification stage, a sequence representing an unknown letter was projected on the model of Eigen space for each letter, and a projection close to "1" represented a match. Li and his group applied their technique to ten spoken letters [A-J] using one person only. The success rate of recognition varied from 90100% depending on the letter to be recognized. Potaminos and Chalapathy [43] investigated the use of visual, mouth-region information in improving automatic recognition of the speech. The visual information in the system was represented by the highest 24 coefficients from a Discrete Cosine Transform (DCT) of a 64x64 pixel region-of-interest containing the speaker's mouth. The audio part of the signal consisted of 24 ceptral coefficients of the acoustical speech signal. Both features were combined in a vector that was then applied to a HMM. Incorporating the visual information improved the SNR by 61% over audio only processing. Another popular visual feature used in speech recognition was based on visual cues from the lips and jaw movements. Paul et al.[15] designed a speech recognizer system that incorporated lip reading information with the acoustic signal to improve speech recognition. The image of the face was supplied to a neural network that extracted the mouth corners and lip movement information. This extracted information was then applied with the acoustic information to a Multi State Time Delay Neural Network (MS-TDNN) to perform the final recognition. Paul and his team stated that compared to audio-alone recognition, the combined audio-visual system achieved a 20-50% error rate reduction for various signal/noise conditions. Baig and Gilles [44] presented a new neural architecture, called a spatio-temporal neural network (STNN) and used it in visual speech recognition. Biag and his group chose four points 24 on the lips and generated a time signal by tracking these four points in successive image frames. These time signals together with the acoustical signal were used as inputs to the STNN for recognition purposes. They tested their system on 510 audio and visual sequences of numbers spoken in French. They used 260 sequences for training the network, while the remaining 250 sequences were used for testing the performance. Although the test and training samples were from the same person, their results showed a success rate of 77.6%. The study did not compare the performance of the system for audio only input. Luettin and Dupont [39] combined the inner and outer lip contour positions together with the lip intensity information and formed a vector of 24 elements representing the visual features. The audio features were obtained by choosing 24 linear prediction coefficients. The audio and visual features were applied to a HMM for speech recognition. They tested their system on clean speech and reported an error rate of 48% with visual features only and 3.4% with the audio signal only. When both audio and visual features were used, the error dropped to 2.6%. Goldschen [45] applied Hidden Markov Models (HMM) as classifiers in a speech recognizer having both audio and visual input data. He also studied which features led to better speech classification decisions. The feature set he preferred was associated with the dominant mouth movement in terms of upper and lower lips, rather than the lip positions. Mase and Pentland [46] reached the same conclusion. The examples shown in this section explored many speech recognizers that utilized both audio and visual information. Performance of speech recognizers improved when visual cues were included in the system. Visual features extracted from the lips were commonly used in most applications, which is consistent with the studies previously discussed in Sections 2.1 and 2.2. 25 2.3 MISCELLANEOUS APPLICATIONS OF AUDIO-VISUAL SIGNAL PROCESSING Human beings can use the visual information available in lip movements or facial expressions to separate two sounds coming from two different sources. Okuno et al. [47] attempted to design a system that would use visual information in enhancing sound source separation. They used the movement of the mouth as an indication of a sound source. This information was sent to a module that checked whether the image information and the sound information are from the same source. If it is, the position information is recorded. Otherwise, the visual module is moved to focus on another sound source. Okuno stated that adding the visual information increased the dimension of the problem. However it provided an accuracy of a few degrees for a point source at 2 meters distance, which was higher accuracy than audio only sound source separation. Speaker detection is another area in which audio-visual information has been combined. Speaker detection is a useful tool in video conferencing, where a camera needs to focus on a person identified as the speaker. Cutler et al.[48] proposed a measurement for the correlation between the mouth movements and the speech and used it in a time delayed neural network (TDNN) to search for a certain speaker. The system was able to successfully locate a single speaker. Their results did not provide a quantitative measure for the accuracy of the device with and without the visual information. Another area in which visual information has been used is sound sensing. Takahashi and Yamazaki[49] proposed a sound sensing system that used audio-visual information. The system was divided into two subsystems: an audio subsystem and a visual subsystem. The audio subsystem extracted a target signal with a digital filter composed of tapped delay lines and 26 adjustable weights. These weights were modified by a special adaptive algorithm called "cue signal method". For the purpose of adaptation, the cue signal algorithm needed only a narrow bandwidth signal that is correlated with the power level of the target signal. This narrow band signal was called the “cue” and was generated from the visual subsystem. The authors stated that fluctuations in sound power due to lip movement correspond to a visual stimulus in the image. Therefore, by locating the lips in an image, the system obtained a visual estimate of the sound power by the squared absolute time difference in successive image frames. Another estimate for the sound power was obtained directly from the audio signal. Both audio and visual estimates of the sound power were multiplied together to form the cue signal. With the visual cues used, their results showed a 96% improvement in object localizing over audio only based object localization. Facial expressions are another area of focus of researchers, since many people can understand emotions based on the facial expressions appearing on the people surrounding us [50]. Craig et al. [51] stated that facial expressions can indicate pain. Katsikitis and Pilowsky[52] mentioned that facial expressions reveal brain functions and pathology. Ekman and Friesen [53] developed an anatomically based Facial Action Coding System (FACS), and many researchers [54-56] working in this area mention that FACS is the most comprehensive method of coding facial displays. This coding system was obtained by viewing videotapes in slow motion of a large sample of recorded facial expressions and then coding those expressions to form action units. The FACS contained more than 7000 facial expressions. In a later study [54], Ekman and Friesen proposed that emotion codes can be obtained by specific combinations of FACS action (i.e. fear, joy, sadness, anger, disgust and surprise). Hegely and 27 Nagel [56] collected these emotions together to form the Emotions Face Action Coding System (EMFACS). Izard [57] developed another anatomically based systems which requires slow motion viewing of videotapes. He called it the Maximally Discriminate Facial Movement Action Coding System (MAX). Compared with FACS, MAX is less comprehensive, and is intended only to code emotion based facial displays, while FACS is intended for displays that are not only emotion related. Essa and Pentland [58], Mase and Pentland [59], and Yacoob and Davis [60] attempted to use optical-flow-based approaches to discriminate facial displays (e.g. fear, joy, surprise). Such approaches were based on the assumption that muscle contraction causes skin deformation. This skin deformation changes the optical spectrum appearing on the face of the speaker. In a digitized image sequence, algorithms for optical flow extract motion from the texture changes in the skin, and the pattern of this motion can be used to discriminate facial displays. Pantic [61] and his group designed an expert system they called Integrated System for Facial Expression Recognition (ISFER). This system performs recognition and emotional classification of human facial expressions from a still, full-face image. At the time of publishing their work, the system was capable of automatically classifying face actions into six emotion categories (happiness, anger, surprise, fear, disgust and sadness). This discussion affirms the claims that visual cues may contribute to speech perception. However, the applications of facial expressions were limited to identifying emotions and not speech. Despite this limitation, identifying emotions may help in speech perception, since different emotions involve the use of different vocabulary. 28 2.4 SUMMARY Despite the wide range of audio-visual applications reviewed in this chapter, there is still little work done in the area of automatic recognition of visemes based only on visual cues. The work done on recognizing visemes was based on the visual response of subjects as detailed in Section 2.2.2. The consistency of viseme grouping in speech psychology test results shown in Table 2-11 motivated this attempt to design an automated viseme classifier that is based on a set of visual cues only. The outlines of this work were detailed in Figure 1-2. There are many factors to be considered in choosing which phonemes to focus on in developing the audio-visual data needed for this study. Since most audio-visual applications utilize 2-D imaging devices, this study shall focus on visual features that could be extracted from 2-D images of speakers. The recording device used in this study was a Motion Analysis system (ExpertVision, Inc) VP110 that traces the 2-D motion of optical reflectors placed anywhere on the face of the subject. The VP110 is manufactured by the Motion Analysis Corporation located in Santa Rosa, California. The optical reflectors used with the device had a circular shape with a reflector side and an adhesive side that sticks to the point of interest. This limits the ability to distinguish phonemes that involve the inside part of the mouth such as sounds within classes 5 and 7 in Table 2-11. The voiced consonants in English are /b/ /d/ /g/ /v/ /ð/ /n/ /l/ /w/ /j/. In addition, the English sounds that involve the lips, jaw and teeth in production are /b,p,m,f,v,th,w,/. One sound from each of the classes of the common viseme-to-phoneme mapping (Table 2-11) was chosen to represent the class to be distinguished. These phonemes are shown in Table 2-12. 29 Table 2-12 VCV sounds to be rerecorded Viseme Class VCV sound Class 1 /b/ Class 2 /v/ Class 3 /ð/ => the Class 4 /w/ Class 5 /d/, /z/ The remaining two classes of Table 2-11 were not studied since they are sounds produced inside the mouth. Phonemes /d/ and /z/ were chosen together from the same class to test if the 1st and 2nd derivatives of the lip motion waveform can capture the difference between both sounds, even though subjects could not distinguish them consistently. The common viseme-to-phoneme mappings showed that the viseme classes did not change when vowels /ɑ,i,^/ were presented to subjects in association with different consonants. In order to test whether the classifier can distinguish between vowels, designs and tests on the VCV recordings were implemented using VCV words with both vowels /ɑ,i/. This resulted in a total of 12 VCV sequences where the consonants are shown in Table 2-12 and the vowels are /ɑ,i/. The 2nd vowel in the VCV sequence was emphasized during the sound production. The correlation between the acoustics and visual cues discussed in Section 2.1 showed that lip motion has high correlation with the acoustic signal being produced. In addition, work on audio-visual speech recognition showed that visual features related to lip motion were the most popular ones to use. The visual features representing the first block of Figure 2-1 were extracted 30 from the 2-D images of lip motion. The study assumes that the location of lips is already determined, and it will not consider techniques to extract lips from recorded video sequences. 31 3.0 EXPERIMENTAL METHOD This chapter explains the experimental setup employed in the study. The first section presents the data recording method. Section 3.2 explains the pre-processing applied to waveforms. Section 3.3 discusses the feature extraction. The final section describes the linear discriminant analysis method. 3.1 3.1.1 DATA RECORDING VP110 Motion Analyzer The VP110 Motion Analysis System (ExpertVision, Inc) analyzes the motion of objects within single or multiple video frames. The system features a real-time data acquisition system together with data analysis capabilities. The ExpertVision system consists of the following units: infra-red optical reflectors, video camera, array of infra-red LED lights, computer, and the VP110 unit [62]. Figure 3-1 shows how a subject with five optical markers placed around the subject’s face sits directly in front of the camera of the Analyzer. An array of infra-red LEDs is attached to the camera to insure that the amount of infra-red light reflected off the optical markers is higher than the amount reflected off the remaining parts of the face. The Motion Analyzer has a 32 threshold for the brightness of the infra-red light received from the camera. This threshold ensures that only objects with high brightness due to infra-red light reflections are detected by the camera. Once the recording process starts, the Motion Analyzer tracks the optical reflectors at a frame rate of 60 Hz and identifies their outlines in consecutive frames. These outlines are used as input to ExpertVision software to identify the x-y coordinates for the center point of individual reflectors in each frame. A microphone, used to record the audio signal spoken by subjects, is sampled at 22050 Hz by the computer. The recorded audio signal was not used in the study. The ExpertVision software processes the recorded video sequence frame-by-frame, calculates the coordinates of the center of each optical reflector based on the coordinate system shown in Figure 3-2, and stores the coordinates in a file called the centroid file. The centroid files are processed further in Matlab to generate a path associated with each centroid in consecutive frames. The path files are used to generate distance waveforms. Then, the visual features are extracted. This process is discussed further in Section 3-3. 3.1.2 The Recording Procedure Five reflectors were placed on the face of the subjects as shown in Figure 3-2. The motion of these reflectors represented the visual features associated with the VCV sequence as discussed in Section 3.3. The reflectors were placed on the mid points of upper and lower lips, lip corners and the upper nose bridge. The upper nose bridge is a point on the face that does not move while a subject is speaking. It was used as a correction factor for the effect of head movement during the recordings. Participants sitting in front of the LED lights repeated the desired VCV sequence. The recording of the audio-visual sounds involved the following steps: 33 o Participants’ hearing was screened using the Welch Allyn AudioScope 3 Screening Audiometer. This instrument provides the means for quickly checking that the ear canal is patent and that the subject can hear tones presented at 500, 1000, 2000, and 4000 Hz at 25 dB hearing level. These are frequencies that are of primary importance for hearing speech and to assure accurate hearing of speech at conversational levels. Participants were required to detect all four test tones, at least in one ear, to be included in the study. o The required VCV word was presented to participants via a head-phone, thus guiding them in producing the proper clear VCV sequence. The guide signal was recorded by a PhD audiology student capable of producing clear VCV sequence. The guide signal emphasized the 2nd vowel in the sequence. o Participants repeated the VCV word they heard in the head phone several times closing their lips after each repetition. o Participants repeated the desired VCV words while sitting directly in front of the camera. o The recording time for each VCV sequence was 45 seconds. o Once the recording of a certain VCV word was done, the collected data was saved and the recording of the next VCV word began. 34 Figure 3-1 Experimental setup for audio-visual data recording Figure 3-2 Location of optical reflectors and coordinate system for tracking of lip movements 35 3.1.3 The Recorded Audio-Visual Data Audio-visual recordings from 28 participants (26 females and 2 males) were obtained. The participants were monolingual speakers of Native American English and capable of producing clear speech. The protocol for the recordings was approved by the University of Pittsburgh Institutional Review Board (IRB). Table 3-1 lists the number of word utterances obtained for each VCV sequence for all 28 speakers after the processing of centroid and path files was completed. Some speakers produced more utterances than others. In addition, the Motion Analyzer sometimes failed to capture some of the optical reflectors in consecutive frames, reducing the number of available word utterances. For speakers 4, 10, 11, 19 and 21 multiple reflectors were dropped out in consecutive frames for several seconds. This led to discarding the recordings of some of the VCV sequences associated with those speakers. The brightness threshold set in the Motion Analyzer was not able to remove other bright spots on the face of some participants. Teeth and even cheeks of some participants resembled optical reflectors and were captured by the motion analyzer. This led to additional centroids appearing in each frame, causing confusions in the path assignments in consecutive frames. A criterion of a minimum number of 8 word utterances was used to include a speaker in the analysis. Exception was made for speakers 1, 7 and 14 who had 8 or above word utterances in 11 of the VCV sequences and had 7 utterances in the 12th VCV word. The missing utterance for each of these speakers was compensated by adding the average of the other 7 utterances. This resulted in including audio-visual data from the 18 speakers shown in bold fonts in Table 3-1. Audio-visual data coming from the remaining 10 participants was not used in the study. 36 Table 3-1 Number of utterances for each word by all speakers Ab Av At Aw Az Ad Ib Iv It Iw Iz Id # 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12 13 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. Tota l 1 2 3 4 5 6 7 8 9 10 11 12 11 14 10 0 12 12 12 12 12 12 11 9 10 9 10 9 10 10 11 16 10 14 15 14 9 10 12 7 13 9 8 10 10 11 8 10 8 9 10 8 8 13 14 12 12 10 11 10 12 11 10 13 10 12 0 9 6 12 11 9 15 12 11 14 9 8 11 12 7 17 12 10 9 11 11 11 14 9 12 10 10 8 11 8 12 11 8 17 13 11 6 10 8 11 8 11 16 13 10 6 10 8 8 10 11 17 12 6 8 10 10 8 14 11 11 11 8 7 10 9 11 15 10 17 11 0 6 8 10 11 11 10 16 11 10 9 10 10 12 13 10 17 9 10 9 11 8 12 14 8 16 8 10 7 10 11 12 12 10 13 13 7 13 13 13 11 12 10 12 13 13 13 13 13 13 14 13 13 12 11 10 9 9 7 8 8 8 12 7 12 8 8 12 9 12 12 10 10 9 10 11 11 9 8 8 12 14 18 17 14 13 10 13 12 8 12 19 15 16 0 0 14 11 8 11 14 13 13 9 7 11 12 8 11 15 6 10 28 2 12 0 14 11 9 11 15 9 9 10 12 15 11 9 11 14 9 9 9 12 8 12 10 14 12 9 11 10 9 15 10 8 14 13 10 11 12 7 15 11 9 15 9 11 9 11 9 16 12 7 13 12 13 9 12 12 15 10 7 12 14 10 8 11 8 10 10 8 9 13 12 8 10 9 13 11 9 12 11 10 8 9 10 15 10 8 13 11 9 6 8 12 15 11 8 13 12 10 5 Total 134 123 116 134 131 144 114 184 132 108 89 119 107 139 153 107 123 151 134 123 107 162 131 100 148 151 119 103 313 315 303 305 305 310 299 307 315 306 304 3664 37 3.2 PRE-PROCESSING THE WAVEFORMS The recording system produces two waveforms per reflector. Each waveform shows the position of the reflector on one axis (x or y) in every frame as the subjects repeats a VCV word. Thus, the total number of waveforms to be processed for every VCV word is ten. Figure 3-3 shows an example of the waveforms associated with the lip markers for one subject repeating /ɑbɑ/ 8 times. These waveforms are generated by the ExpertVision system and stored as centroid files on the computer. Figure 3-3 Waveforms associated with each reflector 38 3.2.1 Generating Path files from Centroid files In every frame of the image, the Motion Analyzer scans the image from top to bottom and left to right. When a reflector is detected, a number is assigned to it and the x-y coordinates for that reflector are recorded. There is no guarantee that the same number assignment is given to every centroid in consecutive frames. A certain centroid might be assigned number 2 in one frame, but in the following frame, it may be assigned as number 3. This confusion increased when the reflectors within a frame were along the same line in the x-direction. The assignment confusion is corrected by scanning the generated centroid files frame by frame to ensure that every reflector has the same assignment in consecutive frames. This process involved developing a Matlab code to perform that following tasks: o Check the first frame and assign a path number for the x and y coordinates of each centroid as shown in Table 3-1 Table 3-2 Path assignment in the first frame of the centroid file Path 1 Reference Path Path 2 Upper lip Path 3 Path 4 Left lip Right Lip corner corner Path 5 Lower Lip 39 o Check the next frame fj, 1 < j <= n, where n is the total number of frames: o Use the last coordinate location Pi for Path i (i=1,2,3,4,5) from previous frame fj-1 o Measure Euclidean distance between all five centroids in fj and Pi and assign the centroids in fj with the closest centroid in path Pi o Go to the next path Pi+1 in frame fj-1 and measure the Euclidean distance between the remaining centroids in fj and Pi+1 and assign the centroid with the shortest distance in fj with the path Pi+1. o Repeat the above steps for the remaining centroids in frame fj o Go to the next frame o A flag is used to determine if all five centroids were detected in a frame. If one centroid is dropped out in a frame, the path associated with that centroid is marked and a search is made for the closest centroid to the missing path in future frames. The algorithm then linearly interpolates points across missing frames to keep the path connected between frames. o If a certain path Pj is dropped for more than 60 frames, that path is dropped during those 60 frames while preserving the other paths. Pj continues to be tracked when it appears at later frames. o If the number of detected centroids in a given frame fj is greater than five, the Euclidean distance between them and the five centroids detected in the previous frame fj-1 is calculated. Each centroid in fj is assigned to the closest centroid in fj-1 and the extra centroids from frame fj are discarded o If more than one centroid is missing in a given frame, the frame is dropped. 40 The corrected assignments are stored in path files. There are 10 path files marking the x-y coordinates for each reflector in consecutive frames. 3.2.2 Removing the effect of head motion from every frame The reference reflector location (at the upper nose bridge) was used to track the location of the head in every frame. Since the visual features to be extracted rely on the x-y coordinates of the reflectors in consecutive frames, the effect of head motion on the changes in these coordinates should be minimized. This correction was done by the following technique: o Calculate the average location of the reference reflector in x and y coordinates in the first 60 frames, and use it as the reference point in that window. o Find the difference between the above reference point and the actual location of the reflector in each frame within the 60 frames window in both x and y directions. o Add the difference in x-direction as well as the difference in y-direction to the location of the four reflectors around the lips in all the frames within the window. o Move to the next 60 frames window This correction reduces the effects of head motion in the x-y directions that is parallel to the face of the camera lens. It does not compensate perfectly for forward or backward face tilting. To assess the error that could be introduced by tilting, the actual motion of the forehead reference centroid was monitored across speakers in consecutive frames. A sample waveform for this motion for one VCV sequence from one subject is presented in Figure 3-4. The motion for the forehead reference point in consecutive frames was between half and one pixel. The range of the overall motion for the forehead reference point in different speakers was between 1-2 pixels which is close to the pixel noise level (0.5 pixels) in the Motion Analyzer system itself [62]. This 41 indicates that the subjects had minimal head motion during the 45 sec length of a recording session, and error due to head tilting is not significant. Motion of Forehead Marker in Consecutive frames 1.4 1.2 1 Pixels 0.8 0.6 0.4 0.2 0 0 200 400 600 800 1000 1200 1400 1600 1800 frames Figure 3-4 Motion of forehead reflector in consecutive frames 3.2.3 Generate the distance waveforms It is usually desirable to simplify a problem by reducing the number of variables without jeopardizing the information stored in those variables. The number of variables in this problem 42 can be reduced by decreasing the number of waveforms to be processed. This can be achieved by converting the path files into distance waveforms. The following distance waveforms were generated: 1. Euclidean distance between upper and lower lips in consecutive frames 2. Euclidean distance between lip corners in consecutive frames 3. Euclidean distance between upper lip and forehead reflector in consecutive frames. 4. Euclidean distance between lower lip and forehead reflector in consecutive frames. The distance between the upper and lower lip waveforms superimposed on the audio file associated with the utterance for the eight repetitions of the word /ɑbɑ/ for one subject is shown in Figure 3-5. The peaks of the waveform correspond to the distance between upper and lower lips while speaking. The first peak is associated with producing the sounds /ɑ/, the minimum is associated with producing the consonant /b/, and the second peak represents the mouth opening while producing the second vowel /ɑ/ in the sequence. Figure 3-5 Upper/Lower distance waveform for “aba” with the audio signal superimposed 43 The focus on distance waveforms reduced the number of waveforms to consider. In addition, it further reduced the effect of any possible head motion remaining in the path files, since the distance between two points is independent of the location of the head as long as head tilting is not involved. Figures 3-6 through 3-9 shows examples of these waveforms extracted from four different VCV sequences repeated by the same speaker. The x-axis in these figures represents the frame number while the y-axis represents the amplitude of the distance waveform. There is a pattern repeating in every waveform around the extremas of the waveform. The objective now is to quantify this pattern by using a set of parameters (visual features) to characterize these waveforms. In addition, it is noted that there is a correlation between the 1st waveform representing the distance between upper and lower lips and the 4th waveforms representing the distance between the lower lips and the forehead reference point. This correlation shows that the role played by the upper lips in producing a word is much smaller than the role played by the lower lips and that the lower lips dominate the upper/lower distance waveform. In the remaining analysis of the data, only the upper-lower lip waveform will be used, and the lower-lip waveform will discarded. Figure 3-6 and Figure 3-7 show the same speaker repeating two different words. Figure 3-6 shows that the speaker had 12 utterances of the word /ɑbɑ/, and Figure 2-7 shows 19 utterances of the word /ɑðɑ/. In addition to speaking at different rates, the amplitudes of the waveforms differ from one speaker to another. This demonstrates the need for amplitude as well as time normalization before visual features can be extracted to represent a specific word. 44 UP LOW DISTANCE 20 aba 15 10 5 0 0 500 1000 1500 2000 2500 1500 2000 2500 1500 2000 2500 1500 2000 2500 UP DISTANCE 56 aba 54 52 50 48 0 500 1000 CORNERS DISTANCE 37 aba 36 35 34 33 32 0 500 1000 LOWER LIP DISTANCE 70 aba 65 60 55 50 0 500 1000 Figure 3-6 Four distance waveforms associated with the VCV word “ɑbɑ” UP LOW DISTANCE 20 ata 15 10 5 0 0 500 1000 1500 2000 2500 1500 2000 2500 1500 2000 2500 1500 2000 2500 UP DISTANCE 47 ata 46 45 44 43 42 0 500 1000 CORNERS DISTANCE 29 ata 28 27 26 25 0 500 1000 LOWER LIP DISTANCE 60 ata 55 50 45 0 500 1000 Figure 3-7 Four distance waveforms associated with the VCV word “ɑðɑ” 45 UP LOW DISTANCE 20 ibi 15 10 5 0 0 200 400 600 800 1000 1200 1400 1600 1800 1200 1400 1600 1800 1200 1400 1600 1800 1200 1400 1600 1800 UP DISTANCE 72 70 ibi 68 66 64 62 0 200 400 600 800 1000 CORNERS DISTANCE 48 46 ibi 44 42 40 38 0 200 400 600 800 1000 LOWER LIP DISTANCE 85 ibi 80 75 70 0 200 400 600 800 1000 Figure 3-8 Four distance waveforms associated with the VCV word “iði” UP LOW DISTANCE 20 awa 15 10 5 0 0 500 1000 1500 2000 2500 1500 2000 2500 1500 2000 2500 1500 2000 2500 UP DISTANCE 54 awa 52 50 48 46 0 500 1000 CORNERS DISTANCE 38 awa 36 34 32 30 0 500 1000 LOWER LIP DISTANCE 70 awa 65 60 55 50 0 500 1000 Figure 3-9 Four distance waveforms associated with the VCV word “ɑwɑ” 46 3.2.4 Obtaining Single Utterances The distance waveforms in Figures 3-6 through Figure 3-9 showed the need for applying time-and-amplitude normalization to the waveforms to compensate for speaking at different rates. But before the time-and-amplitude normalization can be obtained, distances associated with single utterances from every word need to be obtained for every speaker. The following steps were used to obtain single utterances for every VCV word: o Start with the upper-lower lips as well as lip-corners distance waveforms calculated earlier. o Use a threshold to determine the starting point of an utterance. o The algorithm displayed the distance waveforms and requested the user to choose the word length as well as the number of points to go behind the starting point identified in the previous step in each waveform. o The algorithm allowed five possible word lengths options 120, 108, 96, 84, and 60 samples. VCV words repeated at a fast rate require shorter word length, while VCV word sequences repeated at slower rates require longer word lengths. If the selected word length is less than 120, then zeros are added before and after the word so that the word is stored in the middle of the record. o Samples from the starting point determined in step 2 to the ending point that depends on the length of the word are stored in a vector of 120 samples (i.e. 2 sec). o The resulting processed words are displayed. The algorithm waits for an input from the user to verify that the word length chosen was long enough to capture each complete word utterance. If more than one word utterance was captured or if the samples captured were not enough to obtain one complete utterance, then the user can 47 go back to step 3 to modify the word length. If the parameters worked, then files are stored. An example for the results of breaking up the words into single utterances is shown in Figure 3-10 for two distance waveforms. The ninth waveform in every figure shows the average of all utterances together plus and minus a standard deviation about the average. The waveforms show good consistency for the word spoken by the same speaker, which was common across speakers. Figure 3-10 Broken utterances for the word /ɑbɑ/ together with the mean for each speaker 48 3.2.5 Time-And-Amplitude Normalization The objective of time–and–amplitude normalization is to have the same time and amplitude scales for all word utterances before extracting the visual features. Subtracting the mean of the word utterance ensures that all utterances have zero means. Several normalization schemes were evaluated. One method for amplitude normalization is to divide each of the utterances by the maximum amplitude of the absolute value of the waveform. This forces all utterances to have maximum amplitude of one. Another method popular in amplitude normalization involves dividing each utterance by the standard deviation present in the waveform representing the utterance. The visual features used in the study and discussed in Section 3.3. were related to the extremas present in the distance waveforms. Dividing by the maximum of the absolute value of the waveform forced that feature to have an amplitude of one across utterances of all speakers. This removes the variability in one of those extremas, which may result in losing one or more visual features. The second amplitude normalization method preserves the amplitude variability across the speakers, which makes it more practical in preserving the contribution of the extremas towards discrimination. The time normalization can also be achieved in many ways. One way is to extract the visual features from the amplitude-normalized waveform as discussed in Section 3.3. The time normalization can then be applied on the extracted features by dividing the slope features by the utterance time-length, which is the number of frames between the first and last peaks as shown in Figures 3-13, 3-14, and 3-15 and presented in equation 3-1 and equation 3-2: ⎛ y 2 − y1 ⎞ ⎟ ⎜⎜ t 2 − t1 ⎟⎠ ⎝ eq 3-1 Norm _ Slope1 = (t 3 − t1 ) 49 ⎛ y3 − y 2 ⎞ ⎟ ⎜⎜ t 3 − t 2 ⎟⎠ ⎝ eq 3-2 Norm _ Slope2 = (t 3 − t1 ) Where: Norm_slope1: Normalized slope between the first and second extremas in the utterance Norm_slope2: Normalized slope between the second and third extremas in word utterance t1, t2, t3: Time location of the 1st, 2nd, and 3rd extremas respectively. Time normalization can also be done by mapping all utterances to the same time scale, and then extracting visual features. The second time-normalization scheme ensures that the visual features are extracted after the time-normalization is applied rather than extracting the features then applying the normalization as is the case in the first method. One of the popular time-and-amplitude normalization techniques was introduced by Smith [63-65]. Popularity of the technique is attributed to its linearity and simplicity. The algorithm starts by applying a low pass filter with a cutoff frequency of 10 Hz to minimize the noise in the signal. The application of this low pass filter reduced the high frequency noise introduced by the pre-processing of the waveform. The amplitude normalization is achieved by dividing each word utterance by the standard deviation of the waveform. The time normalization starts with determining the starting and ending points for each word utterance and then resampling the word utterance on a 120 points time scale using linear interpolation. 50 Upper-Lower lips Distance Waveform Before Normalization 8 6 4 2 0 -2 -4 -6 -8 -10 0 20 40 60 80 100 120 100 120 Lip Conrners Distance Waveform Before Normalization 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 Figure 3-11 Distance waveforms associated with ten utterances of /ɑbɑ/ before normalization Upper-Lower lips Distance Waveform after Amplitude Normalization 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 100 120 100 120 Lip Conrners Distance Waveform after Amplitude Normalization 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 Figure 3-12 Amplitude normalization by dividing over the maximum value 51 Normalized Upper-lower lips distance waveforms for every word 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 100 120 100 120 Normalized lip corners distance waveforms for every word 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 0 20 40 60 80 Figure 3-13 Ten word utterances after applying the Ann Smith normalization technique Figures 3-11 displays upper-lower and lip-corners distance waveforms associated with 10 utterances coming from a speaker stating the word /ɑbɑ/. The figure shows the variations present between different utterances and affirms the need for time and amplitude normalization. Figure 3-12 displays the result of division by the maximum amplitude on these waveforms, while Figure 3-13 shows the results of applying Smith’s algorithm of both time and amplitude normalizations on the waveforms. Figure 3-12 shows how the first amplitude normalization method forces one of the extremas to be one. This may result in the redundancy of this feature in the discrimination problem. 52 Standard Deviation before and after normalization 5 Before Normalization After Noramalization Upper lower lips 4 3 2 1 0 0 20 40 60 80 100 120 Standard Deviation before and after normalization 0.8 Before Normalization After Noramalization 0.7 Lip Corners 0.6 0.5 0.4 0.3 0.2 0.1 0 20 40 60 80 100 120 Standard Deviation before and after normalization 2.5 Before Normalization After Noramalization Up Distance 2 1.5 1 0.5 0 0 20 40 60 80 100 120 Figure 3-14 Standard deviation at each point of the 8 utterances in the 3 distance waveforms Smith’s algorithm consistently resulted in good time and amplitude normalization as shown in Figure 3-13. Figure 3-14 displays three graphs. In each graph, the standard deviation for each of the 120 samples in the waveform between the 8 different utterances for the same speaker is shown before and after normalization. The solid line represents the standard deviation across the 10 utterances before normalization while the dotted line represents the standard deviation across those points after normalization. Smith’s algorithm reduced the calculated standard deviation in the upper-lower lips distance waveforms and the upper lips waveform. The algorithm is not as effective in reducing the standard deviation in lip-corners waveforms but it 53 does as well as the first method described. In this study, the visual features were extracted from the distance waveforms after the application of Smith’s time-and-amplitude normalization. 3.3 FEATURE SELECTION AND EXTRACTION Three normalized distance waveforms associated with each VCV word were obtained as shown in the previous section. This section discusses the process of selecting a set of parameters to capture the uniqueness in each of these waveforms. These parameters are the visual features needed for the classification problem stated in Chapter 1. The distance waveforms capture the motion of the lips while a VCV sound is being produced. As part of preliminary work, spectral analysis and eigen space analysis to extract features were evaluated, but no patterns could be identified in either the spectrum or in the eigen vector space for the distance waveforms. Therefore, the efforts to extract features concentrated on temporal features derived from the articulator motion patterns defined in the speech production literature. The recording device used in this study can trace different points on the face of a speaker by placing optical reflectors on them. Since the visual features are expected to represent the produced sound, the points to track around the face need to be related to the sound production. Chen’s audio-visual library was based on lip-corners and lip-height as shown in Figure 1-1. Preliminary work on this library showed patterns repeating for the same word across different speakers. The literature review in Chapter 2 showed the visual features extracted from lip motion were popularly used in different applications and resulted in satisfactory results in different 54 areas. The articulators involved in producing the sounds selected for this study are detailed in speech production literature [12, 66, 67]. This results in the following observations: 1. The vowel /ɑ/ in the VCV sequence involves wide opening of the mouth at the start of the word, then closure of the lips to produce the consonant and then another opening of the mouth to produce the second /ɑ/ in the word. 2. The amount of closure in the mouth depends on the consonant being produced. Consonants /b/ and /v/ for example involve complete mouth closure while consonant /w/, /ð/ and /z/ involve partial closure of the mouth. 3. There is limited lip motion in producing the consonant /ð/ while there is more lip motion in producing the sound /w/. 4. The distance travelled by the upper lip in producing the consonant /v/ is greater than the distance travelled when producing the consonant /z/ [66, 67] 5. Different consonants need different speeds for the lip motion in producing them. For example, the motion speed for the lower lips while producing the sound /z/ is different from the speed of the lower lip while producing the sound /ð/ [66, 67]. The first three points are related to the distance travelled by different points around the lips when the sound is produced. This distance was represented by the maximum and minimum values detected in the upper-lower and lips corners distance waveforms. The speech production literature suggested that producing some consonants requires more motion in the upper-lip at the time of lip closure. This property was represented by the distance travelled by the upper-lip at the time of lip closure. The speed of lip motion is another characteristic that is related to the sound production. It was captured by calculating the slope between the consecutive extremas. The process of extracting the features is discussed in Sections 3.3.1 through 3.3.3. 55 3.3.1 Detecting Extremas in the waveforms The extremas in the distance waveforms are related to the location of the lips during the production of the word. The distance together with the rate at which this distance changes are two parameters used to characterize these waveforms. The extrema in upper-lower and lip-corners distance waveforms represent the maximum and minimum openings of lips during the sound production. The first extrema is related to the production of the first vowel in the VCV sequence. The second extrema is related to the production of the consonant in the VCV sequence while the third one represents the mouth opening to produce the second vowel. The amount of rounding in the mouth shape is represented by the extremas obtained from the lip corners waveform. These extremas are visually detected in each word utterance and selected using a mouse. A program stored the amplitude and the time of occurrence of the detected extremas and used these values to calculate the visual features to represent the spoken word. This process is described in the following sections. 3.3.2 Features from the upper and lower lips distance waveform The Euclidean distance between the upper and lower lip calculated frame-by-frame is used to extract amplitude-related and time-related features. The amplitudes of the captured extremas are used as features representing the VCV sound and the slopes between these extremas are used to represent the rate at which different points around the lips move while producing the sound. 56 Figure 3-15 Features extracted from upper/lower distance waveform The upper graph in Figure 3-15 shows a sample upper-lower distance waveform for the sequence /ɑbɑ/. The lower graph in the figure shows the extremas extracted from the waveform. The slope of the waveform between the extremas is calculated to capture how quickly the lips moved while stating the VCV word. The slopes are calculated by the following formulas ⎛ y − y1 ⎞ ⎟⎟ Slope1 = ⎜⎜ 2 ⎝ t 2 − t1 ⎠ eq 3-3 ⎛ y − y2 ⎞ ⎟⎟ Slope 2 = ⎜⎜ 3 − t t ⎝ 3 2 ⎠ eq 3-4 Where: o y1, y2, y3:Amplitude of the distance waveform at the 1st, 2nd, and 3rd extremas respectively o t1,t2,t3: Time of occurrence of the 1st, 2nd, and 3rd extremas respectively. 57 3.3.3 Features from the left and right lip corners distance waveform The Euclidean distance between the lip corners waveforms calculated frame-by-frame is used to extract amplitude-related and time-related features. The amplitudes of the captured extremas are used as features representing the VCV sound and the slopes between these extremas are used to represent the rate at which different points around the lips move while producing the sound. The upper graph in Figure 3-16 shows a sample lip corners distance waveform for the sequence /ɑbɑ/. The lower graph in the figure shows the extremas extracted from the waveform. The slope of the waveform between the extremas is calculated to capture how quickly the lips moved while stating the VCV word. The slopes are calculated by equations 3-3, and 3-4. Figure 3-16 Features extracted from lip corners distance waveform 58 3.3.4 Features from the upper-lip distance waveform The speech production literature suggested that producing some consonants requires more motion in the upper-lip at the time of lip closure. This property was represented by the distance travelled by the upper-lip at the time of the lip closure. Therefore, the points of interest in this waveform are chosen to match the frames at which an extrema is detected in the upperlower lip distance waveform. Whenever an extrema is detected in the upper-lower distance waveform, the corresponding amplitude value at the upper-lip distance waveform is taken as a feature. Figure 3-17 shows an example of these features for one subject speaking the word /ɑbɑ/. The upper graph in the figure marks the extremas extracted from the upper-lower lips distance waveform. The lower graph in the figure shows the corresponding signal value extracted from the upper-lip distance waveform. In addition to the amplitude information, the slope of the waveform between the extremas is calculated to capture how quickly the lips moved while stating the VCV sequence. The slopes are calculated by Equations 3-3 and 3-4. 59 Extremas Detected from Upper-Lower lips Distance 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0 20 40 60 80 100 120 80 100 120 Extremas Detected from Upper lip Distance 2 1.5 1 0.5 0 -0.5 -1 -1.5 0 20 40 60 Figure 3-17 Amplitude features extracted from the upper-lips waveform To summarize, three distance waveforms were used with every VCV word to extract the visual features. These waveforms are the upper-lower lips distance waveform, lip corners distance waveform, and waveform representing distance traveled by the upper lip in consecutive frames. Five features are extracted from each of these waveforms to make the total of 15 features for every word. The extracted features are summarized in Table 3-3: 60 Table 3-3 Summary of the extracted visual features The Extracted Feature Abbreviation Three upper-lower lip distance extremas UL1, UL2, UL3 Three upper lip amplitudes at locations UP1, UP2, UP3 matching the upper-lower distance extremas Three lip corners extremas LR1, LR2, LR3 Two features representing the slope between the consecutive extremas in the upper-lower lip slope_UL1, slope_UL2 distance waveform Two features representing the slope between the consecutive extremas in the lip corners slope_LR1, slope LR2 distance waveform. Two features representing the slope between the consecutive extremas in the upper lip slope_UP1, slope_UP2 distance waveform. The effectiveness or usefulness of the above features may vary. Some features might be more important than others in the discrimination problem. The classification technique that is usually used to measure the contribution of different variables to the discrimination problem is the linear discriminant analysis, which is described in the next section. 61 3.4 LINEAR DISCRIMINANT ANALYSIS Linear discriminant analysis is a statistical technique that can be used to examine whether two or more mutually exclusive groups can be distinguished from each other based on linear combinations of values of predictor variables or features (mutually exclusive means that a case can belong to only one group). The main purpose of discriminant function analysis is to predict group membership based on a linear combination of chosen variables or features. The procedure begins with a set of observations where both group membership and the values of the features are known. The end result of the procedure is a model that allows prediction of group membership when only the features are known. A second purpose of discriminant analysis is to provide an understanding of the data set, as a careful examination of the prediction model that results from the procedure can give insight into the relationship between group membership and the features used to predict group membership. A brief discussion of Fisher’s approach to discriminant analysis is discussed in the next section. 3.4.1 Discriminant Analysis Model This section describes the development of Fisher’s discriminant analysis. The material is based on lecture notes by Gutierrez-Osuna [68] and Huberty’s book Applied Discriminant Analysis [69]. The concept of Fisher’s discriminant functions is that given a set of independent variables or features, the analysis attempts to find linear combinations of those features that best separate the groups of cases. The set of cases separated from others are considered to be a group. The combinations of the features are called discriminant functions and have the form. 62 d ik = bok + b1k xi1 + ....... + b pk xip eq 3-5 where: o dik: is the value of the kth discriminant function for the ith case o p: is the number of features o bjk: is the value of the jth coefficient of the kth function o xij: is the value of the ith case of the jth predictor o The number of functions is equal to min(#groups-1, #features). The procedure automatically chooses a first function that will separate the groups as much as possible. It then chooses a second function that is both uncorrelated with the first function and provides as much further separation as possible. The procedure continues adding functions in this way until reaching the maximum number of functions as determined by the number of predictors and categories in the dependent variable. The discriminant model is based on the following assumptions: o The features are not highly correlated with each other. o The mean and variance of a given feature are not correlated. o The correlation between two features is constant across groups. o The values of each feature have a normal distribution. o The variance-covariance matrices of the features across the various groups are the same in the population, i.e., homogeneous 3.4.2 Linear Discriminant Analysis for Two Groups We start with a number of samples N1 and N2 from two independent random samples of classes w1 and w2 with each observation x1, x2 having p-dimensions with means u1, u2 and a 63 common covariance matrix ∑. The objective of the analysis is to find a scalar function "y" by projecting the samples of Xi onto a line in a way that maximizes the separability of the samples. Figure 3-18 Projection of data on a line (a) Poor separability (b) Good separability The projection of every observation is given by y = wt x eq 3-6 where “w” is a vector containing the coefficients for the discriminant function. Figure 3-18 shows an example for projecting a set of data belonging to two different classes on two different lines for the purpose of discriminating between both classes. Part (a) of the figure shows the result of projecting the data onto a line without achieving good separability between both classes. Part (b) of the figure shows a projection that resulted in much better separability on the line. 64 To explain this concept further, a separability measure needs to be defined. In order to achieve this, the mean vector for each class in x and y feature space is defined as μi = 1 Ni ∑x eq 3-7 x∈wi This means that the mean along the line of projection is given by ^ μi = 1 Ni 1 ∑ y = N ∑w x = w x y∈wi t eq 3-8 i y∈wi Now we could chose the distance between the projected means as our objective function ^ ^ J ( w) =| μ 1 − μ 2 |=| wt ( μ 1 − μ 2 ) | eq 3-9 The distance between the projected means is not a very good measure because it does not take the standard deviation within the classes into account. Fisher presented a solution to this problem by suggesting maximizing a function that represents the difference between the means, normalized by a measure of the within-class scatter. Fisher defined the scatter for each class as s i2 = ∑ y∈ w i ^ ( y − μi )2 eq 3-10 Then he defined the within-class scatter of the projected samples to be within − class − scatter = s12 + s 22 eq 3-11 The Fisher linear discriminant is defined as the linear function wtx that maximizes the criterion function ^ ^ | μ − μ |2 J ( w) = 12 22 s1 + s2 eq 3-12 Therefore, we would be looking for a projection where examples from the same class are projected very close to each other and at the same time, the projected means of different classes are as far apart as possible. Fisher’s solution to the above problem is given by 65 −1 w* = S w ( μ1 − μ 2 ) eq 3-13 and the within class scatter matrix Sw is given by S w = S1 + S 2 eq 3-14 Equation 2-9 is usually known as the Fisher Linear Discriminant function, which represents a specific choice of direction for the projection of the data down to one line. This equation can be generalized for C-class problems, and the next section will discuss this process. 3.4.3 Linear Discriminant Analysis, C-classes The solution presented by Fisher shown in equation 2-9 can be extended for a general situation that involves C-classes. In this case, we will seek (C-1) projections [y1,y2,y3, ….. yc-1] by means of (C-1) projection vectors wi which can be arranged by columns into a projection matrix W=[w1|w2|w3|…|wc-1] so the problem becomes yi = wi x ⇒ y = W t x t eq 3-15 The solution to the above problem is given by the following equation W * = [ w1* | w2* | ..... | wc*−1 ] ⇒ ( S B − λi S w ) wi* = 0 eq 3-16 where SB is defined as the generalized between-class-scatter and given by: c S B = ∑ N i ( μ i − μ )( μ i − μ ) t eq 3-17 i =1 with μ= 1 N 1 ∑x = N ∑N μ i ∀x i x∈wi Sw, which is the generalization of the within-class scatter, is 66 eq 3-18 C S w = ∑ Si eq 3-19 i =1 with Si = ∑ ( x − μ )( x − μ ) i x∈wi t eq 3-20 i and μi = 1 Ni ∑x eq 3-21 x∈wi Equation 3-16 simply means that the projections with maximum class separation are the eigen vectors corresponding to the largest eigen values of Sw-1SB. 3.4.4 Stepwise Discriminant Analysis Some variables may contribute more to the discrimination problem than others. Variables with very little contribution to the discrimination problem can be discarded to reduce the complexity and dimensions of the problem. The stepwise discriminant analysis is a technique used to test which variables contribute more to the discrimination function. It can help in reducing the dimensions of the problem by discarding variables that have insignificant contribution to the discrimination function. Before the stepwise process, a statistical measure for evaluating each variable in the analysis, together with a significance level of F values that a variable must have to enter a model or be removed from the model, must be developed. Once the criteria and the F values are chosen, the Linear Discriminant Function (LDF) is estimated for all variables. The process proceeds as follows [70]: 67 o The variable that best meets the criteria is entered into the analysis. o The remaining variables are tested again and the variable with the best value for the selection criteria is added to the analysis. o The variables in the model are tested to check if any meet the removal criteria and variables meeting the criteria are removed. o The process of evaluating variables not in the model is repeated until all variables have been tested for entry or removal. o The process is terminated when no more variables meet the entry or removal criteria. In this work, the statistical measure chosen is Wilk's lambda which is the ratio of the generalized within-class-scatter given by equation 3-19 to the generalized over all scatter given by equation 3-17. The change in Wilk's lambda for a model if a variable is added or removed is calculated by the following formula Fchange ⎛ λ p +1 ⎜1− λp ⎛ n − g − p ⎞⎜ ⎟⎟⎜ = ⎜⎜ ⎝ g − 1 ⎠⎜ λ p +1 ⎜ λ p ⎝ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ eq 3-22 Where: o λ p is the Wilk's lambda value before adding the new variable o λ p +1 is the Wilk's lambda value with the added variable o g is the number of groups o p is the number of independent variables entered in the stepwise analysis o n total sample size There are many software packages available to perform the discriminant analysis. One of those packages is SPSS version 16 which is a statistical software package that can perform the 68 above calculations and provide useful statistics to better understand the strength of the discrimination and the distribution of the data. 69 4.0 RESULTS Experiments were conducted to evaluate the ability of visual cues to classify visemes associated with VCV words. The first section of this chapter presents the results of developing the linear discriminant functions (LDA) needed for the classification (training phase) with all speakers involved in the training phase. The second section presents the results of testing the developed functions. The third section shows the step-wise linear discriminant analysis results, and the final section presents the results of training and testing discrimination functions that were built for each individual speaker. 4.1 TRAINING AND TESTING THE CLASSIFIER There are two methods for generating the training data by partitioning the VCV sequences coming from the 18 speakers shown in bold fonts in Table 3-1. These two methods are speaker-based training and word-based training. In speaker-based training, VCV words from 9 speakers are used to develop the LDA functions and VCV words from the remaining 9 speakers are used to test the resulting functions. In word-based training, all the available VCV words are divided into two equal parts. One is used for developing the LDA functions and the other one is used for testing them. 70 4.1.1 Speaker Based Training LDA functions were developed and tested for different numbers of discrimination classes using speaker-based data analysis. 4.1.1.1 Speaker-Based Training with 12 Classes In this part, each VCV word was treated as a separate class and the SPSS linear classification algorithm was applied to calculate the LDA functions needed to discriminate between those 12 classes. A prerequisite for SPSS analysis is to ensure the validity of the assumptions listed in Section 3.4.1. This can be achieved by many tests performed by the SPSS package. The Wilk's lambda test seeks to confirm the assumption of un-equal means between the LDA functions. It tests the null hypothesis that the population means for all the discrimination functions are equal in all the classes. If the hypothesis is accepted, then the discrimination functions represent nothing more than the sampling variability. SPSS calculates the value of lambda for different functions. If the significance level for the function is small, then the null hypothesis is rejected. The first step of the test calculates Wilk’s lambda for all 11 functions according to equation 4-1 λ =∏ ∀i Sw SB eq 4-1 where: Sw is within-class scatter and given by eq 3-19 SB is overall-class-scatter and given by eq 3-17 i is the ith discrimination function, i=1:11 71 In the following steps, the test excludes one function at a time and calculates the Wilk’s lambda for the remaining functions. Table 4-1 shows the results of this test. The small significance values shown in the 3rd column of the Table indicates that the null hypothesis is rejected for the first eight functions. The differences between the means in the remaining three functions are not sufficient, indicating that these functions have small contribution towards the discrimination. Table 4-1 Testing equality of means for speaker based training analysis Test of Function(s) Wilks' Lambda Sig. 1 through 11 .011 .000 2 through 11 .059 .000 3 through 11 .189 .000 4 through 11 .309 .000 5 through 11 .486 .000 6 through 11 .690 .000 7 through 11 .838 .000 8 through 11 .932 .003 9 through 11 .968 .163 10 through 11 .984 .310 11 .994 .396 Another assumption involved in the LDA analysis is that samples represent a multivariate normal distribution with equal covariance matrices in the population. SPSS performs the Box M test to verify the validity of this assumption. Box M tests the null hypothesis that the covariance matrices for the features are equal. The SPSS literature states that for sample sizes of more than 72 40, the normality test may detect statistically significant but unimportant deviations from normality [70]. Table 4-2 Test of equality of covariance matrix between groups Class Rank Log Determinant /ɑbɑ/ 15 -70.745 /ɑvɑ/ 15 -65.853 /ɑðɑ/ 15 -56.453 /ɑwɑ/ 15 -63.055 /ɑdɑ/ 15 -64.485 /ɑzɑ/ 15 -60.873 /ibi/ 15 -50.944 /ivi/ 15 -63.958 /iði/ 15 -67.769 /iwi/ 15 -57.317 /idi/ 15 -64.722 /izi/ 15 -53.074 Pooled within-groups 15 -51.497 Table 4-2 shows the results for the test of equality of covariance matrices between groups. The second column of Table 4-2 shows that the covariance matrices for each of the 12 classes are full ranked. However, the 3rd column of Table 4-2 shows that the log determinant for the covariance matrix associated with every class is not always close to the overall covariance matrix. The significance results for the Box M test presented in Table 4-3 shows that the null 73 hypothesis is rejected. The literature associated with the SPSS software stated that small variability in the data available in a large sample can result in failing the normality test, but the LDA analysis can still be used to discriminate between the classes. Table 4-3 Tests null hypothesis of equal population covariance matrices. Box's M F 8611.260 Approx. 5.996 df1 1320.000 df2 547550.691 Sig. .000 Table 4-4 shows how well the training process does in classifying all 12 classes. The average percent of correct classification is 55.3%, which is much higher than chance (8%). In addition, the features discriminated the following classes (/ɑbɑ/, /ɑwɑ/, /eðe/, /iwi/, /ɑðɑ/, /izi/, and /iði/) with a correct percent of classification above the average performance. The Table shows that sounds are confused with each other at different rates. Two sequences VCV1 and VCV2 are mutually confused if utterances of VCV1 are misclassified as VCV2 and utterances of VCV2 are mis-classified as VCV1. For example, VCV sequences for the same consonant and different vowels are often mutually confused as shown in cells with thick boarder lines. Sixteen of the 29 mis-classified sequences of /ɑvɑ/ were classified as /ivi/. In addition, 8 of the 34 mis-classified sequences of /ivi/ were classified as /ɑvɑ/. This mutual confusion is also present between other sequences such as /ɑðɑ/ and /iði/. This mutual 74 confusion is consistent with the results of common viseme-to-phoneme mappings presented in Table 2-11. The Table shows other mutual confusions between sounds associated with consonant pairs /d/, and /z/, in addition to /ɑbɑ/ and /ɑvɑ/. The viseme-to-phoneme mappings in Table 2-11 were obtained based on visual observations of VCV sequences. Sequences that had similar visible articulators were confused with each other and assigned to the same visible class by different observers. The visual features used in this study were extracted from the motion of the visible articulators. The confusion results in Table 4-4 show that the classifier has confusion patterns similar to the ones shown by studies involving human identification of visemes. These confusions will be discussed further as the effect of merging mutually confused classes together on the performance of the classifier is studied. SPSS performs additional analyses that help in understanding the discrimination problem. Table 4-5 presents the contribution of each of the resulting 11 functions in discriminating between the variables. The first column shows the function number, the 2nd column shows the percentage of explained variance by the function across the data, and the 3rd column presents the cumulative explained variance achieved by adding the scores for functions above that row. Results of this test indicate that 97% of the variance in the data can be explained by the first six functions. There are two more important results generated by the SPSS package. The first one is called the structural matrix, which shows the contribution of each visual feature towards the discrimination problem. The second important result is the coefficients of the discrimination functions. Tables 4-6 and 4-7 show these results respectively. 75 Table 4-4 Classification results and the confusion matrix 12-class speaker based Predicted Group Membership Class Original Count /ɑbɑ/ /ɑvɑ/ /ɑðɑ/ /ɑwɑ/ /ɑdɑ/ /ɑzɑ/ /ibi/ /ivi/ /iði/ /iwi/ /idi/ /izi/ Total /ɑbɑ/ 66 5 0 0 0 0 1 0 0 0 0 0 72 /ɑvɑ/ 11 43 0 0 1 0 1 16 0 0 0 0 72 /ɑðɑ/ 3 4 20 0 6 6 1 8 10 1 10 3 72 /ɑwɑ/ 12 0 0 56 0 0 0 1 0 2 0 1 72 /ɑdɑ/ 1 5 16 0 17 12 0 5 3 0 6 7 72 /ɑzɑ/ 4 6 6 1 8 24 0 4 10 0 8 1 72 /ibi/ 13 0 2 1 4 0 49 1 1 0 0 1 72 /ivi/ 4 8 0 0 4 5 0 38 8 0 0 5 72 /iði/ 0 0 10 0 1 2 0 0 42 1 16 0 72 /iwi/ 3 1 4 5 0 1 0 3 1 54 0 0 72 /idi/ 1 0 4 0 7 8 0 3 15 1 33 0 72 /izi/ 0 4 9 1 4 3 0 4 2 0 9 36 72 % 91.7 59.7 27.8 77.8 23.6 33.3 68. 52. 58. 45. 50. 75.0 1 8 3 8 0 The columns of Table 4-6 represent the discrimination functions, and the rows represent the correlation between the feature and the score of a discriminating function. The higher the correlation is, the more contribution this feature has towards the discrimination score provided by the function. Features with the highest contribution in each function are picked up by SPSS program and shown in bold fonts. The second extrema in the lip-corners distance is the most significant feature in the discrimination score provided by the 1st discriminant function. The order in which features are arranged for the 1st function is not the same for the remaining functions, which means that a 76 certain feature may contribute more in the discrimination by a given function while playing little role in discriminating in other functions. Table 4-5 Contribution of the discriminant functions towards the classification problem Function % of Variance Cumulative % 1 49.4 49.4 2 21.8 71.2 3 11.9 83.1 4 6.3 89.4 5 5.1 94.5 6 2.6 97.1 7 1.6 98.7 8 .5 99.2 9 .3 99.6 10 .3 99.8 11 .2 100.0 Table 4-5 showed that for a 97% overall accuracy with the training data, it would be enough to consider the first 6 functions and discard the remaining ones. This suggests that the following features (Slope_LR2, Slope_LR1, LR2, Slope_UP2, UP2, Slope_UP1, Slope_UL2, UL1, UL3, UL2, LR3) contribute more to the discrimination than the remaining ones (slope_UL1, UP1, UP3, and LR1). 77 Table 4-6 The contribution of each feature towards the discrimination (Structural matrix) Feature LR2 Function 1 2 3 4 5 6 7 8 9 10 -.563* -.522 .190 .401 -.158 .034 .087 -.023 -.213 .096 11 .149 Slope_UP2 -.529* .177 -.445 -.163 .022 .096 -.278 -.286 .019 .228 -.085 Slope_UP1 .528* -.351 .375 .050 -.116 .127 .081 -.133 .113 -.330 .045 UP2 .501* -.202 .306 -.121 -.290 .265 .346 .071 -.287 .250 -.306 Slope_LR2 .471 .690* -.136 .296 .028 .313 -.025 -.042 -.074 -.137 -.060 Slope_LR1 -.473 -.669* .126 -.146 -.008 -.014 .032 .168 -.243 -.061 .046 LR1 .168 .460* -.029 .211 .013 -.269 .119 -.028 -.209 .151 UL3 -.020 -.075 -.172 -.074 .639* .411 -.178 .262 -.054 -.175 .342 UL1 -.095 -.091 -.169 -.045 .601* .494 -.252 .020 .048 -.348 -.063 UL2 -.250 .291 .040 -.350 .409 .465* .056 .216 .087 -.369 .266 .296 * Slope_UL1 -.180 .366 -.109 -.261 -.130 .109 .475 .132 .241 -.147 .340 Slope_UL2 .284 -.263 .241 .352 .358 -.243 -.429* .080 .015 .307 .049 LR3 .042 .326 .147 .411 .161 .189 .190 -.606* -.265 -.001 .237 UP1 -.024 .124 .055 -.064 .019 .092 .417 .202 -.255 .687* -.404 UP3 .029 .042 .147 -.167 -.101 .191 .078 -.133 -.338 .356 -.514* Table 4-7 shows the coefficients of the 11 discrimination functions calculated using the training set. The performance of the discriminator is tested by obtaining the dot product between an unknown record and the coefficients of all the functions and then assigning that record to the class associated with the function that resulted in the highest score [70]. 78 Table 4-7 Classification function coefficients Function features 1 2 3 4 5 6 7 8 9 10 11 12 UL1 1.211 2.483 .560 2.500 .858 .236 -.609 1.831 -.560 -.110 .565 2.436 UL2 -5.624 -8.215 -1.015 -7.009 -1.785 -2.039 -2.108 -6.111 .652 -3.233 -.438 -3.672 UL3 6.229 1.600 4.675 1.797 5.249 LR1 -4.034 -2.710 -2.709 -1.492 -3.211 -2.910 -3.925 -2.135 -2.256 -1.469 -3.331 -2.922 LR2 2.044 2.511 1.309 1.541 LR3 -2.414 -2.680 -2.548 -1.055 -2.690 -2.149 -.966 -2.103 -1.813 -2.465 -1.875 -1.753 UP1 -.512 UP2 UP3 7.341 2.217 .935 2.892 7.034 3.244 1.776 -3.724 2.812 3.788 4.983 6.193 1.743 .319 3.285 2.480 .189 -.198 .757 -.656 1.153 1.611 .120 .106 .558 2.054 4.111 -1.888 .007 3.157 .077 .259 -.662 -1.900 .551 1.046 -.339 -2.021 -3.758 -.671 -2.863 -1.708 -.648 -.358 -.336 -1.221 -1.459 -.793 -.416 -.566 Slope_UL1 5.891 34.763 -22.292 23.503 -14.189 -34.393 -36.739 19.305 -26.315 -9.467 -18.495 17.989 Slope_UL2 17.023 -37.834 29.650 22.780 54.386 29.698 72.226 -22.957 42.830 52.321 32.218 12.492 Slope_LR1 -14.972 -13.460 -5.054 12.706 -20.396 -14.944 -22.420 -7.375 -13.218 -29.758 -29.647 -7.793 Slope_LR2 17.473 14.018 6.296 -13.222 16.678 12.961 -9.184 9.621 6.737 68.177 15.272 21.009 Slope_UP1 14.955 31.766 -6.844 20.020 13.488 -24.260 74.064 27.627 -11.702 18.793 6.345 38.253 Slope_UP2 48.609 -29.999 -6.792 27.674 28.107 7.123 -48.588 -27.771 9.683 -7.070 7.550 -.204 (Constant) -23.418 -19.112 -12.345 -23.949 -14.781 -13.717 -23.829 -13.721 -7.625 -18.136 -9.602 -16.592 4.1.1.2 Speaker-Based Training with 6 Classes Owen’s work summarized in Table 2-9 showed that the visual identification of the VCV sounds was not affected by the change in the vowel involved in the VCV sequence. This means that viewers in Owen’s experiments assigned both /ɑbɑ/ and /ibi/ to the same viseme class. Results of the 12 classes shown in Table 4-4 indicate that when the vowels were considered different classes, the classifier had an overall 55.3% of correct classification. However, the results showed a level of mutual confusion between consonants associated with both vowels. In this section, the speaker-based training is performed assuming that sounds coming from different 79 vowels with the same consonant belong to the same class. This reduced the number of classes to the six shown in Table 4-8: Table 4-8 Six classes resulting from combining words for the same vowel Class # VCV Sounds 1 /ɑbɑ,ibi/ 2 /ɑvɑ,ivi/ 3 /ɑðɑ,iði/ 4 /ɑwɑ,iwi/ 5 /ɑdɑ,idi/ 6 /ɑzɑ,izi/ The classification results for this training set are shown in Table 4-9 and the average percentage of correct classification was 65.2%. Merging the vowels into one class increased the performance in recognizing some of the sounds. For example, the recognition rates for /ɑvɑ/, /ɑðɑ/ in the 12 class case were 59.7% and 27.8% respectively. These percentages increased to 74.3% and 68.1% respectively when the vowels were merged together. Classes associated with the consonants /b/ and /w/ had the highest recognition score in both 12 class and 6 class cases. Classes associated with consonants /d/ and /z/ had the poorest performance in both 6 and 12 class configurations. In the 6 class configuration, 33 utterances involving the consonant /d/ were classified as /z/, and 22 utterances involving the consonant /z/ were classified as /d/. The total number of mis-classified utterances for both sounds was 173, and 55 of them (32%) were mutually confused. This confusion is 80 shown in cells marked with thick boarders in Table 4-9. The mutual confusion between visemes representing sequences /z/ and /d/ is a result of them having similar visual articulators as discussed in the previous section. Table 4-9 Classification results and the confusion matrix 6 class speaker based Predicted Group Membership Classes Original Class /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /ɑdɑ,idi/ /ɑzɑ,izi/ Total /ɑbɑ,ibi/ 130 5 4 1 1 3 144 /ɑvɑ,ivi/ 13 107 10 1 10 3 144 /ɑðɑ,iði/ 4 10 98 2 9 21 144 /ɑwɑ,iwi/ 19 8 4 113 0 0 144 /ɑdɑ,idi/ 4 23 28 1 55 33 144 /ɑzɑ,izi/ 2 17 41 2 22 60 144 % 90.3 74.3 68.1 78.5 38.2 41.7 There were other confusion patterns appearing in the table between classes involving the consonant /ð/ and the pair /d,z/. 99 of the 224 mis-classified utterances involving these sounds were mutually confused. In addition, despite that /b/ and /v/ had recognition rates higher than the overall performance, 18 of the 51 mis-classified utterances for /b/ and /v/ were mutually confused. The structural matrix associated with the resulting classification functions is shown in Table 4-10. Features that have high correlation with the score of each function are shown in bold 81 fonts. The top six significant variables for the first function are (LR2, Slope_UP1, Slope_UP2, UP2, Slope_LR1, and Slope_LR2). In the second function, Slope_UL1 became important. Slope_UP2, and UP2 did not have high correlation with the score associated with the function. The results in Tables 4-9 and 4-10 are discussed further in Chapter 5. Table 4-10 The contribution of each feature towards the discrimination (Structural matrix) features Function 1 2 3 4 5 * Slope_UP2 -.506 -.392 -.213 .197 .134 UP2 .502* .374 .346 -.150 -.050 LR2 -.569 .683* -.171 -.003 -.025 Slope_LR1 -.430 .577* -.012 -.098 -.475 Slope_LR2 .412 -.524* -.007 .291 .306 Slope_UP1 .513 .520* .227 .018 -.003 Slope_UL1 -.163 -.402* .196 -.312 -.002 LR1 .158 -.371* -.072 -.111 .103 Slope_UL2 .252 .337* -.161 .297 .011 UP3 .032 .011 .246* .065 .042 UL1 -.090 -.049 -.141 .673* -.352 UL3 -.020 -.069 -.142 .568* -.555 LR3 .041 -.134 -.044 .354* .218 UP1 -.019 -.074 .092 -.146* .015 UL2 -.235 -.368 .317 .358 -.401* 82 4.1.1.3 Speaker-Based Training with 5 Classes One of the objectives of this study was to evaluate how well these features could discriminate between the /d/ and the /z/ sounds. Both of these sounds were associated with the same class shown in Table 2-11. Table 4-9 shows that these two sounds are highly confused with each other. Utterances of these sounds are merged together resulting in the 5-class situation shown in Table 4-11. The classification results for this training set are shown in Table 4-12. The average percentage of correct classification was 72.2% Table 4-11 Five classes resulting from combining /VdV/ with /VzV/ Class # VCV Sounds 1 /ɑbɑ,ibi/ 2 /ɑvɑ,ivi/ 3 /ɑðɑ,iði/ 4 /ɑwɑ,iwi/ 5 /ɑdɑ,idi,ɑzɑ,izi/ 83 Table 4-12 Classification results and the confusion matrix 5 class speaker based Predicted Group Membership /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /VzV,VdV/ Classes /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ Original Class /ɑwɑ,iwi/ /VzV,VdV/ % Total 128 4 3 1 8 144 13 98 8 1 24 144 2 5 97 3 37 144 19 6 3 112 4 144 6 35 58 1 188 288 88.9 68.1 67.4 77.8 65.3 The performance of individual classes between the 6-class and 5-class configuration did not change except for the class involving consonants /d/ and /z/. These two consonants had 38% and 42% correct perception in the 6 class case and merging them improved the overall performance to 65.3%. Classes associated with the consonants /b/ and /w/ continued to have the highest correct percentage score. The mutual confusion observed in the 6 class configuration between classes associated with consonants /d/, /z/ and /ð/ remained (69% of the mis-classified utterances were mutually confused). 84 Table 4-13 The contribution of features towards the discrimination (Structural matrix) Feature Function 1 2 3 4 * Slope_UP2 -.491 -.410 -.234 .244 UP2 .486* .386 .358 -.224 LR2 -.600 .644* -.149 -.053 Slope_UP1 .497 .540* .244 -.043 Slope_LR1 -.465 .538* .045 -.133 S ope_LR2 .446 -.477* -.045 .242 Slope_UL1 -.134 -.411* .196 -.337 UL2 -.214 -.367* .347 .298 LR1 .179 -.354* -.092 -.134 UP3 .045 .015 .244* .014 UL1 -.098 -.040 -.115 .603* UL3 -.038 -.061 -.106 .527* Slope_UL2 .215 .348 -.169 .390* LR3 .072 -.111 -.054 .199* UP1 -.009 -.074 .092 -.159* The structural matrix associated with the resulting discrimination functions is shown in Table 4-13. There was no change in the order of importance for the features in the first function for both 5-class and 6-class cases. Slope_UP2 became a significant feature in the second function and LR2 remained the feature with the highest correlation with the discrimination score in the first function for all 3 class combinations presented. These results are discussed further in Chapter 5. 85 4.1.1.4 Speaker-Based Training with 3 Classes Table 4-12 shows that the three classes /aða,eðe/, and /Vz,dV/ have a great deal of mutual confusion. The confusion between classes /aba,ebe/ and /ava,eve/ continued to appear in all class configurations tested. In this section, these classes are combined and a discriminator is designed based on the resulting training data. Table 4-14 shows the resulting classes after combining the classes. Table 4-14 Three classes resulting from combining /VdV/ with /VzV/ Class # VCV Sounds 1 /ɑbɑ,ibi/, /ɑvɑ,ivi/ 2 /ɑðɑ,iði/, /ɑdɑ,idi/, /ɑzɑ,izi/ 3 /ɑwɑ/,/iwi/ The classification results for this training set is shown in Table 4-15 The average percentage of correct classification was 84.4% Classes associated with consonants /b/ and /w/ continued to have the highest classification score. In addition, in the 3 classes configuration confusion between consonants involving partial mouth closure like /w/ had confusion with classes involving complete mouth contact like /b/. 86 Table 4-15 Classification results and the confusion matrix 3 class speaker based Predicted Group Membership Classes /ɑb,vɑ,ib,vi/ /ɑwɑ/,/iwi/ /ɑð,d,zɑ/ /ið,d,z,i/ /ɑb,vɑ,ib,vi/ /ɑwɑ/,/iwi/ Original Class /ɑð,d,zɑ/ /ið,d,z,i/ % Total 240 1 47 288 28 110 6 144 47 6 379 432 83.3 76.4 87.7 The structural matrix associated with the discrimination functions is shown in Table 4-16. The structural matrix shows that LR2, Slope_LR1, and Slope_LR2 have the highest correlation with the score of the first and second functions. The second function’s score is also correlated with UL2, Slope_UP1, and Slope UL1. The results associated with the speaker training are discussed further in next chapter. 87 Table 4-16 The contribution of each features towards the discrimination (Structural matrix) Features LR2 Function 1 2 * .666 -.603 Slope_LR1 .529* -.498 Slope_LR2 -.502* .474 Slope_UP2 .399* .364 UP2 -.377* -.288 UL2 .227 .557* Slope_UP1 -.371 -.415* Slope_UL1 .122 .415* Slope_UL2 -.216 -.361* LR1 -.223 .277* LR3 -.061 .164* UL1 .084 .109* UL3 .009 .095* UP3 -.011 .087* UP1 .015 .085* 88 4.1.2 Testing Models developed by Speaker-based Training Discriminant models developed using the training set were tested with features extracted from word utterances coming from 9 speakers who were different from the speakers used to develop the discriminant models. The results of testing are shown in Table 4-17, where the VCV sequences included in each of the classes are given in Tables 4-2, 4-8, 4-11, and 4-14 respectively. Performance of the developed LDA functions in the testing set for the 6-class, 5-class and 3-class configurations were close to the training results. The test results for the 12-class configuration were 12% lower than those of training class. This high drop in the test set can be attributed to the difficulty present in distinguishing both vowels from each other. Table 4-17 Testing the Fisher functions developed in speaker based training % correct Classification Number of Classes Training Testing Twelve 55.3 43.1 Six 65.2 59.91 Five 72.1 69.93 Three 84.4 83.37 89 The confusion matrices associated with the testing phase are presented in Tables 4-18 through 4-21. Classes /ɑbɑ/, /ɑwɑ/ and /ewe/ had a high percentage of correct classification when the testing was done for 12 classes, which is consistent with training results. This indicates that the features captured the uniqueness of the visual cues associated with these sounds fairly well. The pairs of classes </ɑðɑ/, /iði/>, </ɑwɑ/, /iwi/>, </ɑdɑ/,/idi/>, </ibi/,/ɑbɑ/>, and </ivi/,/ɑvɑ/> were mutually confused. This confusion is consistent with the common viseme-to-phoneme mappings shown in Table 2-9 in which VCV sequences of the consonant were assigned to the same viseme class. The mutual confusion between VCV sequences involving /d/ and /z/ appeared in the test set as it appeared in the training set. Comparing the testing results from Table 4-18 with the training results of Table 4-4, the confusion patterns between utterances in both sets are very similar to each other. When the number of classes was reduced to 6, the confusion patterns discussed in the 6class training set continued to exist in the 6-class testing set. These results are shown in Table 419. The mutual confusion between classes /d/ and /z/ is consistent with the results of the common viseme-to-phoneme mappings shown in Table 2-11. The overall percent of correct classification in the test set was 59.91%. In addition, classes involving consonants /b/ and /v/ were mutually confused as they were in the training set. 90 Table 4-18 Confusion matrix for the 12 class testing phase Predicted Group Membership Class /ɑbɑ/ /ɑvɑ/ /ɑðɑ/ /ɑwɑ/ /ɑdɑ/ /ɑzɑ/ /ibi/ /ivi/ /ɑbɑ/ 63. 0 1. 0 0 8. 0 0 /ɑvɑ/ 14. 0 41. 0 0 6. 0 4. 0 0 3. 0 48. 0 0 /ɑðɑ/ 0 /ɑwɑ/ 7. 0 Original Count 5. 0 17. 0 0 0 0 0 /iði/ /iwi/ /idi/ /izi/ Total 0 0 0 0 0 72. 0 2. 0 1. 0 4. 0 0 0 0 0 72. 0 7. 0 2. 0 1. 0 22. 0 0 15. 0 0 72. 0 0 0 10. 0 0 0 72. 0 0 7. 0 /ɑdɑ/ 4. 0 15. 0 13. 0 0 10. 0 10. 0 3. 0 7. 0 1. 0 0 7. 0 /ɑzɑ/ 1. 0 0 18. 0 16. 0 6. 0 0 2. 0 0 10. 0 0 72. 0 2. 0 7. 0 4. 0 0 2. 0 1. 0 0 72. 0 6. 0 13. 0 /ibi/ 41. 0 1. 0 2. 0 12. 0 0 2. 0 72. 0 /ivi/ 6. 0 25. 0 1. 0 0 6. 0 6. 0 0 19. 0 1. 0 0 7. 0 1. 0 72. 0 /iði/ 1. 0 27. 0 2. 0 72. 0 2. 0 6. 0 0 3. 0 0 0 2. 0 29. 0 0 0 0 0 0 0 3. 0 1. 0 0 68. 0 /idi/ 1. 0 2. 0 5. 0 2. 0 6. 0 12. 0 2. 0 4. 0 1. 0 0 32. 0 5. 0 72. 0 /izi/ 4. 0 1. 0 0 3. 0 4. 0 1. 0 0 31. 0 23. 0 72. 0 /iwi/ % 0 3. 0 2. 0 0 0 0 72. 0 87.5 56.94 23.61 66.67 13.89 22.22 9.72 26.39 40.28 94.44 44.44 31.94 The classification results after merging the VCV sequences for the consonants /d/ and /z/ together are shown in Table 4-20. The percentage of correct classification with the test set in this case was 69.93%. The confusion patterns in the training set between the three classes /ɑðɑ/,/iði/, and classes associated with consonants /z/,/d/ continued in the test set. In addition, classes /ɑbɑ,ibi/ and /ɑvɑ,ivi/ had mutual confusion in both training and testing. The result of testing the 91 3-class models are presented in Table 4-21. The overall percentage of correct classification was 83.37%. Table 4-19 Confusion matrix for the 6 class testing phase Predicted Group Membership Classes /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /ɑdɑ,idi/ /ɑzɑ,izi/ Total /ɑbɑ,ibi/ 117 8 2 15 0 2 144 /ɑvɑ,ivi/ 21 87 2 7 14 13 144 /ɑðɑ,iði/ 3 12 98 0 3 28 144 Original Class /ɑwɑ,iwi/ 19 1 2 122 0 0 144 /ɑdɑ,idi/ 12 15 32 0 46 39 144 /ɑzɑ,izi/ 11 28 24 2 31 48 144 % 81.25 60.42 68.1 84.72 31.94 33 In summary, the pattern of confusions between consonants was similar in both training and testing sets for the different class configurations used in this study. This indicates that the models developed using known utterances can effectively identify unknown utterances. It also indicates that the set of visual features used in this study are capable of representing the VCV sequences used in the study. 92 Table 4-20 Confusion matrix for the 5 class testing phase Predicted Group Membership Classes Original Class /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /VzV,VdV/ Total /ɑbɑ,ibi/ 114 7 0 18 5 144 /ɑvɑ,ivi/ 21 76 2 8 37 144 /ɑðɑ,iði/ 2 8 104 1 29 144 /ɑwɑ,iwi/ 18 0 1 124 1 144 /VzV,VdV/ 27 43 56 0 162 288 % 79.17 52.8 72.2 86.1 56.25 Table 4-21 Confusion matrix for the 3 class testing phase Predicted Group Membership Classes Original Class /ɑb,vɑ,ib,vi/ /ɑð,d,zɑ,ið,d,z,i/ /ɑwɑ,iwi/ /ɑb,vɑ,ib,vi/ 215 23 50 /ɑð,d,zɑ/ /ið,d,z,i/ 19 123 2 /ɑwɑ/,/iwi/ 39 4 389 % 74.65 85.42 90.04 93 Total 288 144 432 Class /ɑb,vɑ,ib,vi/ represent sounds that require complete mouth contact to produce them. Class /ɑwɑ,iwi/ involves sounds that require partial mouth closure in producing them. Both classes were mutually confused in both training and testing set results. 4.1.3 Word-Based Training The training and testing sets in word-based training are formed by randomly dividing all the word utterances into two equal sized sets. The first half of utterances is used for training and developing the coefficients of the discrimination functions and the second half of utterances is used for testing the coefficients developed in the training process. The LDA functions were calculated for 12-class, 6-class, 5-class, and 3-class configurations. The classification function coefficients with their classification results are shown in Appendix A for the four configurations. The structural matrix for the word-based training with 12 classes is shown in Table 4-22. SPSS identified the features that have high correlation with the score of each function and those features are marked in bold in Table 4-22. Comparing those features with the features identified by SPSS in speaker-based training with 12 classes shown in bold fonts of Table 4-6 indicates that both training sets had the same features that are highly correlated with the score of the first and second LDA functions. This resemblance of features continued to exist between word-based and speaker based training with 6-classes and 5-classes as indicated in Table 4-10 and Table 4-23 for the 6-class training, Table 4-13 and Table 4-24 for the 5-class training. 94 Table 4-22 Structural matrix for word-based training with 12 classes Features Functions 1 2 3 4 5 * Slope_LR1 .580 -.534 .055 .204 .184 LR2 * .548 -.283 .426 .326 .026 6 7 8 .005 -.144 -.185 9 10 11 .267 .364 -.022 -.087 -.249 .216 .190 -.054 -.085 Slope_UP2 .492* .174 .054 -.393 -.222 -.003 .342 .303 .207 -.085 .150 Slope_UP1 -.432* -.340 -.140 .408 .302 -.093 .063 .105 -.245 Slope_LR2 -.547 .583* .281 .069 .041 .043 .284 -.042 .014 .079 -.132 .157 -.057 LR1 -.215 .443* .072 -.109 -.088 .023 -.355 .412 .318 -.057 -.196 UP2 -.406 -.201 -.219 .506* .405 .210 .058 .161 -.234 .387 * -.227 -.074 .163 -.007 .162 * -.199 .000 -.134 -.384 -.127 * -.220 -.067 -.021 -.376 Slope_UL2 -.266 -.242 .197 -.129 -.105 -.692 Slope_UL1 .160 .363 -.211 .029 .024 .442 .053 UL1 .097 -.054 .225 -.414 .457 -.160 .490 LR3 -.046 .353 .408 .202 .398 -.181 -.213 .485* UP3 -.021 .054 -.169 .205 .333 .049 .131 .243 -.321 .659* UP1 -.033 .101 -.087 .142 .195 .293 -.030 -.026 .477 -.252 .600* UL3 .023 -.064 .093 -.420 .236 -.189 .253 -.336 .254 -.355 -.534* UL2 .246 .327 -.209 -.190 .398 .060 .295 -.317 .020 -.288 -.507* .166 .048 .070 -.174 Table 4-25 shows the structural matrix for word-based training with 3 classes. Comparing the structural matrix for 3 class configurations in word-based training with the structural matrix for 3 class configurations in speaker-based training shown in Table 4-16, the variables that are highly correlated with the classification score of the LDA functions were different from those resulting from the 12, 6, and 5 class configurations. 95 Table 4-23 Structural matrix for word-based training with 6 classes Function Feature 1 2 3 4 5 Slope_UP2 .505* -.216 -.382 .096 .006 Slope_UP1 -.464* .345 .448 .219 -.068 Slope_LR1 .482 .640* .188 -.036 .380 LR2 .536 .616* -.038 .030 -.094 Slope_LR2 -.402 -.469* -.129 .197 -.064 LR1 -.159 -.400* -.204 -.115 -.100 Slope_UL1 .182 -.393* .196 -.359 -.005 UP2 -.433 .215 .639* .089 -.049 UP3 -.022 -.058 .344* .238 -.147 Slope_UL2 -.270 .263 -.312* .281 -.237 UP1 -.028 -.085 .239* -.086 .049 LR3 -.010 -.109 -.098 .340* -.158 UL3 .013 -.007 -.263 .345 .599* UL1 .081 .025 -.286 .556 .563* UL2 .259 -.422 .139 .279 .545* Table 4-26 summarizes the classification performance associated with word-based training and speaker-based training for different number of classes. The results of testing the LDA functions developed using word-based training with unknown word utterances were similar to speaker-based testing results. The correct classification percentages for the test set are shown in Table 4-27. The confusion matrices associated with testing word-based functions is included in Appendix A. 96 Table 4-24 Structural matrix for word-based training with 5 classes Function Feature 1 2 3 4 Slope_UP2 -.505* -.227 -.403 .076 Slope_UP1 .461* .356 .441 .221 Slope_LR1 -.504 .613* .169 -.037 LR2 -.548 .595* -.053 -.037 Slope_LR2 .435 -.441* -.084 .101 UL2 -.245 -.410* .141 .288 Slope_UL1 -.168 -.400* .248 -.348 LR1 .182 -.399* -.166 -.185 UP2 .431 .226 .639* .120 Slope_UL2 .250 .274 -.398* .351 UP3 .033 -.045 .324* .273 UP1 .034 -.081 .251* -.065 UL1 -.078 .036 -.302 .456* UL3 -.028 -.004 -.307 .353* LR3 .044 -.090 -.059 .183* In a typical classification problem, models are developed based on known parameters; the developed models are tested with parameters coming from unknown sources. The analysis of the results for structural matrices, the classification results of Table 4-26 and the testing results of Table 4-27 suggest that the performance of the discrimination in the training and testing parts is not much affected by the way the data is divided. The speaker based training develops a discrimination model for utterances coming from known speakers, and then this model can be used to classify utterances coming from unknown speakers, which is a configuration that is 97 closer to a typical classification problem. In the remaining part of this research, tests are performed on models developed by speaker based training. Table 4-25 Structural matrix for word-based training with 3 classes Feature Function 1 LR2 2 * .598 -.536 Slope_LR2 -.458* .454 Slope_UP2 .454* .264 Slope_UP1 -.377* -.306 UP2 -.346* -.200 Slope_UL2 -.263* -.244 UL1 .083* .067 Slope_LR1 .563 -.588* UL2 .255 .502* LR1 -.211 .350* Slope_UL1 .155 .342* LR3 -.013 .172* UP3 .002 .117* UP1 -.024 .074* UL3 .006 .052* 98 Table 4-26 Comparing performance results between speaker based and word based training % correct Classification Number of Classes Speaker based Training word based Training Twelve 55.3 53.9 Six 65.2 62.3 Five 72.1 71.8 Three 84.4 85.1 Table 4-27 Testing speaker-based and word-based LDA functions % correct Classification Number of Classes Speaker-Based testing Word-Based testing Twelve 43.1 49.42 Six 59.91 62.15 Five 69.93 71.32 Three 83.37 83.45 99 4.2 STEP-WISE ANALYSIS In step-wise analysis, the features are applied to the discrimination problem one at a time. In every step of the analysis, the feature with the highest statistical value in discrimination is entered. This process is continued till all features are applied or the user set thresholds for entering and removing variables are reached. Table 4-28 shows which features were entered in every step of the analysis for different class configurations. The order in which these features enter into the analysis indicates how important those features are in the discrimination problem. Features being admitted to the analysis at later steps have a smaller contribution towards the discrimination problem. Figure 4-1 shows the effect of adding one feature at a time on the classification performance for different class configuration with the training set. Adding more features in each classifier improves the performance. The increase in the performance becomes stable after adding the 7th feature into the analysis. In the 3 class case, the first two features LR2 and UL2 were enough to obtain high classification score and adding more features resulted in a small drop in the performance. Table 4-29 shows the training and testing results obtained when the discrimination problem included the top 7 captured in the step-wise analysis. The LDA functions obtained using these seven features were tested to determine the impact of ignoring 8 features on the discrimination problem 100 Effect of Adding features on Performance 90 80 % Correct Classification 70 60 12-Class 50 6-Class 5-Class 40 3-Class 30 20 10 0 1 2 3 4 5 6 7 8 9 Number of Features Figure 4-1 Effect of adding features on the classification performance 101 Comparing these results with the summary of the classification performance when all 15 features are included (Table 4-17), shows that in the training set, these 7 features performed almost as well at the 15 features. The effect of ignoring those features resulted in a 4.8% drop in the training performance for the 12 and 6 class case. The performance difference is almost negligible in the case of training for 5 and 3 classes. In the testing phase, ignoring 8 features resulted in a 3.8% drop for the 12 class case and 1% or less in the remaining classes. 102 Table 4-28 Features in the order of their importance in different classes Step number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 12 Classes 6 Classes 5 Classes 3 Classes LR2 LR2 LR2 LR2 Slope_UP1 Slope_UP1 Slope_UP1 UL2 Slope_LR2 UL2 UL2 UL3 Slope_UL2 UL3 UL3 Slope_UP2 UL2 UP2 Slope_UP2 Slope_UL2 UL3 LR1 LR1 LR1 UP2 UP3 Slope_UL2 Slope_UP1 LR1 Slope_UL2 UP2 Slope_UL1 UP1 Slope_LR2 Slope_LR2 UL1 Slope_UP2 Slope_UP2 UP3 UP2 UP3 Slope_UL1 Slope_UL1 LR3 UL1 UL1 Slope_LR1 UP1 UP1 Slope_UL1 Slope_LR1 UL1 103 UP1 Table 4-29 Classification performance with the top 7 features % correct Classification Number of Classes 4.3 Training Testing Twelve 50.1 39.5 Six 61.9 59.14 Five 70.9 68.68 Three 84.1 83.37 SPEAKER SPECIFIC DISCRIMINATION The results presented so far were associated with models that included multiple speakers in their development. Despite the variability across speakers, the models developed were able to classify utterances into different groups. This section shows the results obtained by developing models for each individual speaker, i.e. models that can be used to identify unknown utterances coming from the same speaker. The data set used in this research consisted of 8 utterances of 12 different VCV words coming from 18 speakers. The utterances for every individual speaker are divided into training and testing sets. The training set consisted of 5 utterances of each VCV word and the testing set consisted of the 3 remaining utterances. Fisher discrimination functions were developed and tested for every speaker. The range of correct classification across speakers in the training and testing phases for different classes are summarized in Table 4-30. The percentage of correct 104 classification for every speaker for models with different class configurations are presented in figures 4-2 through 4-5. Table 4-30 Range and average of correct discrimination for 18 speaker-specific models # of classes # of classes # of classes # of classes 3 5 6 12 Trainin Trainin Testing Training Testing g Average % correct 88.3100 98.3 81.94 93.9 77.49 105 5090-100 80-100 100 Testing g 66.7- 50- 85-100 90.06 Testing g 55.6Range of % correct Trainin 86.1 91.7 92.87 75.31 96.29 68.06 Classification for every speaker (3 Classes) 120 100 Training 60 Testing 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Speaker Figure 4-2 Training and testing results for every speaker (3 class configuration) Classification for every speaker (5 Classes) 120 100 80 % correct % correct 80 Training 60 Testing 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Speaker Figure 4-3 Training and testing results for every speaker (5 class configuration) 106 Classification for every speaker (6 classes) 120 100 % correct 80 Training 60 Testing 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 speaker Figure 4-4 Training and testing results for every speaker (6 class configuration) 107 Classification for every speaker (12 Classes) 120 100 % correct 80 Training 60 Testing 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Speaker Figure 4-5 Training and testing results for every speaker (12 class configuration) 108 5.0 DISCUSSION The objective of this research is to investigate the feasibility of using a small set of features extracted from moving lips to distinguish between the sounds produced. Audio-visual data representing 12 different VCV sounds were obtained from 18 speakers. Several visual features, as shown in Table 3-2, were extracted from the lip motion to represent each of these VCV sounds, and many tests using linear discriminant analysis were conducted on these features to study how well they can classify the different sounds. 5.1 SPEAKER-BASED VERSES WORD-BASED TRAINING In speaker-based training, the classifier was designed using features from 9 of the 18 speakers. In word-based training, the classifier was designed using features extracted from half of the utterances from all 18 speakers. Each of these configuration schemes models a different problem. In speaker-based training, a model was developed based on features extracted from known speakers. The resulting model was tested using features from utterances produced by speakers who were not included in the development of the discrimination model. In word-based training, a model was developed based on the contribution of 18 speakers in the data set. The resulting model was tested with unknown utterances coming from the same speakers used to 109 develop the discrimination model. In this modeling scheme, the training and testing data come from the same source. The performance of both models on the training data is shown in Table 4-21. The performance of both models on test data is shown in Table 4-22. Results suggest that there is not much difference in the performance of the two. The features highly correlated with the score of the first and second discrimination functions across different classification classes didn’t change in both training methods. This indicates that the method of dividing the data didn’t result in significant differences in the classification performance. 5.2 PERFORMANCE FOR DIFFERENT NUMBER OF CLASSIFICATION CLASSES The ideal performance of a classifier would be to distinguish between all 12 VCV words listed in section 3.5. However, Owen’s results shown in Table 2-9 suggest that changing the vowel does not affect the classification done visually. Table 4-4 shows the classification results when the training was done by associating every VCV word to a different class. The average percent of correct classification was 53.3%, and the correct classification percentage for each class was more than that of chance (8%). The highest classification scores were associated with classes /ɑbɑ/ (91%), /ɑwɑ/ (77.8%), and /iwi/ (75%) while the lowest scores were associated with /ɑðɑ/ (27%), /ɑdɑ/ (23%), and /ɑzɑ/ (33%). This indicates that the features used to represent these sounds capture certain characteristics in sounds /ɑbɑ/, /ɑwɑ/ and /iwi/ in a way that distinguishes them well from the other 9 VCV 110 sounds. However, these features don’t perform as well in capturing the difference between the sounds /ɑðɑ/, /ɑdɑ/, and /ɑzɑ/. In addition, the classification scores indicate that the features used in the study were able to detect some difference between the VCV sounds associated with both vowels. The classification scores and confusions associated with consonants /d/ and /z/ shown in Table 4-4 are consistent with common viseme-to-phoneme mapping of Table 2-11 in which consonants /d/ and /z/ were assigned to the same visual class. Twenty percent of the misclassified utterances for these sounds are mutually confused, and these two classes have the poorest classification score. The discussion in section 2.2.2 on the visual perception of phonemes showed that changing the vowel associated with the consonant didn’t have an effect on how the observer classified the sequence. To compare more directly to Owen’s results, the classification problem was re-defined to assign words with the same consonants to the same class, reducing the number of classes to six classes as shown in Table 4-8. The classification scores for the training data when divided into six classes are shown in Table 4-9. The over all average percent of correct classification is 65.2%. Merging the vowels resulted in an increase of about 7% in the overall classification performance. As for individual VCV sounds, the sounds with the highest scores continued to be consonants /b/ and /w/ in the 6 class case as they were in the 12 class case. In addition, consonants /d/ and /z/ continued to have the lowest discrimination score in the 6 class configuration as it was in the 12 class. Table 4-9 shows that the discrimination scores for classes associated with /d/ and /z/ were 38.2% and 41.7% respectively. It also shows that 32% of the mis-classified utterances for these sounds were mutually confused. This is consistent with 111 common viseme-to-phoneme mapping shown in Table 2-9 as discussed in section 4.1.1.2. This led to testing the classification performance when both /d/ and /z/ are considered to belong to the same class, reducing the number of classes to 5 as shown in Table 4-11. The five-class training results presented in Table 4-12 showed that the class associated with the consonant /b/ and /w/ continued to be detected very well. In addition, merging classes /d/ and /z/ increased the overall percentage of correct classification from 65.2% to 72.1% with individual scores for each class still 3 to 4 times higher than chance (20%). The /d/ and /z/ classes had the poorest performance in the 6 class configuration. When both classes were combined, the recognition for the combined class jumped to 65.3%. There is still confusion between classes as shown in Table 4-12; Classes /ɑðɑ, iði/, /ɑdɑ, idi/, and /ɑzɑ,izi/. Sixty seven percent of the mis-classified sounds from these sequences were mutually confused. In addition, although class /ɑbɑ,ibi/ and /ɑvɑ,ivi/ involve full closure of lips and have high percentages of correct classification, there is still mutual confusion between both classes. Merging these classes results in a three class problem as shown in Table 4-14. The three classes presented in Table 4-14 are directly related to the role of the upper and lower lips in producing those sounds. Class 1 represent sounds that involve complete lip contact. Class 2 represents sounds that involve partial lip closure. Class 3 involves sounds with little contribution coming from the lips when producing them. The training results of the 3 class configuration are shown in Table 4-15, with overall performance increasing to 84.4%. When the number of classes in the training phase was reduced to 3, features highly correlated with the discrimination scores of the first and second LDA functions were (LR2, UL2, UP2, Slope_LR1, Slope_LR2, Slope_UP1, and Slope_UL1). The intuitive meaning for these 112 features is shown in Table 5-1. Features related to the upper and lower lips became important, while some of the slope features became less correlated with the score of the function. The features having high correlation with the discrimination score of the first and second discrimination functions were the same for the 12 class, 6 class, and 5 class cases. These seven features were (UP2, LR2, Slope_LR1, Slope_LR2, Slope_UP1, Slope_UL1, and Slope_UP2). These features are related to contributions of the upper lip in the sound production. They capture how fast the upper lip and the lip corners are moving while producing the consonant. The speech production literature discussed in Section 3-3 stated that the upper-lip moves at different rates while producing different sounds. This can explain the significance of slope features at higher number of classes. The results presented in this section indicate that the features chosen to represent these VCV sounds provide good discrimination between these sounds. The optical reflectors used with the Motion Analyzer could not be used to track the tongue because the adhesive side does not stick to moist surfaces. The tongue plays a role in producing the consonant /ð/ but it has no role in producing the consonants /d/, and /z/. In the context of two dimensional images, the tongue is expected to appear in the image sequence when /ð/ is stated. Capturing this property may reduce the confusion between the classes /ð/ and /d,z/ presented in Table 4-12. 5.3 TESTING THE MODELS Four different models were developed in the speaker training method. Each of these models had a different number of classes to discriminate between. The Fisher functions for each 113 model were tested using the utterances coming from 9 speakers that were not used in developing that model, as described earlier in section 3.5. The results of testing the LDA models for different classes are shown in Table 4-17. The second column shows the classification performance with the training results, and the third column shows the testing results. The testing results are lower than the training results but the difference between the training and testing results decreases as more classes are merged. For the 3 class configuration, the testing and training results are almost identical. For the 12 class case, the drop in the testing results is higher than the drop in other classes. This shows that the automatic classifier performance drops when comparing sequences from two different vowels. The confusion matrix associated with testing the 12 class case shown in Table 4-24 suggest the following: o Classes /ɑbɑ/, /ɑwɑ/ and /iwi/ have the highest percentage of correct recognition as was the case with the training data. o Classes /ɑdɑ/, and /ɑzɑ had low scores, which is consistent with common viseme-to- phoneme mappings that assigned both of them to the same class. o Class /ibi/ is almost at chance and most of /ibi/ utterances were classified as /ɑbɑ/, i.e. the discrimination functions didn’t capture the visual differences between the vowels “ɑ” and “i” for the consonant /b/. Table 4-20 shows the testing results when the consonants associated with two vowels are assigned to the same class, reducing the number of classes to six. Consonants /d/ and /z/ are mutually confused in testing as they were in training. The results of testing the models after merging these two classes are shown in Table 4-21. The performance of the functions with the testing set was almost equal to the performance with the training set. 114 The results of testing the model consisting of three classes are shown in Table 4-22. The overall performance of the Fisher functions with the testing data is almost the same as the results obtained using the training data. In addition, the confusion patterns between VCV sequences in the testing set are similar to those of the training set. Furthermore, some of the confusion patterns resulting from the automatic classification are similar to those patterns obtained from visual classification of visemes shown in Table 2-9. The utterances in the test set came from unknown speakers. Yet, the performance of the developed LDA functions with the unknown data is comparable with the performance of these functions with the training set. This indicates that the developed functions can be used to classify VCV utterances coming from unknown speakers. 5.4 STEPWISE ANALYSIS The stepwise analysis is a technique to identify which variables contribute more to the discrimination problem. This analysis may help in reducing the dimensionality of the problem by discarding variables with small contribution toward the discrimination. Details of this analysis were discussed in section 3.4.4. and the results of the step-wise analysis for different class configurations are shown in table 4-28. The intuitive meaning of the visual features used in this study is shown in Table 5-1. The feature LR2 is associated with how much the lip corners close during an utterance. In other words, it reflects the rounding of the lips when the consonant in the VCV sequence is produced. For configurations involving classes greater than 3, the LR2 feature was consistently picked to be the most significant feature in the analysis. The 2nd feature to be picked by the step-wise 115 analysis was slope_UP1, which represents how fast the upper lip moved towards the lower lip at the beginning of the consonant production in the VCV sequence. When the number of classes involved is higher than 3, slope features representing how fast different points around the lips move in producing a sound become significant. When the training involved 3 classes, the rounding of the mouth (LR2) remained significant; however, the slope information was entered at later stages. The 2nd and 3rd significant features for the 3 class were UL2, and UL3. This is expected since the 3 class case can be classified based on how far the lips close during the production of the consonant (complete, partial or little contribution) as discussed in section 5.2. The features associated with the production of the consonant are picked up at early stages of the stepwise analysis, while amplitude features related to the 2nd opening to produce the 2nd vowel are picked up at later steps of the analysis. In addition, the slope features become more significant when the number of classification classes increases. This is expected, since the rate at which lips move is different between consonants and for a larger number of classes, the amount of mouth opening is not enough to capture the variation between the visual articulators. This study suggests that the features associated with the consonant in the word (second extrema) play a bigger role in distinguishing between the sounds than the other features. Table 4-29 shows the results of training and testing new models based on the top seven features appearing in Table 4-28. When comparing the results of developing and testing the LDA models shown with the top seven features with the results shown in Table 4-23 with all 15 features, the percentage of correct classification using 7 features is very close to the corresponding percentage when all features are used. When the LDA functions were developed bases on 12 classes, the percent of correct classification was 55.3% when all 15 features were used. This percentage dropped by 4.8% to 50.1% when only 7 features were used. The 116 performance difference for different numbers of classes when all features were included in the analysis was better than the performance when 7 features were used by only 2-5% in both training and testing. This suggests that adding features beyond the top seven shown in Table 4-23 results in a small improvement of 2-5% in performance. Reducing the number of features reduces the complexity of the model with minimal loss in classification accuracy. Table 5-1 Intuitive meaning of the features used in the analysis Feature Intuitive Meaning of the Feature in producing the VCV sequence UL1 How far the upper and lower lips go apart to produce the initial vowel UL2 How close the upper and lower lips get to produce the consonant UL3 How far the upper and lower lips go apart to produce the second vowel LR1 How wide is the mouth when producing the initial vowel LR2 How wide is the mouth when producing the consonant LR3 How wide is the mouth when producing the second vowel UP1 How much the upper lips moved to produce the initial vowel UP2 How much the upper lips moved to produce the consonant UP3 How much the upper lips moved to produce the second vowel Slope_UL1 How fast the upper and lower lips moved in producing the consonant Slope_UL2 How fast the upper and lower lips moved in producing the second vowel Slope LR1 How fast the lip corners moved in producing the consonant Slope_LR2 How fast the lip corners moved in producing the second vowel Slope_UP1 How fast the upper lips moved in producing the consonant Slope_UP2 How fast the upper lips moved in producing the second vowel 117 The top 7 features when the training was done for 12 classes shown in Table 4-23 were LR2, Slope_UP1, Slope_LR2, Slope_UL2, UL2, UL3, and UP2. Some of these features are different from the top seven features that were highly correlated with the discrimination scores of the first and second LDA function of the model (LR2, Slope_UP2, Slope_UP1, UP2, Slope_LR1, Slope_LR2, and LR1). This indicates that a certain feature might contribute towards the discrimination of a specific LDA function, but the same feature might not have significant contribution across all classes. The step-wise analysis evaluates the contribution of features at a more global level than the correlation analysis presented in the structural matrix. These features provide a new look on existing viseme-to-phoneme mappings. Mappings presented in chapter 2 were based on information observed by the eyes identifying visual representation of sounds, while the features used in this study are simple and easy to extract. The success of the features in achieving good discrimination suggests that they sample the information observed by the eyes of a person experienced in lip reading. 5.5 SPEAKER SPECIFIC DISCRIMINATION The results associated with the analysis of speaker specific discrimination are shown in Table 4-30, where the average and the range of percentage of correct classification across all 18 speakers are shown. Some of the speakers scored 100% correct classification in the training and even in the testing set, while some speakers performed poorly, particularly in the testing part. The poor performance in the testing part can be attributed in part to the fact that the training data consisted 118 of 5 utterances and the testing data consisted of only 3 utterances for every word. Jacknife training and testing might provide better insight on speaker-specific classifiers. Results presented in Table 4-30 show the overall average performance across speakers in training and testing. The average percentage of correct classification when the training was based on 12 classes was 96.29%. The average for testing with 12 classes was 68.06%. When all speakers were included in the training and testing, the percentage of correct classification was 55.3% in training and 43.1% in testing. The training and testing performances are higher than those results obtained when more than one speaker is involved. This is certainly expected due to smaller within-speaker variability in producing these sounds. Figures 4-1 through 4-4 show the testing and training results for each of the speakers independently. No specific pattern for the performance of testing the speaker-specific classifiers across different class configurations was observed. Some of the results of testing for speakerspecific models were close to training results, while others were not. 119 6.0 CONCLUSION The objective of this study was to investigate the feasibility of utilizing visual cues extracted from lip motion in distinguishing between different sounds. An audio-visual database with 12 VCV sequences was developed. Several features from the lip motion were extracted to represent the VCV sequence. The results suggest that the visual features used provide good discrimination between the VCV sequences used in this study. Visual features representing the change in mouth shape while producing the consonants in the VCV sequence were the most significant ones. The minimum distance between the upperlower lips, the distance between lip corners (amount of rounding in the mouth), and rate at which the upper-lip moves while producing the consonant were the features that most effectively captured the differences between the VCV sounds in the study. These features are two dimensional and very easy to extract from video sequences. These visual features manage to capture the uniqueness of the VCV sounds produced by a single speaker at performance rates much higher than the rates when multiple speakers are involved. The results of this study contribute towards a better understanding of the visemes-tophoneme mappings and the automatic visual classification of phonemes. Visemes are defined as the visual representation of phonemes. The features used in this study represent a set of parameters that characterize visemes, and they probably represent a sample of what the eye 120 observes and captures. These features should enable researchers to work on designing automatic classifiers to distinguish between different viseme classes. 121 7.0 FUTURE WORK The success in classification of these sounds should encourage attempts to expand the audio-visual data bases to include additional labial English sounds such as /m/, /p/, /f/ ,/θ/, to develop Fisher discriminant functions to identify unknown phonemes. Audio-visual recordings for the new sounds would need to be obtained. Then a lip tracking algorithm similar to the one used by Chen [9, 35] can extract images of the lip-region from successive frames in the video sequence and trace several points around the face in consecutive frames as shown in Figure 3-2. The next step would be to extract the visual features from the distance waveforms for these markers as described in sections 3.2 and 3.3. Grayscale based image segmentation algorithms can be used to detect the appearance of tongue and teeth in each frame in the image sequence. These represent two additional visual features that can be included in the analysis by assigning a binary value of zero or one representing whether lips or teeth were present in a specific frame or not. The extracted features can be used to calculate the Fisher discriminant functions associated with each phoneme. These functions can be used to identify unknown phonemes. For hearing impaired individuals, lip-reading is not universally well developed and is very limited in new hearing-aid users. This work provides a first step towards building automatic lip reading systems that are based on visual information only. The speaker-specific classifiers discussed in Section 4.4 proved to have good classification. This indicates that the variability of these features for the same word is reduced within one speaker. This becomes important in 122 situations where a hearing-impaired handicapped person interacts with few people around him or her. In such situations, a visual classifier could be trained using utterances of important words from specific people. Then a signal could be presented to the handicapped person indicating the word to be communicated to him or her. This can contribute towards improving the communication between that hearing-impaired handicap person and the people living with him or her. Follow-up studies to this work might include studying the relation between the important peaks in the distance waveforms and the acoustic waveform itself. One of the common problems facing people with hearing aids is what is referred to as the cocktail party effect resulting from having more than one person speaking at the same time. In such situations, the noise coming from other speakers occupies the same frequency range as the speaker of interest, which makes it hard to attenuate without affecting the speech signal of interest itself. This study suggests that the features associated with the consonant in the word (second extrema) play a bigger role in distinguishing between the sounds than the other features. Studying the acoustical behaviors of the signal together with the lip waveforms that produced those acoustics may help in identifying instances and points of interest in the acoustical waveform that should be emphasized or deemphasized. The overall impact of this might improve the intelligibility of speech. 123 APPENDIX A WORD-BASED TRAINING A.1 WORD-BASED TRAINING WITH 12 CLASSES Table 7-1 Classification Function Coefficients Functions Features 1 2 3 4 5 6 7 8 9 10 11 12 UL1 1.682 2.118 .351 2.715 .108 .793 1.093 2.373 .388 1.262 3.926 1.923 UL2 -5.974 -7.041 -.076 -6.716 -.063 -2.431 -5.229 -6.164 .353 -4.111 -3.849 -1.042 UL3 5.674 1.399 6.024 1.534 3.320 .717 4.105 LR1 -4.550 -4.173 -3.657 -1.524 -3.963 -4.148 -3.519 -2.929 -2.670 -1.023 -3.719 -3.896 LR2 1.902 LR3 -1.898 -1.496 -1.774 UP1 -.800 -.266 UP2 3.504 UP3 -2.317 6.095 1.911 1.698 -2.843 2.067 5.718 5.300 3.324 .911 2.060 .532 1.223 1.793 1.040 1.943 2.631 -.468 -1.203 -1.304 .089 -.957 -1.262 -2.251 -.385 -.855 -.752 -.763 -1.295 -.692 -.908 .762 -.180 -.585 .871 -.144 -.411 .848 2.217 2.060 .268 .760 -1.226 .657 .540 -1.979 .257 .037 -.289 -1.031 -1.285 .098 .531 .280 -.639 .318 1.222 -.325 Slope_UL1 1.638 24.721 -30.842 6.991 -44.724 -15.928 .965 30.710 -12.520 -2.012 23.586 -3.317 Slope_UL2 -2.981 -26.217 28.585 4.478 72.496 31.513 10.332 -21.481 26.720 31.734 -6.787 37.200 Slope_LR1 -10.649 -15.863 -3.249 -3.723 -13.703 -16.661 1.204 -2.399 -11.180 -34.762 -4.836 -17.544 Slope_LR2 14.093 2.313 Slope_UP1 8.239 3.333 -22.472 1.748 .359 -3.046 3.480 4.336 76.920 10.298 10.672 4.158 -28.823 4.902 -32.880 -18.486 8.300 20.198 -16.681 -1.559 16.472 -4.259 Slope_UP2 39.731 -18.219 6.212 6.109 51.007 6.892 -37.528 -20.083 12.731 -42.798 -23.259 16.368 (Constant) -22.518 -17.881 -10.803 -20.352 -13.691 -13.820 -17.529 -12.784 -6.520 -19.047 -14.087 -10.024 124 Table 7-2 Classification results for word-based training with 12-classes Predicted Group Membership Class /ɑbɑ/ /ɑvɑ/ /ɑðɑ/ /ɑwɑ/ /ɑdɑ/ /ɑzɑ/ /ibi/ /ivi/ /iði/ /iwi/ /idi/ /izi/ Total /ɑbɑ/ 64 3 0 2 0 1 2 0 0 0 0 0 72 /ɑvɑ/ 10 37 0 2 0 4 2 17 0 0 0 0 72 /ɑðɑ/ 0 5 33 0 6 4 1 4 11 0 3 5 72 /ɑwɑ/ 9 0 0 53 0 0 2 1 0 7 0 0 72 /ɑdɑ/ 0 5 11 0 28 11 3 2 6 0 1 5 72 /ɑzɑ/ 2 7 16 0 16 12 1 7 0 0 4 7 72 Original Count /ibi/ 19 1 2 5 3 0 36 3 0 1 1 1 72 /ivi/ 5 11 0 1 2 5 0 36 5 0 3 4 72 /iði/ 0 1 7 0 1 2 0 4 46 0 0 11 72 /iwi/ 1 1 1 5 0 0 1 3 1 59 0 0 72 /idi/ 1 1 6 0 2 2 0 6 3 0 35 16 72 /izi/ 0 0 9 0 11 2 3 3 12 0 5 27 72 % 88.9 51.4 45.8 73.6 38.9 16.7 50.0 50.0 63.9 81.9 48.6 37.5 125 A.2 WORD-BASED TRAINING WITH 6 CLASSES Table 7-3 Classification function coefficients Function Feature 1 2 3 4 5 6 UL1 1.329 2.062 .242 2.237 1.736 1.218 UL2 -4.586 -5.539 .822 -5.033 -.981 -.923 UL3 4.686 .428 4.512 1.575 1.354 LR1 -3.341 -2.942 -2.774 -.880 -3.373 -3.550 LR2 1.843 2.134 LR3 -.728 -1.086 -1.483 -1.005 UP1 -.397 UP2 UP3 4.702 .577 2.215 -1.002 2.557 2.826 -.776 -1.002 -.238 -.365 -.010 -.147 1.170 -1.560 .210 1.062 -.395 -.303 -.296 -.118 -.257 .219 .223 .622 Slope_UL1 7.603 30.881 -19.132 8.418 -10.004 -6.788 Slope_UL2 11.116 -14.110 33.403 19.726 43.040 41.594 Slope_LR1 -8.826 -13.088 -10.788 -17.774 -13.801 -20.710 Slope_LR2 26.392 21.212 18.028 27.758 22.173 20.425 Slope_UP1 15.747 17.015 -19.590 8.276 -5.990 -7.462 Slope_UP2 -18.022 -34.086 -1.843 -21.331 5.250 .674 (Constant) -15.592 -11.917 -6.562 -15.675 -10.807 -9.393 126 Table 7-4 Classification results for word-based training with 6-classes Predicted Group Membership /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /ɑdɑ,idi/ /ɑzɑ,izi/ Classes /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ Original Class /ɑwɑ,iwi/ /ɑdɑ,idi/ /ɑzɑ,izi/ % Total 124 9 3 5 1 2 144 22 98 6 2 6 10 144 1 13 106 2 9 13 144 24 5 2 113 0 0 144 5 22 33 1 55 28 144 6 21 41 0 34 42 144 86.1 68.1 73.6 78.5 38.2 29.2 127 A.3 WORD-BASED TRAINING WITH 5 CLASSES Table 7-5 Classification function coefficients Function Feature 1 2 3 4 5 UL1 1.213 1.857 .113 2.321 1.043 UL2 -4.793 -5.238 .672 -5.171 -1.288 UL3 5.146 .774 4.733 LR1 -3.805 -3.254 -3.047 -1.000 -3.822 LR2 1.868 2.210 LR3 -.639 -1.181 -1.583 UP1 -.484 UP2 UP3 4.774 2.327 2.345 -1.242 2.741 -.902 -1.114 -.296 -.389 -.289 1.100 -1.179 .136 1.102 -.311 -.100 .040 -.316 .423 .427 .465 Slope_UL1 3.756 23.895 -21.662 4.663 -15.047 Slope_UL2 13.344 -2.569 32.435 25.742 44.294 Slope_LR1 -8.844 -14.227 -12.597 -15.901 -19.709 Slope_LR2 26.494 21.700 18.488 26.442 21.022 Slope_UP1 8.628 9.560 -23.201 3.025 -15.201 Slope_UP2 -25.823 -31.348 -8.008 -22.300 -4.432 (Constant) -16.613 -12.637 -6.967 -16.450 -10.764 128 Table 7-6 Classification results for word-based training with 5-classes Predicted Group Membership /ɑbɑ,ibi/ Classes /ɑbɑ,ibi/ /ɑvɑ,ivi/ Original Class /ɑðɑ,iði/ /ɑwɑ,iwi/ /VzV,VdV/ % /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /VzV,VdV/ Total 123 5 0 5 11 144 22 81 6 2 33 144 1 8 102 2 31 144 24 2 2 113 3 144 11 25 50 1 201 288 85.4 56.2 70.8 78.5 69.8 129 A.4 WORD-BASED TRAINING WITH 3 CLASSES Table 7-7 Classification function coefficients Function Features 1 2 3 UL1 1.269 2.027 .605 UL2 -4.478 -4.742 .247 UL3 4.235 4.146 .754 LR1 -2.770 -.503 -3.028 LR2 1.857 -1.059 2.402 LR3 -1.266 -1.364 -1.309 UP1 .214 -.088 -.202 UP2 -.509 .307 .046 UP3 .203 -.055 -.010 Slope_UL1 19.525 12.996 -15.486 Slope_UL2 -10.866 7.556 36.463 Slope_LR1 -9.374 -17.420 -13.519 Slope_LR2 21.230 24.734 19.052 Slope_UP1 10.829 4.013 -15.530 Slope_UP2 -31.927 -29.103 -.566 (Constant) -10.791 -13.845 -6.825 130 Table 7-8 Classification results for word-based training with 3-classes Predicted Group Membership /ɑb,vɑ,ib,vi/ /ɑwɑ/,/iwi/ /ɑð,d,zɑ/ /ið,d,z,i/ Classes /ɑb,vɑ,ib,vi/ /ɑwɑ/,/iwi/ Total 235 8 45 288 29 112 3 144 41 3 388 432 74.65 85.42 90.04 Original Class /ɑð,d,zɑ/ /ið,d,z,i/ % 131 APPENDIX B WORD-BASED TESTING B.1 WORD BASED TESTING WITH 12-CLASSES Table 7-9 Classification results for word-based testing with 12 classes Predicted Group Membership Class /ɑbɑ/ /ɑvɑ/ /ɑðɑ/ /ɑwɑ/ /ɑdɑ/ /ɑzɑ/ /ibi/ /ivi/ /iði/ /iwi/ /idi/ /izi/ Total /ɑbɑ/ 62 2 0 1 0 0 7 0 0 0 0 0 72 /ɑvɑ/ 9 33 0 4 2 4 3 15 0 0 1 1 72 /ɑðɑ/ 0 5 28 0 9 5 4 3 9 0 2 7 72 /ɑwɑ/ 10 0 0 52 0 0 3 1 0 6 0 0 72 /ɑdɑ/ 2 3 15 0 20 11 4 2 3 0 1 11 72 /ɑzɑ/ 2 8 10 0 15 8 4 8 4 0 6 7 72 Original Count /ibi/ 24 1 0 5 7 0 30 2 0 2 0 1 72 /ivi/ 1 14 1 0 2 3 2 34 4 0 7 4 72 /iði/ 1 1 13 0 3 1 0 1 41 0 0 11 72 /iwi/ 1 0 1 2 2 0 3 2 0 61 0 0 72 /idi/ 1 2 5 0 4 0 0 5 4 0 32 19 72 /izi/ 0 0 11 2 2 4 1 6 16 1 3 26 72 % 86.1 45.83 38.89 72. 2 27.78 11.1 41.6 47.2 56.9 84.7 44.4 36.1 49.42 7 2 132 B.2 WORD BASED TESTING WITH 6-CLASSES Table 7-10 Classification results for word-based testing with 6 classes Predicted Group Membership /ɑbɑ,ibi/ /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /ɑdɑ,idi/ /ɑzɑ,izi/ Classes Original Class Total /ɑbɑ,ibi/ 127 5 0 6 0 6 144 /ɑvɑ,ivi/ 21 95 4 2 13 9 144 /ɑðɑ,iði/ 4 10 110 1 5 14 144 /ɑwɑ,iwi/ 22 5 3 113 0 1 144 /ɑdɑ,idi/ 9 21 33 0 50 31 144 /ɑzɑ,izi/ 6 22 44 4 26 42 144 88.19 65.97 76.39 78.47 34.72 % 133 29.17 62.15 B.3 WORD BASED TESTING WITH 5-CLASSES Table 7-11 Classification results for word-based testing with 5 classes Predicted Group Membership /ɑbɑ,ibi/ Classes Original Class /ɑvɑ,ivi/ /ɑðɑ,iði/ /ɑwɑ,iwi/ /VzV,VdV/ Total /ɑbɑ,ibi/ 122 7 0 9 6 144 /ɑvɑ,ivi/ 25 71 2 4 42 144 /ɑðɑ,iði/ 2 8 102 1 31 144 /ɑwɑ,iwi/ 23 0 1 119 1 144 /VzV,VdV/ 18 28 43 0 199 288 84.72 49.30 70.83 82.64 69.1 71.32 % 134 B.4 WORD BASED TESTING WITH 3-CLASSES Table 7-12 Classification results for word-based testing with 3 classes Predicted Group Membership /ɑb,vɑ,ib,vi/ /ɑwɑ/,/iwi/ /ɑð,d,zɑ/ /ið,d,z,i/ Classes Total /ɑb,vɑ,ib,vi/ 225 14 49 288 /ɑwɑ/,/iwi/ 25 118 1 144 /ɑð,d,zɑ/ /ið,d,z,i/ 39 3 390 432 % 78.12 81.94 90.28 83.45 Original Class . 135 BIBLIOGRAPHY [1] Sandlin R E Handbook of Hearing Aid Amplification, 2nd ed.: Allyn & Bacon 2000. [2] Summerfield Q, "Lip-reading and Audio-Visual Speech Perception," Philosophical Transactions: Biological Sciences, vol. 335, pp. 71-78, 1992. [3] Sumby W and Pollack I "Visual Contributions to speech intelligibility in noise," JASA, vol. 26, pp. 212-215, 1954. [4] Mcgurk H and Macdonald J, "Hearing Lips and Seeing voices," Nature, vol. 264, pp. 746-748, 1976. [5] Calvert G A, Bullmore E T, Brammer M J, Campbell R, Williams S R, McGuire P K, Woodruff P R, Iversen SD, and David AS, "Activation of Auditory Cortex During Silent Lipreading," Science, vol. 276, pp. 593-596, April 25, 1997 1997. [6] Bernstein L E and Benoit C, "For speech perception by humans or machines, three senses are better than one," in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, 1996, pp. 1477-1480 vol.3. [7] Jialin Z, Chou W, and Petajan E., "Acoustic driven viseme identification for face animation," in Multimedia Signal Processing, 1997., IEEE First Workshop on, 1997, pp. 7-12. [8] Matthews I, Cootes T F, Bangham J A, Cox S, and Harvey R, "Extraction of visual features for lipreading," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, pp. 198-213, 2002. [9] C. Tsuhan and R. R. Rao, "Audio-visual integration in multimodal communication," Proceedings of the IEEE, vol. 86, pp. 837-852, 1998. 136 [10] R. T. Chen, R. R., "Audio-visual integration in multimodal communication," Proceedings of the IEEE, vol. 86, pp. 837-852, 1998. [11] D. Y.-J. C. C. Tsuhan, Dr Simon Lucy, "Audio Visual Speech Data," Advanced Multimedia Lab at Carnegie Mellon University. [12] Fisher C G "Confusions among visually perceived consonants," Journal of Speech and Hearing Research, vol. 11, pp. 796-804, 1968. [13] Chen T, "Audio Visual Speech Processing," IEEE Signal Processing Magazine, pp. 9-21, January 2001. [14] Hans Peter Graf, Erik Casatto, and M. Potamianos, "Robust Recognition of Faces and Facial Features with a Multi-Modal System," IEEE International Conference on Systems, Man, and Cybernetics, 1997. Computational Cybernetics and Simulation, vol. III, pp. 2034 -2039, 1997. [15] Paul Duchnowski, Martin Hunke, and A. Waibel, "Towards Movement Invariant Automatic Lip-reading And Speech recognition," ICASSP, vol. I, pp. 109-112, 1995. [16] Richard Harvey, Iain Mathews, and J. Andrew, "Lip-reading from scale-space measurements," Proceedings 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997. [17] Bothe H and Rieger F, "Visual Speech and Coarticulations Effects," ICASSP, vol. V pp. 634 -637, 1993. [18] Owens E and Blazek B "Visemes Observed by hearing impaired and normal hearing adult viewers," Journal of Speech and Hearing Research, vol. 28, pp. 381-393, September 1985. [19] Hani Yehia, Philip Rubin, Eric, and Vatikiotis-Bateson, "Quantitative association of vocal-tract and facial behavior," Speech Communications, vol. 26, pp. 23-43, 1998. [20] J Barker and F. Bethommier, "Evidence of Correlation between acoustic and visual features of speech," Proceedings of ICPhS, pp. 199-202, 1999. 137 [21] Ezzat T and Poggio T "Miketalk: A talking facial display based on morphing vicemes," Proc. of Computer Animation, pp. 96-102, June 1998. [22] Ezzat T and Poggio T " "Miketalk: A Video Realistic Text-to Audio-Visual Speech Synthesizer," MIT Center for Biological and Computational Learning [23] Goecke Roland, Millar J Bruce, Zelinsky Alexander, and R.-R. Jordi, "Analysis of audiovideo correlation in vowels in Australian English," In AVSP-2001, pp. 115-120, 2001. [24] J.C.Wells, Longman Pronunciation Dictionary, Second ed.: Harlow: Pearson Education Limited, 2000. [25] Ladefoged P, Vowels and Consonants, 2nd ed.: Wiley-Blackwell, 2005. [26] Cohen M and Massaro D W, "Modeling coarticulation in synthetic visual speech," N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation, Tokyo Japan, pp. 139-156, 1993. [27] Faruquie T A, Kapoor A, Kate R, Rajput N, and Subramaniam L V, "Audio Driven Facial Animation for Audio-Visual Reality," in Multimedia and Expo, 2001. ICME 2001. IEEE International Conference on, 2001, pp. 821-824. [28] Verma A , Rajput N , and Subramaniam L V "Using viseme based acoustic models for speech driven lip synthesis," in Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03: IEEE Computer Society, 2003. [29] Kate Saenko, Trevor Darrell, and J. R. Glass, "Articulatory features for robust visual speech recognition," in Proceedings of the 6th international conference on Multimodal interfaces State College, PA, USA: ACM, 2004. [30] Jintao J, Abeer A, Lynne B , Eduard. T A, and Keating P A, "Similarity structure in perceptual and physical measures for visual consonants across talkers," in Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, 2002, pp. I-441-I-444 vol.1. [31] Jiang Jintao, Alwan Abeer, Auer Edward T , and Bernstein Lynne E "Predicting visual consonant perception from physical measures," In EUROSPEECH-2001, pp. 179-182, 2001. 138 [32] W. M. Salah Werda, Abdelmajid Ben Hamadou, Salah Werda, Walid Mahdi, Abdelmajid Ben Hamadou "Lip Localization and Viseme Classification for Visual Speech Recognition " International Journal of Computing and Information Sciences, vol. 4, pp. 62-75, 2006. [33] Leszczynski M and Skarbek W, "Viseme recognition - a comparative study," in Advanced Video and Signal Based Surveillance, 2005. AVSS 2005. IEEE Conference on, 2005, pp. 287-292. [34] W. S. Mariusz Leszczynski, Stanislaw Badura, "Fast Viseme Recognition for Talking Head Application " ICIAR 2005, LNCS 3656, pp. 516-523, 2005. [35] Huang FJ and Chen T "Real-Time Lip-Synch Face Animation Driven by Human Voice," IEEE Workshop on Multimedia Signal Processing, December 1998. [36] Kricos P B and Lesner S, "Differences in visual intelligibility across talkers," volta review, vol. 84, pp. 219-225, 1982. [37] Benguerel M and Pichora-Fuller K, "Coarticulation effects in lipreading " Journal of Speech and Hearing Research vol. 25, pp. 600-607, December 1982. [38] Dodd B and Campbell R, Hearing by Eye: The Psychology of Lip-reading London: Lawrence Erlbaum Associates, 1987. [39] J. Luettin and S. Dupont, "Continuous Audio-Visual Speech Recognition," Proc. of Fifth European Conference on Computer Vision, Frieburg, Germany, June 1998. [40] Petajan E, "Automatic Lip-reading to enhance speech recognition," Proceedings of IEEE Global Telecommunication Conference, pp. 265-272, Nov 1984. [41] B. Yuhus, M. Goldstien, and T. Sejnowski, "Integration of acoustic and visual speech signals using neural network," IEEE Transactions Speech Audio Processing, pp. 337351, 1989. [42] Nan Li, Shawn Dettmer, and M. Shah, "Lip-reading using Eigenspace," Proceedgins on workshop on automatic face and gesture recognition, pp. 30-35, 1995. 139 [43] Potamianos G and Neti C, "Automatic Speehreading for Impaired Speech," Proceedings of the Audio Visual Speech Processing Workshop, Spetember 2001. [44] R. S. Abdulrauf Biag, and Gilles Vaucher, "A Spatio-Temporal Neural Network Applied to Visual Speech Recognition," The 2001 IEEE International Symposium on Circuits and Systems, vol. 2, pp. 329 -332, 2001. [45] Goldschen A J "Continuous Automatic Speech Recognition by lip-reading." vol. PhD Washington, DC: Goerge Washington University, 1987. [46] Mase K and Pentland A, "Automatic Lip-reading by Optical flow Analysis," Systems and Computers in Japan, vol. 22, pp. 67-75, 1991. [47] Hiroshi G. Okuno, Yukiko Nakagawa, and H. Kitano, "Incorporating Visual Information into Sound Source Separation," Proc. of IJCAI-99 Workshop on Computational Auditory Scene Analysis (CASA'99), Stockholm, Sweeden, 1999. [48] Cutler R and Davis L "Look Who's Talking: Speaker Detection Using Video and Audio Correlation," IEEE International Conference on Multimedia and Expo, vol. 3, pp. 15891592, 2000. [49] Y. H. Takahashi K, "Audio Visual sensor fusion system for intelligent sound sensing," Proceedings of IEEE International conference on multisensor fusion and integration for intelligent systems (MFI 94)}, Las Vegas, NV, Oct 2-5 1994. [50] Ekman P, "Facial Expression and Emotion," American Psychologist, vol. 48, pp. 384392, 1993. [51] Craig K D, Hyde S A, and Patrick CJ, "Genuine, suppressed and faked facial behaviour during exacerbation of chronic low back pain," Pain, vol. 46, pp. 161-171, 1991. [52] Katsikitis M and Pilowsky I, "A study of facial expressions in Parkinson's disease using a novel microcomputer based method," Journal of Neurology, Neurosergery, and Psychiatry, vol. 51, pp. 362-366, 1988. [53] Ekamn P and Friesen W V, "Facial Action Coding System," Palo Alto: Consulting Pscychologist Press, vol. (a), 1978. 140 [54] Ekamn P and Friesen W V, "Facial Action Coding System:Investigator's Guide," Palo Alto: Consulting Pscychologist Press, vol. b, 1978. [55] Oster H. and Rosentsien, "Baby FACS: Analyzing facial movements in Infants," Unpublished Manuscript, New York University, 1993. [56] Oster Hegely and Nagel, "Adult Adjustment and fine grain analysis of infant facial expressions:Testing the validity of priori coding formulas," Developmental Pscychology, vol. 28, pp. 1115-1131, 1992. [57] Izard C E, "The Maximally Discriminative Facial Movement Coding System," Unpublished Manuscript, University of Delaware, 1983. [58] Essa I A and Pentland A, "A Vision System for Observing and Extracting Facial Action Parameters," Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 76-83, 1994. [59] K. a. P. Mase, "Lip Reading by Optical Flow," IEICE of Japan, Trnasactions, vol. 6, 796-803. [60] Yacoob Y and Davis L, "Computing spatio-temporal Representations of Human Faces," Proceedings In Computer Vision and Pattern Recognition, pp. 70-70, 1994. [61] M. Pantic and L. J. M. Rothkrantz, "Expert System for Automatic Analysis of Facial Epxressions," Image and Vision Computing, vol. 18, pp. 881-905, March 2000. [62] ExpertVision, ExpertVision Operation Manual. Santa Rosa, CA: Motion Analysis Corporation, 1990. [63] Godinho Tara, Ingham Roger J, Davidow Jason, and C. John, "The Distribution of Phonated Intervals in the Speech of Individuals Who Stutter," J Speech Lang Hear Res, vol. 49, pp. 161-171, February 1, 2006 2006. [64] Smith A and Kleinow J, "Kinematic Correlates of Speaking Rate Changes in Stuttering and Normally Fluent Adults," J Speech Lang Hear Res, vol. 43, pp. 521-536, April 1, 2000 2000. 141 [65] Wohlert A B and Smith A, "Spatiotemporal Stability of Lip Movements in Older Adult Speakers," J Speech Lang Hear Res, vol. 41, pp. 41-50, February 1, 1998 1998. [66] Stevens K, Acoustic Phonetics, First ed.: MIT Press, 2000. [67] Cohen M, Walker R, and Massaro D "Perception of synthetic visual speech," in Speechreading by Humans and Machines New York: Springer, 1996, pp. 153-168. [68] Gutierrez-Osuna R, "Fisher Discriminant Analysis," in Pattern Recognition and Intelligent Sensor Machines College Station: Texas A&M University, Last accessed on June 26, 2008. [69] Huberty C, Applied Discriminant Analysis: John Whiley & Sons, Inc, 1994. [70] Norusis M J SPSS 16.0 Statistical Procedures Companion: Prentice Hall, 2008. 142

Log In

Classification of visemes using visual cues

Related papers

Related papers

Related topics