Automatic dictation software with reasonably high word recognition accuracy is now widely avail- able to the general public. Many people with gross motor impairment, including some people with cerebral palsy and closed head injuries, have... more
Automatic dictation software with reasonably high word recognition accuracy is now widely avail- able to the general public. Many people with gross motor impairment, including some people with cerebral palsy and closed head injuries, have not enjoyed the benet of these advances, because their general motor impairment includes a component of dysarthria: reduced speech intelligibility caused by neuromotor impairment. These motor impairments often preclude normal use of a keyboard. For this reason, case studies have shown that some dysarthric users may nd it easier, instead of a key- board, to use a small-vocabulary automatic speech recognition system, with code words representing letters and formatting commands, and with acoustic speech recognition models carefully adapted to the speech of the individual user. Development of each individualized speech recognition system remains extremely labor-intensive, because so little is understood about the general characteristics of dysarthric s...
This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step... more
This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using “Harmonic plus Noise Model” synthesis techniques. Experimental results are based on a onehour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker’s lips.
... Our eye detector is inspired by the works of Viola and Jones on the use of Haar-like features with integral images [27 ... The movements of mouth regions are described using local binary patterns from XY, XT and YT planes, combining... more
... Our eye detector is inspired by the works of Viola and Jones on the use of Haar-like features with integral images [27 ... The movements of mouth regions are described using local binary patterns from XY, XT and YT planes, combining local features from pixel, block and volume ...
In recent year, lip-reading systems have received much attention, since it plays an important role in human communication with computer especially for hearing impaired or elderly people. In this paper, we introduce a new visual feature... more
In recent year, lip-reading systems have received much attention, since it plays an important role in human communication with computer especially for hearing impaired or elderly people. In this paper, we introduce a new visual feature representation combines the Hypercolumn Neural Network model (HCM) with Hidden Markov Model (HMM) to achieve a complete lip-reading system. To check our system performance we introduce the Arabic language to it. According to our knowledge, this is the first time that a visual speech recognition system is applied for Arabic language. Experiments include different Arabic sentences gathered from different native speakers (Male & Female).
Magnetic motion trackers have been widely used for tracking user head / hand pose information owing to their advantages such as size, occlusion-less tracking environment and high sample rate. Yet, issues such as latency and jitter leave... more
Magnetic motion trackers have been widely used for tracking user head / hand pose information owing to their advantages such as size, occlusion-less tracking environment and high sample rate. Yet, issues such as latency and jitter leave magnetic tracker technology unfavorable as compared to other tracker technologies. While latency is due to tracker hardware employed, jitter is due to magnetic field distortion caused by the presence of metals nearby. These issues permit the emergence of registration errors when employed for Virtual Environment (VE) systems, specifically for interactive Virtual Reality (VR) or Augmented Reality (AR) applications. This paper discusses the integration of prediction-smoothing algorithms to achieve accurate registration for a magnetic tracker-based immersive Augmented Reality - Computational Fluid Dynamics (CFD) environment. In this project, Kalman and Gaussian filters are utilized to remove latency and jitter effects. In addition, to allow efficient con...
Latest results on continuous speech phone recognition from video observations of the tongue and lips are described in the context of an ultrasound-based silent speech interface. The study is based on a new 61-minute audiovisual database... more
Latest results on continuous speech phone recognition from video observations of the tongue and lips are described in the context of an ultrasound-based silent speech interface. The study is based on a new 61-minute audiovisual database containing ultrasound sequences of the tongue as well as both frontal and lateral view of the speaker’s lips. Phonetically balanced and exhibiting good diphone coverage, this database is designed both for recognition and corpus-based synthesis purposes. Acoustic waveforms are phonetically labeled, and visual sequences coded using PCA-based robust feature extraction techniques. Visual and acoustic observations of each phonetic class are modeled by continuous HMMs, allowing the performance of the visual phone recognizer to be compared to a traditional acoustic-based phone recognition experiment. The phone recognition confusion matrix is also discussed in detail.
Latest results on continuous speech phone recognition from video observations of the tongue and lips are described in the context of an ultrasound-based silent speech interface. The study is based on a new 61-minute audiovisual database... more
Latest results on continuous speech phone recognition from video observations of the tongue and lips are described in the context of an ultrasound-based silent speech interface. The study is based on a new 61-minute audiovisual database containing ultrasound sequences of the tongue as well as both frontal and lateral view of the speaker’s lips. Phonetically balanced and exhibiting good diphone coverage, this database is designed both for recognition and corpus-based synthesis purposes. Acoustic waveforms are phonetically labeled, and visual sequences coded using PCA-based robust feature extraction techniques. Visual and acoustic observations of each phonetic class are modeled by continuous HMMs, allowing the performance of the visual phone recognizer to be compared to a traditional acoustic-based phone recognition experiment. The phone recognition confusion matrix is also discussed in detail.
The need for an automatic lip-reading system is ever increasing. Infact, today, extraction and reliable analysis of facial movements make up an important part in many multimedia systems such as videoconference, low communication systems,... more
The need for an automatic lip-reading system is ever increasing. Infact, today, extraction and reliable analysis of facial movements make up an important part in many multimedia systems such as videoconference, low communication systems, lip-reading systems. In addition, visual information is imperative among people with special needs. We can imagine, for example, a dependent person ordering a machine with an easy lip movement or by a simple syllable pronunciation. Moreover, people with hearing problems compensate for their special needs by lip-reading as well as listening to the person with whome they are talking. We present in this paper a new approach to automatically localize lip feature points in a speaker’s face and to carry out a spatial-temporal tracking of these points. The extracted visual information is then classified in order to recognize the uttered viseme (visual phoneme). We have developed our Automatic Lip Feature Extraction prototype (ALiFE). Experiments revealed t...
Visual Speech Recognition aims at transcribing lip movements into readable text. There have been many strides in automatic speech recognition systems that can recognize words with audio and visual speech features, even under noisy... more
Visual Speech Recognition aims at transcribing lip movements into readable text. There have been many strides in automatic speech recognition systems that can recognize words with audio and visual speech features, even under noisy conditions. This paper focuses only on the visual features, while a robust system uses visual features to support acoustic features. We propose the concatenation of visemes (lip movements) for text classification rather than a classic individual viseme mapping. The result shows that this approach achieves a significant improvement over the state-of-the-art models. The system has two modules; the first one extracts lip features from the input video, while the next is a neural network system trained to process the viseme sequence and classify it as text.
Lipreading is a main part of audio-visual speech recognition systems which are mostly faced with redundancy of extracted features. In this paper, a new approach has been proposed to increase the lipreading performance by extraction of... more
Lipreading is a main part of audio-visual speech recognition systems which are mostly faced with redundancy of extracted features. In this paper, a new approach has been proposed to increase the lipreading performance by extraction of discriminant features. In this way, first, faces are detected; then, lip key points are extracted in which four cubic curves characterize lip contours. Next,
In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is described as a temporal sequence of visemes. Each viseme is described by a support vector machine, and the... more
In this paper we propose a visual speech recognition network based on support vector machines. Each word of the dictionary is described as a temporal sequence of visemes. Each viseme is described by a support vector machine, and the temporal character of speech is modeled by integrating the support vector machines as nodes into a Viterbi decoding lattice. Experiments conducted on a small visual speech recognition task show a word recognition rate on the level of the best rates previously reported, even without training the state transition probabilities in the Viterbi lattice and using very simple features. This proves the suitability of support vector machines for visual speech recognition.
This paper reports on a visual speech recognition method that is invariant to translation, rotation and scale. Dynamic features representing the mouth motion is extracted from the video data by using a motion segmentation technique termed... more
This paper reports on a visual speech recognition method that is invariant to translation, rotation and scale. Dynamic features representing the mouth motion is extracted from the video data by using a motion segmentation technique termed as motion history image (MHI). MHI is generated by applying accumulative image differencing technique on the sequence of mouth images. Invariant features are derived