Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, Mar 1, 2008
To reproduce a face motion from an image sequence, natural motion parameters provide a semantical... more To reproduce a face motion from an image sequence, natural motion parameters provide a semantical and efficient way of representation. State-of-the art techniques for 3D face motion estimation employ a limited set of predefined key-shapes of face structure, and thereby restrict the possible face motion which can cause distortions. We propose a new approach in which such distortions are avoided by augmenting a 3D structural surface face model with a physical motion model originating from continuum mechanics. Implementation with a displacement-based FEM does not only describe, but also explain the motion of the face's skin tissue caused by the muscle force parameterized actuations. The correct usage of the motion model and the mapping of the 3D scene flow to the 2D optical flow, allow posing the 3D deformation estimation as an inverse problem, for which a solution has been obtained using numerical solvers.
Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, 2010
Abstract In this paper, we propose an approach to convert acoustic speech to video realistic mout... more Abstract In this paper, we propose an approach to convert acoustic speech to video realistic mouth animation based on an articulatory dynamic Bayesian network model with constrained asynchrony (AF_AVDBN). Conditional probability distributions are defined to control the asynchronies between the articulators such as lips, tongue and glottis/velum. An EM-based conversion algorithm is also presented to learn the optimal visual features given an auditory input and the trained AF_AVDBN parameters. In the training of the AF_AVDBN ...
This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenat... more This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted
This paper addresses the problem of animating a talking figure, such as an avatar, using speech i... more This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The proposed system is based on Hidden Markov Models for the acoustic observation vec-tors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (called visemes). This case study illustrates that it is indeed possible to obtain visually relevant speech segmen-tation data directly from a purely acoustic observation sequence.
The broadcasting sector is facing challenging years. With the exponential growth of media content... more The broadcasting sector is facing challenging years. With the exponential growth of media content whilst consumers expect a high quality service, it is getting progressively more challenging to offer a personalized media experience. Media companies are continuously exploring novel ways to target their viewers with such a service. In this context, it is crucial for a broadcaster to get a better understanding of the targeted audience. In this paper we highlight how empathy is an essential ingredient for future personalized and interactive services and how it brings the consumer experience to an unprecedented level. Knowing how people experience different kinds of content is strategic information for a broadcaster. (Re)acting on this information potentially delivers better content targeting, enrichment and adaptation. We present an overview of the state‐of‐the art and how it is being applied in broadcasting. We illustrate this with a concrete example case of an empathic product and dis...
In this paper we describe a MPEG-4 AFX compliant Skin&Bone based animation technique applied on a... more In this paper we describe a MPEG-4 AFX compliant Skin&Bone based animation technique applied on a novel surface representation method, called MESHGRID, which is characterized by a connectivity-description defined in terms of a regular 3D grid of points, i.e. the reference-grid. The approach is based on deforming the regular reference-grid in order to obtain the animation of the vertices of
International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003, 2003
This paper presents an acoustic viseme based continuous speech recognition system for speech driv... more This paper presents an acoustic viseme based continuous speech recognition system for speech driven talking face animation. The system is developed using viseme HMMs with acoustic speech as input only. Triseme HMMs are adopted to reflect the mouth shape contexts. Visual decision trees are introduced to get robust parameter training for triseme HMMs with the limited training data. In the tree building process, methods based on lip rounding and similarity of viseme shapes are introduced to design visual questions. The results from objective and subjective evaluations show that the talking face animation based on the speech recognition system provided by this paper outperforms the conventional phoneme based one, and it is possible to obtain visually relevant speech segmentation information from acoustic speech signal only.
Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667), 2003
This paper addresses the problem of animating a talking figure, such as an avatar, using speech i... more This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The system that was developed is based on hidden Markov models for the acoustic observation vectors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (visemes). The acoustic variability with context was taken into account by building acoustic viseme models that are dependent on the left and right viseme contexts. Our experimental results show that it is indeed possible to obtain visually relevant speech segmentation data directly from the purely acoustic speech signal.
Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693), 2003
This paper presents an audio visual bimodal continuous speech recognition system. The visual feat... more This paper presents an audio visual bimodal continuous speech recognition system. The visual feature extraction of the mouth movements uses the number of granules obtained by applying a datasieve. Multi-stream HMMs are introduced for combining audio and visual modalities using time synchronous audio visual features. Experimental results show that the recognition system provided by this paper is suitable for continuous speech recognition tasks in noisy environments, and the datasieve based visual features outperform the conventional DCT and DWT features.
In this paper, we consider fitting a 3D wireframe face model to continuous video sequences for th... more In this paper, we consider fitting a 3D wireframe face model to continuous video sequences for the tasks of simultaneous tracking of rigid head motion and non-rigid facial animation. We propose a two-level integrated model for accurate 3D face alignment. At the low level, the 2D shape is accurately extracted via a regularized shape model relied on a cascaded parameter/constraint
This work presents a new method to automatically locate frontal facial feature points under large... more This work presents a new method to automatically locate frontal facial feature points under large scene variations (illumination, pose and facial expressions). First, we use a kernel-based tracker to detect and track the facial region in an image sequence. Then the results of the face tracking, i.e. face region and face pose, are used to constrain prominent facial feature detection and tracking. In our case, eyes and mouth corners are considered as prominent facial features. In a final step, we propose an improvement to the Bayesian Tangent Shape Model for the detection and tracking of the full shape model. A constrained regularization algorithm is proposed using the head pose and the accurately aligned prominent features to constrain the deformation parameters of the shape model. Extensive experiments demonstrate the accuracy and effectiveness of our proposed method.
2009 Fifth International Conference on Image and Graphics, 2009
This paper presents an audio visual multi-stream DBN model (Asy_DBN) for emotion recognition with... more This paper presents an audio visual multi-stream DBN model (Asy_DBN) for emotion recognition with constraint asynchrony, in which audio state and visual state transit individually in their corresponding stream but the transition is constrained by the allowed maximum audio visual asynchrony. Emotion recognition experiments of Asy_DBN with different asynchrony constraints are carried out on an audio visual speech database of four emotions, and compared with the single stream HMM, state synchronous HMM (Syn_HMM) and state synchronous DBN model, as well the state asynchronous DBN model without asynchrony constraint. Results show that by setting the appropriate maximum asynchrony constraint between audio and visual streams, the proposed audio visual asynchronous DBN model gets the highest emotion recognition performance, with an improvement of 15% over Syn_HMM.
2009 Fifth International Conference on Image and Graphics, 2009
Abstract This paper presents a mouth animation construction method based on the DBN models with a... more Abstract This paper presents a mouth animation construction method based on the DBN models with articulatory features (AF_AVDBN), in which the articulatory features of lips, tongue, glottis/velum can be asynchronous within a maximum asynchrony constraint to describe the speech production process more reasonably. Given an audio input and the trained AF_AVDBN models, the optimal visual feature learning algorithm is deduced based on the Maximum Likelihood Estimation criterion. The learned visual features are then ...
Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, Mar 1, 2008
To reproduce a face motion from an image sequence, natural motion parameters provide a semantical... more To reproduce a face motion from an image sequence, natural motion parameters provide a semantical and efficient way of representation. State-of-the art techniques for 3D face motion estimation employ a limited set of predefined key-shapes of face structure, and thereby restrict the possible face motion which can cause distortions. We propose a new approach in which such distortions are avoided by augmenting a 3D structural surface face model with a physical motion model originating from continuum mechanics. Implementation with a displacement-based FEM does not only describe, but also explain the motion of the face's skin tissue caused by the muscle force parameterized actuations. The correct usage of the motion model and the mapping of the 3D scene flow to the 2D optical flow, allow posing the 3D deformation estimation as an inverse problem, for which a solution has been obtained using numerical solvers.
Acoustics Speech and Signal Processing 1988 Icassp 88 1988 International Conference on, 2010
Abstract In this paper, we propose an approach to convert acoustic speech to video realistic mout... more Abstract In this paper, we propose an approach to convert acoustic speech to video realistic mouth animation based on an articulatory dynamic Bayesian network model with constrained asynchrony (AF_AVDBN). Conditional probability distributions are defined to control the asynchronies between the articulators such as lips, tongue and glottis/velum. An EM-based conversion algorithm is also presented to learn the optimal visual features given an auditory input and the trained AF_AVDBN parameters. In the training of the AF_AVDBN ...
This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenat... more This paper presents a novel audio visual diviseme (viseme pair) instance selection and concatenation method for speech driven photo realistic mouth animation. Firstly, an audio visual diviseme database is built consisting of the audio feature sequences, intensity sequences and visual feature sequences of the instances. In the Viterbi based diviseme instance selection, we set the accumulative cost as the weighted
This paper addresses the problem of animating a talking figure, such as an avatar, using speech i... more This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The proposed system is based on Hidden Markov Models for the acoustic observation vec-tors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (called visemes). This case study illustrates that it is indeed possible to obtain visually relevant speech segmen-tation data directly from a purely acoustic observation sequence.
The broadcasting sector is facing challenging years. With the exponential growth of media content... more The broadcasting sector is facing challenging years. With the exponential growth of media content whilst consumers expect a high quality service, it is getting progressively more challenging to offer a personalized media experience. Media companies are continuously exploring novel ways to target their viewers with such a service. In this context, it is crucial for a broadcaster to get a better understanding of the targeted audience. In this paper we highlight how empathy is an essential ingredient for future personalized and interactive services and how it brings the consumer experience to an unprecedented level. Knowing how people experience different kinds of content is strategic information for a broadcaster. (Re)acting on this information potentially delivers better content targeting, enrichment and adaptation. We present an overview of the state‐of‐the art and how it is being applied in broadcasting. We illustrate this with a concrete example case of an empathic product and dis...
In this paper we describe a MPEG-4 AFX compliant Skin&Bone based animation technique applied on a... more In this paper we describe a MPEG-4 AFX compliant Skin&Bone based animation technique applied on a novel surface representation method, called MESHGRID, which is characterized by a connectivity-description defined in terms of a regular 3D grid of points, i.e. the reference-grid. The approach is based on deforming the regular reference-grid in order to obtain the animation of the vertices of
International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003, 2003
This paper presents an acoustic viseme based continuous speech recognition system for speech driv... more This paper presents an acoustic viseme based continuous speech recognition system for speech driven talking face animation. The system is developed using viseme HMMs with acoustic speech as input only. Triseme HMMs are adopted to reflect the mouth shape contexts. Visual decision trees are introduced to get robust parameter training for triseme HMMs with the limited training data. In the tree building process, methods based on lip rounding and similarity of viseme shapes are introduced to design visual questions. The results from objective and subjective evaluations show that the talking face animation based on the speech recognition system provided by this paper outperforms the conventional phoneme based one, and it is possible to obtain visually relevant speech segmentation information from acoustic speech signal only.
Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667), 2003
This paper addresses the problem of animating a talking figure, such as an avatar, using speech i... more This paper addresses the problem of animating a talking figure, such as an avatar, using speech input only. The system that was developed is based on hidden Markov models for the acoustic observation vectors of the speech sounds that correspond to each of 16 visually distinct mouth shapes (visemes). The acoustic variability with context was taken into account by building acoustic viseme models that are dependent on the left and right viseme contexts. Our experimental results show that it is indeed possible to obtain visually relevant speech segmentation data directly from the purely acoustic speech signal.
Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693), 2003
This paper presents an audio visual bimodal continuous speech recognition system. The visual feat... more This paper presents an audio visual bimodal continuous speech recognition system. The visual feature extraction of the mouth movements uses the number of granules obtained by applying a datasieve. Multi-stream HMMs are introduced for combining audio and visual modalities using time synchronous audio visual features. Experimental results show that the recognition system provided by this paper is suitable for continuous speech recognition tasks in noisy environments, and the datasieve based visual features outperform the conventional DCT and DWT features.
In this paper, we consider fitting a 3D wireframe face model to continuous video sequences for th... more In this paper, we consider fitting a 3D wireframe face model to continuous video sequences for the tasks of simultaneous tracking of rigid head motion and non-rigid facial animation. We propose a two-level integrated model for accurate 3D face alignment. At the low level, the 2D shape is accurately extracted via a regularized shape model relied on a cascaded parameter/constraint
This work presents a new method to automatically locate frontal facial feature points under large... more This work presents a new method to automatically locate frontal facial feature points under large scene variations (illumination, pose and facial expressions). First, we use a kernel-based tracker to detect and track the facial region in an image sequence. Then the results of the face tracking, i.e. face region and face pose, are used to constrain prominent facial feature detection and tracking. In our case, eyes and mouth corners are considered as prominent facial features. In a final step, we propose an improvement to the Bayesian Tangent Shape Model for the detection and tracking of the full shape model. A constrained regularization algorithm is proposed using the head pose and the accurately aligned prominent features to constrain the deformation parameters of the shape model. Extensive experiments demonstrate the accuracy and effectiveness of our proposed method.
2009 Fifth International Conference on Image and Graphics, 2009
This paper presents an audio visual multi-stream DBN model (Asy_DBN) for emotion recognition with... more This paper presents an audio visual multi-stream DBN model (Asy_DBN) for emotion recognition with constraint asynchrony, in which audio state and visual state transit individually in their corresponding stream but the transition is constrained by the allowed maximum audio visual asynchrony. Emotion recognition experiments of Asy_DBN with different asynchrony constraints are carried out on an audio visual speech database of four emotions, and compared with the single stream HMM, state synchronous HMM (Syn_HMM) and state synchronous DBN model, as well the state asynchronous DBN model without asynchrony constraint. Results show that by setting the appropriate maximum asynchrony constraint between audio and visual streams, the proposed audio visual asynchronous DBN model gets the highest emotion recognition performance, with an improvement of 15% over Syn_HMM.
2009 Fifth International Conference on Image and Graphics, 2009
Abstract This paper presents a mouth animation construction method based on the DBN models with a... more Abstract This paper presents a mouth animation construction method based on the DBN models with articulatory features (AF_AVDBN), in which the articulatory features of lips, tongue, glottis/velum can be asynchronous within a maximum asynchrony constraint to describe the speech production process more reasonably. Given an audio input and the trained AF_AVDBN models, the optimal visual feature learning algorithm is deduced based on the Maximum Likelihood Estimation criterion. The learned visual features are then ...
Uploads
Papers by Ilse Ravyse