Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Frédéric Berthommier

The articulatory-acoustic relationship is many-to-one and non linear and this is a great limitation for studying speech production. A simplification is proposed to set a bijection between the vowel space (f1, f2) and the parametric space... more
The articulatory-acoustic relationship is many-to-one and non linear and this is a great limitation for studying speech production. A simplification is proposed to set a bijection between the vowel space (f1, f2) and the parametric space of different vocal tract models. The generic area function model is based on mixtures of cosines allowing the generation of main vowels with two formulas. Then the mixture function is transformed into a coordination function able to deal with articulatory parameters. This is shown that the coordination function acts similarly with the Fant’s model and with the 4-Tube DRM derived from the generic model.
We have found a significant relationship between the consonant inventory size and labiodentals /f v/. The geographical distribution of small inventory size languages overlaps with Hunter-Gatherers populations, leading to a possible... more
We have found a significant relationship between the consonant inventory size and labiodentals /f v/. The geographical distribution of small inventory size languages overlaps with Hunter-Gatherers populations, leading to a possible confusion. The particular characteristics of Australia are detailed because symptomatic of this confusion.
thèse de l’universite ́ de Lyon délivrée par l’universite ́ claude bernard lyon 1 École doctorale M.E.G.A diplôme de doctorat
The predominant way to synthesize stop consonants is currently to use an articulatory model controlled by vocal tract parameters. We propose a new method to make this synthesis in various vocalic contexts. To generate the formant... more
The predominant way to synthesize stop consonants is currently to use an articulatory model controlled by vocal tract parameters. We propose a new method to make this synthesis in various vocalic contexts. To generate the formant transitions, the basic principle is to apply an opening function on the (equal-length section) area function derived from the linear predictive (LP) model of speech signals. The definition of this opening function is empirically based on morphological considerations, and the main parameter is the place of articulation. Syllabic sounds with /b d g / in /a i u / vowel contexts are generated using LP synthesis with reflections coefficients corresponding to the interpolated area function. We show that the general structure of the formant transitions can be well represented using this model, and provide intelligible sound examples. Index Terms: syllable synthesis, co-articulation, stop consonants, place of articulation, acoustic tube model, formant
This paper addresses a method of blind separation of delayed and superimposed sources. We employ a linear feedforward temporal network as a separation system and present a simple associated learning algorithm. We adopt a blind separation... more
This paper addresses a method of blind separation of delayed and superimposed sources. We employ a linear feedforward temporal network as a separation system and present a simple associated learning algorithm. We adopt a blind separation technique as a front-end processor for robust cocktail party speech recognition task. Experimental study of the proposed algorithm is given through two different real-world data.
It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions.... more
It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Fi...
We propose and test a technique for speech enhancement based on the computation of a harmonicity index, which is non linearly related to the SNR. We assume this method is close to "segregation" of speech and noise and it follows... more
We propose and test a technique for speech enhancement based on the computation of a harmonicity index, which is non linearly related to the SNR. We assume this method is close to "segregation" of speech and noise and it follows the aim of the CASA approach. To carry out the performance evaluation, we quantify the accuracy of reconstruction of the target speech source. We vary factors including the size of the time-frequency regions in which the enhancement process is applied and the use of demodulation. We conclude that these factors have little effect on reconstruction accuracy, but demodulation improves the reconstruction and a process applied in 4 sub-bands with 128 ms time frame-duration is satisfactory. Then, using a HMM/ANN model, we evaluate the recognition scores in comparison with those obtained with unprocessed noisy speech, J-RASTA-PLP pre-processing and training with a clean signal. A gain of 3-4dB is observed in loud noise with GWN, and 3dB with car noise, at...
synthesis driven by geometrical contours of the
Perceptual experiments on audio-visual consonant recognition based on the spectral reduction of the speech (SRS) have been carried out with coherent and incoherent (McGurk) audio-visual pairs. The main interest of SRS in four sub-bands is... more
Perceptual experiments on audio-visual consonant recognition based on the spectral reduction of the speech (SRS) have been carried out with coherent and incoherent (McGurk) audio-visual pairs. The main interest of SRS in four sub-bands is to have a partial suppression of the information transmitted for the place of articulation. The integration of manner, restricted to the fricative/occlusive contrast, is also concerned and a new 'crossmanner ' combination is tested. As expected, we have a good audiovisual complementarity for SRS and a high amount of McGurk responses, but new interesting effects are observed. For the interpretation of human confusion about place of articulation, the Bayesian model proposed by Massaro and Stork [8] is compared to a new place identification model which is based on averaging as well as on the separate identification of articulatory features. This decomposition is a promising way for the development of multistream speech recognition models.
For speech segregation, a blind separation model (BSS) is tested together with a CASA model which is based on the localisation cue and the evaluation of the time delay of arrival (TDOA). The test database is composed of 332 binary mixture... more
For speech segregation, a blind separation model (BSS) is tested together with a CASA model which is based on the localisation cue and the evaluation of the time delay of arrival (TDOA). The test database is composed of 332 binary mixture sentences recorded in stereo with a static set-up. These are truncated at 1 second for the simulations. For applying the two models, we cut the frequency domain in a variable number of subbands, which are processed independently. Then, we evaluate the gain, using reference signals recorded in isolation. Without using this reference, a coherence index is also established for the BSS model, which measures the degree of convergence. After a careful analysis, we find gains of about 1-3dB for the two methods, which are smaller than those published for the same task. The variation of the number of subbands allows an optimisation, and we obtain a significant peak at 4 subbands for the CASA model, and a smaller maximum at 2 subbands for the BSS model.
enhancement and segregation based on the localisation
doi: 10.3389/fpsyg.2014.01340 A possible neurophysiological correlate of audiovisual binding and unbinding in speech perception
In this paper we present a system for audio-visual speech recognition based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) approach. To setup the system it was necessary to record a new audio-visual database. We will... more
In this paper we present a system for audio-visual speech recognition based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) approach. To setup the system it was necessary to record a new audio-visual database. We will describe the recording and labeling of the database. The fusion of audio and video data is a key aspect of the paper. Three conditions, when only the audio or only the video data is reliable and when they are both equally reliable, will attract our attention. A method to combine the video and audio information based on these three conditions will be presented. An implementation of this method in an automatic fusion depending on the noise level in the audio channel is developed. The performance of the complete system is demonstrated using two types of additive noise at varying SNR. 1.
Abstract: Audio-visual speech recognition leads to significant improvements compared to pure audio recognition especially when the audio signal is corrupted by noise. In this article we investigate the consequences of additional... more
Abstract: Audio-visual speech recognition leads to significant improvements compared to pure audio recognition especially when the audio signal is corrupted by noise. In this article we investigate the consequences of additional degradations in the video signal on the audio-visual recognition process.. We degrade the images with noise, a JPEG compression, and errors in the localization of the mouth region. The first question we address is how the noise in the video stream influences the recognition scores. Therefore we added noise to both, the audio and video signal at different SNR levels. The second question is how the adaptation of the fusion parameter, controlling the contribution of the audio and video stream to the recognition, is affected by the additional noise in the video stream. We compare the results we obtain when we adapt the fusion parameter to the noise in the audio and video stream to those we get when it is only adapted to the noise in the audio stream and hence a ...
During the fusion of audio and video information for speech recognition, the estimation of the reliability of the noise affected audio channel is crucial to get meaningful recognition results. In this paper we compare two types of... more
During the fusion of audio and video information for speech recognition, the estimation of the reliability of the noise affected audio channel is crucial to get meaningful recognition results. In this paper we compare two types of reliability measures. One is the use of the statistics of the phoneme a-posteriori probabilities and the other is the analysis of the audio signal itself. We implemented the entropy and the dispersion of the probabilities and, from the audio-based criteria, the so called Voicing Index. To test the criteria a hybrid ANN/HMM audio-visual recognition system was used and 5 different types of noise at 12 SNR levels each were added to the audio signal. The best sigmoidal fit for each criterion between the fusion parameter and the value of the criterion over all noise types and SNR values was performed. The resulting individual errors and the corresponding averaged relative errors are given. 1.
This paper addresses a method of blind separation of delayed and superimposed sources. We employ a linear feedforward temporal network as a separation system and present a simple associated learning algorithm. We adopt a blind separation... more
This paper addresses a method of blind separation of delayed and superimposed sources. We employ a linear feedforward temporal network as a separation system and present a simple associated learning algorithm. We adopt a blind separation technique as a front-end processor for robust cocktail party speech recognition task. Experimental study of the proposed algorithm is given through two different real-world data.
Speech Audio-visual (AV) interaction has been considered for redundancy and complementary properties at the phonetic level but a few experiments have shown a significant role in early auditory analysis. A new paradigm is proposed which... more
Speech Audio-visual (AV) interaction has been considered for redundancy and complementary properties at the phonetic level but a few experiments have shown a significant role in early auditory analysis. A new paradigm is proposed which uses the pre-voicing component (PVC) excised from a true /b/. When the so called target PVC is added up to a /p/ this leads to the clear perception of /b/. Moreover, the amplitude variation of the target PVC allows building of a perceptual continuum between /p/ when amplitude is set at 0 and /b/ at original amplitude. In the audio channel, adding a series of PVC at fixed low amplitude before and after the target allows the creation of a stream of regular sounds, which are not related to visible events. On the contrary, the bilabial aperture of the /p/ is a specific speech gesture visible in the video channel. The target PVC and the visible gesture are also not redundant events. Then, depending on its intensity level, the target PVC added to an audio /...
When several speakers are present simultaneously, what is the influence of interaural time difference (ITD) on speech discrimination? We duplicate Shackleton and Meddis' experiment (1992) and we propose an AM map model enhanced by ITD... more
When several speakers are present simultaneously, what is the influence of interaural time difference (ITD) on speech discrimination? We duplicate Shackleton and Meddis' experiment (1992) and we propose an AM map model enhanced by ITD information in a task of segregation of double-vowels. Coupling such primitive representations improves globally the recognition rate by 10 %. This agrees with the grouping of spectral informations by the auditory system according to delay cue.
In a series of experiments we showed that the McGurk effect may be modulated by context: applying incoherent auditory and visual material before an audiovisual target made of an audio "ba" and a video "ga"... more
In a series of experiments we showed that the McGurk effect may be modulated by context: applying incoherent auditory and visual material before an audiovisual target made of an audio "ba" and a video "ga" significantly decreases the McGurk effect. We interpreted this as showing the existence of an audiovisual "binding" stage controlling the fusion process. Incoherence would produce "unbinding" and result in decreasing the weight of the visual input in the fusion process. In this study, we further explore this binding stage around two experiments. Firstly we test the "rebinding" process, by presenting a short period of either coherent material or silence after the incoherent "unbinding" context. We show that coherence provides "rebinding", resulting in a recovery of the McGurk effect. In contrary, silence provides no rebinding and hence "freezes" the unbinding process, resulting in no recovery of the McG...
ABSTRACT
Keywords: speech Reference EPFL-CONF-82465 URL: http://publications.idiap.ch/downloads/reports/1998/glotin-nsi98.pdf Record created on 2006-03-10, modified on 2017-05-10
Le but de ce travail est de decrire un mode de controle descendant de l'activite dans le bulbe olfactif, considere comme une etape intermediaire de traitement. Nous supposons que ce controle n'est pas seulement realise en... more
Le but de ce travail est de decrire un mode de controle descendant de l'activite dans le bulbe olfactif, considere comme une etape intermediaire de traitement. Nous supposons que ce controle n'est pas seulement realise en modifiant l'intensite moyenne de decharge des cellules principales, mais en agissant sur les correlations de leurs activites. L'image olfactive spatio-temporelle peripherique est pre-traitee dans une premiere couche de filtrage lineaire puis dans une seconde etape non-lineaire recevant des influences descendantes. Cette derniere est un reseau d'oscillateurs probabilistes delivrant directement des probabilites de decharge instantanees. L'operateur de connexion, nomme co-reset, est aussi de nature probabiliste. Nous etablissons ainsi une equivalence stricte entre un reseau stochastique a pulses et un reseau dont les neurones echangent et transmettent un signal continu. Il offre une possibilite d'integrer l'information spatio-temporelle...
This paper examines the degree of correlation between lip and jaw configuration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker’s face and profile, and... more
This paper examines the degree of correlation between lip and jaw configuration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker’s face and profile, and the acoustics are represented using line spectral pair parameters and a measure of RMS energy. A correlation is found between the measured acoustic parameters and a linear estimate of the acoustics recovered from the visual data. This correlation exists despite the simplicity of the visual representation and is in rough agreement with correlations measured in earlier work by Yehia et al. using different techniques. However, analysis of the estimation errors suggests that the visual information, as parameterised in our experiment, offers only a weak constraint on the acoustics. Results are discussed from the perspective of models of early audio-visual integration.
Recent findings demonstrate that audiovisual fusion during speech perception may involve pre-phonetic processing. The aim of the current experiment is to investigate this hypothesis using a pairing task between auditory sequences of... more
Recent findings demonstrate that audiovisual fusion during speech perception may involve pre-phonetic processing. The aim of the current experiment is to investigate this hypothesis using a pairing task between auditory sequences of vowels and non speech visual cues. The audio sequences are composed of 6 auditory French vowels alternating in pitch (or not) in order to build 2 interleaved streams of 3 vowels each. Various elementary visual displays are mounted in synchrony with one vowel stream out of the two. Our hypothesis is that, in a forced choice pairing task, the AV synchronized vowels will be found more frequently if such a perceptual binding operates. We show that the most efficient visual feature increasing pairing performance is the movement. Surprisingly, some features we manipulated do not provide the increase in pairing performances. The visual cue of contrast variation is not correctly paired with the synchronized auditory vowels. Moreover, the auditory segregation, ba...
Perceptual experiments on audio-visual consonant recognition based on the spectral reduction of the speech (SRS) have been carried out with coherent and incoherent (McGurk) audio-visual pairs. The main interest of SRS in four sub-bands is... more
Perceptual experiments on audio-visual consonant recognition based on the spectral reduction of the speech (SRS) have been carried out with coherent and incoherent (McGurk) audio-visual pairs. The main interest of SRS in four sub-bands is to have a partial suppression of the information transmitted for the place of articulation. The integration of manner, restricted to the fricative/occlusive contrast, is also of concern, and a new 'crossmanner' combination is tested. As expected, we have a good audiovisual complementarity for SRS and a high amount of McGurk responses, but new interesting effects are observed. For the interpretation of human confusion about place of articulation, the Bayesian model proposed by Massaro and Stork [8] is compared to a new place identification model which is based on averaging as well as on the separate identification of articulatory features. This decomposition is a promising way for the development of multistream speech recognition models.
Recently, we have proposed a new technique for facilitating the extraction of vocal tract contours from complete sequences of large existing cineradiographic databases. The articulators (tongue, tongue tip, velum, lips, etc.) are... more
Recently, we have proposed a new technique for facilitating the extraction of vocal tract contours from complete sequences of large existing cineradiographic databases. The articulators (tongue, tongue tip, velum, lips, etc.) are processed independently before being combined to reconstruct the whole vocal tract. Applied to one sequence of the ATR database, Laval43, the method allows us to estimate the shape of the complete vocal tract and the corresponding midsagittal sections. These are compatible with standard articulatory synthesis models. The formant trajectories are synthesized using the transfer functions calculated from the estimated area. A comparison between estimated and original formants is carried out. Then, by introducing the 2-subbands amplitude modulation extracted from the original audio signal, the synthesis of intelligible speech is realized and spectral distances are evaluated.
We study effects of additive white noise on the cepstral representation of speech signals. Distribution of each individual cepstrum coefficient of speech is shown to depend strongly on noise and to overlap significantly with the cepstrum... more
We study effects of additive white noise on the cepstral representation of speech signals. Distribution of each individual cepstrum coefficient of speech is shown to depend strongly on noise and to overlap significantly with the cepstrum distribution of noise. Based on these studies, we suggest a scalar quantity, V, equal to the sum of weighted cepstral coefficients, which is able to classify frames containing speech against noise-like frames. The distributions of V for speech and noise frames are reasonably well separated above SNR = 5 dB, demonstrating the feasibility of robust speech detector based on V.
This paper examines the degree of correlation between lip and jaw con guration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker's face and pro le, and... more
This paper examines the degree of correlation between lip and jaw con guration and speech acoustics. The lip and jaw positions are characterised by a system of measurements taken from video images of the speaker's face and pro le, and the acoustics are represented using line spectral pair parameters and a measure of RMS energy. A correlation is found between the measured acoustic parameters and a linear estimate of the acoustics recovered from the visual data. This correlation exists despite the simplicity of the mapping and is in rough agreement with correlations measured in earlier work by Yehia et al. The linear estimates are also compared to estimates made using nonlinear models. In particular it is shown that although performance of the two models is remarkably similar for static visual features, non-linear models are better able to handle dynamic features.
1. Neighboring mitral cells in the rat olfactory bulb have been previously shown to give similar response profiles to a series of odorants. We now analyze their temporal patterns of activity before and during stimulation to evaluate to... more
1. Neighboring mitral cells in the rat olfactory bulb have been previously shown to give similar response profiles to a series of odorants. We now analyze their temporal patterns of activity before and during stimulation to evaluate to what extent soma proximity may act on their temporal correlation and to what extent olfactory stimulation may force two close cells to fire with similar patterns. 2. In anesthetized adult rats, we recorded simultaneously the extracellular single-unit activities of two mitral cells with the use of twin micropipettes with tips separated by less than 40 microns. These activities were recorded before and during stimulation by a series of five odorants. 3. Activities were classified into nine types according to their temporal pattern along the respiratory cycle. These types comprised nonrhythmic patterns and rhythmic ones, the latter being simple or complex. A phase parameter was also calculated to compare the positions of maximal activity within respirato...
The present work is concerned with the processing of spectral inforrnation by stochastic coding, at the output of the cochlea. We study the collective behaviour of a !arge number of cells based on a measure of the correlation of sequences... more
The present work is concerned with the processing of spectral inforrnation by stochastic coding, at the output of the cochlea. We study the collective behaviour of a !arge number of cells based on a measure of the correlation of sequences of spikes between neighbour units. Our first results show that we can obtain significant discrimination of spectra1 components of vowels without any kind of cabled lateral inhibition or second filter device. This collective behaviour is made possible by the synchronization of inputs of adjacent units. This could provide an interesting and rather new way of making profit of both geo&I'ap.hical c~ing (tonotopy) and temporal coding (synchromzatwn) parttcularly useful for formant detection.
Research Interests:
Introduction Au cours de la phylogenese, le conduit vocal des primates a augmente sa portion pharyngienne, habituellement attribue a la « descente » du larynx, et a diminue sa longueur orale suite a la diminution du prognathisme. Le but... more
Introduction Au cours de la phylogenese, le conduit vocal des primates a augmente sa portion pharyngienne, habituellement attribue a la « descente » du larynx, et a diminue sa longueur orale suite a la diminution du prognathisme. Le but de l’etude est de revisiter la « descente » du larynx du primate non humain (babouin papio papio), et d’en discuter les implications dans la deglutition et l’emergence de la parole. Materiel et methodes L’etude anatomique a ete conduite a partir de coupes anatomiques, de dissections et de CT-scan (2 sujets) et de radiographie standard (6 sujets). La position de l’os hyoide par rapport aux vertebres a ete determinee dans un plan parallele au plan occlusal. La dissection a permis d’etudier l’anatomie de l’os hyoide, du larynx et des muscles extrinseques de la langue. Resultats La position de l’os hyoide du babouin est comparable a celle de l’homme (niveau C3–C4). Par contre, le cartilage thyroide est positionne en arriere du corps de l’os hyoide : le larynx est comme emboite dans l’os hyoide. Cette disposition entraine une position des plis vocaux au niveau du corps de l’os hyoide alors que chez l’homme les plis vocaux sont en regard de C5–C6. En dehors de la longueur de la langue, la dissection retrouve des muscles stylohyoidien et digastrique orientes tres horizontalement par rapport a celle de l’homme adulte ; mais comparable a celle de l’enfant. Discussion Il y a bien eu une descente du larynx au cours de la phylogenese : l’augmentation de la portion pharyngale du conduit vocal s’est faite par une dissociation verticale avec desemboitement de l’os hyoide, mais avec une preservation de la position de l’os hyoide. Ceci confere a la base de la langue une configuration proche de celle de l’Homme. La deglutition a ainsi ete preservee. Par ailleurs, on observe un potentiel de production similaire pour les gestes de parole chez l’Homme et les vocalisations chez le babouin. Ceci corrobore l’hypothese que la parole a pu evoluer par exaptation de la fonction de deglutition.
The decomposition principle was first proposed by Varga and Moore [1] and applied to Automatic Speech Recognition (ASR) in noise. We show a new adaptation of this principle to model the schema-based streaming process which was inferred... more
The decomposition principle was first proposed by Varga and Moore [1] and applied to Automatic Speech Recognition (ASR) in noise. We show a new adaptation of this principle to model the schema-based streaming process which was inferred after psychoacoustical studies [2]. We address here the classical problem of double vowel segregation. The signal decomposition is allowed by an internal and statistical model of vowel spectra. We apply this decomposition model able to reconstruct the spectra of superimposed signals after identification of only the dominant or of both members of the pair. Three stages are invoked. The first one is a module performing identification when the input is a mixture of interfering signals. Prior identification of the dominant spectra prevents combinatorial reconstruction. The second step is an evaluation of the mixture coefficient also based on an internal representation of spectra. Finally, the reconstruction of spectra is probabilistic, by the way of likelihood maximisation. It uses labels and mixture coefficient. This is tested on a large database of synthetic vowels.
Research Interests:
ABSTRACT
Research Interests:
Research Interests:

And 38 more