Isabelle Ferrané

Followers

Following

Co-authors

Public Views

Interests

Uploads

Papers by Isabelle Ferrané

Using Phonologically Weighted Levenshtein Distances for the Prediction of Microscopic Intelligibility

Download

Besoins lexicaux a la lumiere de l'analyse statistique du corpus de textes du projet "BREF

Download

An automatic interactive dictation and correction system: application to language teaching/learning

ABSTRACT

Mesures automatiques de parole non-native : exploration pilote d’un corpus d’apprenants japonais de français et différenciation de niveaux

XXXIVe Journées d'Études sur la Parole -- JEP 2022

Download

Apport du geste dans l'acquisition de la prononciation en L2 : quand la réalité ne correspond pas aux attentes

HAL (Le Centre pour la Communication Scientifique Directe), Jun 13, 2022

Prediction of L2 speech proficiency based on multi-level linguistic features

Interspeech 2022

Download

Reward-Based Environment States for Robot Manipulation Policy Learning

arXiv (Cornell University), Dec 10, 2021

Download

L'apport du geste dans l'acquisition de la prononciation en L2 via un outil d'apprentissage en ligne : une étude pilote

HAL (Le Centre pour la Communication Scientifique Directe), Nov 4, 2021

Download

Benchmark for Kitchen20, a daily life dataset for audio-based human action recognition

2019 International Conference on Content-Based Multimedia Indexing (CBMI), 2019

Download

Weakly supervised discourse segmentation for multiparty oral conversations

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021

Download

Mind the Regularized GAP, for Human Action Classification and Semi-supervised Localization based on Visual Saliency

Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2018

Download

Détection audiovisuelle du locuteur actif lors de réunion

Characterizing multi-person interactions in meetings, for example successive speaking tours, is u... more Characterizing multi-person interactions in meetings, for example successive speaking tours, is useful for many concrete applications e.g. in multimedia. During the course of a meeting, the detection of the active speaker is intuitively inferred by voice activity. However, additional information extracted from video streams or models of human interactions are likely to strengthen the detection process. Thus, these aspects can create an original modality of active speaker detection mixing audiovisual percepts and social behaviors inherent in the meeting context. Visual percepts are inferred using a Convolutional Neural Network (CNN) that captures spatio-temporal relationships of video clips of participants faces at the meeting. We thus compare several CNN architectures with two types of visual input data: Figure 1: Exemple de détection du locuteur actif utilisant la vidéo Talking face mise à disposition par le groupe de travail sur la reconnaissance des visages et des gestes 2. RGB i...

Download

LinTO : Assistant vocal open-source respectueux des données personnelles pour les réunions d'entreprise

ArXiv, 2019

This paper presents the first results of the PIA "Grands D\'efis du Num\'erique&quot... more This paper presents the first results of the PIA "Grands D\'efis du Num\'erique" research project LinTO. The goal of this project is to develop a conversational assistant to help the company's employees, particularly during meetings. LinTO is an interactive device equipped with microphones, a screen and a 360$^\circ$ camera, which allows to control the room, query company's information system, helps facilitate the meeting and provides an environment to aid minute writing. Distributed according to an open model that respects private data LinTO is the first open-source enterprise's assistant designed to comply with the GDPR requirements.

Download

Audio-Video detection of the active speaker in meetings

2020 25th International Conference on Pattern Recognition (ICPR), 2021

Download

Apprentissage en ligne d’une signature audiovisuelle pour la ré-identification de personne

L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines,... more L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines, l'enjeu est par exemple la gestion automatique de l'energie ainsi que l'analyse des interactions entre les usagers partageant le lieu. Pour ca-racteriser les interactions entre individus ou entre un in-dividu et l'infrastructure d'un bâtiment, une tâche de re-identification des usagers du lieu lors de leur deplacement est necessaire et l'utilisation de modeles multimodaux per-met clairement de robustifier cette re-identification. Dans cet article, nous proposons une methode de fusion audiovi-suelle, introduisant un nouvel indice de confiance de zones de saillance audio-video, pour l'apprentissage d'une signature audiovisuelle d'une personne. Mots Clef Signature audiovisuelle, fusion audio-video, re-identification de personne. Abstract In intelligent environments, activity detection is a necessary pre-processing step for adaptive energy management and ...

Download

Étude Des Facteurs Affectant La Compréhensibilité De Documents Multimodaux : Une Étude Expérimentale

La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spe... more La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spectateur (ex. langue maternelle, performances cognitives) et de facteurs propres aux contenus des documents (ex. complexite linguistique, intelligibilite de la parole). Dans ces travaux, nous etudions les effets de facteurs propres aux contenus sur la comprehensibilite de 55 dialogues extraits de films, presentes a 15 experts (enseignants de francais langue etrangere) selon cinq modalites differentes (transcription, transcription + audio, audio, audio + video, transcription + audio + video). Les experts ont evalue les dialogues en termes de comprehensibilite generale, de complexite du vocabulaire, de complexite grammaticale, et d’intelligibilite de la parole. L’analyse de leurs evaluations montre que (1) la complexite du vocabulaire, la complexite grammaticale, et l’intelligibilite de la parole sont significativement correlees a la comprehensibilite generale, et (2) que les evaluations de...

Download

Localisation sonore par attention et apprentissage profond semi-supervisé

Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir d... more Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir des informations visuelles et sonores de la scene qu’il observe ou a laquelle il participe. Il doit notamment etre capable de reperer des elements saillants dans les differents signaux captes : localisation spatiale dans une image ou temporelle dans un flux audio. L’aspect "datavore"des methodes dites d’apprentissage profond, et le cout considerable de l’annotation des donnees, militent pour l’utilisation de methodes semi-supervisees, capables d’une part d’extraire de l’information de maniere supervisee, et d’autre part de predire l’organisation spatiale ou temporelle des evenements presents dans le signal traite. Dans le domaine de la vision, ce concept a ete utilise a plusieurs reprises pour effectuer de la localisation spatiale d’objet ou d’activite sur des images [1, 2, 3] a partir des signaux 2D bruts (pixels). Au niveau audio, la tendance consistant a s’affranchir des repr...

Download

Overview of the MediaEval 2012 Tagging Task

The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Tas... more The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Task and the MediaEval 2010 Wild Wild Web Tagging Task to test and evaluate retrieval techniques for video content as it occurs on the Internet, i.e., for semi-professional user generated content that is associated with annotations existing on the Social Web. The task uses the MediaEval 2012 Tagging Task (ME12TT) dataset which is based on the whole blip10,000 collection, in contrast to the MediaEval 2010 Wild Wild Web (ME10WWW) set used in previous tasks. In this task overview paper, we describe the principal characteristics of the dataset, the task itself, and the evaluation metrics used to test the particpants’ results.

Download

Automatic macro segmentation into interaction sequence: a silence-based approach for meeting structuring

2021 International Conference on Content-Based Multimedia Indexing (CBMI), 2021

Meetings are a common activity in professional contexts, and it remains difficult to analyze them... more Meetings are a common activity in professional contexts, and it remains difficult to analyze them because they are not always structured and people cut each other off (in a debate of ideas for example). A first step, to facilitate their analysis, is to segment the meeting into homogeneous zones at interaction level. To do so, we studied the typology of the non-speech segments (pauses and silences) in order to determine the different sequences during a meeting. Indeed, information such as the frequency and lengths of the non-speech segments will be different during a presentation or a debate. In this article, we propose an original approach to segment meetings using only the non-speech segments. We apply a Voice Activity Detection (VAD) to find the non-speech segments from which a set of parameters are extracted to study the typology of silence segments. We then use a sliding window on the whole meeting and we apply an unsupervised approach on each of these windows. We have validated our approaches using purity and coverage metrics on part of the AMI corpus (38 meetings of about 28 minutes each). This approach is non-invasive and relies only on acoustic information and does not analyze speech content since moments containing speech, and potentially sensitive information, are not processed.