Characterizing multi-person interactions in meetings, for example successive speaking tours, is u... more Characterizing multi-person interactions in meetings, for example successive speaking tours, is useful for many concrete applications e.g. in multimedia. During the course of a meeting, the detection of the active speaker is intuitively inferred by voice activity. However, additional information extracted from video streams or models of human interactions are likely to strengthen the detection process. Thus, these aspects can create an original modality of active speaker detection mixing audiovisual percepts and social behaviors inherent in the meeting context. Visual percepts are inferred using a Convolutional Neural Network (CNN) that captures spatio-temporal relationships of video clips of participants faces at the meeting. We thus compare several CNN architectures with two types of visual input data: Figure 1: Exemple de détection du locuteur actif utilisant la vidéo Talking face mise à disposition par le groupe de travail sur la reconnaissance des visages et des gestes 2. RGB i...
This paper presents the first results of the PIA "Grands D\'efis du Num\'erique"... more This paper presents the first results of the PIA "Grands D\'efis du Num\'erique" research project LinTO. The goal of this project is to develop a conversational assistant to help the company's employees, particularly during meetings. LinTO is an interactive device equipped with microphones, a screen and a 360$^\circ$ camera, which allows to control the room, query company's information system, helps facilitate the meeting and provides an environment to aid minute writing. Distributed according to an open model that respects private data LinTO is the first open-source enterprise's assistant designed to comply with the GDPR requirements.
L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines,... more L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines, l'enjeu est par exemple la gestion automatique de l'energie ainsi que l'analyse des interactions entre les usagers partageant le lieu. Pour ca-racteriser les interactions entre individus ou entre un in-dividu et l'infrastructure d'un bâtiment, une tâche de re-identification des usagers du lieu lors de leur deplacement est necessaire et l'utilisation de modeles multimodaux per-met clairement de robustifier cette re-identification. Dans cet article, nous proposons une methode de fusion audiovi-suelle, introduisant un nouvel indice de confiance de zones de saillance audio-video, pour l'apprentissage d'une signature audiovisuelle d'une personne. Mots Clef Signature audiovisuelle, fusion audio-video, re-identification de personne. Abstract In intelligent environments, activity detection is a necessary pre-processing step for adaptive energy management and ...
La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spe... more La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spectateur (ex. langue maternelle, performances cognitives) et de facteurs propres aux contenus des documents (ex. complexite linguistique, intelligibilite de la parole). Dans ces travaux, nous etudions les effets de facteurs propres aux contenus sur la comprehensibilite de 55 dialogues extraits de films, presentes a 15 experts (enseignants de francais langue etrangere) selon cinq modalites differentes (transcription, transcription + audio, audio, audio + video, transcription + audio + video). Les experts ont evalue les dialogues en termes de comprehensibilite generale, de complexite du vocabulaire, de complexite grammaticale, et d’intelligibilite de la parole. L’analyse de leurs evaluations montre que (1) la complexite du vocabulaire, la complexite grammaticale, et l’intelligibilite de la parole sont significativement correlees a la comprehensibilite generale, et (2) que les evaluations de...
Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir d... more Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir des informations visuelles et sonores de la scene qu’il observe ou a laquelle il participe. Il doit notamment etre capable de reperer des elements saillants dans les differents signaux captes : localisation spatiale dans une image ou temporelle dans un flux audio. L’aspect "datavore"des methodes dites d’apprentissage profond, et le cout considerable de l’annotation des donnees, militent pour l’utilisation de methodes semi-supervisees, capables d’une part d’extraire de l’information de maniere supervisee, et d’autre part de predire l’organisation spatiale ou temporelle des evenements presents dans le signal traite. Dans le domaine de la vision, ce concept a ete utilise a plusieurs reprises pour effectuer de la localisation spatiale d’objet ou d’activite sur des images [1, 2, 3] a partir des signaux 2D bruts (pixels). Au niveau audio, la tendance consistant a s’affranchir des repr...
The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Tas... more The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Task and the MediaEval 2010 Wild Wild Web Tagging Task to test and evaluate retrieval techniques for video content as it occurs on the Internet, i.e., for semi-professional user generated content that is associated with annotations existing on the Social Web. The task uses the MediaEval 2012 Tagging Task (ME12TT) dataset which is based on the whole blip10,000 collection, in contrast to the MediaEval 2010 Wild Wild Web (ME10WWW) set used in previous tasks. In this task overview paper, we describe the principal characteristics of the dataset, the task itself, and the evaluation metrics used to test the particpants’ results.
2021 International Conference on Content-Based Multimedia Indexing (CBMI), 2021
Meetings are a common activity in professional contexts, and it remains difficult to analyze them... more Meetings are a common activity in professional contexts, and it remains difficult to analyze them because they are not always structured and people cut each other off (in a debate of ideas for example). A first step, to facilitate their analysis, is to segment the meeting into homogeneous zones at interaction level. To do so, we studied the typology of the non-speech segments (pauses and silences) in order to determine the different sequences during a meeting. Indeed, information such as the frequency and lengths of the non-speech segments will be different during a presentation or a debate. In this article, we propose an original approach to segment meetings using only the non-speech segments. We apply a Voice Activity Detection (VAD) to find the non-speech segments from which a set of parameters are extracted to study the typology of silence segments. We then use a sliding window on the whole meeting and we apply an unsupervised approach on each of these windows. We have validated our approaches using purity and coverage metrics on part of the AMI corpus (38 meetings of about 28 minutes each). This approach is non-invasive and relies only on acoustic information and does not analyze speech content since moments containing speech, and potentially sensitive information, are not processed.
Characterizing multi-person interactions in meetings, for example successive speaking tours, is u... more Characterizing multi-person interactions in meetings, for example successive speaking tours, is useful for many concrete applications e.g. in multimedia. During the course of a meeting, the detection of the active speaker is intuitively inferred by voice activity. However, additional information extracted from video streams or models of human interactions are likely to strengthen the detection process. Thus, these aspects can create an original modality of active speaker detection mixing audiovisual percepts and social behaviors inherent in the meeting context. Visual percepts are inferred using a Convolutional Neural Network (CNN) that captures spatio-temporal relationships of video clips of participants faces at the meeting. We thus compare several CNN architectures with two types of visual input data: Figure 1: Exemple de détection du locuteur actif utilisant la vidéo Talking face mise à disposition par le groupe de travail sur la reconnaissance des visages et des gestes 2. RGB i...
This paper presents the first results of the PIA "Grands D\'efis du Num\'erique"... more This paper presents the first results of the PIA "Grands D\'efis du Num\'erique" research project LinTO. The goal of this project is to develop a conversational assistant to help the company's employees, particularly during meetings. LinTO is an interactive device equipped with microphones, a screen and a 360$^\circ$ camera, which allows to control the room, query company's information system, helps facilitate the meeting and provides an environment to aid minute writing. Distributed according to an open model that respects private data LinTO is the first open-source enterprise's assistant designed to comply with the GDPR requirements.
L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines,... more L'intelligence ambiante pose entre autres le probleme de la detection des activites humaines, l'enjeu est par exemple la gestion automatique de l'energie ainsi que l'analyse des interactions entre les usagers partageant le lieu. Pour ca-racteriser les interactions entre individus ou entre un in-dividu et l'infrastructure d'un bâtiment, une tâche de re-identification des usagers du lieu lors de leur deplacement est necessaire et l'utilisation de modeles multimodaux per-met clairement de robustifier cette re-identification. Dans cet article, nous proposons une methode de fusion audiovi-suelle, introduisant un nouvel indice de confiance de zones de saillance audio-video, pour l'apprentissage d'une signature audiovisuelle d'une personne. Mots Clef Signature audiovisuelle, fusion audio-video, re-identification de personne. Abstract In intelligent environments, activity detection is a necessary pre-processing step for adaptive energy management and ...
La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spe... more La comprehensibilite de documents audiovisuels peut dependre de facteurs propres a l’auditeur/spectateur (ex. langue maternelle, performances cognitives) et de facteurs propres aux contenus des documents (ex. complexite linguistique, intelligibilite de la parole). Dans ces travaux, nous etudions les effets de facteurs propres aux contenus sur la comprehensibilite de 55 dialogues extraits de films, presentes a 15 experts (enseignants de francais langue etrangere) selon cinq modalites differentes (transcription, transcription + audio, audio, audio + video, transcription + audio + video). Les experts ont evalue les dialogues en termes de comprehensibilite generale, de complexite du vocabulaire, de complexite grammaticale, et d’intelligibilite de la parole. L’analyse de leurs evaluations montre que (1) la complexite du vocabulaire, la complexite grammaticale, et l’intelligibilite de la parole sont significativement correlees a la comprehensibilite generale, et (2) que les evaluations de...
Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir d... more Afin d’interagir avec l’humain et son environnement, un robot de service doit pouvoir percevoir des informations visuelles et sonores de la scene qu’il observe ou a laquelle il participe. Il doit notamment etre capable de reperer des elements saillants dans les differents signaux captes : localisation spatiale dans une image ou temporelle dans un flux audio. L’aspect "datavore"des methodes dites d’apprentissage profond, et le cout considerable de l’annotation des donnees, militent pour l’utilisation de methodes semi-supervisees, capables d’une part d’extraire de l’information de maniere supervisee, et d’autre part de predire l’organisation spatiale ou temporelle des evenements presents dans le signal traite. Dans le domaine de la vision, ce concept a ete utilise a plusieurs reprises pour effectuer de la localisation spatiale d’objet ou d’activite sur des images [1, 2, 3] a partir des signaux 2D bruts (pixels). Au niveau audio, la tendance consistant a s’affranchir des repr...
The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Tas... more The MediaEval 2012 Genre Tagging Task is a follow-up task of the MediaEval 2011 Genre Tagging Task and the MediaEval 2010 Wild Wild Web Tagging Task to test and evaluate retrieval techniques for video content as it occurs on the Internet, i.e., for semi-professional user generated content that is associated with annotations existing on the Social Web. The task uses the MediaEval 2012 Tagging Task (ME12TT) dataset which is based on the whole blip10,000 collection, in contrast to the MediaEval 2010 Wild Wild Web (ME10WWW) set used in previous tasks. In this task overview paper, we describe the principal characteristics of the dataset, the task itself, and the evaluation metrics used to test the particpants’ results.
2021 International Conference on Content-Based Multimedia Indexing (CBMI), 2021
Meetings are a common activity in professional contexts, and it remains difficult to analyze them... more Meetings are a common activity in professional contexts, and it remains difficult to analyze them because they are not always structured and people cut each other off (in a debate of ideas for example). A first step, to facilitate their analysis, is to segment the meeting into homogeneous zones at interaction level. To do so, we studied the typology of the non-speech segments (pauses and silences) in order to determine the different sequences during a meeting. Indeed, information such as the frequency and lengths of the non-speech segments will be different during a presentation or a debate. In this article, we propose an original approach to segment meetings using only the non-speech segments. We apply a Voice Activity Detection (VAD) to find the non-speech segments from which a set of parameters are extracted to study the typology of silence segments. We then use a sliding window on the whole meeting and we apply an unsupervised approach on each of these windows. We have validated our approaches using purity and coverage metrics on part of the AMI corpus (38 meetings of about 28 minutes each). This approach is non-invasive and relies only on acoustic information and does not analyze speech content since moments containing speech, and potentially sensitive information, are not processed.
Uploads
Papers by Isabelle Ferrané