We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline tha... more We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.
Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021
In this paper, we present a study in which a robot initiates interactions with people passing by ... more In this paper, we present a study in which a robot initiates interactions with people passing by in an in-the-wild scenario. The robot adapts the loudness of its voice dynamically to the distance of the respective person approached, thus indicating who it is talking to. It furthermore tracks people based on information on body orientation and eye gaze and adapts the text produced based on people's distance autonomously. Our study shows that the adaptation of the loudness of its voice is perceived as personalization by the participants and that the likelihood that they stop by and interact with the robot increases when the robot incrementally adjusts its behavior.
10th International Conference on Speech Prosody 2020, 2020
Listeners typically provide feedback while listening to a speaker in conversation and thereby eng... more Listeners typically provide feedback while listening to a speaker in conversation and thereby engage in the co-construction of the interaction. We analyze the influence of the listener on the speaker by investigating how her verbal feedback signals help in modeling the speaker’s language. We find that feedback from the listener may help in modeling the speaker’s language, whether through the listener’s feedback as transcribed, or the acoustic signal directly. We find the largest positive effects for end of sentence as well as for pauses mid-utterance, but also effects that indicate we successfully model elaborations of ongoing utterances that may result from the presence or absence of listener feedback.
Robots should appropriately give reasons for their actions when these actions affect a human’s ac... more Robots should appropriately give reasons for their actions when these actions affect a human’s action or goal space. Communicating reasons may help the human understand the robot’s intents and may initiate joint action, i. e., accepting the robot’s goals and cooperating on the robot’s actions. However, to be efficient, the communication of reasons should be limited to the necessary rather than to completeness, conforming to the Gricean Maxim of Quantity. Furthermore, what is necessary only becomes apparent as the situation evolves and hence, for seamless interaction, ongoing utterances must be adapted as they happen. We present a system that flexibly gives reasons in a reduced setting in which the robot needs to intrude a human’s personal space in order to reach its goal.
of paper 0582 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherl... more of paper 0582 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019.
Current-day spoken dialogue systems are tedious to interact with (Ward et al. 2005). eir naturaln... more Current-day spoken dialogue systems are tedious to interact with (Ward et al. 2005). eir naturalness and (measurable) quality of interaction can be improved through incremental (step-by-step) processing schemes that enable dialogue systems to interact continuously (Baumann 2013). However, incremental models have not yet adequately addressed the challenge of joint decision making and optimization of hypotheses across the multitude of components within a modularized system in real-time, mostly because their data-ows follow simple pipeline approaches. Ad-hoc integration of modules fails completely for distributed systems which are preferred in robotics, for research systems, and in mobile applications. is shortcoming impedes incremental spoken dialogue systems to leverage their full potential. is project proposes to design and implement an architecture for concurrent, distributed incremental processing and knowledge representation for spoken dialogue in which components share their un...
Sprachdialogsysteme profitieren von inkrementeller Verarbeitung. Inkrementelle Sprachdialogsystem... more Sprachdialogsysteme profitieren von inkrementeller Verarbeitung. Inkrementelle Sprachdialogsysteme (bei denen die Verarbeitung auf allen Ebenen schon während der Eingabe beginnt) führen schneller zum Ziel und werden von Nutzern besser bewertet als nichtinkrementelle Sprachdialogsysteme (bei denen die Spracherkennung und darauffolgende Module erst die Verarbeitung beginnen wenn die Eingabe abgeschlossen ist, bzw. das vorgeschaltete Modul seine Verarbeitung beendet hat).[1],[14] In meiner Promotion beschäftige ich mich mit der Inkrementalisierung von Spracherkennung und -verstehen für Sprachdialogsysteme. Vielfach wurde gezeigt, dass Prosodie unter anderem einen Beitrag zur Spracherkennung (ASR) [2], zur Disfluenzerkennung [13], zur End-of-Turn-Erkennung [9] und -Vorhersage [4], und zum Parsing [11] leisten kann. Die meisten dieser Anwendungen werden im experimentellen Maßstab allerdings offline durchgeführt, d. h. die prosodischen Merkmale und Kategorien werden in separaten Durchläuf...
Our paper focuses on the computational analysis of “readout poetry” (german: Hördichtung) – recor... more Our paper focuses on the computational analysis of “readout poetry” (german: Hördichtung) – recordings of poets reading their own work – with regards to the most important type of this genre, the modern “sound poetry” (german: Lautdichtung). Whereas “readout poetry” often uses normal words and sentences, the “sound poetry”, developed by dadaistic poets like Hugo Ball and Kurt Schwitters or concrete poets like Ernst Jandl, Oskar Pastior, or Bob Cobbing, combines the “microparticles of the human voice” like the segments in Ernst Jandls sound poem “schtzngrmm” (“schtzngrmm / schtzngrmm / tttt / tttt / grrrmmmmm / tttt / sch / tzngrmm”). Within the genre of sound poetry, there are two main forms: The lettristic and the syllabic decomposition. A short anecdote will explain this difference: The dadaist Raoul Hausmann developed the lettristic sound poetry in his early dadaistic poem “fmsbw” from 1918. This is said to have inspired his successor Schwitters, whose famous “Ursonate” [The Sona...
Proceedings of the Conference on Mensch und Computer, 2020
We analyze the addressee detection task for complexity-identical dialog for both human conversati... more We analyze the addressee detection task for complexity-identical dialog for both human conversation and device-directed speech. Our recurrent neural model performs at least as good as humans, who have problems with this task, even native speakers, who profit from the relevant linguistic skills. We perform ablation experiments on the features used by our model and show that fundamental frequency variation is the single most relevant feature class. Therefore, we conclude that future systems can detect whether they are addressed based only on speech prosody which does not (or only to a very limited extent) reveal the content of conversations not intended for the system.
We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline tha... more We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.
Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021
In this paper, we present a study in which a robot initiates interactions with people passing by ... more In this paper, we present a study in which a robot initiates interactions with people passing by in an in-the-wild scenario. The robot adapts the loudness of its voice dynamically to the distance of the respective person approached, thus indicating who it is talking to. It furthermore tracks people based on information on body orientation and eye gaze and adapts the text produced based on people's distance autonomously. Our study shows that the adaptation of the loudness of its voice is perceived as personalization by the participants and that the likelihood that they stop by and interact with the robot increases when the robot incrementally adjusts its behavior.
10th International Conference on Speech Prosody 2020, 2020
Listeners typically provide feedback while listening to a speaker in conversation and thereby eng... more Listeners typically provide feedback while listening to a speaker in conversation and thereby engage in the co-construction of the interaction. We analyze the influence of the listener on the speaker by investigating how her verbal feedback signals help in modeling the speaker’s language. We find that feedback from the listener may help in modeling the speaker’s language, whether through the listener’s feedback as transcribed, or the acoustic signal directly. We find the largest positive effects for end of sentence as well as for pauses mid-utterance, but also effects that indicate we successfully model elaborations of ongoing utterances that may result from the presence or absence of listener feedback.
Robots should appropriately give reasons for their actions when these actions affect a human’s ac... more Robots should appropriately give reasons for their actions when these actions affect a human’s action or goal space. Communicating reasons may help the human understand the robot’s intents and may initiate joint action, i. e., accepting the robot’s goals and cooperating on the robot’s actions. However, to be efficient, the communication of reasons should be limited to the necessary rather than to completeness, conforming to the Gricean Maxim of Quantity. Furthermore, what is necessary only becomes apparent as the situation evolves and hence, for seamless interaction, ongoing utterances must be adapted as they happen. We present a system that flexibly gives reasons in a reduced setting in which the robot needs to intrude a human’s personal space in order to reach its goal.
of paper 0582 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherl... more of paper 0582 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019.
Current-day spoken dialogue systems are tedious to interact with (Ward et al. 2005). eir naturaln... more Current-day spoken dialogue systems are tedious to interact with (Ward et al. 2005). eir naturalness and (measurable) quality of interaction can be improved through incremental (step-by-step) processing schemes that enable dialogue systems to interact continuously (Baumann 2013). However, incremental models have not yet adequately addressed the challenge of joint decision making and optimization of hypotheses across the multitude of components within a modularized system in real-time, mostly because their data-ows follow simple pipeline approaches. Ad-hoc integration of modules fails completely for distributed systems which are preferred in robotics, for research systems, and in mobile applications. is shortcoming impedes incremental spoken dialogue systems to leverage their full potential. is project proposes to design and implement an architecture for concurrent, distributed incremental processing and knowledge representation for spoken dialogue in which components share their un...
Sprachdialogsysteme profitieren von inkrementeller Verarbeitung. Inkrementelle Sprachdialogsystem... more Sprachdialogsysteme profitieren von inkrementeller Verarbeitung. Inkrementelle Sprachdialogsysteme (bei denen die Verarbeitung auf allen Ebenen schon während der Eingabe beginnt) führen schneller zum Ziel und werden von Nutzern besser bewertet als nichtinkrementelle Sprachdialogsysteme (bei denen die Spracherkennung und darauffolgende Module erst die Verarbeitung beginnen wenn die Eingabe abgeschlossen ist, bzw. das vorgeschaltete Modul seine Verarbeitung beendet hat).[1],[14] In meiner Promotion beschäftige ich mich mit der Inkrementalisierung von Spracherkennung und -verstehen für Sprachdialogsysteme. Vielfach wurde gezeigt, dass Prosodie unter anderem einen Beitrag zur Spracherkennung (ASR) [2], zur Disfluenzerkennung [13], zur End-of-Turn-Erkennung [9] und -Vorhersage [4], und zum Parsing [11] leisten kann. Die meisten dieser Anwendungen werden im experimentellen Maßstab allerdings offline durchgeführt, d. h. die prosodischen Merkmale und Kategorien werden in separaten Durchläuf...
Our paper focuses on the computational analysis of “readout poetry” (german: Hördichtung) – recor... more Our paper focuses on the computational analysis of “readout poetry” (german: Hördichtung) – recordings of poets reading their own work – with regards to the most important type of this genre, the modern “sound poetry” (german: Lautdichtung). Whereas “readout poetry” often uses normal words and sentences, the “sound poetry”, developed by dadaistic poets like Hugo Ball and Kurt Schwitters or concrete poets like Ernst Jandl, Oskar Pastior, or Bob Cobbing, combines the “microparticles of the human voice” like the segments in Ernst Jandls sound poem “schtzngrmm” (“schtzngrmm / schtzngrmm / tttt / tttt / grrrmmmmm / tttt / sch / tzngrmm”). Within the genre of sound poetry, there are two main forms: The lettristic and the syllabic decomposition. A short anecdote will explain this difference: The dadaist Raoul Hausmann developed the lettristic sound poetry in his early dadaistic poem “fmsbw” from 1918. This is said to have inspired his successor Schwitters, whose famous “Ursonate” [The Sona...
Proceedings of the Conference on Mensch und Computer, 2020
We analyze the addressee detection task for complexity-identical dialog for both human conversati... more We analyze the addressee detection task for complexity-identical dialog for both human conversation and device-directed speech. Our recurrent neural model performs at least as good as humans, who have problems with this task, even native speakers, who profit from the relevant linguistic skills. We perform ablation experiments on the features used by our model and show that fundamental frequency variation is the single most relevant feature class. Therefore, we conclude that future systems can detect whether they are addressed based only on speech prosody which does not (or only to a very limited extent) reveal the content of conversations not intended for the system.
Uploads