This research attempts to propose an approach to make decisions under uncertainty for designing a dialogue manager for Amharic spoken dialogue system, Amharic as under-resourced language. A prototype Amharic Spoken Dialogue System was... more
This research attempts to propose an approach to make decisions under uncertainty for designing a dialogue manager for Amharic spoken dialogue system, Amharic as under-resourced language. A prototype Amharic Spoken Dialogue System was implemented on hotel and restaurant address information domain for experimentation. Data for this research was collected through a method called Wizard of Oz and domain knowledge is prepared using address search websites. How to design a dialogue manager which is robust for languages especially with low performing Automatic Speech Recognition unit is a fundamental question of this research. Previous studies on designing of spoken dialogue system for under resourced languages focused mainly on Automatic Speech Recognizer. We reviewed methods and frameworks on dialogue management. Design of Partially Observable Markov Decision Processes (POMDP) based dialogue manager, which provides a principled framework to plan under uncertainty, yields robustness. Low performing Spoken Dialogue System (SDS) components, especially the Automatic Speech Recognizer (ASR) were considered the causes of uncertainty. Maintaining multiple hypotheses (evidences) improves the correctness of the dialogue manager. We conducted experiments to test correctness score, error rate and robustness. With a maximum of 6 N-best list and 20 partitions the correctness score grow by 14.19%. Increasing the number of n-best list of 6 reduced the error rate by 5.78% and with 6 n-best list. The belief updated below 0.12 seconds with 20 partitions and 6-nbest list. And the dialogue manager was able to complete a task with an average 8.75 turns by 50% Word Error Rate. The finding from the research illustrates that POMDP-based design approach to dialogue management is robust and possible to develop an improving spoken dialogue system for under-resourced languages.
We studied with telephone survey the opinions and ideas of 1004 Finns concerning domestic technology and controlling it with voice commands. There is distrust towards commanding your home environment with voice; the users doubt especially... more
We studied with telephone survey the opinions and ideas of 1004 Finns concerning domestic technology and controlling it with voice commands. There is distrust towards commanding your home environment with voice; the users doubt especially its functionality. Voice commands are regarded as unpleasant. The most positive conceived aspects of voice commands are the speed of control and the ability to
This article addresses synchronous acquisition of high-speed multimodal speech data, composed of ultrasound and optical images of the vocal tract together with the acoustic speech signal, for a silent speech interface. Built around a... more
This article addresses synchronous acquisition of high-speed multimodal speech data, composed of ultrasound and optical images of the vocal tract together with the acoustic speech signal, for a silent speech interface. Built around a laptop-based portable ultrasound machine (Terason T3000) and an industrial camera, an acquisition setup is described together with its acquisition software called Ultraspeech. The system is currently able to record ultrasound images at 70 fps and optical images at 60 fps, synchronously with the acoustic signal. An interactive inter-session re-calibration mechanism which allows recording of large audiovisual speech databases in multiple acquisition sessions is also described.
Virtual reality has sometimes been thought of as embodying a return to a `natural' way of interacting by direct manipulation of objects in a world. However, in the everyday world we also act through language: speaking is a... more
Virtual reality has sometimes been thought of as embodying a return to a `natural' way of interacting by direct manipulation of objects in a world. However, in the everyday world we also act through language: speaking is a `natural' way of communicating our goals to others, and effecting changes in the world. In this paper, we discuss technical and design issues which need to be addressed in order to combine a direct manipulation interface to virtual reality with a speech interface. We then describe a prototype system based on intelligent agents which provide specialised functions in the virtual world. The agents have simple dialogue capabilities allowing users to directly control them with speech. 1. Introduction A direct manipulation interface to a virtual world can be augmented with a spoken language interface so that users can give spoken commands to manipulate objects. In order to develop such a multimodal interface, a number of technical and design issues need to ...
DelosDLMS is a prototype of a next-generation Digital Library (DL) management system. It is the result of integrating various specialized DL services provided by partners of the DELOS network of excellence into the OSIRIS platform. OSIRIS... more
DelosDLMS is a prototype of a next-generation Digital Library (DL) management system. It is the result of integrating various specialized DL services provided by partners of the DELOS network of excellence into the OSIRIS platform. OSIRIS is a middleware environment for the reliable and scalable distributed execution of processes. Processes, in turn, are DL applications that are built from the specialized services available in the integrated system. DelosDLMS provides support for content-based retrieval in image, ...
The auditory formation of visual-oriented documents is a process that enables the delivery of a more representative acoustic image of document s via speech interfaces. We have set up an experimental environment for conducting a series of... more
The auditory formation of visual-oriented documents is a process that enables the delivery of a more representative acoustic image of document s via speech interfaces. We have set up an experimental environment for conducting a series of complex psycho-acoustic experiments to evaluate users' performance in recognizing synthesized auditory components that represent visual structures. For our purposes, we exploit an open,
This paper presents a proposal for the development of laboratory assignments about the design and implementation of advanced interfaces for mobile robots using speech recognition. In these assignments, the main objective is the analysis... more
This paper presents a proposal for the development of laboratory assignments about the design and implementation of advanced interfaces for mobile robots using speech recognition. In these assignments, the main objective is the analysis of the possibilities for using speech interfaces as a complementary system for other interaction components with a mobile robot, such as artificial vision. The paper also
This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step... more
This article presents a framework for a phonetic vocoder driven by ultrasound and optical images of the tongue and lips for a “silent speech interface” application. The system is built around an HMM-based visual phone recognition step which provides target phonetic sequences from a continuous visual observation stream. The phonetic target constrains the search for the optimal sequence of diphones that maximizes similarity to the input test data in visual space subject to a unit concatenation cost in the acoustic domain. The final speech waveform is generated using “Harmonic plus Noise Model” synthesis techniques. Experimental results are based on a onehour continuous speech audiovisual database comprising ultrasound images of the tongue and both frontal and lateral view of the speaker’s lips.
The paper describes a new architecture for accessing hyperlinked speech-accessible knowledge sources that are distributed over the Internet. The architecture, LRRP SpeechWeb, uses Local thin-client application-specific speech Recognition... more
The paper describes a new architecture for accessing hyperlinked speech-accessible knowledge sources that are distributed over the Internet. The architecture, LRRP SpeechWeb, uses Local thin-client application-specific speech Recognition and Remote natural-language query Processing. Users navigate an LRRP SpeechWeb using voice-activated hyperlink commands, and query the knowledge sources through spoken natural-language using a speech browser executing on a local device (Frost, R.A. and Chitte, S., Proc. PACLING '99, Conf. of Pacific Association for Computational Linguistics, p.82-90, 1999). It differs from the use of speech interfaces to conventional Web HTML pages, from conventional telephone access to remote speech applications (as used in many call centers), and from the use of a network of hyperlinked VXML pages. The architecture is ideally suited for use when cell-phones become available with built-in speech-to-text and text-to-speech capabilities.
Exploiting a tissue-conductive sensor – a stethoscopic microphone – the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GMM-based statistical mapping is a very promising technique. The quality of the... more
Exploiting a tissue-conductive sensor – a stethoscopic microphone – the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GMM-based statistical mapping is a very promising technique. The quality of the converted speech is however still insufficient for computer-mediated communication, notably because of the poor estimation of F0 from unvoiced speech and because of impoverished phonetic
We studied with telephone survey the opinions and ideas of 1004 Finns concerning domestic technology and controlling it with voice commands. There is distrust towards commanding your home environment with voice; the users doubt especially... more
We studied with telephone survey the opinions and ideas of 1004 Finns concerning domestic technology and controlling it with voice commands. There is distrust towards commanding your home environment with voice; the users doubt especially its functionality. Voice commands are regarded as unpleasant. The most positive conceived aspects of voice commands are the speed of control and the ability to free your hands for something else. Voice feedback, however, is really appreciated. Especially it is welcomed instead of alarm beeps and blinking lights. It is possible that users need first to get accustomed that a device speaks to them. Finally, our studies show major changes in user attitudes when they actually use speech applications. Index Terms: smart home, speech interface, voice command, consumer survey
Speech-based user interfaces are growing in popularity. Unfortunately, the technology expertise required to build speech UIs precludes many individuals from participating in the speech interface design process. Furthermore, the time and... more
Speech-based user interfaces are growing in popularity. Unfortunately, the technology expertise required to build speech UIs precludes many individuals from participating in the speech interface design process. Furthermore, the time and knowledge costs of building even simple speech systems make it difficult for designers to iteratively design speech UIs. SUEDE, the speech interface prototyping tool we describe in this paper, allows designers to rapidly create prompt/response speech interfaces. It offers an electronically supported Wizard of Oz (WOz) technique that captures test data, allowing designers to analyze the interface after testing. This informal tool enables speech user interface designers, even non-experts, to quickly create, test, and analyze speech user interface prototypes.
The auditory formation of visual-oriented documents is a process that enables the delivery of a more representative acoustic image of document s via speech interfaces. We have set up an experimental environment for conducting a series of... more
The auditory formation of visual-oriented documents is a process that enables the delivery of a more representative acoustic image of document s via speech interfaces. We have set up an experimental environment for conducting a series of complex psycho-acoustic experiments to evaluate users’ performance in recognizing synthesized auditory components that represent visual structures. For our purposes, we exploit an open, XML-based platform that drives a Voice Browser and transforms documents’ visual meta-information to speech synthesis markup formalism. A user -friendly graphical interface allows the investigator to build acoustic variants of document s. First, a source de-compilation method builds a logical layer that abstractly classifies visual meta-information. Then, the investigator can define distinctive sound fonts in a free way by assigning combined prosodic parameters and non-speech audio sounds to the logical elements. Four blind and four sighted student subjects were asked...