This paper presents methodologies involved in text normalization and diphone preparation for Bangla Text to Speech (TTS) synthesis. A Concatenation based TTS system comprises basically two modules- one is natural language processing and... more
This paper presents methodologies involved in text normalization and diphone preparation for Bangla Text to Speech (TTS) synthesis. A Concatenation based TTS system comprises basically two modules- one is natural language processing and the other is Digital Signal Processing (DSP). Natural language processing deals with converting text to its pronounceable form, called Text Normalization and the diphone selection method based on the normalized text is called Grapheme to Phoneme (G2P) conversion. Text normalization issues addressed in this paper include tokenization, conjuncts, null modified characters, numerical words, abbreviations and acronyms. Issues related with diphone preparation include diphone categorization, corpus preparation, diphone labeling and diphone selection. Appropriate rules and algorithms are proposed to tackle all the above mentioned issues. We developed a speech synthesizer for Bangla using diphone based concatenative approach which is demonstrated to produce much natural sounding synthetic speech.
This paper is introducing a new notice system which does not require reaching to the display or any pinning or pasting of papers anywhere. The system is consists of the voice alert notice which can be built on single board known as... more
This paper is introducing a new notice system which does not require reaching to the display or any pinning or pasting of papers anywhere. The system is consists of the voice alert notice which can be built on single board known as Raspberry Pi which includes ARM8 quad core processor from Broadcom. So, the entire development will be on the linux based operating system and the hardware module is selected as Raspberry Pi. The new system is consist of a text to voice feature, also the message will can be remotely send through email.
This paper gives updates on the development of two studies on grapheme to phoneme (G2P)1 of Khmer language. Their Methodology, datasets and results are carefully reviewed and compared. This is done to evaluate which system perform better... more
This paper gives updates on the development of two studies on grapheme to phoneme (G2P)1 of Khmer language. Their Methodology, datasets and results are carefully reviewed and compared. This is done to evaluate which system perform better and why.
In this study, the framework of a concatenative text-to-speech system for Turkish is built and its evaluation techniques, namely MOS, DRT and CT have been considered. Naturalness and intelligibility of the Turkish TTS system is tested by... more
In this study, the framework of a concatenative text-to-speech system for Turkish is built and its evaluation techniques, namely MOS, DRT and CT have been considered. Naturalness and intelligibility of the Turkish TTS system is tested by MOS and CT-DRT respectively. Although the system uses simple techniques, it provides promising results for Turkish TTS, since the selected concatenative method is very well suited for Turkish language structure.
An unlimited vocabulary text-to-speech engine has been developed, which currently handles both Tamil and Kannada Unicode text. The input text is processed by a grapheme to phoneme converter module, which uses language specific... more
An unlimited vocabulary text-to-speech engine has been developed, which currently handles both Tamil and Kannada Unicode text. The input text is processed by a grapheme to phoneme converter module, which uses language specific pronunciation rules to convert the text into an unambiguous phonetic representation. This text is then parsed into demisyllable like basic units. The occurrence of these basic units are searched for, from the phonetically rich spoken database, which is segmented and annotated at the phone level. An unit selection algorithm then selects the best combination of the available speech units to be concatenated to synthesize the speech, which is then converted into .wav format.
As part of a general framework for the development of global information systems, we include support for the development of aural interfaces. The framework uses an object-oriented database for the management of application, document... more
As part of a general framework for the development of global information systems, we include support for the development of aural interfaces. The framework uses an object-oriented database for the management of application, document content and presentation data. The access layer is based around an XML server and XSLT for document generation from default and customised templates. Specifically, aural interfaces are supported through a VoiceXML server that provides the speech recognition and synthesis mechanisms, together with XSLT templates for the generation of VoiceXML. In this paper, we describe the implementation of a generic voice browser for application databases as well as the development of a customised aural interface for a community diary managing appointments and events.
In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open- source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input... more
In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open- source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme scripting interface to incorporate Bangla language support. Festival is a concatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The modules of such a TTS system are described in this paper, followed by an evaluation of the quality of synthesized speech for acceptability and intelligibility.
All areas of language and speech technology, directly or indirectly, require handling of real (unrestricted) text. For example, text-to-speech systems directly need to work on real text, whereas automatic speech recognition systems depend... more
All areas of language and speech technology, directly or indirectly, require handling of real (unrestricted) text. For example, text-to-speech systems directly need to work on real text, whereas automatic speech recognition systems depend on language models that are trained on text. This paper reports our ongoing effort on Hindi text normalization. In that, a novel approach to text normalization, wherein tokenization and initial token classification are combined into one stage followed by a second level of token sense disambiguation, is described. Tokenization and initial token classification are performed using a lexical analyser that is derived from various token definitions in the form of regular expressions. For second level of token sense disambiguation, application of decision lists and decision trees are explored. Token-to-word rules are then applied, which are specific for each token type and also for each format within a token type.
In this study, a concatenative text-to-speech system for Turkish is built. The system uses simple techniques and the concatenation units are obtained from the atomic units. This approach is very well suited for Turkish language structure... more
In this study, a concatenative text-to-speech system for Turkish is built. The system uses simple techniques and the concatenation units are obtained from the atomic units. This approach is very well suited for Turkish language structure and it is flexible enough to allow the synthesis of all types texts. The Turkish TTS system is tested considering the naturalness and intelligibility criterion. It is evaluated by using MOS, DRT and CT. Naturalness and intelligibility of the Turkish TTS system is tested by MOS and CT-DRT, respectively. Although the system uses simple techniques, it provides promising results for Turkish TTS, since the selected concatenative method is very well suited for Turkish language structure.
Abstract: - This paper describes the TTTS (Turkish Text-To-Speech) synthesis system, developed at Fatih University for Turkish language. TTTS is a concatenative TTS system aiming to advance the process of developing natural and human... more
Abstract: - This paper describes the TTTS (Turkish Text-To-Speech) synthesis system, developed at Fatih University for Turkish language. TTTS is a concatenative TTS system aiming to advance the process of developing natural and human sounding Turkish voices. The resulting ...
This paper explores the modeling of prosody parameters for improving naturalness of Tamil speech synthesis, by studying recorded utterances of speech with and without explicit emotions. To begin with, we look at interrogative and... more
This paper explores the modeling of prosody parameters for improving naturalness of
Tamil speech synthesis, by studying recorded utterances of speech with and without explicit
emotions. To begin with, we look at interrogative and exclamatory Tamil sentences. Prosody
parameters namely, pitch contour, energy and duration of each word in the sentences were
observed, analyzed and generalized from the un‐intonated and intonated, interrogative and
exclamatory human speech. Differences in energy level were also analyzed between the two
sets of utterances in three different frequency bands. Pitch is by modified in the LP residual
domain using DCT. Energy is modified by multiplying the signal by the hypothesized factor and
duration is modified as per the duration model by duplicating or removing integer number of
pitch periods as necessary. The model was implemented on speech synthesized from
Thirukkural TTS, developed by MILE LAB, and the results were found to be satisfactory.
ABSTRACT India is home to about 1600 languages. Less than 5% of this population is English literate. A consortium effort on developing a common framework to build text-to-speech synthesis (TTS) in various Indian languages has already been... more
ABSTRACT India is home to about 1600 languages. Less than 5% of this population is English literate. A consortium effort on developing a common framework to build text-to-speech synthesis (TTS) in various Indian languages has already been realized. In this paper, we propose a parallel effort to integrate the various TTS systems developed into various applications. These applications are primarily aimed at bringing the marginalized sections of Indian society into the main stream. The applications integrated include: i) screen readers for the visually challenged on desktop platforms ii) web-based plug-in to read highlighted text for the language challenged iii) integration of TTS with optical character recognition systems iv) various applications for the Android platform v) On-line test and tutorial for visually challenged and vi) conversion of gesture to speech for persons with cerebral palsy. The performance of the integrated systems have been extensively tested by visually challenged persons for the last three years. About 180 students have learnt to use the Office Suite, Internet and e-mail over the last 3 years.
This article focuses on the systematic design of a segment database which has been used to support a time-domain speech synthesis system for the Greek language. Thus, a methodology is presented for the generation of a corpus containing... more
This article focuses on the systematic design of a segment database which has been used to support a time-domain speech synthesis system for the Greek language. Thus, a methodology is presented for the generation of a corpus containing all possible instances of the segments for the specific language. Issues such as the phonetic coverage, the sentence selection and iterative evaluation techniques employing custom-built tools, are examined. Emphasis is placed on the comparison of the process-derived corpus to naturally-occurring corpora with respect to their suitability for use in time-domain speech synthesis. The proposed methodology generates a corpus characterised by a near-minimal size and which provides a complete coverage of the Greek language. Furthermore, within this corpus, the distribution of segmental units is similar to that of natural corpora, allowing for the extraction of multiple units in the case of the most frequently-occurring segments. The corpus creation algorithm incorporates mechanisms that enable the fine-tuning of the segment database's language-dependent characteristics and thus assists in the generation of high-quality text-to-speech synthesis.
We report the design and development of Thirukkurul, the first text-to-speech converter in Tamil. Syllables of different lengths have been selected as units since Tamil is a syllabic language. Automatic segmentation algorithm [8] has been... more
We report the design and development of Thirukkurul, the first text-to-speech converter in Tamil. Syllables of different lengths have been selected as units since Tamil is a syllabic language. Automatic segmentation algorithm [8] has been devised for segmenting syllables into consonant and vowel. The units are pitch marked using Discrete Cosine Transform - Spectral Auto-correlation Function (DCTSAF) [6]. Prosodic information is captured in tables based on extensive observation of spoken Tamil. During synthesis, DCT based pitch modification [3][7][11] is applied for both waveform interpolation and modifying pitch contour for different sentence modalities. Thirukkural is designed in VC++ and runs on windows 95/98/NT. Perceptual evaluation by natives show that the synthesized speech is intelligible and fairly natural.
This article focuses on the systematic design of a segment database which has been used to support a time-domain speech synthesis system for the Greek language. Thus, a methodology is presented for the generation of a corpus containing... more
This article focuses on the systematic design of a segment database which has been used to support a time-domain speech synthesis system for the Greek language. Thus, a methodology is presented for the generation of a corpus containing all possible instances of the segments for the specific language. Issues such as the phonetic coverage, the sentence selection and iterative evaluation techniques employing custom-built tools, are examined. Emphasis is placed on the comparison of the process-derived corpus to naturally-occurring corpora with respect to their suitability for use in time-domain speech synthesis. The proposed methodology generates a corpus characterised by a near-minimal size and which provides a complete coverage of the Greek language. Furthermore, within this corpus, the distribution of segmental units is similar to that of natural corpora, allowing for the extraction of multiple units in the case of the most frequently-occurring segments. The corpus creation algorithm incorporates mechanisms that enable the fine-tuning of the segment database's language-dependent characteristics and thus assists in the generation of high-quality text-to-speech synthesis.
This paper describes the TTTS (Turkish Text-To-Speech) synthesis system, developed at Fatih University for Turkish language. TTTS is a concatenative TTS system aiming to advance the process of developing natural and human sounding Turkish... more
This paper describes the TTTS (Turkish Text-To-Speech) synthesis system, developed at Fatih University for Turkish language. TTTS is a concatenative TTS system aiming to advance the process of developing natural and human sounding Turkish voices. The resulting system can be implemented by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units effecting the output range, the quality and the clarity. The letters of the Turkish alphabet and the syllables that consist of two letters at most are used as the smallest phonemes in the context of this study. The syllables that have more than two letters are derived from these smallest units. The results are evaluated by using the Degradation Mean Opinion Score (DMOS) method
Our participation in the Blizzard Challenge 2014 is only for the Tamil language. We have a unit selection based concatenative speech synthesis system. Sentence level viterbi search is used to select the reliable speech units among a set... more
Our participation in the Blizzard Challenge 2014 is only for the Tamil language. We have a unit selection based concatenative speech synthesis system. Sentence level viterbi search is used to select the reliable speech units among a set of candidate units. The given RD (reading), SUS (semantically unpredictable sentences) and ML (multi‐lingual) test sentences are synthesized using the corpus made available for the participants. The listening test results reported by the blizzard evaluation team are discussed. The letter code for MILE TTS is “J”.
In this article, an evolutionary algorithm known as Genetic Programming (GP) was used to design a parametric speech quality estimation model. Nowadays, GP is one of the machine learning techniques employed in a quality estimation process.... more
In this article, an evolutionary algorithm known as Genetic Programming (GP) was used to design a parametric speech quality estimation model. Nowadays, GP is one of the machine learning techniques employed in a quality estimation process. In principle, the set of quality-affecting parameters was used as an input to the designed estimation model based on GP approach in order to estimate a quality of synthesized speech transmitted over IP channel (VoIP environment). The performance results obtained by the designed estimation model have confirmed the good properties of genetic programming, namely good accuracy and generalization ability; this makes it to be perspective approach to a quality estimation of this type of speech in the corresponding environment. The developed model can be helpful for network operators and service providers implementing it in planning phase or early-development stage of telecommunication services based on synthesized speech.
End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is... more
End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.
Abstract: The Medical Faculty of the University of Bern uses voice-over in picture driven e-learning modules to avoid split attention induced by the modality effect. To lower production costs, professional narrators have been replaced by... more
Abstract: The Medical Faculty of the University of Bern uses voice-over in picture driven e-learning modules to avoid split attention induced by the modality effect. To lower production costs, professional narrators have been replaced by computer-generated voices. The e-learning modules are produced with a content management system (CMS) offering text-to-speech functionality. 107 Swiss high school students passed a 20-minute e-learning sequence on cystic fibrosis. In a nested between-group design with four learning content presentation modalities (written text vs. human voice-over vs. artificial voice-over plus 15"-laptop-screens vs. 2,8"smart-phone screens), the learning outcome was assessed at three points in time: before, just after, and six weeks after the learning phase. All modalities led to significant short-term and long-term increase in factual knowledge about cystic fibrosis. Our two hypotheses are supported: (1) presenting pictures with both human and artificial...
We present a system that provides an advanced telephone service for the dissemination of both national and regional avalanche forecasts in the Swiss Alps. The service enables members of the public and also mountain guides to access... more
We present a system that provides an advanced telephone service for the dissemination of both national and regional avalanche forecasts in the Swiss Alps. The service enables members of the public and also mountain guides to access forecast information while travelling in mountain areas and, particularly, to be notified when entering regions of high risk. By telephone access, we include both voice and WAP-based access as well as combinations of both. The service is achieved through the integration of a special forecast content delivery database, including geographical data for location-dependent delivery, into the overall avalanche information system architecture. This database was implemented using the XIMA framework for adaptable content delivery, which is based on an XML server for the OMS Java data management system and XSLT presentation templates. The speech interface was implemented using VoiceXML.
End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is... more
End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.
This paper investigates an impact of independent and dependent losses and coding on speech quality predictions provided by PESQ and P.563, when both naturally-produced and synthesized speech are transmitted over various channels. Two... more
This paper investigates an impact of independent and dependent losses and coding on speech quality predictions provided by PESQ and P.563, when both naturally-produced and synthesized speech are transmitted over various channels. Two synthesized speech signals generated with two different Text-to-Speech systems and one naturallyproduced signal are investigated. In addition, we assess the variability of PESQ’s and P.563’s predictions with respect to type of signal used (naturally-produced or synthesized) and loss conditions. The results show that the synthesized speech suffers more from packet loss than the naturally-produced speech which also results in the bigger predictions deviations for such type of speech. On the other hand, the impact of ‘artificially’ sounding codecs is higher for the naturally-produced speech. In contrast to previous codec type, the quality predictions for ‘natural’ sounding codec are -with the exception of the predictions provided by P.563 model- very simil...
ABSTRACT India is home to about 1600 languages. Less than 5% of this population is English literate. A consortium effort on developing a common framework to build text-to-speech synthesis (TTS) in various Indian languages has already been... more
ABSTRACT India is home to about 1600 languages. Less than 5% of this population is English literate. A consortium effort on developing a common framework to build text-to-speech synthesis (TTS) in various Indian languages has already been realized. In this paper, we propose a parallel effort to integrate the various TTS systems developed into various applications. These applications are primarily aimed at bringing the marginalized sections of Indian society into the main stream. The applications integrated include: i) screen readers for the visually challenged on desktop platforms ii) web-based plug-in to read highlighted text for the language challenged iii) integration of TTS with optical character recognition systems iv) various applications for the Android platform v) On-line test and tutorial for visually challenged and vi) conversion of gesture to speech for persons with cerebral palsy. The performance of the integrated systems have been extensively tested by visually challenged persons for the last three years. About 180 students have learnt to use the Office Suite, Internet and e-mail over the last 3 years.
Very good quality, speech synthesis systems exist for languages like English and Chinese. However, only in the recent past, increased attention has been paid for developing TTS for Indian languages. There have been several reasons for the... more
Very good quality, speech synthesis systems exist for languages like English and Chinese. However, only in the recent past, increased attention has been paid for developing TTS for Indian languages. There have been several reasons for the same in the past: 1) lack of adequate market, 2) non-availability of quality training data. In this work, we have developed a human-like quality Kannada text-to-speech conversion system using about 44.8 hours of training data recorded from a studio from a Kannada teacher with good diction. We have used the transfer learning technique to continue training over the Tacotron2 and WaveGlow checkpoints pre-trained on English. Evaluation by thirty five Kannada natives resulted in an overall MOS of 4.51±0.52, whereas the original speech of the speaker was given an MOS of 4.62 ± 0.53. In another independent testing, where another set of 25 human evaluators were given ten pairs of the original utterances of the speaker and the synthesized speech of the same sentences, some of the synthesized speech samples were judged to be better than the original! In a final round of evaluation, five sentences were synthesized by our TTS, Google's Wavenet TTS and also Nuance's TTS. Kannada natives were presented these outputs in a random order and asked to choose one of them as their most preferred output. Based on 55 human evaluators, RaGaVeRa's Kannada TTS obtained a mean preference score of 78.2%, whereas Google's and Nuance's TTS got scores of 13.1% and 5.1%, respectively. Thus, to the best of the knowledge of the authors, this is the best quality TTS that has ever been achieved for Kannada so far.
This research aims to explore CBAR concepts to implement test oracles to support testing activities of TTS Systems, helping the human in quality evaluations. In an automated software testing environment, Test Oracles represent the... more
This research aims to explore CBAR concepts to implement test oracles to support testing activities of TTS Systems, helping the human in quality evaluations. In an automated software testing environment, Test Oracles represent the mechanism to evaluate whether an SUT (Software Under Testing) execution is correct or not.
Audio podcasting is increasingly present in the educational field and is especially appreciated as an ubiquitous/pervasive tool (“anywhere, anytime, at any pace”) for acquiring or expanding knowledge. We designed and implemented a... more
Audio podcasting is increasingly present in the educational field and is especially appreciated as an ubiquitous/pervasive tool (“anywhere, anytime, at any pace”) for acquiring or expanding knowledge. We designed and implemented a Web-based Text To Speech (TTS) system for automatic generation of a set of structured audio podcasts from a single text document. The system receives a document in input (doc, rtf, or txt), and in output provides a set of audio files that reflect the document’s internal structure (one mp3 file for each document section), ready to be downloaded on portable mp3 players. Structured audio files are useful for everyone but are especially appreciated by blind users, who must explore content audially. Fully accessible for the blind, our system offers WAI-ARIA-based Web interfaces for easy navigation and interaction via screen reader and voice synthesizer, and produces a set of accessible audio files for Rockbox mp3 players (mp3 and talk files), allowing blind use...
This paper shows the current results of development of TTS-MK – a speech synthesizer for Macedonian language. The basic principles for projecting and building of speech synthesizer for Macedonian language, based on concatenation of speech... more
This paper shows the current results of development of TTS-MK – a speech synthesizer for Macedonian language. The basic principles for projecting and building of speech synthesizer for Macedonian language, based on concatenation of speech segments, are shown. Every language has its respective and specific speech norms and characteristics that should be observed during the speech synthesis. The Macedonian language is phonetic; hence the normative pronunciation does not contain great difficulty, except in some special cases that should be taken into consideration. The presentation also focuses on the accent in the Macedonian language, which is dynamic and positioned on the third syllable. The rules and regulations for the accent positioning in the Macedonian language can be easily derived, with some deviations that should be resolved. There are two versions of the system based on different segments corpora. Both of them are presented, as well as their application.