Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Several factors affect the efficiency of bootstrapping approaches to the generation of pronunciation dictionaries. We focus on factors related to the underlying rule-extraction algorithms, and demonstrate variants of the Dynamically... more
Several factors affect the efficiency of bootstrapping approaches to the generation of pronunciation dictionaries. We focus on factors related to the underlying rule-extraction algorithms, and demonstrate variants of the Dynamically Expanding Context algorithm ...
Broadband speech corpus of approximately 10 hours and the corresponding transcriptions. The development process of the corpus involved the recording and transcribing of radio broadcasts. The transcriptions were used to generate the Sepedi... more
Broadband speech corpus of approximately 10 hours and the corresponding transcriptions. The development process of the corpus involved the recording and transcribing of radio broadcasts. The transcriptions were used to generate the Sepedi code-switched prompts to re-record speech from multiple speakers. The following sub-directories are found in this directory: Audio: Audio files for all the recorded code-switched speech Transcriptions: The corresponding orthographic transcriptions Metadata: Information about the speakers and the transcriptions Documentation: The directory structure and the Sepedi prompt list
A robust theoretical framework that can describe and predict the generalization ability of DNNs in general circumstances remains elusive. Classical attempts have produced complexity metrics that rely heavily on global measures of... more
A robust theoretical framework that can describe and predict the generalization ability of DNNs in general circumstances remains elusive. Classical attempts have produced complexity metrics that rely heavily on global measures of compactness and capacity with little investigation into the effects of sub-component collaboration. We demonstrate intriguing regularities in the activation patterns of the hidden nodes within fully-connected feedforward networks. By tracing the origin of these patterns, we show how such networks can be viewed as the combination of two information processing systems: one continuous and one discrete. We describe how these two systems arise naturally from the gradient-based optimization process, and demonstrate the classification ability of the two systems, individually and in collaboration. This perspective on DNN classification offers a novel way to think about generalization, in which different subsets of the training data are used to train distinct classi...
Research Interests:
Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
High quality audio and orthographic transcriptions, produced by a single, 29 year old, male speaker.
Abstract—Smartphones provide an efficient means for the collection of speech data; however, the quality of the corpora created in this fashion is not predictable. We describe an approach that allows us to post-process and rank utterances... more
Abstract—Smartphones provide an efficient means for the collection of speech data; however, the quality of the corpora created in this fashion is not predictable. We describe an approach that allows us to post-process and rank utterances in a prompted speech corpus quickly and effectively. Utterance ranking makes it possible to both select those utterances with the highest likelihood of being correct and to evaluate the quality of the resulting corpus from a limited sample. This approach has been applied to a collection in the eleven official languages of South Africa, and we show that it naturally leads to the creation of stratified corpora from the same collection. Such corpora can be useful for different purposes, and corpus users are provided with the tools to extract these easily: from small, highly accurate corpora to larger corpora that are likely to contain more errors. Index Terms: speech corpora, automatic speech recognition, confidence scoring
Data preparation and selection affects systems in a wide range of complexities. A system built for a resource-rich language may be so large as to include borrowed languages. A system built for a resource-scarce language may be affected by... more
Data preparation and selection affects systems in a wide range of complexities. A system built for a resource-rich language may be so large as to include borrowed languages. A system built for a resource-scarce language may be affected by how carefully the training data is selected and produced. Accuracy is affected by the presence of enough samples of qualitatively relevant information. We propose a method using the Kullback-Leibler divergence to solve two problems related to data preparation: the ordering of alternate pronunciations in a lexicon, and the selection of transcription data. In both cases, we want to guarantee that a particular distribution of n-grams is achieved. In the case of lexicon design, we want to ascertain that phones will be present often enough. In the case of training data selection for scarcely resourced languages, we want to make sure that some n-grams are better represented than others. Our proposed technique yields encouraging results.
Research Interests:
We describe a new language-independent technique for automatically identifying errors in an electronic pronunciation dictionary by analyzing the source of conflicting patterns directly. We evaluate the effectiveness of the technique in... more
We describe a new language-independent technique for automatically identifying errors in an electronic pronunciation dictionary by analyzing the source of conflicting patterns directly. We evaluate the effectiveness of the technique in two ways: we perform a controlled experiment using artificially corrupted data (allowing us to measure precision and recall exactly); and then apply the technique to a real-world pronunciation dictionary, demonstrating its effectiveness in practice. We also introduce a new freely available pronunciation resource (the RCRL Afrikaans Pronunciation Dictionary), the largest such dictionary that currently exists. Index Terms: pronunciation dictionaries, error detection, quality verification, Default&Refine, grapheme-to-phoneme, g2p
Abstract-We improve on a piece-wise linear model of the trajectories of Mel Frequency Cepstral Coefficients, which are commonly used as features in Automatic Speech Recognition. For this purpose, we have created a very clean... more
Abstract-We improve on a piece-wise linear model of the trajectories of Mel Frequency Cepstral Coefficients, which are commonly used as features in Automatic Speech Recognition. For this purpose, we have created a very clean single-speaker corpus, which is ideal for the investigation of contextual effects on cepstral trajectories. We show that modelling improvements, such as continuity constraints on parameter values and more flexible transition models, systematically improve the robustness of our trajectory models. However, the parameter estimates remain unexpectedly variable within triphone contexts, suggesting interesting challenges for further exploration. I. INTRODUCTION Current approaches to automatic speech recognition (ASR) require large amounts of speech data to achieve high accuracies, since context-dependent modelling of phones is an important feature of these approaches. The requirement for context-dependent modelling results from the physical constraints of the human vo...
Abstract—When modelling code-switched speech (utterances that contain a mixture of languages), the embedded language often contains phones not found in the matrix language. These are typically dealt with by either extending the phone set... more
Abstract—When modelling code-switched speech (utterances that contain a mixture of languages), the embedded language often contains phones not found in the matrix language. These are typically dealt with by either extending the phone set or mapping each phone to a matrix language counterpart. We use acoustic log likelihoods to assist us in identifying the optimal mapping strategy at a context-dependent level (that is, at triphone, rather than monophone level) and obtain new insights in the way English/Sepedi code-switched vowels are produced. I. INTRODUCTION AND BACKGROUND Code switching – using words and phrases from more than one language within a single utterance – is a common phe-nomenon among multilingual speakers. There are a number of reasons why multilingual speakers engage in code switching.
Bootstrapping techniques can accelerate the development of language technology for resource-scarce languages. We define a framework for the analysis of a general bootstrapping process whereby a model is improved through a controlled... more
Bootstrapping techniques can accelerate the development of language technology for resource-scarce languages. We define a framework for the analysis of a general bootstrapping process whereby a model is improved through a controlled series of increments, at each stage using the previous model to generate the next. We apply this framework to the task of creating pronunciation models for resource-scarce languages, iteratively combining machine learning and human knowledge in a way that minimizes the human intervention required during this process. We analyse the effectiveness of such an approach when developing a mediumsized (5000–10 000 word) pronunciation lexicon. We develop such an electronic pronunciation lexicon in Afrikaans, one of South Africa’s official languages, and provide initial results obtained for similar lexicons developed in Zulu and Sepedi, two other South African languages. We derive a mathematical model that can be used to predict the amount of time required for th...
Proceedings of the 13th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Langebaan, South Africa, November 2003
Grapheme-based speech recognition systems are faster to develop but typically do not reach the same level of performance as phoneme-based systems. In this paper we introduce a technique for improving the performance of standard... more
Grapheme-based speech recognition systems are faster to develop but typically do not reach the same level of performance as phoneme-based systems. In this paper we introduce a technique for improving the performance of standard grapheme-based systems. We find that by handling a relatively small number of irregular words through phoneme-to-grapheme (P2G) transliteration -transforming the original orthography of irregular words to an 'idealised' orthography -grapheme-based accuracy can be improved. An analysis of speech recognition accuracy based on word categories shows that P2G transliteration succeeds in improving certain word categories in which grapheme-based systems typically perform poorly, and that the problematic categories can be identified prior to system development. We evaluate when category-based P2G transliteration is beneficial and discuss how the technique can be implemented in practice.
No framework exists that can explain and predict the generalisation ability of deep neural networks in general circumstances. In fact, this question has not been answered for some of the least complicated of neural network architectures:... more
No framework exists that can explain and predict the generalisation ability of deep neural networks in general circumstances. In fact, this question has not been answered for some of the least complicated of neural network architectures: fully-connected feedforward networks with rectified linear activations and a limited number of hidden layers. For such an architecture, we show how adding a summary layer to the network makes it more amenable to analysis, and allows us to define the conditions that are required to guarantee that a set of samples will all be classified correctly. This process does not describe the generalisation behaviour of these networks, but produces a number of metrics that are useful for probing their learning and generalisation behaviour. We support the analytical conclusions with empirical results, both to confirm that the mathematical guarantees hold in practice, and to demonstrate the use of the analysis process.
Feedforward neural networks provide the basis for complex regression models that produce accurate predictions in a variety of applications. However, they generally do not explicitly provide any information about the utility of each of the... more
Feedforward neural networks provide the basis for complex regression models that produce accurate predictions in a variety of applications. However, they generally do not explicitly provide any information about the utility of each of the input parameters in terms of their contribution to model accuracy. With this in mind, we develop the pairwise network, an adaptation to the fully connected feedforward network that allows the ranking of input parameters according to their contribution to model output. The application is demonstrated in the context of a space physics problem. Geomagnetic storms are multi-day events characterised by significant perturbations to the magnetic field of the Earth, driven by solar activity. Previous storm forecasting efforts typically use solar wind measurements as input parameters to a regression problem tasked with predicting a perturbation index such as the 1-minute cadence symmetric-H (Sym-H) index. We re-visit the task of predicting Sym-H from solar ...
15th Annual Symposium of the Pattern Recognition Association of South Africa, Grabouw, South Africa, 25 to 26 November 2004
Utilising the known language of origin of a name can be useful when predicting the pronunciation of the name. When this language is not known, automatic language identification (LID) can be used to influence which language-specific... more
Utilising the known language of origin of a name can be useful when predicting the pronunciation of the name. When this language is not known, automatic language identification (LID) can be used to influence which language-specific grapheme-to-phoneme (G2P) predictor is triggered to produce a pronunciation for the name. We investigate the implications when both the LID system and the G2P system generate errors: what influence does this have on a resulting speech recognition system? We experiment with different approaches to LID-based dictionary creation and report on results in four South African languages: Afrikaans, English, Sesotho and isiZulu.
Conference Proceedings of the 24th Annual Symposium of the Pattern Recognition Association of South Africa, Johannesburg, South Africa, 3 December 2013
Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
The generalization capabilities of deep neural networks are not well understood, and in particular, the influence of activation functions on generalization has received little theoretical attention. Phenomena such as vanishing gradients,... more
The generalization capabilities of deep neural networks are not well understood, and in particular, the influence of activation functions on generalization has received little theoretical attention. Phenomena such as vanishing gradients, node saturation and network sparsity have been identified as possible factors when comparing different activation functions [1]. We investigate these factors using fully connected feedforward networks on two standard benchmark problems, and find that the most salient differences between networks with sigmoidal and ReLU activations relate to the way that class-distinctive information is propagated through a network.
Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005
The understanding of generalization in machine learning is in a state of flux. This is partly due to the relatively recent revelation that deep learning models are able to completely memorize training data and still perform appropriately... more
The understanding of generalization in machine learning is in a state of flux. This is partly due to the relatively recent revelation that deep learning models are able to completely memorize training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about generalization. The phenomenon was brought to light and discussed in a seminal paper by Zhang et al. [24]. We expand upon this work by discussing local attributes of neural network training within the context of a relatively simple and generalizable framework. We describe how various types of noise can be compensated for within the proposed framework in order to allow the global deep learning model to generalize in spite of interpolating spurious function descriptors. Empirically, we support our postulates with experiments involving overparameterized multilayer perceptrons and controlled noise in the training data. The main insights are that deep learning models are optimized fo...
20th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Stellenbosch, South Africa, 30 November - 01 December 2009
Proceedings of the Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU 2010), Penang, Malaysia, May 2010
We describe a new language-independent technique for automatically identifying errors in an electronic pronunciation dictionary by analyzing the source of conflicting patterns directly. We evaluate the effectiveness of the technique in... more
We describe a new language-independent technique for automatically identifying errors in an electronic pronunciation dictionary by analyzing the source of conflicting patterns directly. We evaluate the effectiveness of the technique in two ways: we perform a controlled experiment using artificially corrupted data (allowing us to measure precision and recall exactly); and then apply the technique to a real-world pronunciation dictionary, demonstrating its effectiveness in practice. We also introduce a new freely available pronunciation resource (the RCRL Afrikaans Pronunciation Dictionary), the largest such dictionary that currently exists. Index Terms: pronunciation dictionaries, error detection, quality verification, Default&Refine, grapheme-to-phoneme, g2p
School of Electrical, Electronic and Computer Engineering, North-West University, Potchefstroom, South Africa Human Language Technology, Competency Area, CSIR, Meraka Institute Multilingual Speech Technologies, North-West University,... more
School of Electrical, Electronic and Computer Engineering, North-West University, Potchefstroom, South Africa Human Language Technology, Competency Area, CSIR, Meraka Institute Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Africa
We explore pattern recognition techniques for verifying the correctness of a pronunciation lexicon, focusing on techniques that require limited human interaction. We evaluate the British English Example Pronunciation (BEEP) dictionary... more
We explore pattern recognition techniques for verifying the correctness of a pronunciation lexicon, focusing on techniques that require limited human interaction. We evaluate the British English Example Pronunciation (BEEP) dictionary [1], a popular public domain resource that is widely used in English speech processing systems. The techniques being investigated are applied to the lexicon and the results of each step are illustrated using sample entries. We find that as many as 5553 words in the BEEP dictionary are incorrect. We demonstrate the effect of correction techniques on a lexicon and implement the lexicon in an automatic speech recognition (ASR) system.
In this research, we use machine learning techniques to provide solutions for descriptive linguists in the domain of language standardization. With regard to the personal name construction in Afrikaans, we perform function learning from... more
In this research, we use machine learning techniques to provide solutions for descriptive linguists in the domain of language standardization. With regard to the personal name construction in Afrikaans, we perform function learning from word pairs using the Default & Refine algorithm. We demonstrate how the extracted rules can be used to identify irregularities in previously standardized constructions and to predict new forms of unseen words. In addition, we define a generic, automated process that allows us to extract constructional schemas and present these visually as categorization networks, similar to what is often being used in Cognitive Grammar. We conclude that computational modeling of constructions can contribute to new descriptive linguistic insights, and to practical language solutions.

And 160 more