Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Adalya

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339228759

A Survey of Language and Dialect Identification Systems

Article · January 2020

CITATIONS READS

0 356

1 author:

Tanvira Ismail
Assam Don Bosco University
8 PUBLICATIONS   7 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Tanvira Ismail on 13 February 2020.

The user has requested enhancement of the downloaded file.


ADALYA JOURNAL ISSN NO: 1301-2746

A Survey of Language and Dialect Identification


Systems
Tanvira Ismail
Senior Assistant Professor, Assam Don Bosco University, India
tanvira.ismail@gmail.com

Abstract—As per dictionary, a language is a system of speech recognition, data mining, spam filtering, document
communication which consists of a set of sounds and written summarization, etc.
symbols which are used by the people of a particular country or On the other hand, developing a good method to detect
region for talking or writing. Besides languages, people also dialect accurately helps in improving certain applications and
communicate through dialects. Dialect refers to a regional or services such as speech recognition systems which exist in
social variety of a language distinguished by pronunciation, most of today’s electronic devices [5]. It will allow researchers
grammar or vocabulary. The quest to automate the ability to
to infer the speaker’s regional origin and ethnicity and to adapt
identify languages and dialects has never stopped and hence the
rise in research on automatic language and dialect identification features used in speaker identification to regional original [6].
systems. In this paper, we survey work done in the field of Accurate dialect identification technique is expected to help in
language and dialect identification. providing new services in the field of e-health and
telemedicine which is especially important for older and
Keywords—LID, dialect identification, GMM, GMM-UBM, homebound people [5].
SVM, HMM, neural network
II. BRIEF SURVEY OF LITERATURE ON LANGUAGE
I. INTRODUCTION IDENTIFICATION
Humans are born with the ability to discriminate between The task of identifying the language being spoken from a
spoken languages as part of human intelligence [1]. The first sample of speech by an unknown speaker is called LID.
perceptual experiment measuring how well human listeners Sugiyama [6] proposed two language identification algorithms.
can perform language identification was reported by The first algorithm was based on the standard vector
Muthusamy et al. [2], wherein it was concluded that human quantization (VQ) algorithm and the second one was based on
beings, with adequate training, are the most accurate language VQ and histogram algorithm. In this work acoustic features of
recognizers. For languages with which they are not familiar, the speech signal such as Linear Predictor Coding (LPC)
human listeners can often make subjective judgments with coefficients, autocorrelation coefficients and delta cepstral
reference to the languages they know, e.g., ‘it sounds like coefficients were used. In the first algorithm based on the
English’. Though such judgments are less precise for hard standard VQ algorithm, every language denoted by k was
decisions to be made for an identification task, they show how characterized by its own VQ codeword, Vk. The second
human listeners apply linguistic knowledge at different levels algorithm based on VQ and histogram algorithm consisted of a
for distinguishing between certain broad language groups. The single universal VQ code word, U= {vi}, for all languages and
its occurrence probability histograms, hk. Every language, k
quest to automate such ability has never stopped. Just like any
was characterized by a histogram hk. The multilingual speech
other artificial intelligence technologies, automatic language
database used for the work consisted of 20 languages, namely,
identification (LID) aims to replicate such human ability
American, Arabic, Chinese, Danish, Dutch, English, Finnish,
through computational means [2]. French, German, Greek, Hindi, Hungarian, Italian, Japanese,
Besides languages, people also communicate through Norwegian, Polish, Portuguese, Russian, Spanish and Swedish.
dialects. A dialect is considered as a variety of speech differing For each language, 16 sentences were uttered twice by four
from the standard literary language or speech pattern of the male speakers and four female speakers. The duration of each
culture in which it exists. Such a variety could be associated sentence was about 8 seconds. The experimental results
with a particular place or region. For example, American showed that the recognition rates for the first and second
English (spoken by people from America) and British English algorithms were 65% and 80%, respectively. For each
(spoken by people from Britain) are dialects of English. algorithm, just 8 sentences of unknown speech, i.e., a total of
Dialects of a specific language differ from each other but they approximately 64 seconds were used.
are still understandable to the speakers of another dialect of the Balleda et al. [7] presented an LID system that worked for
same language [3]. Dialect identification is the task of four South Indian languages, viz., Kannada, Malayalam, Tamil
recognizing a speaker’s regional dialect within a and Telugu and one North Indian language viz., Hindi. The
predetermined language. The problem of automatic dialect speech corpora consisted of speech utterances from read text.
identification is viewed more challenging than that of language For each language, they collected speech from five native male
recognition due to the greater similarity between dialects of the speakers and five native female speakers ensuring further that
same language [4]. the text read by different speakers was different. The text of
LID based research has received much interest and attention the sentences was chosen randomly and no attempt was made
due to its importance in the areas of machine translation, to choose a phonetically balanced set of sentences. The
training utterances had an average length of ten seconds and
the testing utterances had an average length of five seconds.

Volume 9, Issue 1, January 2020 682 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

For each speaker, 60 seconds of speech was collected. It was than 95% for texts over 500 characters while all the systems
also ensured that the text sentences used in training and testing achieved a precision higher than 99% for texts of 5000
were different. Each language was modeled using an approach characters. It was observed that the influence of the number of
based on VQ. The speech was segmented into different sounds languages the system could identify was a very relevant factor
and the performance of the system on each of the segments to take into account. The more languages the system has to
was studied. It was observed that the presence of some recognize, the less precision it will have. Furthermore, it was
consonants and vowels (CVs) was crucial for each language, concluded that if the language identification system has to be
and for the same consonant and vowel combination the quality applied in a multilingual environment involving similar
of the sound was different for different languages. This study languages, the precision of the system would fall and if the
also showed that once the speech signal was segmented into languages to be identified have different origins, the
CVs, it was possible to perform automatic language identification system would achieve a high precision.
identification on very short segments (ranging between 100- Singh et al. [10] explored the use of prosodic feature based
150ms) of speech. sparse representation classification (SRC) system for the LID
Nagarajan and Murthy [8] proposed an approach which used task. The prosodic features, i.e., intonation, rhythm and stress,
parallel syllable like unit recognizers in a framework similar to were computed by extracting syllable like unit with the help of
Parallel Phone Recognition (PPR) for the language a vowel onset point detection algorithm and mapped to i-vector
identification task. The difference between their proposed domain for SRC using an exemplar dictionary. For comparing
system and the PPR system was that unsupervised syllable purpose, they also developed a contrast system based on cosine
models were built from the training data. The data were first distance scoring (CDS). The system was evaluated on five
segmented into syllable-like units. The syllable segments were Indian languages, viz., Assamese, Bengali, English, Hindi and
then clustered using an incremental approach. This resulted in Nepali. The test data consisted of 114 speech utterances from
a set of syllable models for each language. These language- speakers of age group 18 to 35 years out of which 23
dependent syllable models were then used for identifying the utterances were spoken in Assamese, 22 in Bengali, 22 in
language of the unknown test utterances. The Oregon Graduate English, 23 in Hindi and 24 utterances in Nepali languages.
Institute Multi-language Telephone Speech (OGI-MLTS) Each of the test utterances were of approximately 1 minute
corpus, which consists of spontaneous speech in eleven duration. The performance of the SRC based LID system with
languages, viz., English, Farsi, French, German, Hindi, and without channel compensation was found to be
Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese, significantly better than the CDS based system.
was used in this approach. The utterances were recorded from Martin et al. [11] used a syllable-length segmental
90 male and 40 female speakers in each language. For each framework to analyze how individual information sources
language, 30 speakers were used for training and 20 speakers contribute to overall language identification performance. The
were used for testing. The initial results of their proposed syllabic framework was achieved via a multilingual phone
system on the OGI-MLTS corpus showed a performance of recognition system, which used broad phonetic classes.
69.5%. They further showed that if only a subset of syllable Features derived to represent acoustic, prosodic and
models that were unique in some sense were considered, then phonotactic information was then used to produce three
the performance improved to 75.9%. separate models. First series of experiments were conducted
Padro and Padro [9] performed language identification using based on modeling acoustic features. When a baseline GMM
three different statistical methods based on Markov models, system was compared with a GMM system which modeled the
comparison of Trigram frequency vectors and n-gram text recognized segments, both systems achieved comparable levels
categorization. The experiments were focused on studying the of performance. A second series of experiments were
influence of the training set size, the amount of text to classify conducted which examined whether complementary
and the number of languages among which the system can phonotactic information could be extracted by using n-gram
choose, in order to determine the influence they have on the statistics over both short and extended segmental lengths. It
system performance. The corpora were formed by a set of was found that the use of unigrams statistics for the phone
daily newspaper news and consisted of six languages, viz., triplets provided significant improvements when used to
Catalan, Spanish, English, Italian, German and Dutch. For complement existing Parallel Phone Recognition and
each of the language corpus a random partition containing Language Modeling (PPRLM) systems. Finally, a small set of
about 30,000 words was selected to be used as the test set. The fusion experiments were conducted in order to assess the
rest of the speech corpora were used to randomly extract degree of complementary information contained within the
training samples of different sizes. The experiments involved acoustic, phonotactic and prosodic systems. Martin et al. [11]
training each system for all languages using a train set ranging combined the best performing acoustic system based on the
from 2500 to 25000 words and evaluating their performance Hidden Markov Model (HMM) models with the pitch system
over the test data. The test was done by giving the system an and observed that it provided a minor improvement. The levels
amount of unclassified text ranging from 5 to 1000 characters. of performance achieved by the baseline prosodic system and
The process was repeated for all possible combinations of that built under the syllabic framework were comparable with
languages, from two to six languages. The experiments results indicating that prosodic information can be used to
revealed that the influence of the train set size is not important obtain marginal improvements when combined with acoustic
when the size is bigger than approximately 50k words. These and phonotactic systems. However, the combination of the
researchers [9] also proved that the amount of text to classify is HMM acoustic, prosodic and phone-triplet unigrams achieved
crucial but it is not necessary to have very long texts to achieve similar levels of performance to the PPRLM system and most
a good precision. All the systems achieved a precision higher importantly, the fusion of all systems resulted in an absolute

Volume 9, Issue 1, January 2020 683 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

improvement. In this study, two speech corpora were used, corpus consisted of recordings from 35 speakers out of which
namely, the OGI-MLTS corpus and the Linguistic Data 31 were male and 4 were female speakers. The length of
Consortium CallFriend (LDC CallFriend) database. From the sentences ranged between 2 to 5 seconds. The total duration of
OGI-MLTS corpus six languages were considered, namely, training samples was 440 seconds and the number of training
English, Hindi, Spanish, Mandarin, German and Japanese. On sentences was 135 while the total number of test utterances
the other hand, the LDC CallFriend database consists of twelve was 105. For Telugu, the training speech corpus consisted of
languages; viz., Arabic, English, Farsi, French, German, Hindi, recordings from 24 speakers out of which 20 were male and 4
Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese were female speakers, while the testing speech corpus
of which English, Mandarin and Spanish have a second dialect. consisted of recordings from 22 speakers out of which 18 were
Zhang et al. [12] proposed an approach using Support male and 4 were female speakers. The length of sentences
Vector Machines (SVMs). In this approach a large background ranged between 3 to 9 seconds. The total duration of training
Gaussian Mixture Model (GMM) was used to perform the samples was 440 seconds and the number of training sentences
sequence mapping process which maps variable length feature was 98 while the total number of test utterances was 62. The
vectors to one fixed length vector. Performance of the Gujarati training speech corpus consisted of recordings from
proposed approach was evaluated on the speech signal from 22 speakers out of which 18 were male speakers and 4 were
five languages, namely, English, German, Japanese, Mandarin female speakers, while the testing speech corpus consisted of
and Spanish. Zhang et al. [13] too used the OGI-MLTS corpus 22 speakers out of which 18 were male and 4 were female
for the speech utterances of the five languages. For each speakers. The length of sentences ranged between 2 to 10
language 50 speakers were selected as the training set and the seconds. The total duration of training samples was 472
duration of each speaker was about 60 seconds. The testing set seconds and the number of training sentences was 132 while
was made up of the rest of the speech utterances for the five the total number of test utterances was 88. For English, the
languages from the same speech corpus. The feature vectors training speech corpus consisted of recordings from 25
used consisted of 12 Mel-frequency Cepstral Coefficient speakers out of which 14 were male speakers and 11 were
(MFCC). Experimental results demonstrated that their female speakers while the testing speech corpus consisted of
proposed system not only performed better than a GMM 28 speakers out of which 14 were male and 14 were female
classifier but also outperformed the system using Generalized speakers. The length of sentences ranged between 2 to 10
Linear Discriminant Sequence (GLDS) Kernel. seconds. The total duration of training samples was 420
Noor and Aronowitz [13] showed that speaker-specific seconds and the number of training sentences was 138 while
anchor model representation could be used for language the total number of test utterances was 91. The extracted
identification when combined with SVMs for performing the features of the speech utterances were the MFCCs and their
classification. In the proposed system, given utterances in a delta as well as delta-delta coefficients. These features were
language projected them onto a speaker space using anchor then modeled as GMM and an SMEM algorithm was used to
modeling and then used an SVM to generalize them. They [13] obtain the model parameters. It was observed that the use of
too made use of the LDC Call Friend in order to evaluate their SMEM overcame the difficulty of local maxima due to EM
language identification system which consists of twelve algorithm. Manwani et al. [14] basically showed that the
languages. Additional utterances in Russian were introduced accuracy of the system could be improved by using split and
only in the test data. One advantage of this method was that merge EM algorithm.
very little labeled data was required. The only labels used for Kumar et al. [15] proposed a system that uses hybrid robust
training the SVM were taken from National Institute of feature extraction technique for language identification system.
Standards and Technology Language Recognition Evaluation The speech recognizers used a parametric form of a signal in
dataset (NIST LRE-03) development data, which consists of order to get the most important distinguishable features of
about 30 second utterances per language. This was found to be speech signal for recognition task. They used MFCC and PLP
very helpful for automatic identification of languages that have coefficients along with two hybrid features. Bark Frequency
little human labeled available examples. Noor and Aronowitz Cepstral Coefficients (BFCC) and Revised Perceptual Linear
[13] also proposed a more efficient way to calculate the Prediction Coefficients (RPLP) were the two hybrid features
speaker characterization vectors using test utterance that were obtained from the combination of MFCC and PLP.
parameterization instead of the classic Gaussian Mixture Two different classifiers, namely, VQ with Dynamic Time
Model with Universal Background Model (GMM-UBM). Warping (DTW) and GMM were used for the identification
Manwani et al. [14] described a LID system using GMM for purpose. They evaluated the proposed system on three Indian
the extracted features which were then trained using Split and languages, namely, Bengali, Hindi and Telugu. The speech
Merge Expectation Maximization (SMEM) Algorithm. This corpora consisted of seven different speakers for each
approach improved the global convergence of the Expectation language and each speaker utterance was of one minute
and Maximization (EM) algorithm. A maximum likelihood duration. All speakers of the respective languages uttered same
classifier was used for identifying a language. Manwani et al. paragraph for one minute duration which were recorded in a
[14] tested their method on four languages, viz., Hindi, Telugu, noise free environment. Apart from the Indian languages, they
Gujarati and English. For Hindi, the training speech corpus also worked on seven foreign languages, namely, Dutch,
consisted of recordings from 27 speakers out of which 23 were English, French, German, Italian, Russian and Spanish which
male and 4 were female speakers, while the testing speech

Volume 9, Issue 1, January 2020 684 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

were downloaded from the internet. Thus, they worked on ten which were combined with Shifted Delta Coefficient (SDC) to
languages with seven different speakers for each language obtain the final set of features. GMM and SVM were used as
giving a total of 70 utterances. The duration of speech the modeling tools. Different combinations of the feature
utterance of all languages ranged between 35 seconds to 70 extraction methods and the modeling tools were used to
seconds. They observed that all the feature extraction develop four main systems, viz., MFCC+SDC+GMM,
techniques that they had considered worked better with GMM SFCC+SDC+GMM, MFCC+SDC+SVM and
as compared to VQ and DTW since Gaussian Mixture SFCC+SDC+SVM. These systems were applied to identify the
Language Model falls into the implicit segmentation approach two major language families of India, viz., Indo-Aryan and
to language identification. It also provides a probabilistic Dravidian, which in total consisted of 22 languages. The Indo-
model of the underlying sounds of a person’s voice. Aryan family consisted of 18 languages, viz., Assamese,
Reddy et al. [16] used both spectral and prosodic features Bengali, Bhojpuri, Chhattisgarhi, Dogri, English, Gujarati,
for analyzing the language specific information present in Hindi, Kashmiri, Konkani, Manipuri, Marathi, Nagamese,
speech. Spectral features extracted from frames of 20 Odia, Punjabi, Sanskrit, Sindhi and Urdu, while Dravidian
milliseconds, individual pitch cycles and glottal closure family consisted of 4 languages, viz., Kannada, Malayalam,
regions were used for discriminating the languages. In addition Tamil and Telugu. The speech corpora were prepared from the
to spectral features, prosodic features extracted from syllable, All India Radio website since the quality of speech was good
tri-syllable and multi-word levels were proposed for capturing and the speech was of sufficiently long duration. A large
the language specific information. The language specific number of speakers consisting of both male and female
prosody was represented by intonation, rhythm and stress speakers were also available for every language. It was
features at syllable and tri-syllable levels, whereas temporal observed that all the four systems could identify the language
variations in fundamental frequency, durations of syllables and families with high accuracy. The influence of one language
temporal variations in intensities were used to represent the family on the other was also evaluated and in most of the cases
prosody at multi-word level. GMM was used to capture the the neighboring languages were found to be influenced more
language specific information from the proposed features. by the other family.
Performance of proposed features were analyzed on the Multi- Muthusamy et al. [19] adopted a segment-based approach
lingual Indian Language Speech Corpus (IITKGP-MLILSC) which is based on the idea that the acoustic structure of
which consists of 27 Indian languages, namely, Arunachali, languages can be estimated by segmenting speech into broad
Assamese, Bengali, Bhojpuri, Chattisgarhi, Dogri, Gojri, phonetic categories. They [19] observed that LID could be
Gujrati, Hindi, Indian English, Kannada, Kashmiri, Konkani, achieved by computing features that describe the phonetic and
Manipuri, Mizo, Malayalam, Marathi, Nagamese, Nepali, prosodic characteristics of a language and by using these
Oriya, Punjabi, Rajasthani, Sanskrit, Sindhi, Tamil, Telugu and feature measurements to train a classifier to distinguish
Urdu. Every language in the database consists of speech from between languages. A multi-language, neural network-based
at least ten speakers. From each speaker, about 5-10 minutes of segmentation and broad classification algorithm was first
speech was collected. On the whole, each language contains developed by using seven broad phonetic categories. This
minimum 1 hour of speech. Reddy et al. [16] also used the algorithm was trained and tested on separate sets of speakers
OGI-MLTS database which consists of 11 languages, namely, of American English, Japanese, Mandarin Chinese and Tamil.
English, Farsi, French, German, Japanese, Korean, Mandarin The speech corpora consisted of natural continuous speech
Chinese, Spanish, Tamil, Vietnamese and Hindi for analyzing from twelve native speakers for each of the four languages out
the language recognition accuracy. of which six were male speakers and six were female speakers.
Roy et al. [17] discussed the comparison of VQ and GMM The recording was done in the laboratory and the age of the
classification techniques based on four Indian languages, viz., female speakers ranged between 15 to 70 years while that of
Assamese, Bengali, English and Hindi. The database used male speakers ranged between 18 to 71 years. Each speaker
consisted of speech recorded from 50 speakers. Each speaker was told to speak 15 conversational sentences on any topic of
was asked to repeat each sentence 20 times in all four personal choice, to ask two questions, and to recite the days of
languages resulting in 4000 samples for training. For testing, the week, the months of the year and the numbers 0 to 10 for a
the 50 speakers spoke each sentence 5 times in each language total of 20 utterances in his/her native language. This system
resulting in a total of 1000 test samples. Each sample was of 2 gave an accuracy of 82.3% on the utterances of the test set.
to 3 seconds duration. MFCC was used as the feature Muthusamy et al. [20] continued the work further by
extraction technique and VQ was found to work better than developing a four language LID system based on the same four
GMM. The VQ model was more efficient for all the four languages they used previously [19]. The system used a neural
languages at higher codebook sizes. network-based segmentation algorithm to segment speech into
Sengupta and Saha [18] worked on identification of the seven broad phonetic categories. Phonetic and prosodic
major language families of India. They [18] extended the features computed on these categories were then input to a
language identification framework to capture features common second network that performed the language classification.
to language families and they developed models which could The system was trained and tested on separate sets of speakers
efficiently represent the language families. Sengupta and Saha of American English, Japanese, Mandarin Chinese and Tamil.
[19] used MFCC and Speech Signal based Frequency Cepstral The training set contained 12 speakers from each language
Coefficient (SFCC) as the primary feature extraction tools with 10 or 20 utterances per speaker, for a total of 440

Volume 9, Issue 1, January 2020 685 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

utterances. The development test set contained a different of a total of nine hours of broadcast data from Doordarshan
group of 2 speakers per language with 20 utterances from each television network. The performance of the system was
speaker, for a total of 160 utterances. The final test set had 6 analyzed for various acoustic features and different classifiers.
speakers per language with 10 or 20 utterances per speaker for The proposed system was modeled using HMM, GMM and
a total of 440 utterances. The average duration of the Artificial Neural Network (ANN). Jothilakshmi et al. [22] also
utterances in the training set was 5.1 seconds and that of the studied the discriminative power of the system for the features,
test sets was 5.5 seconds. Approximately 15% of the utterances namely, MFCC, MFCC with delta and acceleration
in the training and testing sets consisted of a fixed vocabulary coefficients and Shifted Delta Cepstral (SDC) coefficients. The
of the days of the week, the months of the year and the GMM based LID system using MFCC with delta and
numbers zero through ten. Their results indicated that the acceleration coefficients was found to perform well with
system performed better on longer utterances. Furthermore, accuracy 80.56%. The performance of GMM based LID
their system gave an accuracy of 89.5% without using any system with SDC was also considerable.
spectral information in the classifier feature set. Lopez-Moreno et al. [23] experimented with the use of Deep
Montavon [21], on the other hand, used two datasets, viz., Neural Networks (DNNs) for the problem of language
VoxForge and RadioStream each consisting of English, French identification. They compared the proposed system with the i-
and German languages, in order to evaluate his system. The vector based acoustic system and extracted 39 Perceptual
VoxForge dataset contains 5 seconds speech samples Linear Prediction (PLP) features for both the DNN based and
associated with different metadata including the language of the i-vector based systems. In this experiment two different
the sample. Since the speech samples were recorded by datasets were used, viz., the Google 5M LID corpus (Google
speakers using their own microphones, quality varies Language Identification (LID) corpus with 5 million
significantly between different samples. This dataset contains utterances) and the NIST LRE’09. The Google 5M LID corpus
25420 English samples, 4021 French samples and 2963 consists of twenty five languages and nine dialects. From the
German samples. The RadioStream dataset consists of samples LRE’09 dataset Lopez-Moreno et al. [23] selected eight
ripped from several web radios. Furthermore, it has the representative languages, viz., US English, Spanish, Dari,
advantage of containing virtually infinite number of samples French, Pashto, Russian, Urdu and Chinese Mandarin for
that are of excellent quality. Montavon [21] suggested a deep which at least 200 hours of audio were available. For the
architecture that learnt features automatically for the language Google 5M LID corpus, 87.5 hours of speech per language
identification task. The classifier was trained and evaluated on were used resulting in a total of 2975 hours of speech. When
balanced classes, i.e., 33% English samples, 33% French the systems were used on Google 5M LID corpus, the i-vector
samples and 33% German samples. Each sample corresponded system gave similar performance for the discriminative back-
to a speech signal of 5 seconds. The proposed classifier end, Logistic Regression (LR) and the generative ones, Linear
mapped spectrograms into languages and was implemented as Discriminant Analysis (LDA_CD) and the one based on a
a Time-delay Neural Network (TDNN) with two-dimensional single Gaussian with a shared covariance matrix across the
convolutional layers as feature extractors. Implementation of languages (1G_SC). Lopez-Moreno et al. [23] also observed
the TDNN performs a simple summation on the outputs of the that increasing to two Gaussians and allowing individual
convolutional layers. The results showed that the deep covariance matrices gave a relative improvement of 19%.
architecture could identify three different languages with However, the best performance was achieved by the DNN
83.5% accuracy on 5 seconds speech samples coming from system especially when the eight hidden layer DNN proposed
RadioStreams and with 80.1% accuracy on 5 seconds speech architecture was used. Similar results were noticed when the
samples coming from VoxForge. The deep architecture was developed systems were used on the LRE’09 dataset.
also compared with a shallow architecture and it was observed Lopez-Moreno et al. [24] further worked on DNN based
that the deep architecture was 5-10% more accurate than the automatic language identification systems and went on to
shallow architecture. propose two more systems. In the first one, the DNN acted as
Jothilakshmi et al. [22] also used the type of language an end-to-end LID classifier, receiving as input the speech
family to which a language belonged as a distinguishing factor features and providing as output the estimated probabilities of
and proposed a hierarchical language identification system for the target languages. In the second approach, the DNN was
Indian languages. Since nearly 98% of the people in India used to extract bottleneck features that were then used as
speak languages from Aryan and Dravidian families, the inputs for a state-of-the-art i-vector system. They evaluated
system they proposed was designed to identify the languages their language identification system on the NIST LRE 2009
of these two families. In the first level of the proposed system, dataset consisting of 23 languages that comprised of
the family of the spoken language was identified and then this significantly different amounts of available data for each target
information was given as input to the second level in order to language and a subset of the Voice of America (VOA) data
identify the particular language in the corresponding family. A from LRE’09 consisting of eight languages that comprised of
database consisting of nine languages was prepared. The equal quantity of data for each target language. Results for
Dravidian family consisted of Tamil, Telugu, Kannada and both datasets showed that the DNN based systems significantly
Malayalam while the Indo-Aryan family consisted of Hindi, outperformed a state-of-the-art i-vector system when dealing
Bengali, Marathi, Gujarati and Punjabi. The database consisted with short duration utterances. Furthermore, the combination

Volume 9, Issue 1, January 2020 686 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

of the DNN based and the classical i-vector system led to dialect in the three languages resulting in 40 conversations per
additional performance improvements. language and each conversation is about 30 minutes long. It
Gazeau and Varol [25] worked on language identification of also consists of 80 testing utterances per dialect except for
four languages, viz. French, English, Spanish and German. As English where an additional group of about 320 utterances are
far as the speech corpus is concerned, they chose speech included and each of these utterances is about 30 seconds long.
samples from Shtooka, VoxForge and Youtube. For testing, The Miami corpus consists of two dialects of the Spanish
apart from the speech samples from the mentioned sources, language, namely, Cuban Spanish and Peruvian Spanish. They
they also used personally recorded voices. They used neural observed that the performance obtained by the GMM based
network architectures, SVM and HMM for the identification system for the Miami corpus was lower than that obtained by
Zissman et al. [26] when they used the Miami corpus for
purpose and reached to the conclusion that HMM yields the
dialect identification. However, the system provided very good
best results with accuracy of about 70%.
performance for two of the dialects in the CALLFRIEND
corpus. They also observed that the technique was ported from
III. BRIEF SURVEY OF LITERATURE ON DIALECT
language identification without any specialization for the
IDENTIFICATION
purpose of dialect identification and the results obtained were
Dialect identification is also equally important as language promising.
identification because usually people communicate with Torres Carrasquillo et al. [28] continued their work on
dialects. Zissman et al. [26] showed that the Phone dialect identification and used three GMM based systems for
Recognition and Language Modeling (PRLM) approach yields identifying American vs. Indian English, four Chinese dialects
good results classifying Cuban and Peruvian dialects of and three Arabic dialects. Two of these dialects i.e. the
Spanish language. They introduced the Miami corpus which Chinese dialects and the English dialects come from the NIST-
was designed to be a new Spanish speech corpus specifically Language Recognition Evaluation (LRE) 2007 campaign and
for the purpose of dialect identification. A variety of Spanish the third dialect discrimination task comes from the LDC
speech was collected including spontaneous speech in which Arabic corpus. The Chinese dialects included Cantonese,
speakers gave answers to Spanish questions designed to elicit Mandarin, MinNan and Wu while the Arabic dialects included
long stretches of uninterrupted Spanish speech, read paragraph Gulf, Iraqi and Levantine. They developed three systems, viz.
in which each speaker read three paragraphs consisting of a a baseline GMM-UBM, a GMM-UBM with feature
phonetically balanced paragraph, a variable paragraph from a compensation using Eigen-channel compensation and a GMM-
textbook about Spanish culture and a variable paragraph from UBM with maximum mutual information (MMI) along with
a newspaper, fill in the blank sentences which were very feature compensation using Eigen-channel compensation.
simple questions designed to elicit predictable text responses, They observed that all the three systems showed similar
read sentences that were rich in words useful for analyzing behavior with the third system showing the best results.
dialects and digits spoken twice from 0 to 10 in random order. Ma et al. [29] worked on dialect identification of three
After the Spanish speech was collected, the same type of Chinese dialects, namely, Mandarin, Cantonese and
speech was elicited in English. Altogether, each speaker in the Shanghainese using GMMs. Their corpora had about 10 hours
corpus spoke for about 20 to 30 minutes. The dialect of speech data in each of the three dialects as training data. In
identification experiments were conducted using speech from order to have a clear picture on the relationship between the
143 Cuban and Peruvian speakers. During training, three amount of training data and identification accuracy, the GMM
minutes of spontaneous Spanish speech from each speaker in models were trained with different amount of training data of
the Cuban training set were processed by the English phone 1, 2, 4, 6, 8 and 10 hours. For the evaluation, 1000 speech
recognizer and the Cuban language model statistics were segments for each of the four durations of 3, 6, 9 and 15
computed. This step was repeated for the Peruvian speakers, seconds, respectively were used as test data. They discussed
from which a Peruvian model was created. After the two that since Chinese is a tone rich language with multiple
language models were developed, test-speaker spontaneous intonations, the intonations are important information for
speech was processed and a dialect identification decision was people to understand the spoken Chinese language. Different
produced. The test utterances were also three minutes long. As Chinese dialects have different numbers of intonations and
mentioned above, PRLM based dialect identification algorithm different patterns of intonations, so they deduced that better
was used in this study. performance on Chinese dialect identification could be
Torres-Carrasquillo et al. [27] focused on applying some achieved by making good use of such kind of discrimination
of the techniques developed for language identification to the information. Instead of calculation of fundamental frequency
area of dialect identification. They employed GMM with (F0) features explicitly, they extracted frame-based multi-
shifted delta cepstral features (GMM-SDC) for the purpose of dimensional tone relevant features based on the pitch flux in
dialect identification. They used the CALLFRIEND corpus continuous speech signal. Covariance coefficients between the
and Miami corpus for their data. The CALLFRIEND corpus autocorrelations of two adjacent frames were estimated to
consists of twelve languages including two dialects for three of serve as such features. These pitch flux features were applied
the languages, recorded over domestic telephone lines. They as separate feature stream to provide additional discriminative
basically used the dialects available for English, Mandarin and information at the basis of MFCC feature stream. Each of the
Spanish languages. Each of these languages included two two streams was modeled by GMM. They observed that by
dialects, namely, North and South for English, Mandarin and fusing the pitch flux feature stream with the MFCC stream, the
Taiwanese for Chinese and Caribbean and Non-Caribbean for error rate was reduced by more than 30% as compared to when
Spanish. The training includes 20 conversations for each

Volume 9, Issue 1, January 2020 687 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

only the MFCC feature stream was used even when the test the speech files of about 965 speakers with a total of about
speech segments were as short as 3 seconds. 41.02 hours of speech from the Gulf Arabic conversational
Shen et al. [30] described a dialect recognition system that telephone speech database out of which 150 speakers resulting
made use of adapted phonetic models per dialect applied in a in about 6.06 hours of speech were set apart for testing. They
PRLM framework to distinguish between American vs. Indian used the Iraqi Arabic conversational telephone speech database
English and two Mandarin dialects (Mainland and Taiwanese). for the Iraqi dialect selecting 475 speakers with a total duration
They trained systems for each language using data from the of about 25.73 hours of speech out of which 150 speakers
CALLFRIEND corpus, the Language Recognition Evaluation resulting in about 7.33 hours of speech were kept aside for
dataset 2005 (LRE’05) test set, data from OGI’s foreign testing. Their Levantine data consisted of 1258 speakers from
accented English, LDC’s MIXER and FISHER corpora. In the Arabic CTS Levantine Fisher Training Data Set with a
total, 104 and 20.14 hours of data were used to adapt and train total duration of 78.79 hours of speech. Here too they kept
PRLM and adapted phonetic models for English and aside 150 speakers resulting in about 10 hours of speech for
Mandarin, respectively. In each task, the performance of the testing. For their Egyptian data, they used CallHome Egyptian
adapted phonetic model system was compared with a baseline and its Supplement. They used 398 speakers with a total
GMM model. They observed that the adapted phonetic model duration of 75.7 hours of speech keeping aside 150 speakers
system was capable of good performance for the dialect resulting in 28.7 hours of speech for testing. For MSA, they
recognition problem without phonetically word transcribed used TDT4 Arabic broadcast news since no database similar to
data. Furthermore, this model could be combined with PRLM the other dialects was available. They used about 47.6 hours of
to improve performance. It was noticed that the combination of speech. For testing, they again kept aside 150 speakers
this system with PRLM outperformed combinations of PRLM resulting in about 12.06 hours of speech. They observed that
with GMM based models and the combination of all three their system was giving a good accuracy and the most
systems could further improve the performance. distinguishable dialect among the five variants was MSA.
Alorfi [31] used ergodic Hidden Markov Model to identify Salameh et al. [33] also performed Arabic dialect
two Arabic dialects, namely, Gulf and Egyptian Arabic. Apart identification on a large-scale collection of parallel sentences
from using the CALLHOME Egyptian Arabic Speech from the that covered the dialects of 25 Arab cities in addition to
LDC database, he created an additional database for his work English, French and MSA. They worked on two corpora viz.
by recording TV soap operas containing both Egyptian and Corpus-26 and Corpus-6. Corpus-26 consists of 2000
Gulf dialects. However, these recordings often contained sentences translated into dialects of 25 cities while Corpus-6
background noises such as echoes, coughs, laughter and has 10,000 additional sentences and are translated into dialects
background music. The overall condition of this database was of five cities, namely, Beirut, Cairo, Doha, Tunis and Rabat.
poor compared to other standard speech databases. They basically presented results on a fine-grained
Furthermore, the additional database consisted of recordings classification task and the system they developed could
from only male speakers. For the Egyptian dialect, he used a identify the exact city of the speaker with an accuracy of
combination of twenty male speakers from the CALLHOME 67.9% for sentences with an average length of seven words
database and twenty male speakers from the TV recordings and with an accuracy of 90% for sentences with 16 words.
database. The speech of ten speakers from each database was Bougrine et al. [34] worked on Spoken Algerian Arabic
used for training and the speech from the other ten was used Dialect Identification (SAADID) and proposed a new system
for testing. The speech for training from each speaker was one based on prosodic speech information, viz. intonation and
minute long. The speech used for the Gulf dialect was solely rhythm. They used SVM as the modeling technique and
from the TV recordings database. The speech from 10 male worked on identification of six dialects spoken in the
speakers was used for training while a different set of 10 departments of Algiers, Adrar, Bousaâda, Djelfa, Laghouat and
speakers was used for testing. He utilized many different Oran. The speech corpus they developed consisted of speech
combinations of speech features related to MFCC such as time from their own recording database (OR) and speeches
derivatives, energy and the shifted delta cepstra in training and extracted from reports which were selected from regional
testing the system. Due to similarities and differences between radios and TVs (RTV). The OR database consists of 1.5 hours
the Arabic dialects, he developed an ergodic HMM that had of recordings from 34 speakers and each speaker recorded 57
two states, viz. one of them represented the common sounds sentences. RTV database consists of 10 sentences of MSA and
across Arabic dialects while the other represented the unique 47 sentences of dialect speech wherein dialect speech consists
sounds of the specific dialect. The best result of the Arabic of free responses, free translation of phrases, a short text story
dialect identification system was 96.67% correct identification. and a semi-guided narration obtained from images without
Biadsy et al. [32] used the PPRLM framework with nine text. Their results showed as accuracy of 69% when test
phone recognizers to distinguish between the Arabic dialects, utterances of 2 seconds were used.
namely, Gulf Arabic, Iraqi Arabic, Levantine Arabic, Egyptian Rao and Koolagudi [35] used both spectral and prosodic
Arabic and Modern Standard Arabic (MSA). They were able features to identify five Hindi dialects, viz., Chattisgharhi
to obtain corpora from the Linguistic Data Consortium (LDC) (spoken in Central India), Bengali (Bengali accented Hindi
with similar recording conditions for the first four Arabic spoken in Eastern India), Marathi (Marathi accented Hindi
dialects. These are corpora of spontaneous telephone spoken in Western India), General (Hindi spoken in Northern
conversations produced by native speakers of the dialects, India) and Telugu (Telugu accented Hindi spoken in Southern
speaking with family members, friends and unrelated India). For each dialect, their database consisted of data from
individuals, sometimes about predetermined topics. They used 10 speakers out of which 5 were male and 5 were female

Volume 9, Issue 1, January 2020 688 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

speakers speaking in spontaneous speech for about 5-10 energy contours. They used Auto-associative Neural Network
minutes resulting in a total of 1-1.5 hours. The spectral (AANN) models with SVM as the modeling technique. Their
features were represented by MFCCs while the prosodic dialect identification system was showing recognition
features were represented by durations of syllables, pitch and performance of 81%.

[13] E. Noor and H. Aronowitz, “Efficient Language Identification Using


Anchor Models and Support Vector Machines”, in Proc. of Speaker and
IV. CONCLUSION Language Recognition Workshop, Puerto Rico, 2006, pp. 1-6.
From the survey discussed in this paper, it can be observed [14] N. Manwani, S. K. Mitra and M. V. Joshi, “Spoken Language
that while extensive work has been done on language Identification for Indian Languages Using Split and Merge EM
Algorithm”, in Proc. of the 2nd International Conference on Pattern
identification, the same cannot be said true for dialect Recognition and Machine Intelligence, Kolkata, 2007, pp. 463-468.
identification. One of the reasons for this can be the lack of [15] P. Kumar, A. Biswas, A. N. Mishra and M. Chandra, “Spoken language
databases which is also observed from the above discussion. identification using hybrid feature extraction methods”, Journal of
While popular speech databases such as OGI-MLTS, LDC Telecommunication, vol. 1, pp. 11-15, 2010.
[16] V. R. Reddy, S. Maity, K. S. Rao, “Identification of Indian languages
CallFriend, NIST LRE, IITKGP-MLILSC, VoxForge, using Multi-Level Spectral and Prosodic Features”, International Journal
RadioStream, Google 5M LID etc. are available for research in of Speech Technology, vol. 16, pp. 489-511, 2013.
language identification, the same is not the case in dialect [17] P. Roy and P.K. Das, “Comparison of VQ and GMM approach for
identifying Indian languages”, International Journal of Applied Pattern
identification. Most researchers have had to develop their own
Recognition, vol. 1, pp. 99-107, 2013.
speech corpus since very few speech databases are available [18] D. Sengupta and G. Saha, “Identification of the Major Language
for dialects. Families of India and Evaluation of Their Mutual Influence”, Current
Language and dialect identification systems also help in Science, vol. 110, pp. 667-681, 2016.
preserving a language/dialect. The number of languages [19] Y. K. Muthusamy, R. A. Cole and M. Gopalakrishnan, “A Segment-
Based Approach to Automatic Language Identification”, in Proc. of 1991
spoken in the world is estimated to be between six and seven IEEE International Conference on Acoustics, Speech and Signal
thousand [36]. However, as can be observed from the survey Processing, 1991, pp. 353-356.
discussed in this paper, research in the field of language and [20] Y. K. Muthusamy and R. A. Cole, “A Segment-Based Automatic
dialect identification is restricted to a few languages/dialects. Language Identification System”, in Advances in Neural Information
Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann,
We need to diversify research in this field in order to include Ed. San Mateo: Morgan Kaufmann Publisher, 1992, pp. 241-248.
language and dialect identification systems for varied [21] G. Montavon, “Deep Learning for Spoken Language Identification”, in
languages and dialects. Proc. of NIPS Workshop on Deep Learning for Speech Recognition and
Related Applications, Vancouver, 2009, pp. 1-4.
[22] S. Jothilakshmi, V. Ramalingam and S. Palanivel, “A Hierarchical
REFERENCES Language Identification System for Indian Languages”, Digital Signal
[1] H. Li, B. Ma and K. A. Lee, “Spoken Language Recognition: From Processing, vol. 22, pp. 544-553, 2012.
Fundamentals to Practice”, Proceedings of the IEEE, vol. 101, pp. 1136- [23] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J.
1159, 2013. Gonzalez-Rodriguez and P. Moreno, “Automatic Language Identification
[2] Y. K. Muthusamy, N. Jain and R. A. Cole, “Perceptual Benchmarks for using Deep Neural Networks”, in Proc. of IEEE International
Automatic Language Identification”, in Proc. of the IEEE International Conference on Acoustic, Speech and Signal Processing, Italy, 2014, pp.
Conference on Acoustic, Speech, and Signal Processing, Adelaide, 1994, 5337-5341.
pp. 333-336. [24] I. Lopez-Moreno, J. Gonzalez-Dominguez, D. Martinez, O. Pichot, J.
[3] H. Behravan, “Dialect and Accent Recognition”, Master’s Thesis, Gonzalez-Rodriguez and P. J. Moreno, “On the Use of Deep
University of Eastern Finland, 2012. Feedforward Neural Networks for Automatic Language Identification”,
[4] F. Biadsy, “Automatic Dialect and Accent Recognition and its Computer Speech Language, vol. 40, pp. 46-59, 2016.
Application to Speech Recognition”, PhD Thesis, Columbia University, [25] Valentin Gazeau and Cihan Varol, “Automatic Spoken Language
2011. Recognition with Neural Networks”, International Journal of
[5] A. Etman and A. A. L. Beex, “Language and Dialect Identification: A Information Technology and Computer Science, vol. 8, pp. 11-17, 2018.
Survey”, in Proc. of the 2015 SAI Intelligent Systems Conference, [26] M. A. Zissman, T. Gleason, D. Rekart and B. Losiewicz, “Automatic
London, 2015, pp. 220-231. Dialect Identification of Extemporaneous Conversational, Latin
[6] M. Sugiyama, “Automatic Language Recognition Using Acoustic American Spanish Speech”, in Proc. of the IEEE International
Features”, in Proc. of IEEE International Conference on Acoustics, Conference on Acoustics, Speech and Signal Processing, Atlanta, 1996,
Speech and Signal Processing, Toronto, 1991, pp. 813-816. pp. 777-780.
[7] J. Balleda, H. A. Murthy and T. Nagarajan, “Language Identification [27] P. A. Torres-Carrasquillo, T. P. Gleason and D. A. Reynolds, “Dialect
from Short Segments of Speech”, in Proc. of the 6th International Identification using Gaussian Mixture Models”, in Proc. of the Speaker
Conference on Spoken Language Processing, Beijing, 2000, pp. 1033- and Language Recognition Workshop, Toledo, 2004, pp. 297-300.
1036. [28] P. A. Torres-Carrasquillo, D. E. Sturim, D. A. Reynolds and A. McCree,
[8] T. Nagarajan and H. A. Murthy, “Language Identification Using Parallel “Eigen-Channel Compensation and Discriminatively Trained Gaussian
Syllable Like Unit Recognition”, in Proc. of International Conference on Mixture Models for Dialect and Accent Recognition”, in Proc. of the 9th
Acoustics, Speech and Signal Processing, Canada, 2004, pp. 401-404. Annual Conference of the International Speech Communication
[9] M. Padro and L. Padro, “Comparing Methods for Language Association, Brisbane, 2008, pp. 723-726.
Identification”, Procesamiento del lenguaje natural, vol. 33, pp. 155- [29] B. Ma, D. Zhu and R. Tong, “Chinese Dialect Identification using Tone
161, 2004. Features Based on Pitch Flux”, in Proc. of International Conference on
[10] O. P. Singh, B. C. Harris, R. Sinha, B. Chetri and A. Pradhan, “Sparse Acoustics, Speech and Signal Processing, Toulouse, 2006, pp. 1029-
Representation Based Language Identification Using Prosodic Features 1032.
for Indian Langauges”, in Proc. of the 2013 Annual IEEE India [30] W. Shen, N. Chen and D. Reynolds, “Dialect Recognition using Adapted
Conference, Mumbai, 2013, pp. 1-5. Phonetic Model”, in Proc. of the 9th Annual Conference of the
[11] T. Martin, B. Baker, E. Wong and S. Sridharan, “A Syllable-Scale International Speech Communication Association, Brisbane, 2008, pp.
Framework for Language Identification”, Computer Speech & Language, 763-766.
vol. 20, 2006, pp. 276-302. [31] F. S. Alorfi, “Automatic Identification of Arabic Dialects Using Hidden
[12] W. Zhang, B. Li, D. Qu and B.Wang, “Automatic Language Markov Models”, PhD Thesis, University of Pittsburgh, 2008.
Identification Using Support Vector Machines”, in Proc. of 8th [32] F. Biadsy, J. Hirschberg and N. Habash, “Spoken Arabic Dialect
International Conference on Signal Processing, Beijing, 2006. Identification using Phonotactic Modeling”, in Proc. of the EACL 2009

Volume 9, Issue 1, January 2020 689 http://adalyajournal.com/


ADALYA JOURNAL ISSN NO: 1301-2746

Workshop on Computational Approaches to Semitic Languages, Athens,


2009, pp. 53-61.
[33] M. Salameh, H. Bouamor, and N. Habash, “Fine-Grained Arabic Dialect
Identification”, in Proc. of the 27th International Conference on
Computational Linguistic, New Mexico, 2018, pp. 1332-1334.
[34] S. Bougrine, H. Cherroun, and D. Ziadi, “Prosody-based Spoken
Algerian Arabic Dialect Identification”, Procedia Computer Science, vol.
128, pp. 9-17, 2018.
[35] K. S. Rao and S. G. Koolagudi, “Identification of Hindi Dialects and
Emotions Using Spectral and Prosodic Features of Speech”, Systemics,
Cybernics and Informatics, vol. 9, pp. 24-33, 2011.
[36] M. P. Lewis, Ed. “Ethnologue: Languages of the World, Sixteenth
Edition”, Dallas, SIL International, 2009.

Volume 9, Issue 1, January 2020 690 http://adalyajournal.com/


View publication stats

You might also like